This is a guest post by Mikkel Thorup:
--------------------------
I think there is nothing more inhibiting for problem solving than referees looking for new general techniques.
When I go to STOC/FOCS, I hope to see some nice solutions to important problems and some new general techniques. I am not interested in semi-new techniques for semi-important problems. A paper winning both categories is a wonderful but rare event.
Thus I propose a max-evaluation rather than a sum. If the strength of a paper is that it solves an important problem, then speculations on the generality of the approach are of secondary importance. Conversely, if the strength of the paper is some new general techniques, then I can forgive that it doesn't solve anything new and important.
One of the nice things about TCS is that we have problems that are important, not just as internal technical challenges, but because of their relation to computing. At the end of the day, we hope that our techniques will end up solving important problems.
Important problems should be solved whatever way comes natural. It may be deep problem specific understanding, and it may build on previous techniques. Why would we be disappointed if an old problem got solved by a surprising reuse of an old technique?
Wednesday, September 29, 2010
Problem solving versus new techniques
Posted by Mihai at 4:47 PM 11 comments
Monday, September 27, 2010
Retrieval-Only Dictionaries
We saw two cool applications of dictionaries without membership; now it's time to construct them. Remember that we are given a set S, where each element x∈S has some associated data[x], a k-bit value. We want a data structure of O(nk) bits which retrieves data[x] for any x∈S and may return garbage when queried for x∉S.
- the graph is acyclic with some constant probability. Thus, the construction algorithm can rehash until it finds an acyclic graph, taking O(n) time in expectation.
- the total length of all cycles is O(lg n) with high probability. Thus we can make the graph acyclic by storing O(lg n) special elements in a stash. This gives construction time O(n) w.h.p., but the query algorithm is slightly more complicated (for instance, it can handle the stash by a small hash table on the side).
Posted by Mihai at 2:17 PM 2 comments
Tuesday, September 21, 2010
Static 1D Range Reporting
Method 4 for implementing van Emde Boas with linear space, described in my last post, is due to [Alstrup, Brodal, Rauhe: STOC'01]. They worked on static range reporting in 1 dimension: preprocess a set of integers S, and answer query(a,b) = report all points in S ∩ [a,b]. This is easier than predecessor search: you can first find the predecessor of a and then output points in order until you exceed b. Using van Emde Boas, we would achieve a linear-space data structure with query time O(lglg u + k), where k is the number of points to be reported.
Static 1D range reporting can be solved with O(n) space and O(1+k) query time.
- if LCA(a,b) is a branching node, look at the two descendant branching nodes. If the interval [a,b] is nonempty, it must contain either the max in the tree of the left child, or the min in the tree of the right child.
- if LCA(a,b) is an active node, go to its lowest branching ancestor, and do something like the the above.
- if LCA(a,b) is not an active node, the interval [a,b] is certainly empty!
- active nodes at depth i·√lg u ;
- active nodes less than √lg u levels below a branching node.
Posted by Mihai at 1:20 PM 0 comments
vEB Space: Method 4
In the previous post I described 3 ways of making the "van Emde Boas data structure" take linear space. I use quotes since there is no unique vEB structure, but rather a family of data structures inspired by the FOCS'75 paper of van Emde Boas. By the way, if you're curious who van Emde Boas is, here is a portrait found on his webpage.
- Find v, the lowest active ancestor of the query at some depth of the form i·√lg u (binary search on i). Say v is on the path u→w (where u, w are branching nodes). If w is not an ancestor of the query, return u.
- Otherwise, the lowest branching ancestor of the query is found at some depth in [ i·√lg u , (i+1)√lg u ]. Binary search to find the lowest active ancestor in this range, and follow a pointer to the lowest active ancestor.
- membership: is x in the set?
- retrieval: assuming x is in the set, return data[x].
- We only need k=O(lglg u) bits of associated data. Instead of storing a pointer to the lowest branching ancestor, we can just store the height difference (a number between 1 and lg u). This is effectively a pointer: we can compute the branching ancestor by zeroing out so many bits of the node.
- We only need to store them in a retrieval-only dictionary. Say we query some node v and find a height difference δ to the lowest branching ancestor. We can verify whether v was real by looking up the δ-levels-up ancestor of v in the hash table of branching nodes, and checking that v lies on one of the two paths descending from this branching node.
Posted by Mihai at 11:11 AM 1 comments
Sunday, September 19, 2010
Van Emde Boas and its space complexity
In this post, I want to describe 3 neat and very different ways of making the space of the van Emde Boas (vEB) data structure linear. While this is not hard, it is subtle enough to confuse even seasoned researchers at times. In particular, it is the first bug I ever encountered in a class: Erik Demaine was teaching Advanced Data Structures at MIT in spring of 2003 (the first grad course I ever took!), and his solution for getting linear space was flawed.
predecessor(q) = max { x ∈ S | x ≤ q }
- a hash table H storing hi(x) for all x ∈ S.
- a top structure solving predecessor search among { hi(x) | x ∈ S }. This is the same as the original data structure, i.e. use recursion.
- for each element α∈H, a recursive bottom structure solving predecessor search inside the α segment, i.e. among the keys { lo(x) | x ∈ S and hi(x)=α }.
- a top structure, with all paths that begin and end above height ½ lg u.
- a hash table H with the nodes at depth ½ lg u of every path crossing this depth.
- for each α∈H, a bottom structure with all paths starting below depth ½ lg u which have α as prefix.
- if p is below depth ½ lg u, recurse in the appropriate bottom structure. (We have no work to do on this level.)
- look in H for the node above the query at depth ½ lg u. If not found, recurse in the top structure. If found, let p be the bottom node of the path crossing depth ½ lg u which we just found in the hash table. Recurse to the appropriate bottom structure.
Posted by Mihai at 2:34 PM 3 comments
Tuesday, September 7, 2010
IOI Wrap-up
In the past 2 years, I have been a member of the Host Scientific Committee (HSC) of the IOI. This is the body that comes up with problems and test data. While it consists primarily of people from the host country (Bulgaria in 2009, Canada in 2010), typically the host will have a call-for-problems and invite the authors of problems they intend to use.
This year, I was elected member of the International Scientific Committee (ISC). This committee works together with the HSC on the scientific aspects, the hope being that a perenial body will maintain similar standards of quality from one year to another. There are 3 elected members in the ISC, each serving 3-year terms (one position is open each year).
I anticipate this will be a lot of fun, and you will probably hear more about the IOI during this time. When a call for problems comes up (will be advertised here), do consider submitting!
I will end with an unusual problem from this IOI:
Consider the largest 50 languages on Wikipedia. We picked 200 random articles in each language, and extracted an excerpt of 100 consecutive characters from each. You will receive these 10000 texts one at a time in random order, and for each you have to guess its language. After each guess, your algorithm learns the correct answer. The score is the percentage of correct guesses.Considering the tiny amount of training data and the real-time nature of guessing, one might not expect too good solutions. However, it turns out that one can get around 90% accuracy with relatively simple ideas.
To discourage students from coding a lot of special rules, a random permutation is applied to the Unicode alphabet, and the language IDs are random values. So, essentially, you start with zero knowledge.
My own approach was to define Score(text, language) as the minimal number of substrings seen previously in this language that compose the text. This can be computed efficiently by maintaining a suffix tree for each language, and using it to answer longest common prefix queries.
Posted by Mihai at 2:23 PM 10 comments