Tuesday, September 7, 2010

IOI Wrap-up

In the past 2 years, I have been a member of the Host Scientific Committee (HSC) of the IOI. This is the body that comes up with problems and test data. While it consists primarily of people from the host country (Bulgaria in 2009, Canada in 2010), typically the host will have a call-for-problems and invite the authors of problems they intend to use.

This year, I was elected member of the International Scientific Committee (ISC). This committee works together with the HSC on the scientific aspects, the hope being that a perenial body will maintain similar standards of quality from one year to another. There are 3 elected members in the ISC, each serving 3-year terms (one position is open each year).

I anticipate this will be a lot of fun, and you will probably hear more about the IOI during this time. When a call for problems comes up (will be advertised here), do consider submitting!

I will end with an unusual problem from this IOI:

Consider the largest 50 languages on Wikipedia. We picked 200 random articles in each language, and extracted an excerpt of 100 consecutive characters from each. You will receive these 10000 texts one at a time in random order, and for each you have to guess its language. After each guess, your algorithm learns the correct answer. The score is the percentage of correct guesses.

To discourage students from coding a lot of special rules, a random permutation is applied to the Unicode alphabet, and the language IDs are random values. So, essentially, you start with zero knowledge.
Considering the tiny amount of training data and the real-time nature of guessing, one might not expect too good solutions. However, it turns out that one can get around 90% accuracy with relatively simple ideas.

My own approach was to define Score(text, language) as the minimal number of substrings seen previously in this language that compose the text. This can be computed efficiently by maintaining a suffix tree for each language, and using it to answer longest common prefix queries.

10 comments:

11011110 said...

It's not really true that one starts with zero knowledge in this problem, though, is it? The substitution cipher is trivial to break, probably even given a single example, after which you can match against known dictionaries, and the language ids can be deduced once you've found an example in each language. So I'd expect that with more care it should be possible to get around 99.5% accuracy: give up on getting the first of the 200 examples in each language, because until you guess wrong you don't know the language id, and after that get them all right.

gvc said...

Thanks, Mihai, for your contribution to Canda's HSC. I've enjoyed your paraphrased presentation of the problems, too.

1011110: here's an on-line version with no cypher to break.

Mihai said...

David, you only have 200 examples in each language, each of 100 characters. You certainly require several to break the permutation of the alphabet (substitution cypher). Are you going to break it statistically? I would guess you need 1000 characters in each language or so, but maybe I'm overestimating.

Anonymous said...

when is soda out btw?

Anonymous said...

Hi Mihai,
for curiosity, did you try it against 3- and 4-gram analysis?
On that data 4-gram scored 98 points..

Mihai said...

SODA results will be out in about a week. Hold your horses.

Yes, the most common solutions were based on q-grams. Note that the scores were scaled up, so if you got 98 points, it means you had about 88% accuracy.

My intuition is that the suffix tree approach should be a generalization of q-grams, since it adapts the "q" to the entropy of the text... If some letters are frequent (low entry) you will catch longer patterns containing those letters.

Still, it's not behaving significantly better than q-grams, so maybe this effect is not strong enough. Or maybe I'm doing something wrong in the tuning.

Anonymous said...

Here the "edit distance with moves" metric of Muthukrishnan and Cormode can be helpful. It can be computed in linear time.

This metric is always within log n factor of what Mihai proposed, but is known to perform very well for natural languages.

Mihai said...

I don't understand. You apply "edit distance with moves" to which 2 strings?

daveagp said...

Awesome! Congratulations and good luck with your continued participation in IOI-running.

Anonymous said...

Hi Mihai,
I am anonymous September 12.

Sorry my description was imprecise.

Let x be the unknown article and y be the concatenation of all articles that belong to say English.

Your suffix tree method is often denoted LZ(x | y).

You can also define appropriately EDM(x | y), which also gives a valid Lempel Ziv parsing. (Of course your algorithm gives the optimal LZ parsing. Still, one can show that EDM(x | y) is within log factor of it)

The good thing is EDM metric is highly related to finding a small context free grammar that generates the string, which somehow works great for natural languages.