Comments on WebDiarios de Motocicleta: IOI Wrap-up

Hi Mihai, I am anonymous September 12. Sorry my d...

2010-09-22T06:46:00.335-04:00

Hi Mihai,
I am anonymous September 12.

Sorry my description was imprecise.

Let x be the unknown article and y be the concatenation of all articles that belong to say English.

Your suffix tree method is often denoted LZ(x | y).

You can also define appropriately EDM(x | y), which also gives a valid Lempel Ziv parsing. (Of course your algorithm gives the optimal LZ parsing. Still, one can show that EDM(x | y) is within log factor of it)

The good thing is EDM metric is highly related to finding a small context free grammar that generates the string, which somehow works great for natural languages.

Awesome! Congratulations and good luck with your c...

2010-09-14T03:13:19.909-04:00

Awesome! Congratulations and good luck with your continued participation in IOI-running.

I don't understand. You apply "edit dista...

2010-09-12T14:29:07.783-04:00

I don't understand. You apply "edit distance with moves" to which 2 strings?

Here the "edit distance with moves" metr...

2010-09-12T00:54:47.367-04:00

Here the "edit distance with moves" metric of Muthukrishnan and Cormode can be helpful. It can be computed in linear time.

This metric is always within log n factor of what Mihai proposed, but is known to perform very well for natural languages.

SODA results will be out in about a week. Hold you...

2010-09-09T10:55:29.634-04:00

SODA results will be out in about a week. Hold your horses.

Yes, the most common solutions were based on q-grams. Note that the scores were scaled up, so if you got 98 points, it means you had about 88% accuracy.

My intuition is that the suffix tree approach should be a generalization of q-grams, since it adapts the "q" to the entropy of the text... If some letters are frequent (low entry) you will catch longer patterns containing those letters.

Still, it's not behaving significantly better than q-grams, so maybe this effect is not strong enough. Or maybe I'm doing something wrong in the tuning.

Hi Mihai, for curiosity, did you try it against 3...

2010-09-09T05:11:27.685-04:00

Hi Mihai,
for curiosity, did you try it against 3- and 4-gram analysis?
On that data 4-gram scored 98 points..

when is soda out btw?

2010-09-09T03:28:51.679-04:00

when is soda out btw?

David, you only have 200 examples in each language...

2010-09-08T13:44:17.808-04:00

David, you only have 200 examples in each language, each of 100 characters. You certainly require several to break the permutation of the alphabet (substitution cypher). Are you going to break it statistically? I would guess you need 1000 characters in each language or so, but maybe I'm overestimating.

Thanks, Mihai, for your contribution to Canda'...

2010-09-07T20:49:40.756-04:00

Thanks, Mihai, for your contribution to Canda's HSC. I've enjoyed your paraphrased presentation of the problems, too.

1011110: here's an on-line version with no cypher to break.

It's not really true that one starts with zero...

2010-09-07T18:52:17.005-04:00

It's not really true that one starts with zero knowledge in this problem, though, is it? The substitution cipher is trivial to break, probably even given a single example, after which you can match against known dictionaries, and the language ids can be deduced once you've found an example in each language. So I'd expect that with more care it should be possible to get around 99.5% accuracy: give up on getting the first of the 200 examples in each language, because until you guess wrong you don't know the language id, and after that get them all right.