User:Hagman/Marathon

General Method

In short: Find M-blocks, extend them to M/I-blocks, merge these to M/I/D-blocks.

M-blocks

Find large identical blocks. Detect blocks by hash values of four consecutive characters (=one unsigned int). Accept only blocks of length at least B+1, or increase this limit until there are at most 2000 blocks left. Some articles (Osaka University, Philippine Airlines) seem to produce plenty of small blocks, possibly because of large sequences of question marks from non-ASCII characters. Sort blocks, early destination first, in case of tie longer first. Try M-blocks in this order to produce...

MI-blocks

An MI-block (initially, M-block) that produces $[dst_{0},dst_{1})$ from $[src_{0},src_{1})$ of cost $C=B+(dst_{1}-dst_{0})-(src_{1}-src_{0})$ influences costs as follows: Let $f(x)$ be the cost to produce the destination string up to (excluding) $x$ , initially $f(x)=x$ . Then we get a new constraint $f(x)\leq f(dst_{0})+C$ if $x\leq dst_{1}$ and $f(x)\leq (x-dst_{1})+(f(dst_{0})+C)$ if $x\geq dst_{1}$ .

Apply the constraints given by an M-block. For each block also try to extend it to the left by finding the preceeding character somewhat further left in the destination string. This effectively prepends someting like "MM..MIII..I". Apply the corresponding constraint as well. Repeat a few times. If at least one of these really changed $f$ , take the best of them (i.e. the one responsible for $f(dst_{1})$ ) and try to extend it to the right (with "II..IMM..M") repeatedly.

MID-blocks

Find the best sequence of MI-blocks. Note that this at best finds (or approximates) the optimal sequence possible without deletions 'D'. Only as an afterthought, consecutive blocks are joined into MID-blocks with sequences of 'D' (or - with some luck - 'M') if the change is worth it. In principle, starting with a worse sequence of MI-blocks might produce a better sequence of MID-blocks. However, I assume that with real life wikipedia data, the gain is only marginal compared to the added complexity.

Time Management

The sigmoidal characteristic of the time dependency made it essential to check the clock. I observed long delays of up to 14 seconds(!) between creation of static data and the call to my constructor, as well as several milliseconds between the constructor and the method. Fortunately, the first interval seems not to count against me, while the second does.

With big test cases, one might run out of time. Therefore I thought it essential to start with the big points, i.e. treat versions from later to earlier as earlier versions differ a lot from the target. Also, I look for edit-wars and reversions such that versions occurring repeatedly are treated once and the result is directly copied to the others.

When to Stop

In case we are running out of time, stop when I expect that less than three versions can be treated in the time remaining. To stop means that all remaining versions get a trivial "III...I" solution. I hope I got the time measurement very good, for otherwise I might suffer from timeouts. In debug mode, I could make Martial Arts successfully eat about 59900 ms without ever timing out.

Alternatively, stop if the next step won't bring improvement, i.e. the expected raw score times the expected time penalty is less than the current product. The expected raw score and execution time are extrapolated from the previous 10 strings. Again, I hope the time measurement was accurate as otherwise I might think the steep part of the sigmoidal curve is somewhere else.

Unexpected Improvements

I had learned from Intel Marathons that one should always improve the algorithm whereas manual "optimization" (including assembler) always worsens the result.

However, this time I climbed several places by

traversing one vector from back to front instead of forward for a certain search. (Doing the same by binary search was even worse than forward; the "hit" was simply distributed such that it was often near the end.
getting rid of a sentinel. Additional checks were cheaper than pushing the bytes around when something got inserted.
replacing an erase() of n followed by an insert with an erase() of n-1 followed by an assignment
unfolding a string comparison from char-wise to unsigned-wise (plus trailing)
replacing one vector<foo> with a handwritten lookalike. The time to erase a range dropped from dominant to neglectible!

I really haven't changed my approach from the beginning. All I did was start with M-blocks, then add MI-blocks and finally MID-blocks, but only as a matter of implementation order, not as additional ideas. However, a lot of finetuning was added with respect to time management, deciding what to ignore, optimizing for speed.

Lessons Learned

Know your input!

Wikipedia articles grow from stub to encyclopedic. I take this to justify my asymmetric preference of 'I' over 'D'.
Of course this also justifies starting with more recent versions.
There are edit wars and reverts. Thus looking for identical strings is worth while.

Tests are valuable

Many have suffered from mismatched 'M' in position x in version 0 , which was usually caused my misunderstanding the problem statement. I observed a mismatched 'M' in position x in version 1000-and-something with all other examples passing ok. Such things show that there is really a bug in your code. Fortunately, I could verify the mismatch at home, but it took me a while to find the bug because it looked totally weird that the code should decide to produce the output observed.

Optimize with care

Selfexplanatory.

Trivia

Mork was always fun, even though it is just a nano-nano example.
Jahwist sometimes surprised me with unbelievable results.
I could learn a lot from Osaka University.
I'm glad that Philippine Airlines never crashed my code.
Most of the time I fought with Martial Arts.