The 89x89 Experiment - doug-osborne.com

Rather than make yet another set in the 90×90 range – I decided to do something a bit different for 89×89 board size.

For the 91×90 and 90×89 galleries I experimented with making the level generator purposely add phantom dots (and later, phantom rectangles) before drawing the first line. Though my program rated these levels a bit better than the previous levels, I didn’t really know whether or not this actually helped the levels – though they certainly took a lot longer to generate.

So, I made two different batches of 89×89 images – both starting with the usual 12,000 levels but with different settings:

The first batch used the fastest reasonable settings (the same settings I used for 120×120).
The 2nd batch expanded on the experiments with phantom stuff. I also made a couple of other smaller changes to the settings designed to make the levels somewhat more random.

My level generator averaged 279 levels/hour for the first batch compared to 81 levels/hour for the 2nd batch. The average number of lines increased from 103 in the first to 134 in the 2nd.

One Goal Achieved

Part of the motivation for this experiment was that I wanted to make my program think the levels it selected were better than any other set – at least for the 2nd set.

Here is a chart of how my program evaluated each set since 100×100. The sets preceding 100×100 used a slightly different formula such that the scores aren’t on the same scale. A lower score is better.

	Wgt Score	Last Score
100×100	41.55%	36.97%
95×85	41.96%	36.92%
101×99	42.23%	37.33%
111×99	45.26%	40.21%
111×109	43.07%	38.32%
120×110	43.00%	38.07%
100×100 New	42.71%	37.74%
120×120	43.01%	38.09%
90×90	42.39%	37.38%
91×91	42.99%	38.33%
91×90	41.96%	37.17%
90×89	42.05%	37.28%
89×89 1st	43.20%	38.35%

The last score is how my program scored the top 30 levels in the 9th and final iteration of similarity filtering. The weighted score is also for the top 30 levels – but it includes scores from previous iterations as well with each iteration’s score weight 60% of the following iteration. E.g., the 9th iteration has a weight of 1, the 8th a weight of 0.6, the 7th a weight of 0.36, and so on. This score is a lot higher because the scores are calculated differently for certain iterations – the average score for the 8th iteration is about 10 points higher than 9th iteration.

111×99 is the outlier – the levels were generated shortly after I discovered and removed the phantom dots bug – which made my program 3x as fast, but also made the levels much worse according to the program. The scores after 111×99 are generally around 38% and about a point higher than 100×100 and 95×85 – while I added the phantom dots bug back after 111×99, I made some other changes to increase the speed at a “lower cost”.

91×90 and 90×89 both scored nearly as well as 100×100 due to my first experiments with purposely adding phantom dots, and I was expecting the 2nd batch of 89×89 to score even better.

It scored 0.9 points better by both measures with a weighted score of 40.65% and a 36.07% score for the last iteration – exceeding my expectations by about half a point. Admittedly, a human is going to have a hard time seeing the difference between a level that scores 36% and one that scores 37%. Part of my plan for the 89×89 experiment was to see how well I could guess how the program rated each level.

Still, I was quite happy with this result.

More Bugs – Past, Present, and Future

The Original Phantom Dots Bug

Phantom dots were the 2nd major bug I discovered in my level generator that had a major impact on similarity scores.

I discovered the 1st bug after the 70×56 set and before the 64×64 set. The bug effectively enforced a hard-coded limited of 500 cells for any individual line. According to the “no-touch” rule, a single line can cover at most 50% of a board; so this bug didn’t have any effect until the board size exceeds 1000 cells (over 4x the size of any level in my games).

500 cells is 12.7% of the cells for a 70×56 board – roughly 2/3rds of the levels selected by my program have a line exceeding 12.7% of the cells (1005 cells for 89×89). I haven’t seen a level without at least 1 line above 500 cells since 80×80.

Up until and including the 70×56 set, my program scored each set better than the previous as I increased the board size. After fixing this bug, the 64×64 set scored about 1.5 points worse than the 70×56 – a final round score of 32.53% compared to 31.02%. The scores stopped improving with board size: the 7 sets between 64×64 and 80×80 had scores ranging from 32.45% to 33.07% – a range of less than half the difference between 70×56 and 64×64.

Interestingly – the main reason the scored suffered so much were indirectly caused by the phantom dots bug and not a direct result of the 500 cell limit – if the program did try to draw a line over the limit, it would induce the phantom dots bug for the next line it tried to draw.

In this case however, I could not leave the bug in and continue to increase the board sizes as I did. Instead, I added an additional setting that let me set the limit at the percentage of a board’s size, which I set at 24.9%. I also disagreed with my program and thought the 64×64 set was better than 70×56 despite the scores.

An All-Time Record?

Between the 80×80 and 100×100 set, I changed the similarity score formula which increased the average score in the last round by about 5%. 70×56 scored 31.02% in the last round, and with a score of 36.07% in the last round, the latest 89×89 batch might be the top-scoring set of all time. To find this out, I reran the final round of filtering for the 70×56 to get the last round score according to the new formula, and was disappointed to find that it’s score of 36.02% kept it as the top set of all-time, according to the final round score.

More importantly, while I was doing this – I discovered yet another fairly major bug in my program. Besides greatly damaging my confidence in my programming ability, this bug also impacted the similarity scores for the 1st through 6th and especially the 9th (or last) iterations.

Oops

Similarity filtering splits a set of levels into small groups of levels (between 10 and 50 levels depending on the iteration) and then compares each level within that group to every other level in the group.

Within each group, the code loops through the levels and compares the 1st level to all the other levels, the 2nd level to all the levels except the first, and so-on. When it reaches the 2nd level, it doesn’t need to compare it to the first since this comparison was already made.

Unfortunately, my program wasn’t actually applying the scores to each level that were made from previous iterations through the loop.

For example – with a group size of 10, the 1st level’s similarity score would count 9 comparisons, the 2nd level would count 8, and the 9th would count only a single comparison (the 10th would count all 9, which I could but won’t explain).

Worse – one component of a levels total similarity score is the maximum similarity score within the group – which is usually going to be much higher if a level is compared to 9 other levels as opposed to just 1 level.

Even worse (but also better) – the group size is 50 levels for the final iteration (compared to 10 for the 1st 6 iterations and 25 for the 7th and 8th iterations) – the max score component was calculated from a range of 1 to 49 scores depending on the random order of each level. This meant that the final round score that I was using to proclaim the latest set of levels the greatest thing ever didn’t mean as much as I thought it did. Part of the reason I did use the final round score instead of the weighted score is that the final round was supposed to be the only one not impacted by randomness since there is only 1 group of 50 levels.

Fortunately – although this last iteration is a holdover from the days when a set used to be 30 levels and not 60 – the scores for that iteration have no impact on the levels that are selected for imaging. Additionally, the 7th and 8th iterations were not impacted by the bug since they include a line uniqueness calculation in the formula which follows a different code path.

The scores from the 1st 6 iterations (where the program eliminates 11,800 of the 12000 levels) were impacted and do matter, though.

Get To The Point Already

89×89 almost broke the scoring record
I found a bug that made this record less meaningful (not that it really had any meaning to begin with) and made similarity filtering less accurate.
To evaluate the impact of the bug, I reran similarity filtering not once but twice for the 2nd batch of 89×89 levels.
- 67 of the 90 levels selected for imaging were the same after fixing the bug. More importantly, the 34 highest ranked levels the 2nd pass were all included (though ranked somewhat differently) in the first pass.
- I ran it a 3rd time to see how much variance to expect normally after fixing the bug. 72/90 levels were the same this time.
- As expected, the scores for the earlier iterations were much more consistent after fixing the bug.
- The final result wasn’t impacted much though – the bug is much more likely to give a “bad” level a good score than a good level a bad score. The iterative process would weed these levels out eventually anyways.
- I estimate that the bug made levels score about 0.25 points higher the final iteration. The 2nd batch of 89×89 scored a bit closer than that 34.14% and 34.21% the 2nd and 3rd times around, most likely because fixing the bug made the program do a slightly better job at picking the best levels
My plan for the experiment was to take 60 levels from the 1st batch and 90 levels from the 2nd batch and randomly mix them together. Then I would do some subjective evaluations without knowing which came from each set or how my program rated them.
As a result of the above sequence, I wrote the small script to mix the levels at 9:00 PM as opposed to 2:00 PM.
Instead of picking 60 levels from the 1st batch, and 90 levels from the 2nd batch, it picked the top 60 levels from the 2nd batch and mixed them with all 90 levels from the same batch. Thus, there were 60 duplicate levels within the 150 levels – and all of them were from the 2nd “better” set. This is the future (though now past) bug.
I (finally) noticed a duplicate when I looked at the images for the 61st level.
I discovered that I had already looked at 13 duplicate levels (with random rotation and randomized colors, at least that is my excuse) – and that all of them were from the “better” set.
I remade the set as it was supposed to be done, throwing out the 900+ images I previously generated and looked at.
My “blindness” and thus the experiment was ruined, however.

What About The Images?

The mistake I made after fixing the bug and re-running similarity filtering not only ruined the experiment, it ruined my expectations for the set. While I had 150 “levels” to look at, I had already seen 47 of them and wasn’t particularly impressed. And these were the levels that were supposed to be good.

Psychology plays a big role in how I evaluate any image at any given time – though I will admit that I have a hard time turning anything positive out of the 150 levels that I did eventually look at.

That being said – it only takes one really great level/image to make a set worthwhile. With a few exceptions, I am usually disappointed the first time I look at the images in a new set and don’t see a level that is immediately stunning – even though about half of my current favorites came from levels that were not instant favorites.

After I had gone through 145 of the 150 levels and failed to see that stunner, the 146th level provided it. Or at least I thought it did. After I generated 1000 images of the level with randomized colors and failed to see one that I really liked – I realized that the level just isn’t that good aesthetically. This completely killed any enthusiasm I had for the set.

Another Numbers Success (Kind Of)

Usually I wait until after I finish a set until I look at how my program rated each level – and after the disappointment with the 146th level, I was “done” with the set even though it had no images, so I looked at the rankings.

By this time I had narrowed the 150 levels down to 8 potential levels for the gallery. It turns out that I picked the levels ranked #1, #2, and #3 out of 90 in the “better” batch. I have never agreed with my program’s ranking of the levels so much – despite strongly disagreeing with it’s assessment of the set in general. The infamous level 146 wasn’t one of these levels though – in fact it came from the “weak” batch, and was only ranked 12/60 for that batch.

One Image Survives

Out of the 150 levels and thousands of images I looked at, this is the only image that made the cut:

The image might make it into my current “best of the 90s” set, even though it’s difficult to pick out a single distinctive feature unlike the rest of the set.

And as a further example of cherry-picking the numbers when they suit my purposes – the level ranked #1 once and #2 twice in the 2nd best scoring set of all-time, making it one of the very best levels ever according to my program. Admittedly – I was extremely surprised to find this out given the pattern’s relative simplicity compared to other levels in the set.

One thing I did notice while editing the line colors that might be why my program scored it so highly is that just about every medium to long length line (20+ cells) does something interesting, even if none of them are individually all that crazy. Of the 26 lines >= 20 cells, there is only 1 exception.