December Adventure Day 19

Published as part of 'December Adventure 2025' series.

Yesterday was basically a day off. I looked a little bit at the current state of Scheme implementations... I don't know what I'm waiting for, but I think I'm waiting a little longer.

Today I wrote notes toward a new ebook chunking algorithm for @tomes.

The idea being the splitting works from the whole instead of accumulating from the beginning. Trying to balance chunks better.

As an example, a series of chapters with the following character counts:

  1. 12,000
  2. 8,000
  3. 12,000

Currently, the accumulation strategy grows each chunk until hitting around 8,000 characters, and would result in chunks approximately like so:

  1. 8,000
  2. 8,000 (spanning the chapter 1-2 border)
  3. 8,000 (spanning the chapter 2-3 border)
  4. 8,000

Instead, look first for any chapters within the threshold, and make their chunks (nearly?) invincible. Then chunk the remainders "evenly":

  1. 6,000 (chapter 1, front half)
  2. 6,000 (chapter 1, back half)
  3. 8,000 (the entirety of chapter 2)
  4. 6,000 (chapter 2, front half)
  5. 6,000 (chapter 2, back half)

Additionally, apply some penalties to certain pieces of markup to try to avoid breaking up sections that would likely suffer.

For example, near the middle of a long chapter:

<p>
    .... long paragraph ...
</p>
<p>"Quick dialog," she said.</p>
<p>"Witty retort," he replied</p>
<p>"Devastating comeback."</p>
<p>
    .... long paragraph ...
</p>

It'd be nice to penalize breaking a chunk between short paragraphs, likely to be dialog that would be better to stay together. And have large penalties near the beginning and end of a chapter, to avoid cutting too close to natural seams.

Then find the lowest penalty cut points near the point where a naïve even split would go.