Posts from 'december-adventure-2025' series - 2

Posts from 'december-adventure-2025' series - 2

I took a break! I got to enjoy the fruits of the past 10 days when I read O Pioneers! part 2 of 39 this morning.

Some things I might pick up when I continue the adventure:

  • Error handling.
  • Rewrite in another language.
  • Investigate why the chunking script failed on the Standard Ebook edition of Frankenstein.
  • Do something else! My little e-ink dashboard could use a Message Of The Day.

See you tomorrow!

When I had tried cutting up Standard Ebook's Frankenstein as a possible book for the book-posting bot, the script failed with an error that suggested there was a "navigation item" with a "target" that wasn't in the "manifest". I looked into that today and discovered that a nav item's target can include an anchor. In this case there's a subsection in chapter 24 for "Walton, in Continuation" that points to text/chapter-24.xhtml#walton-in-continuation.

The manifest only lists the files themselves, so there was nothing matching exactly text/chapter-24.xhtml#walton-in-continuation.

So I did the easiest thing I could think of: I strip any anchors off and track which files have already been processed (so I don't end up repeating the content of text/chapter-24.xhtml).

This will work as long as I'm dealing with linear narrative, which I think should be a safe assumption for a while.

This morning I continued my adjustments to make Frankenstein work.

Each xhtml file in a Standard Ebook epub file basically looks like this:

<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:epub="http://www.idpf.org/2007/ops"
      lang="en-GB"
      epub:prefix="z3998: http://www.daisy.org/z3998/2012/vocab/structure/, se: https://standardebooks.org/vocab/1.0" xml:lang="en-GB">
	<head>
		<title>Chapter XXIII</title>
		<link href="../css/core.css" rel="stylesheet" type="text/css"/>
		<link href="../css/local.css" rel="stylesheet" type="text/css"/>
	</head>
	<body epub:type="bodymatter z3998:fiction">
		<section id="chapter-23" role="doc-chapter" epub:type="chapter">
			<h2>
				<span epub:type="label">Chapter</span>
				<span epub:type="ordinal z3998:roman">XXIII</span>
			</h2>
			<p>It was eight o’clock when we landed...</p>
            <p>...</p>
            <p>...</p>
            ...
		</section>
	</body>
</html>

One assumption of the script that splits up the xhtml files is that each immediate child of the <body>'s <section> will be "small". So all it does is take each child, see if adding it to the current chunk is still within the size threshold and either add it or make a new chunk.

However, Frankenstein proved that assumption false. There were two examples of children that were big: another <section> and <blockquote>. The <blockquote> examples are kind of fun, because Frankenstein is a story-within-a-story (-within-a-story...).

Today's solution was to "unwrap" <section>s. A <section> doesn't do anything in a fediverse post, so it seems safe to do that.

<blockquote>s impact rendering so I'm unwrapping it, but re-wrapping each child in its own <blockquote>. So this:

<blockquote>
  <p>Three be the things I shall have till I die:</p>
  <p>Laughter and hope and a sock in the eye.</p>
</blockquote>

Becomes this:

<blockquote>
  <p>Three be the things I shall have till I die:</p>
</blockquote>
<blockquote>
  <p>Laughter and hope and a sock in the eye.</p>
</blockquote>

Semantically bad, structurally questionable, but renders fine.

Good enough for me.

I think today will just be the quick morning update to handle ebooks with multiple title entries by simply picking the first one. In the case in front of me,

["Frankenstein", "Or, the Modern Prometheus", "Frankenstein, or the Modern Prometheus"]

it will choose Frankenstein.

Meanwhile, O Pioneers! is proving to be a good read. It wasn't even on my list!

All my thoughts about cleaning up the code that powers @tomes@phantasmal.work got sidetracked by starting on Tumble Forth, which promises, "Starting from bare metal on the PC platform, we build a Forth from scratch..."

I came across it from a sequence of clicks that started from browsing the #DecemberAdventure hashtag and ended up derailing my morning.

Now here I am, looking at wikis and trying to write assembly.

  mov al, 0     ; clear lines
  mov cl, 0     ; starting at the left
  mov ch, 0     ; starting at the top
  mov dh, 7     ; until line 8 (which is enough)
  mov ah, 0x06  ; Scroll up window
  int 0x10      ; Interrupt

I wrote a few things in Uxn earlier this year, which primed me for getting captured by a tutorial for implementing Forth from scratch.

Yesterday was basically a day off. I looked a little bit at the current state of Scheme implementations... I don't know what I'm waiting for, but I think I'm waiting a little longer.

Today I wrote notes toward a new ebook chunking algorithm for @tomes.

The idea being the splitting works from the whole instead of accumulating from the beginning. Trying to balance chunks better.

As an example, a series of chapters with the following character counts:

  1. 12,000
  2. 8,000
  3. 12,000

Currently, the accumulation strategy grows each chunk until hitting around 8,000 characters, and would result in chunks approximately like so:

  1. 8,000
  2. 8,000 (spanning the chapter 1-2 border)
  3. 8,000 (spanning the chapter 2-3 border)
  4. 8,000

Instead, look first for any chapters within the threshold, and make their chunks (nearly?) invincible. Then chunk the remainders "evenly":

  1. 6,000 (chapter 1, front half)
  2. 6,000 (chapter 1, back half)
  3. 8,000 (the entirety of chapter 2)
  4. 6,000 (chapter 2, front half)
  5. 6,000 (chapter 2, back half)

Additionally, apply some penalties to certain pieces of markup to try to avoid breaking up sections that would likely suffer.

For example, near the middle of a long chapter:

<p>
    .... long paragraph ...
</p>
<p>"Quick dialog," she said.</p>
<p>"Witty retort," he replied</p>
<p>"Devastating comeback."</p>
<p>
    .... long paragraph ...
</p>

It'd be nice to penalize breaking a chunk between short paragraphs, likely to be dialog that would be better to stay together. And have large penalties near the beginning and end of a chapter, to avoid cutting too close to natural seams.

Then find the lowest penalty cut points near the point where a naïve even split would go.

Yesterday and today I started a Rust project where I'm re-implementing the ebook-splitting code and where I'll write the new algorithm I mentioned the other day.

Part of picking up Rust again is getting my bearings, especially around organizing code into modules. I think I finally have the rule I need to remember that always trips me up:

When you have binary and library code, they are different crates and the library crate controls all the Rust source except the one file that makes the binary, main.rs. That's why in main.rs you use PACKAGE_NAME::... and everywhere else you use crate::.... The binary has to import the library. The library gets to talk about itself as the crate.