Page - 3

Page - 3

Looking at yesterdays script's output, I noticed a bug. It's unexpected behavior, at least. Some entries of the table of contents were being duplicated. I didn't see any entries duplicated in the epub/toc.ncx file, which I assumed was the sole source the table of contents was being built from. But there's also epub/toc.xhtml, which I finally noticed. That has an extra section for "landmarks", like "bodymatter", "loi" (list of illustrations), and "endnotes". The "bodymatter" landmark points to text/book-1.xhmtl, one of the duplicated entries. Ah ha.

I didn't see a method for getting the table of contents without the landmarks, so as I build the list of pages in order, I'm skipping any entry with an item_type, which landmark items have and "regular" items don't.

That works for the epub file I'm starting with1. Who knows if it'll be a reliable way to handle this.

  1. Return of the Native by Thomas Hardy, from Standard Ebooks

I didn't write any code yesterday. Heck, I didn't even know what language I was going to use. Today I had to pick something.

I decided to use Python, mostly because of uv. It's really smooth. Python environments have felt fragile in the past, but now I have confidence everything's going to work and stay working.

The other contenders were Go, Rust, and a Scheme. I don't really like Go, but I use it at work a lot, so it's comfortable. Rust I like, but still don't feel like I could be as immediately productive as Python. If I get stuck on things to do I might rewrite parts in Rust for fun. Scheme? It's probably a good fit and my Python code often looks like someone who spent a lot of time writing Lisp. Maybe I'll do some rewrites in Chicken, too.

But, given Python, the next step was looking for a good library for futzing with ePub files. After giving epub-utils a spin, I had my winner.

I wrote a little script to kick the tires: take an ePub file and spit out the paths to each xhtml file in order. Not a big thing, just traversing the table of contents tree.

That's all I had time for in the morning.

Tomorrow I'll think about re-chunking the text from the xhtml files into the size I want. If a content file is too big, it should be split. If a content file is too small, it should be combined with a neighbor (and checked if that got too big). That should be a nice step.

I'm going on a December Adventure!

My plan is to recreate a version of DailyLit for myself. I used it a lot from 2007–2015 or so to read a bunch of books in the public domain via email: The Mayor of Casterbridge, Frankenstein, The Time Machine, some Sherlock Holmes, and others.

These days I'd rather do RSS than email, so I'll do that half of the service. And I'll be making it for just me, so I can ignore so many of the complications the DailyLit team would have had to deal with.

Might I be able to cobble this together very quickly? Yes. But I'm happy to pick something smallish. I'd rather spend December leisurely on polish and quality of life than struggling on something hard.

It occurs to me that finding good places to split a work into chunks is not something smoothly automatable. That is to say: it's Hard.

Today's progress

I have picked a terrible name: metamoRSS.

Thinking about where to get the texts from, I naturally thought of Project Gutenberg, which I've used very happily for similar projects. But since I'm going to be more carefully considering presentation, I wondered if I could source from Standard Ebooks. The answer is yes! They only provide various ebook formats, but what is an ebook but a zipped pile of xml and xhtml?

This is a far as I get today: a regrettable name and a plan to chop up ebooks.