This morning I fixed the problem where I was skipping content from an ebook. I was filtering it out based on a faulty idea of what was valid content versus empty content! Oops.

I replaced prints with logging.

And I improved the chunking algorithm to honor chapter breaks when they appear within 30% of the size limit.

I'm going to switch out the ebook I've been testing with, Return of the Native by Thomas Hardy, with something else. I started reading my copy of it, so now I'm too far ahead of when this will be ready to read via RSS, so I have to find something else. Not a bad problem!

That's probably it for today. I'm headed to electronics recycling and the library. Then: relaxation.

Short (and late) day of reading ebook content and re-chunking it into the approximate size I want.

Except for the part where I'm accidentally skipping huge swaths of text. Gonna have to debug that tomorrow.

Today I got as far as being able to go from items in my flattened table of contents to getting the xhtml content. I had to go on a bit of a detour that I suspect I wouldn't need if I was feeling sharper today.

The items out of the table of contents have a target property, which is the path to its xhtml file inside the ePub. But in order to pull it out, you need the id for the find_content_by_id method on the Document object. The id isn't a part of the table of content items. I had to match each one to an entry in the manifest, which has both the path to the xhtml and id, but the manifest doesn't have ordering.

It's just a little dance. It's fine.

I also put a # type: ignore directive on a line that I just didn't want to bother figuring out how to please pyright this morning. (I use pyright).

Tomorrow I'm betting I'll get to making fresh xhtml files, re-slicing the content into consistently sized files.

Looking at yesterdays script's output, I noticed a bug. It's unexpected behavior, at least. Some entries of the table of contents were being duplicated. I didn't see any entries duplicated in the epub/toc.ncx file, which I assumed was the sole source the table of contents was being built from. But there's also epub/toc.xhtml, which I finally noticed. That has an extra section for "landmarks", like "bodymatter", "loi" (list of illustrations), and "endnotes". The "bodymatter" landmark points to text/book-1.xhmtl, one of the duplicated entries. Ah ha.

I didn't see a method for getting the table of contents without the landmarks, so as I build the list of pages in order, I'm skipping any entry with an item_type, which landmark items have and "regular" items don't.

That works for the epub file I'm starting with1. Who knows if it'll be a reliable way to handle this.

  1. Return of the Native by Thomas Hardy, from Standard Ebooks

I didn't write any code yesterday. Heck, I didn't even know what language I was going to use. Today I had to pick something.

I decided to use Python, mostly because of uv. It's really smooth. Python environments have felt fragile in the past, but now I have confidence everything's going to work and stay working.

The other contenders were Go, Rust, and a Scheme. I don't really like Go, but I use it at work a lot, so it's comfortable. Rust I like, but still don't feel like I could be as immediately productive as Python. If I get stuck on things to do I might rewrite parts in Rust for fun. Scheme? It's probably a good fit and my Python code often looks like someone who spent a lot of time writing Lisp. Maybe I'll do some rewrites in Chicken, too.

But, given Python, the next step was looking for a good library for futzing with ePub files. After giving epub-utils a spin, I had my winner.

I wrote a little script to kick the tires: take an ePub file and spit out the paths to each xhtml file in order. Not a big thing, just traversing the table of contents tree.

That's all I had time for in the morning.

Tomorrow I'll think about re-chunking the text from the xhtml files into the size I want. If a content file is too big, it should be split. If a content file is too small, it should be combined with a neighbor (and checked if that got too big). That should be a nice step.

I'm going on a December Adventure!

My plan is to recreate a version of DailyLit for myself. I used it a lot from 2007–2015 or so to read a bunch of books in the public domain via email: The Mayor of Casterbridge, Frankenstein, The Time Machine, some Sherlock Holmes, and others.

These days I'd rather do RSS than email, so I'll do that half of the service. And I'll be making it for just me, so I can ignore so many of the complications the DailyLit team would have had to deal with.

Might I be able to cobble this together very quickly? Yes. But I'm happy to pick something smallish. I'd rather spend December leisurely on polish and quality of life than struggling on something hard.

It occurs to me that finding good places to split a work into chunks is not something smoothly automatable. That is to say: it's Hard.

Today's progress

I have picked a terrible name: metamoRSS.

Thinking about where to get the texts from, I naturally thought of Project Gutenberg, which I've used very happily for similar projects. But since I'm going to be more carefully considering presentation, I wondered if I could source from Standard Ebooks. The answer is yes! They only provide various ebook formats, but what is an ebook but a zipped pile of xml and xhtml?

This is a far as I get today: a regrettable name and a plan to chop up ebooks.