Posts from 'december-adventure-2025' series

Posts from 'december-adventure-2025' series

I'm going on a December Adventure!

My plan is to recreate a version of DailyLit for myself. I used it a lot from 2007–2015 or so to read a bunch of books in the public domain via email: The Mayor of Casterbridge, Frankenstein, The Time Machine, some Sherlock Holmes, and others.

These days I'd rather do RSS than email, so I'll do that half of the service. And I'll be making it for just me, so I can ignore so many of the complications the DailyLit team would have had to deal with.

Might I be able to cobble this together very quickly? Yes. But I'm happy to pick something smallish. I'd rather spend December leisurely on polish and quality of life than struggling on something hard.

It occurs to me that finding good places to split a work into chunks is not something smoothly automatable. That is to say: it's Hard.

Today's progress

I have picked a terrible name: metamoRSS.

Thinking about where to get the texts from, I naturally thought of Project Gutenberg, which I've used very happily for similar projects. But since I'm going to be more carefully considering presentation, I wondered if I could source from Standard Ebooks. The answer is yes! They only provide various ebook formats, but what is an ebook but a zipped pile of xml and xhtml?

This is a far as I get today: a regrettable name and a plan to chop up ebooks.

I didn't write any code yesterday. Heck, I didn't even know what language I was going to use. Today I had to pick something.

I decided to use Python, mostly because of uv. It's really smooth. Python environments have felt fragile in the past, but now I have confidence everything's going to work and stay working.

The other contenders were Go, Rust, and a Scheme. I don't really like Go, but I use it at work a lot, so it's comfortable. Rust I like, but still don't feel like I could be as immediately productive as Python. If I get stuck on things to do I might rewrite parts in Rust for fun. Scheme? It's probably a good fit and my Python code often looks like someone who spent a lot of time writing Lisp. Maybe I'll do some rewrites in Chicken, too.

But, given Python, the next step was looking for a good library for futzing with ePub files. After giving epub-utils a spin, I had my winner.

I wrote a little script to kick the tires: take an ePub file and spit out the paths to each xhtml file in order. Not a big thing, just traversing the table of contents tree.

That's all I had time for in the morning.

Tomorrow I'll think about re-chunking the text from the xhtml files into the size I want. If a content file is too big, it should be split. If a content file is too small, it should be combined with a neighbor (and checked if that got too big). That should be a nice step.

Looking at yesterdays script's output, I noticed a bug. It's unexpected behavior, at least. Some entries of the table of contents were being duplicated. I didn't see any entries duplicated in the epub/toc.ncx file, which I assumed was the sole source the table of contents was being built from. But there's also epub/toc.xhtml, which I finally noticed. That has an extra section for "landmarks", like "bodymatter", "loi" (list of illustrations), and "endnotes". The "bodymatter" landmark points to text/book-1.xhmtl, one of the duplicated entries. Ah ha.

I didn't see a method for getting the table of contents without the landmarks, so as I build the list of pages in order, I'm skipping any entry with an item_type, which landmark items have and "regular" items don't.

That works for the epub file I'm starting with1. Who knows if it'll be a reliable way to handle this.

  1. Return of the Native by Thomas Hardy, from Standard Ebooks

Today I got as far as being able to go from items in my flattened table of contents to getting the xhtml content. I had to go on a bit of a detour that I suspect I wouldn't need if I was feeling sharper today.

The items out of the table of contents have a target property, which is the path to its xhtml file inside the ePub. But in order to pull it out, you need the id for the find_content_by_id method on the Document object. The id isn't a part of the table of content items. I had to match each one to an entry in the manifest, which has both the path to the xhtml and id, but the manifest doesn't have ordering.

It's just a little dance. It's fine.

I also put a # type: ignore directive on a line that I just didn't want to bother figuring out how to please pyright this morning. (I use pyright).

Tomorrow I'm betting I'll get to making fresh xhtml files, re-slicing the content into consistently sized files.

Short (and late) day of reading ebook content and re-chunking it into the approximate size I want.

Except for the part where I'm accidentally skipping huge swaths of text. Gonna have to debug that tomorrow.

This morning I fixed the problem where I was skipping content from an ebook. I was filtering it out based on a faulty idea of what was valid content versus empty content! Oops.

I replaced prints with logging.

And I improved the chunking algorithm to honor chapter breaks when they appear within 30% of the size limit.

I'm going to switch out the ebook I've been testing with, Return of the Native by Thomas Hardy, with something else. I started reading my copy of it, so now I'm too far ahead of when this will be ready to read via RSS, so I have to find something else. Not a bad problem!

That's probably it for today. I'm headed to electronics recycling and the library. Then: relaxation.

I've half changed my mind building an RSS feed. The RSS reader I use, self-hosted DanB/RSS, truncates content. I don't feel like migrating to another reader or updating DanB/RSS to allow selectively to not truncate content. I could just put up pages to link to, but I want more to have the content where I am and not a click away.

So the natural choice is to make a bot on the Fediverse.

Today's quick work, then, was creating an account on my GoToSocial server, configuring it as a bot, and setting up an "application" for statuses to be posted through.

I did turn on the RSS feed for the bot, so if I ever to get my RSS reader to behave the way I want I could read there.

Next session I'll do some test posts.

A rough plan for what's to come:

  • Add a header of some sort to posts (like, "Title, by Author, part M of N")
  • Pick a book
  • Split it up
  • Schedule posts
  • Share and celebrate
  • Set a reminder close to when the book will finish to pick another

Today's goal was to post sections of a book to the GoToSocial bot I set up earlier in the adventure.

This morning was preparation: cleaning up the code that splits ebooks, cleaning up the directory structure of where book parts get dumped, storing a little json file with each book to store useful information for when it's time to post (like the title, author, and number of parts).

In the evening, I made a bunch of test posts. I found some inconsistencies after checking on the posts in Elk (the web frontend I use most), Pinafore (a popular web frontend), and Tusky (the Android app I use). They all treated whitespace a little differently.

For example:

<p>
  This ebook is the product of
  <a href="https://standardebooks.org/">
    Standard Ebooks
  </a>
</p>

One displayed it as you would expect. One treated the newline between "of" and "Standard" as meaningful and put the two words on different lines. And the other collapsed all the whitespace between "of" and "Standard" down to "ofStandard".

Bizarre.

It was an easy enough fix. I just stopped pretty printing the html. Now they all display things consistently.

So, test posts are done. Tomorrow I'll try deploying the code to my little server, set up a timer to post on a schedule, and let people know they can read O Pioneers! by Willa Cather with me over the course of 39 days.

Just a quick morning of:

  1. Seeing that the bot posted at 8am just as I expected.
  2. Validating the text. There was a jarring jump in the middle of a paragraph that made me worry I had accidentally dropped a bunch of text. But it was all good. Willa Cather made a choice!
  3. Saw that the RSS feed wasn't working and tried looking into it, but no progress.
  4. Deleted the post and re-ran the posting script after changing the visibility of posts to public. (I think since I put the text behind a spoiler/CW and it only posts once a day that I'm not violating good bot practice.)

I'm excited. Reading a book slowly via fediverse posts is gonna work!