This morning I fixed the problem where I was skipping content from an
ebook. I was filtering it out based on a faulty idea of what was valid
content versus empty content! Oops.
I replaced prints with logging.
And I improved the chunking algorithm to honor chapter breaks when
they appear within 30% of the size limit.
I'm going to switch out the ebook I've been testing with, Return of
the Native by Thomas Hardy, with something else. I started
reading my copy of it, so now I'm too far ahead of when this will be
ready to read via RSS, so I have to find something else. Not a bad
problem!
That's probably it for today. I'm headed to electronics recycling and
the library. Then: relaxation.
Today I got as far as being able to go from items in my flattened
table of contents to getting the xhtml content. I had to go on a bit
of a detour that I suspect I wouldn't need if I was feeling sharper
today.
The items out of the table of contents have a target property, which
is the path to its xhtml file inside the ePub. But in order to pull it
out, you need the id for the find_content_by_id method on the
Document object. The id isn't a part of the table of content
items. I had to match each one to an entry in the manifest, which has
both the path to the xhtml and id, but the manifest doesn't have
ordering.
It's just a little dance. It's fine.
I also put a # type: ignore directive on a line that I just didn't
want to bother figuring out how to please pyright this morning. (I use
pyright).
Tomorrow I'm betting I'll get to making fresh xhtml files, re-slicing
the content into consistently sized files.
Looking at yesterdays script's output, I noticed a bug. It's
unexpected behavior, at least. Some entries of the table of contents
were being duplicated. I didn't see any entries duplicated in the
epub/toc.ncx file, which I assumed was the sole source the table of
contents was being built from. But there's also epub/toc.xhtml,
which I finally noticed. That has an extra section for "landmarks",
like "bodymatter", "loi" (list of illustrations), and "endnotes". The
"bodymatter" landmark points to text/book-1.xhmtl, one of the
duplicated entries. Ah ha.
I didn't see a method for getting the table of contents without the
landmarks, so as I build the list of pages in order, I'm skipping any
entry with an item_type, which landmark items have and "regular"
items don't.
That works for the epub file I'm starting with1. Who knows if it'll be
a reliable way to handle this.
I didn't write any code yesterday. Heck, I didn't even know what
language I was going to use. Today I had to pick something.
I decided to use Python, mostly because of uv. It's really
smooth. Python environments have felt fragile in the past, but now I
have confidence everything's going to work and stay working.
The other contenders were Go, Rust, and a Scheme. I don't really like
Go, but I use it at work a lot, so it's comfortable. Rust I like, but
still don't feel like I could be as immediately productive as
Python. If I get stuck on things to do I might rewrite parts in Rust
for fun. Scheme? It's probably a good fit and my Python code often
looks like someone who spent a lot of time writing Lisp. Maybe I'll do
some rewrites in Chicken, too.
But, given Python, the next step was looking for a good library for
futzing with ePub files. After giving epub-utils a
spin, I had my winner.
I wrote a little script to kick the tires: take an ePub file and spit
out the paths to each xhtml file in order. Not a big thing, just
traversing the table of contents tree.
That's all I had time for in the morning.
Tomorrow I'll think about re-chunking the text from the xhtml files
into the size I want. If a content file is too big, it should be
split. If a content file is too small, it should be combined with a
neighbor (and checked if that got too big). That should be a nice
step.
My plan is to recreate a version of DailyLit for myself. I
used it a lot from 2007–2015 or so to read a bunch of books in the
public domain via email: The Mayor of Casterbridge, Frankenstein, The
Time Machine, some Sherlock Holmes, and others.
These days I'd rather do RSS than email, so I'll do that half of the
service. And I'll be making it for just me, so I can ignore so many of
the complications the DailyLit team would have had to deal with.
Might I be able to cobble this together very quickly? Yes. But I'm
happy to pick something smallish. I'd rather spend December leisurely
on polish and quality of life than struggling on something hard.
It occurs to me that finding good places to split a work into chunks
is not something smoothly automatable. That is to say: it's Hard.
Today's progress
I have picked a terrible name: metamoRSS.
Thinking about where to get the texts from, I naturally thought of
Project Gutenberg, which I've used very happily for
similar projects. But since I'm going to be more carefully considering
presentation, I wondered if I could source from Standard
Ebooks. The answer is yes! They only provide various
ebook formats, but what is an ebook but a zipped pile of xml and
xhtml?
This is a far as I get today: a regrettable name and a plan to chop up
ebooks.