Converting Feeds (RSS/Atom/etc.) to Maildir
For many years I’ve been using Emacs to read my emails, and I use the same setup to read news as well (RSS, ATOM, etc.). For all this time, I’ve struggled to get a nice workflow, something which satisfies the following:
- Reliable: I don’t want to be restarting or recovering broken/stuck processes, or losing/corrupting messages.
- Works offline: I shouldn’t require an Internet connection to read my messages.
- Native UI: I don’t want to be stuck inside a Web browser.
- Fast: I shouldn’t have to wait for mail to download/open, or for programs to start/stop/refresh.
- Keyboard driven: I don’t want to use a pointing device.
- Arbitrary extensibility (optional): This isn’t necessary if the tool behaves exactly as I want it to; since that’s unlikely, a scripting language or similar would be nice.
I haven’t found a setup which satisfies all of these nicely; my previous setup was almost there, and today I finally bit the bullet and implemented the missing pieces. So far, I’m quite happy with the result!
tl;dr I’ve made my own version of feed2maildir, which converts RSS feeds into maildir messages, without requiring any extra state (e.g. config files or “last checked” databases) and avoiding duplicates.
Previous Setup
Mail Reader
I used to use Gnus, but more recently I switched to mu4e since it’s much faster. Gnus uses Emacs Lisp to implement much/all of its functionality, including downloading mail/posts, searching/organising messages, etc. This means that the Emacs process itself is doing all of the work, which may be quite slow since Emacs Lisp is a rather slow language. Also, since Emacs doesn’t support any concurrency mechanisms, any time we have to wait for something (either a calculation, or just waiting for a server to send us a response), Emacs cannot do anything else: the UI will freeze, and we can’t perform any further interactions until that task has finished.
To work around this, I used to run two Emacs processes: one for all of my programming, commandlines, etc. and another just for Gnus. Since Gnus is pretty slow, I would keep it running all the time, and manually refresh the messages before starting to read them.
On the other hand, mu4e doesn’t do much in Emacs Lisp at all.
Instead, most of the work is performed by a commandline program
mu
; mu4e just calls out to mu
as necessary and
displays the results. This means I can use a single Emacs process for
everything, starting and stopping mu4e
however I like.
Fetching Mail
One way to speed up Gnus is to avoid going online: rather than having
Gnus log into IMAP servers or download RSS feeds, it’s quicker to
offload that into a background process, storing messages on disk in a
format like maildir.
This has the added advantages that it works offline, and gives me an
up-to-date backup of my email archive. Coincidentally, mu
operates on maildir archives, which made the transition from Gnus
completely painless!
For this, I’ve found the mbsync
command provided by
isync
works quite nicely; this can be run repeatedly using
e.g. cron or systemd (I’m currently using the latter, due to the tight
integration in NixOS).
One thing to keep in mind is that the mbsync
process can
occasionally hang, e.g. if it gets confused by network connectivity
during a resume/suspend cycle. Hence it’s wise to always run it using
timeout
; a long timeout (e.g. an hour) is fine, as long as
hung processes eventually get killed.
I’m currently waiting 10 minute between checking my inboxen, and an hour between a full download of all IMAP folders. These timers start after the previous check has finished, which prevents multiple instances from clobbering each other.
Fetching News
I read “news”, in various forms, from many places. As with my mail, I want everything to be in maildir format, but this requires converting from formats like RSS and Atom.
There are several programs for doing this, e.g. feed2maildir, universal aggregator, imm and rssdrop. Unfortunately, none quite fitted my use case; in particular, most are concerned with periodically downloading feeds and checking against a database of known entries, or last-checked times, etc. which I don’t need or want. All I want is a robust rss to maildir converter, which doesn’t create duplicates; in fact, universal aggregator seems to offer tools for doing this, but I couldn’t figure out how to build software written in Go :(
For a while I struggled along with imm, since it’s written in Haskell and therefore easy for me to hack on. Some problems I encountered were:
- Taking a list of feeds from a config file, rather than when invoked. Thankfully the config file allows arbitrary Haskell to be executed, so I had it build the list dynamically by reading whatever was in the RSS cache.
- It doesn’t allow accessing local files. To work around this, I would
start a local Web server in the RSS cache directory before invoking imm,
have imm download feeds from
localhost
, then kill the server. This worked, but was a pretty ridiculous situation to be in! - imm is very fragile. In particular, it bails out when accessing HTTPS, or when certain punctuation appears in certain elements (e.g. titles), etc. To work around this, I’d download HTTPS feeds to the RSS cache and serve them locally. I’d also pass all feeds through a script for stripping out punctuation, etc.
- Despite maintaining a database on disk, imm often outputs duplicate
entries. It also mangles some fields, e.g. it doesn’t seem to handle
spaces in the
author
field, causing many messages to have uninformativeFrom
fields likeThe
.
Whilst some of these workarounds are pretty silly, it was the last bullet point that finally made me decide enough is enough, and I wouldn’t put up with seeing the same posts again and again.
New Setup
My requirements were actually pretty straightforward: parse RSS data, generate maildir entries, but skip any which are already present. This is a subset of what these existing programs already do! The problem is, they don’t expose that functionality:
- They parse RSS data, but only from URLs.
- They can fetch data from arbitrary places, but only when they’re written in some config file.
- They perform redundancy checks, but against a separate database to the actual maildir entries.
Since these are all Free Software, I decided to pick one which was closest to doing what I wanted, and fork the project into my own tool.
I chose feed2maildir, since
that seemed to be more robust at handling arbitrary input than imm
(presumably thanks to its use of feedparser). It also
doesn’t have a needlessly complicated
approach to configuration (allowing arbitrary Haskell as config is
great; but why compile it into the program, when you could just
eval
it or read the stdout of a runhaskell
process instead?).
The first change I made to the feed2maildir code was to remove the
requirement for a configuration file. Since we’re calling
feed2maildir
from a script (to handle scheduling, etc.), we
can use that script to read/generate the required feed details, in any
way we like, and pass them on to the process via commandline arguments.
This meant the addition of a -n
option for specifying a
feed’s name, which is used as the From
field of the
resulting messages.
This takes us from a configuration file like:
{"Feed1": "http://example.com/feed.rss",
"Useful News": "http://example.org/feed.rss",
"Gossip": "http://example.net/feed.rss",
...
}
and an invocation like:
feed2maildir -s -m ~/Mail/feeds
To no configuration file, and an invocation like:
feed2maildir -s -m ~/Mail/feeds -n "Feed1" "http://example.com/feed.rss" \
-n "Useful News" "http://example.org/feed.rss" \
-n "Gossip" "http://example.net/feed.rss" \
...
Next, I threw away the ability to process multiple feeds at once. This functionality can be implemented with a loop, so there’s no need for the tool to include it. This take us to an invocation like:
while read -r LINE
do
NAME=$(echo "$LINE" | cut -f1)
URL=$(echo "$LINE" | cut -f2)
feed2maildir -s -m ~/Mail/feeds/"$NAME" -n "$NAME" "$URL"
done < ~/.feeds
This uses a configuration file, but it has nothing to do with the tool; it’s just a convenience for our script. It also allows each feed to have a different maildir path.
Next, I ripped out all of the feed fetching code. If we’re calling the tool once per feed, we might as well send the data into the standard input, since that lets us fetch the data any way we like: from disk, from the Web (e.g. via wget or curl), programatically, or any other source we can imagine.
This gives us an invocation like:
while read -r LINE
do
NAME=$(echo "$LINE" | cut -f1)
URL=$(echo "$LINE" | cut -f2)
curl "$URL" | feed2maildir -s -m ~/Mail/feeds/"$NAME" -n "$NAME"
done < ~/.feeds
The next issue is preventing duplicates. feed2maildir attempts this using a separate database, which stores the last-checked time for each configured feed: when checking for updates, we update the time in the database; when we read the feed, we ignore any posts from earlier than the last-checked time.
Storing a separate database seems wasteful to me, as well as the fact that it may become out-of-sync from the maildir and that last-checked times may not bear any relation to whether we’ve seen a post before. For example, if a film is available on iPlayer then its release date is used as the post date. In this case, feed2maildir would only acknowledge films released since the iPlayer feed was last checked, which is very unlikely to find anything.
Instead, I ripped out the database in favour of reading the maildir
directory itself. Each message is given an extra header
field, which I’ve called X-feed2maildirsimple-hash
.
This contains the SHA256 of various
identifying fields, like the feed name, author, ID tag and title. When
we process a feed, we begin by reading all of the existing hash
information from the given maildir, then we skip any posts with matching
hashes. This mechanism is extensible: each field’s hash is stored
separately, a field is only included if it’s present in that post, and
we only skip posts where the fields present on both have
matching hashes; this means we can add new fields without breaking the
detection of old messages without those fields.
This is working for me in place of the complicated imm-based setup I used to use, and as always I’ve released all code as Free Software, in case it’s useful to anyone.