Simple CouchDB: Feed Caching
February 25, 2009
For those of you that haven’t heard of CouchDB, it is a document-oriented database system unlike MySQL or PostgreSQL aimed at solving certain problems differently and often more elegantly than its RDBMS counterparts. I won’t go into any more detail than that.
In designing and mocking up a site that consumes RSS/Atom feeds, I wanted to come up with a very simple solution for deployment that didn’t involve caching in the same way I had been caching some other data objects. I didn’t want to have the user trigger a cache replacement for a few reasons (I’ll talk about them a little). First, a few solutions you might come up with:
Just pull the feed when the page gets hit.
This is not really an appropriate solution unless the feed you are consuming is inside your organization or something to that effect. I think this realization is obvious to most people, but it’s certainly the easiest way of doing things, so I can’t help but wonder how many sites try this.
Pull the feed once and cache it for some length of time.
This works better than the first solution. I think it’s passable, but it still suffers from some weaknesses that are dealbreakers for many people. First of all, what if the target site goes down? Aside from the usual error-handling code, you’ll need to make sure you don’t stick failed results in your cache. One nasty issue that might bite you here is that if the cache TTL expires triggering a cache miss and pulling a fresh feed when the target site is down, suddenly all of your other visitors are going to miss against the cache and, depending on your backend, you might not even be able to pull stale results because they have been invalidated by the expired TTL.
Pull the feed periodically and fetch it from your local store.
This is the situation for which I’m employing CouchDB.
What’d I do?
- Wrote a short script that can be run from cron to fetch the feeds I want. This ended up being 60 lines with comments and error handling, without any CouchDB libraries.
- Used _show to set up a “view” whereby my client could simply request the feed from CouchDB in exactly the same way as it would request a view from the actual URI. Not even simple JSON decoding is needed by the client, just point your feed consuming library at the URI. I can’t stress enough how cool this is.
Because URIs for feeds are unique, each of my feeds could even use its URI as the id in the database! Perfect. When a remote feed goes down, the cron fetcher skips updating that feed and the feed simply doesn’t get refreshed for the length of the downtime. No extra error handling needed beyond the usual. The implementation ends up being dead simple, fast, and (because of CouchDB) scalable if you need it.