Friday, April 15, 2011

RSS Aggregation

For the longest time I have wanted to take advantage of the RSS/Atom feeds of my favorite podcasts/videocasts but always found the aggregators either took lacking in featuresets because they were really meant to grab news headers, or contain a too complicated and uncustomizable configuration. An example of this is the Miro application, which not only requires a significantly higher amount of resources, but contains features I don't want and lacks features I do want. For example, Icepodder, GPodder, Miro, iTunes, none of them will allow me to specify a location to download the episodes on a per feed basis, none of them will allow me to specify a time interval to check, and all of them pick some complicated XML based storage mechanism to use for memory during re-entry. Let's face it, none of the named applications are ones that you really want to run in the background.

This has long been something I've wanted to script out, but conventional shell languages don't have the primitives to be used alone. What I was really looking for in a RSS podcast aggregator was something that was:


  • Easy to understand in few lines of code
  • Relied on simple logic based on fair assumptions
  • Not going to break when files were missing that the app thought had downloaded
  • Easy to configure
  • Able to email me when a new podcast was downloaded

What I came up with was a very simple snippit of python which relied on sqlite and feedparser libraries. With a few caveats, this program, feedread.py, can download the latest episode published assuming it hasn't already been fetched before. The code is relatively simple and can easily be expanded to include more conditions. For one, this relies on specific RSS tags being present per entry. It also expects the date field to be formatted to a specific format string specifier (currently it's GMT 24 hour time and is formatted to fit revision3's specifier).

While this program could be made more general this currently suits my needs and eliminates the need for me to use an application such as Miro or iTunes. No longer did I have to check each show's page on revision3, copy the link, and download them. Instead, the programming is now automatically downloaded with similar convenience to a DVR. Cron runs the application, emails me a notification if a new episode is downloaded, and mythvideo is able to open the appropriate path on the NFS share and stream the video just fine.


How to Use

Usage is pretty straight forward. The config file (~/.feedread/feedlist), which is sensitive to \n's, reads every other line as the RSS link. The filename itself is pulled straight from revision3's web servers instead of being some weird crazy hash. The second line is the corresponding download path to store the given feed's downloads. Once you put up the blank sqlite db uploaded here, the python script can automatically find new entries of podcasts and enter their lastfetch timestamp and feed title into the defined schema. Then, as the user run a cronjob (and be sure to have your mail transfer agent configured properly so that cron can know how to properly email you via a system mail alias to your real world email).

feedread.py
blankhistory.db (you'll want to rename this to history.db)
feedlist

Put the following three files into a directory in your home directory called .feedread (actually just feedlist and history.db).  Be sure to rename blankhistory.db to history.db.  Be sure also that the python library feedparser is installed, as well as the sqlite3 python connectors.