Over on Eric Edelstein's, on talk of building another blog aggregator (to compete with Afrigator and Amatomu - listed in alphabetical order so as not to denote preference), I boasted at how easy it is to build the aggregator portion of it.  (Well, I'm fairly certain I did.  It seems the comments have all disappeared.  I'm not the only one pretty sure there were comments there...)

I did it quite significantly differently before, but I'm now building another version of the aggregator, hopefully in a way that gets rid of most of the tedium of aggregation, but allowing the results to be stored in whatever way the developer-user wants, and to allow capturing more about the feeds and posts than a generic aggregator will.

While I wanted to abstract away the particular model, I started with models for Feed and Post as a beginning, using Elixir, drawing inspiration on fields and definitions from the FeedJack aggregator for Django.


from elixir import *

class Feed(Entity):
    has_field('feed_id', Integer, primary_key=True)

    has_field('feed_url', String(255), unique=True)

    has_field('title', Unicode(200))
    has_field('description', Unicode())
    has_field('link', String(255))

    has_field('etag', String(50))
    has_field('last_modified', DateTime)

    has_field('last_checked', DateTime)

    has_many('posts', of_kind='Post')

    using_options(tablename='a2d_feed')

class Post(Entity):
    has_field('post_id', Integer, primary_key=True)

    has_field('link', String(255))
    has_field('title', Unicode(255))
    has_field('content', Unicode())

    has_field('date_created', DateTime)
    has_field('date_modified', DateTime)

    has_field('guid', String(200), unique=True)

    has_field('author', Unicode(100))
    has_field('author_email', String(255))

    has_field('comments_url', String(255))

    belongs_to('feed', of_kind="Feed", colname='feed_id', required=True)

    using_options(tablename='a2d_post')

These are pretty much the minimum required fields for an aggregator to be both meaningful and efficient.

The system can populate all the feed values the first time (and update it every time) it downloads the feed in the feed url.   The eTag and last_modified should be used to avoid refetching a feed.  The last_checked can be used in future as part of a strategy to try predict the most opportune schedule to pick up changes to the feed - or it can just be used for diagnostics.

Posts have their content in terms of title, description, and link, and metadata like date created, modified, and who wrote them.  They have a guid which can be used to avoid adding the same post twice through multiple sources - or just used to help make sure that the same post isn't added twice from the same source if the link or title or whatever changes.

The next step is to use the Universal Feed Parser to download the feed and parse it into a reasonably portable way of accessing information from the various syndication formats, and then to store that information in the database.  After that, there's looking at what other information the Universal Feed Parser makes available beyond the basics of the syndication formats.

5 old-style comments

  1. Eric EdelsteinApril 25, 2007 at 11:17 PM.

    it's a short sad story.
    i definitely didn't delete your comment on purpose!!!!
  2. Paul JacobsonApril 25, 2007 at 11:42 PM.

    The construction of something like an aggregator fascinates me. I'd love to see what comes up next.
  3. Neville NeweyApril 26, 2007 at 03:06 PM.

    Great to see others in SA using Python. I have some throw away code too that crawls the blogosphere and stores feeds in a sqlite database. There is also a basic web front end. It was used for the now defunct muti-blogs project but if people are interested in it I am willing to share the code. I certainly do NOT plan on creating YASABA! (yet another SA blog aggregator). Although from time to time people may see the crawler in their referrer logs as I experiment with it.

    Regards

  4. ilAnApril 30, 2007 at 03:48 PM.

    The difficult part in creating a aggregator site is NOT in collecting the data. That's easy. All you need to to occasionally check the xml feed for posts; and to store the data in a database with the relevant desciptor fields. You have shown how simple that is in this post.

    The difficult part, or rather I would say, the genius, lies in how one interacts with the user and and how one presents that data.

    I have endless irritations with Afrigator's interface. And although I quite like Amatomu, it still needs some improvements. But admittedly its in Alpha; and I can see Amatomu become nice. Afrigator looks like it has underlying design problems.

    Nevertheless, aggregation itself is just a service that runs through the xml feeds it has registered and updates a database.

    But how one presents that a data. Wow! That is everything. Aggregation. Easy. Presentation. Priceless.
  5. Neil Blakey-MilnerApril 30, 2007 at 06:02 PM.

    Yeah - that's exactly what I was saying on Eric's post. The picking up of feeds is the trivial part. The hard part is the added value - making it more likely for people to encounter content that they will find of interest to them.

    There's still quite a lot you can do without getting into that. A "planet" is a very useful community tool - witness the GeekDinner planet [planet.geekdinner.org.za], which really makes it easy to stay in touch with other people who attend Geek Dinners.

    So, allowing users to create their own mini-blogosphere is quite a valuable exercise - making it easy to create a "Web 2.0 security" planet with a few clicks is very cool. That and a mailing list or two and a calendar and so forth, and you're really going someplace.
blog comments powered by Disqus