Over on Eric Edelstein's, on talk of building another blog aggregator (to compete with Afrigator and Amatomu - listed in alphabetical order so as not to denote preference), I boasted at how easy it is to build the aggregator portion of it.  (Well, I'm fairly certain I did.  It seems the comments have all disappeared.  I'm not the only one pretty sure there were comments there...)

I did it quite significantly differently before, but I'm now building another version of the aggregator, hopefully in a way that gets rid of most of the tedium of aggregation, but allowing the results to be stored in whatever way the developer-user wants, and to allow capturing more about the feeds and posts than a generic aggregator will.

While I wanted to abstract away the particular model, I started with models for Feed and Post as a beginning, using Elixir, drawing inspiration on fields and definitions from the FeedJack aggregator for Django.


from elixir import *

class Feed(Entity):
    has_field('feed_id', Integer, primary_key=True)

    has_field('feed_url', String(255), unique=True)

    has_field('title', Unicode(200))
    has_field('description', Unicode())
    has_field('link', String(255))

    has_field('etag', String(50))
    has_field('last_modified', DateTime)

    has_field('last_checked', DateTime)

    has_many('posts', of_kind='Post')

    using_options(tablename='a2d_feed')

class Post(Entity):
    has_field('post_id', Integer, primary_key=True)

    has_field('link', String(255))
    has_field('title', Unicode(255))
    has_field('content', Unicode())

    has_field('date_created', DateTime)
    has_field('date_modified', DateTime)

    has_field('guid', String(200), unique=True)

    has_field('author', Unicode(100))
    has_field('author_email', String(255))

    has_field('comments_url', String(255))

    belongs_to('feed', of_kind="Feed", colname='feed_id', required=True)

    using_options(tablename='a2d_post')

These are pretty much the minimum required fields for an aggregator to be both meaningful and efficient.

The system can populate all the feed values the first time (and update it every time) it downloads the feed in the feed url.   The eTag and last_modified should be used to avoid refetching a feed.  The last_checked can be used in future as part of a strategy to try predict the most opportune schedule to pick up changes to the feed - or it can just be used for diagnostics.

Posts have their content in terms of title, description, and link, and metadata like date created, modified, and who wrote them.  They have a guid which can be used to avoid adding the same post twice through multiple sources - or just used to help make sure that the same post isn't added twice from the same source if the link or title or whatever changes.

The next step is to use the Universal Feed Parser to download the feed and parse it into a reasonably portable way of accessing information from the various syndication formats, and then to store that information in the database.  After that, there's looking at what other information the Universal Feed Parser makes available beyond the basics of the syndication formats.