Once you've modeled the core objects, and then done the core aggregation logic for feeds and posts, you have pretty much everything else to do.  At the very least, you're going to want to put this together in some sort of interface - either for users to view, or for programs to check against.  And, if you're trying to build something like Amatomu, you've got a lot more work to do.

There are a number of additional standards above the original RSS (which a non-trivial amount of feeds probably still use), and those additional standards allow you to capture more metadata about the posts.  Atom has a standard around categories, and there are a few other ways people indicate categories in their feeds as well (and they may just call them "tags").  Someone might "geotag" their posts using GeoRSS.  There are ways to indicate the license of your work - especially with Creative Commons licenses.  There are podcasts using enclosures, and vodcasts which may or may not be using Yahoo!'s Media RSS, and all sorts of other stuff in the feeds you're aggregating.


And that's just in the feed itself.  You could follow an entry's link, and grab the page and parse it for links to other web logs and build up some sort of reputation system or for all links and find out what is being talked a lot about in the feeds you're aggregating, or you might parse it for "technorati" tags or other microformats - to find out about events that people are attending using hCalendar, for example.  There's a lot more to building an aggregator than the boring collection of feeds and posts.

If you can think of anything I've missed - other things that aggregators might do with the feed and post data they collect, I'd love to hear from you.

Anyway, back to the code...

I've posted pyaggregator 0.1 on the Python Cheeseshop.  It relies on the Universal Feed Parser, which you'll have to download separately for now.  (It doesn't look like the Universal Feed Parser makes it easy to add checking for attributes on additional elements, so I might need to maintain a branch for a bit until I can come up with a patch to make it easier to adjust its behaviour externally.  Or I might be missing something.  But that's later...)

In there, I added back the tag checking support that was in FeedJack originally.  Further post processing of entries after they've been created or updated is done in post_process_entry, the default implementation of which checks for methods that start with process_entry_:

    # Hook to post-process the entry after it has been created or
    # modify.
    #
    # Default implementation finds methods prefixed with process_entry_
    # and executes these after an entry has been created or modified.
    def post_process_entry(self, post, feed, entry, posts, parsed_data):
        funcnames = [a for a in dir(self) if a.startswith('process_entry_')]
        for funcname in funcnames:
            func = getattr(self, funcname)
            func(post, feed, entry, posts, parsed_data)

So, adding tag checking is done by creating process_entry_tags:

    # Process entry to find and save the tags associated with the post
    def process_entry_tags(self, post, feed, entry, posts, parsed_data):
        if not hasattr(self, "set_tags_for_post"):
            return

        if not hasattr(self, "get_or_create_tag"):
            return

        entry_tags = self.get_tags(entry)
        self.log.debug('%s - tags: %s', post.link, [tag.name for tag in entry_tags])
        self.set_tags_for_post(post, entry_tags)

    # Gets a list of tag objects for the tags on the entry
    #
    # Copied wholesale, but with variable renaming, from FeedJack -
    #     http://www.feedjack.org/
    #
    # Copyright (c) 2006, Gustavo Picon (with an accent on the o, but
    # I'm an encoding-Python-files-newbie)
    def get_tags(self, entry):
        """Returns a list of tag objects from an entry."""

        if 'tags' not in entry:
            return []

        entry_tags = []
        for tag in entry.tags:
            if tag.label != None:
                terms = tag.label
            else:
                terms = tag.term
            terms = terms.strip()
            if ',' in terms or '/' in terms:
                terms = terms.replace(',', '/').split('/')
            else:
                terms = [terms]

            for tagname in terms:
                tagname = tagname.lower()
                while '  ' in tagname:
                    tagname = tagname.replace('  ', ' ')
                if not tagname or tagname == ' ':
                    continue
                if tagname not in self.tags:
                    tagobj = self.get_or_create_tag(tagname)
                    self.tags[tagname] = tagobj
                entry_tags.append(self.tags[tagname])
        return entry_tags

Picking up enclosures for podcasts is similarly done by creating process_entry_enclosures:

    # Process entry to find and save the enclosures within the post
    def process_entry_enclosures(self, post, feed, entry, posts, parsed_data):
        if not hasattr(self, "save_enclosure"):
            return

        if not hasattr(self, "clear_enclosures_for_post"):
            return

        self.clear_enclosures_for_post(post)
            
        if 'enclosures' in entry:
            for e in entry.enclosures:
                ed = dict(url=e.href, size=e.length, type=e.type, post=post)
                ed['date_created'] = post.date_created
                ed['date_modified'] = post.date_modified
                if hasattr(entry, "itunes_duration"):
                    ed['duration'] = 0
                self.save_enclosure(**ed)

These check that the associated model methods are available to persist the data.  For Elixir, these look like this:

class Tag(Entity):
    has_field('tag_id', Integer, primary_key=True)
    has_field('name', Unicode(50))

    has_and_belongs_to_many('posts', of_kind='Post', inverse='tags')
    has_and_belongs_to_many('feeds', of_kind='Feed', inverse='tags')

    using_options(tablename='a2d_tag')

class Enclosure(Entity):
    has_field('enclosure_id', Integer, primary_key=True)

    has_field('url', String(255))
    has_field('type', Unicode(255))
    has_field('size', Integer)

    has_field('duration', Integer) # in seconds

    has_field('date_created', DateTime)
    has_field('date_modified', DateTime)

    belongs_to('post', of_kind="Post", colname='post_id', inverse='enclosures')

    using_options(tablename='a2d_enclosures')


class ElixirAggregatorMixin(object):

    ...

    def set_tags_for_post(self, post, tags):
        post.tags = tags

    def get_or_create_tag(self, tagname):
        tags = Tag.select(Tag.c.name == tagname)
        if not tags:
            return Tag(name = tagname)
        return tags[0]

    def save_enclosure(self, **kw):
        Enclosure(**kw).save()

    def clear_enclosures_for_post(self, post):
        Enclosure.table.delete(Enclosure.c.post_id == post.post_id)

Time permitting, I'll continue to look at other information that one can gain from the feeds that are being aggregated.