Once you've modeled the core objects, and then done the core aggregation logic for feeds and posts, you have pretty much everything else to do. At the very least, you're going to want to put this together in some sort of interface - either for users to view, or for programs to check against. And, if you're trying to build something like Amatomu, you've got a lot more work to do.
There are a number of additional standards above the original RSS (which a non-trivial amount of feeds probably still use), and those additional standards allow you to capture more metadata about the posts. Atom has a standard around categories, and there are a few other ways people indicate categories in their feeds as well (and they may just call them "tags"). Someone might "geotag" their posts using GeoRSS. There are ways to indicate the license of your work - especially with Creative Commons licenses. There are podcasts using enclosures, and vodcasts which may or may not be using Yahoo!'s Media RSS, and all sorts of other stuff in the feeds you're aggregating.
And that's just in the feed itself. You could follow an entry's link, and grab the page and parse it for links to other web logs and build up some sort of reputation system or for all links and find out what is being talked a lot about in the feeds you're aggregating, or you might parse it for "technorati" tags or other microformats - to find out about events that people are attending using hCalendar, for example. There's a lot more to building an aggregator than the boring collection of feeds and posts.
If you can think of anything I've missed - other things that aggregators might do with the feed and post data they collect, I'd love to hear from you.
Anyway, back to the code...
I've posted pyaggregator 0.1 on the Python Cheeseshop. It relies on the Universal Feed Parser, which you'll have to download separately for now. (It doesn't look like the Universal Feed Parser makes it easy to add checking for attributes on additional elements, so I might need to maintain a branch for a bit until I can come up with a patch to make it easier to adjust its behaviour externally. Or I might be missing something. But that's later...)
In there, I added back the tag checking support that was in FeedJack originally. Further post processing of entries after they've been created or updated is done in post_process_entry, the default implementation of which checks for methods that start with process_entry_:
# Hook to post-process the entry after it has been created or
# modify.
#
# Default implementation finds methods prefixed with process_entry_
# and executes these after an entry has been created or modified.
def post_process_entry(self, post, feed, entry, posts, parsed_data):
funcnames = [a for a in dir(self) if a.startswith('process_entry_')]
for funcname in funcnames:
func = getattr(self, funcname)
func(post, feed, entry, posts, parsed_data)
So, adding tag checking is done by creating process_entry_tags:
# Process entry to find and save the tags associated with the post
def process_entry_tags(self, post, feed, entry, posts, parsed_data):
if not hasattr(self, "set_tags_for_post"):
return
if not hasattr(self, "get_or_create_tag"):
return
entry_tags = self.get_tags(entry)
self.log.debug('%s - tags: %s', post.link, [tag.name for tag in entry_tags])
self.set_tags_for_post(post, entry_tags)
# Gets a list of tag objects for the tags on the entry
#
# Copied wholesale, but with variable renaming, from FeedJack -
# http://www.feedjack.org/
#
# Copyright (c) 2006, Gustavo Picon (with an accent on the o, but
# I'm an encoding-Python-files-newbie)
def get_tags(self, entry):
"""Returns a list of tag objects from an entry."""
if 'tags' not in entry:
return []
entry_tags = []
for tag in entry.tags:
if tag.label != None:
terms = tag.label
else:
terms = tag.term
terms = terms.strip()
if ',' in terms or '/' in terms:
terms = terms.replace(',', '/').split('/')
else:
terms = [terms]
for tagname in terms:
tagname = tagname.lower()
while ' ' in tagname:
tagname = tagname.replace(' ', ' ')
if not tagname or tagname == ' ':
continue
if tagname not in self.tags:
tagobj = self.get_or_create_tag(tagname)
self.tags[tagname] = tagobj
entry_tags.append(self.tags[tagname])
return entry_tags
Picking up enclosures for podcasts is similarly done by creating process_entry_enclosures:
# Process entry to find and save the enclosures within the post
def process_entry_enclosures(self, post, feed, entry, posts, parsed_data):
if not hasattr(self, "save_enclosure"):
return
if not hasattr(self, "clear_enclosures_for_post"):
return
self.clear_enclosures_for_post(post)
if 'enclosures' in entry:
for e in entry.enclosures:
ed = dict(url=e.href, size=e.length, type=e.type, post=post)
ed['date_created'] = post.date_created
ed['date_modified'] = post.date_modified
if hasattr(entry, "itunes_duration"):
ed['duration'] = 0
self.save_enclosure(**ed)
These check that the associated model methods are available to persist the data. For Elixir, these look like this:
class Tag(Entity):
has_field('tag_id', Integer, primary_key=True)
has_field('name', Unicode(50))
has_and_belongs_to_many('posts', of_kind='Post', inverse='tags')
has_and_belongs_to_many('feeds', of_kind='Feed', inverse='tags')
using_options(tablename='a2d_tag')
class Enclosure(Entity):
has_field('enclosure_id', Integer, primary_key=True)
has_field('url', String(255))
has_field('type', Unicode(255))
has_field('size', Integer)
has_field('duration', Integer) # in seconds
has_field('date_created', DateTime)
has_field('date_modified', DateTime)
belongs_to('post', of_kind="Post", colname='post_id', inverse='enclosures')
using_options(tablename='a2d_enclosures')
class ElixirAggregatorMixin(object):
...
def set_tags_for_post(self, post, tags):
post.tags = tags
def get_or_create_tag(self, tagname):
tags = Tag.select(Tag.c.name == tagname)
if not tags:
return Tag(name = tagname)
return tags[0]
def save_enclosure(self, **kw):
Enclosure(**kw).save()
def clear_enclosures_for_post(self, post):
Enclosure.table.delete(Enclosure.c.post_id == post.post_id)
Time permitting, I'll continue to look at other information that one can gain from the feeds that are being aggregated.
I don't know how likely this situation is but:
could be problematic. If one were to have a term with a comma in it but with / separators you would end up splitting the term with a comma in it like this:
Also, if you haven't seen it you should checkout: http://cleverdevil.org/computing/52/ it seems to me like you could maybe to some really sexy Elixir statements for enclosures and tags.
Anyway, keep up the cool posts!
Alexander: Yeah, I'm not happy with that code myself. I thought of trying other stuff, but I just left it as it was for FeedJack (just renaming the variables to make it easier to understand). If the feed uses Atom categories, then the aggregator should just use the categories as given. For others, at best you could use some sort of heuristic to determine the separators on a feed-by-feed basis.
I personally kind of hate them, but I figure you could probably use a regex to determine the separator and then split on that. I'm not entirely sure just how it would present, but it seems reasonable.
I guess my blog actually does a separate category tag for each tag though that's just what PyRSS2Gen does when you pass it a list.
I'm really interested to see what kind of interface you intend to do for this, would be even cooler if your aggregator could be extended to have functionality like plagger or Yahoo! Pipes :) (though that's a bit of a pipe dream, no pun intended).
I've started to look through the actual code release you did, the little snippits make a lot more sense now. :) One thing I noticed though you raise RuntimeError in the various functions that you implement in ElixirAggregatorMixin, it would probably be better to raise NotImplementedError since that's exactly what the case is.
I feel like the transparency of the model is a little inconsistent though. You seem to try and proxy a lot of the model access away into separate functions in your ElixirAggregatorMixin but then you basically directly expose the Elixir/SQLAlchemy mapper API when dealing with feeds. Might be better to either add some feed handling functions to the Mixin or to more directly expose the api for all of the other data types (enclosures, tags, posts, etc).
A big reason to try and move all off the database access into the Mixin is it makes it a lot easier to do some kind of message based concurrency which would make it easier to run multiple instances of the Aggregator class without worrying about going through the Aggregator code and adding locking. The mixin functions could simply be methods which appended a message to the database thread's queue to be handed later (a bit harder for the read operations but still easier then running through and going locking crazy). I think for situations where you're doing a lot of aggregation (like a google reader type site) a multi-threaded approach would really be imperative, but it would be nice to be able to use a package like this that already does a lot of the leg work for you.
Hey Alex. For some reason, NotImplementedError was not behaving the way I expected when I was originally using it. I'll try move back to it and see what happens.
Maybe I should do "saveFeed(feed)" instead of feed.save(), though - that way the model, as you say, can handle concurrency/threading. The "feed" object can just be a dictionary - should make it a lot easier to do a lot of things.
Thanks for all the comments! Definitely keeping my thinking...