One item I mentioned in terms of custom aggregator code might be parsing the post (or the content located at the post's link) for microformats.  Microformats are basically a way to embed semantic information or structure into web pages within the existing tool of HTML.  Microformats can provide additional structured information about people or companies (hCard) or events (hCalendar) or reviews (hReview) or locations (hCard again) that are contained in the web page, or they may provide additional metadata for the current page - the providing a set of words that describe the page ("technorati" tags or rel-tag) or the license under which the page falls (rel-license). 

If the feed provides an HTML version of the content (type="html" or type="xhtml", or just by convention), it may be that there are microformats in that HTML.  It may be that only the text is available, and then you have some decisions to make as to whether to get the information from the link given in the feed for the post.  There are some trade-offs here - you can't be sure which data on that page has to do exclusively with the post (for example, there may be data in a blog roll that is on every page, and doesn't have to do with this post in particular).

 


Anyway, on to the code... 

Parsing microformats "just" means parsing HTML.  If you're not lucky enough to have 100% valid XHTML, this can be tricky to do reliably.  I've just relied on Genshi's HTML function to convert broken HTML into a stream of events.  This can be rendered into a well-formed XML/XHTML document, and dealing with that is easier.  I'm not sure quite how well Genshi's HTML function converts really broken HTML into something useful - it was just handy at the time.  Alternatively, one can use various tidy-based options - utidylib, pytidy, mxTidy, and I'm sure more.

I might've missed a few, but most elements are known to be part of the microformat in terms of a class.  As such, any time we encounter a class, we can check whether it is one in use of the current microformat under consideration, and add it to our current data.  I used a simple handle_ prefixed function "pattern" with the relevant class names to handle each element (could just a dictionary dispatch too). flatten_element provides all the elements below the current one to simplify the basic logic.

class HCalendar(object):
    def __init__(self, content):
        self.content = content

    def get_events(self):
        vevents = []
        for event, elem in ET.iterparse(self.content):
            classes = elem.get('class')
            if classes:
                if 'vevent' in classes:
                    vevents.append(elem)

        for vevent in vevents:
            vevent_data = {}
            for elem in self.flatten_element(vevent):
                classes = elem.get('class', None)
                if classes:
                    for class_ in classes.split(" "):
                        funcname = "handle_%s" % (class_,)
                        func = getattr(self, funcname, None)
                        if func:
                            func(vevent_data, elem)

            if self.has_enough_data(vevent_data):
                vevent_data['description'] = ET.tostring(vevent)
                yield vevent_data

    def flatten_element(self, elem, yieldself = True):
        if yieldself:
            yield elem
        for e in elem:
            yield e
            for e1 in self.flatten_element(e, yieldself = False):
                yield e1

The handle functions tend to be pretty simple.  They use some common functions - mostly to deal with the abbr design pattern (handled by text_for_elem) and the datetime design pattern (by parse_data) in microformats:

    def text_for_elem(self, elem):
        if elem.tag != "abbr":
            return elem.text
        else:
            return elem.get('title', elem.text)

    def parse_date(self, date_str):
        try:
            return iso8601.parse_date(date_str)
        except:
            return None

    def handle_dtstart(self, vevent_data, elem):
        date_str = self.text_for_elem(elem)
        date_obj = self.parse_date(date_str)
        if date_obj:
            vevent_data['dtstart'] = date_obj

    def handle_dtend(self, vevent_data, elem):
        date_str = self.text_for_elem(elem)
        date_obj = self.parse_date(date_str)
        if date_obj:
            vevent_data['dtend'] = date_obj

    def handle_url(self, vevent_data, elem):
        vevent_data['url'] = elem.get('href')

    def handle_summary(self, vevent_data, elem):
        vevent_data['summary'] = self.text_for_elem(elem)

    def handle_location(self, vevent_data, elem):
        vevent_data['location'] = self.text_for_elem(elem)

Finally, we just need to check we have all the relevant information before passing on the information to be saved:

    def has_enough_data(self, vevent_data):
        fields_to_check = [
            'summary',
            'dtstart',
            'dtend',
            'url',
            'location',
        ]
        for field in fields_to_check:
            if not vevent_data.get(field, None):
                print "Does not have field: %s" % (field,)
                return False
        return True

The output of all this is a list of events for a particular post's content.  Just need to hook it up into the aggregator, and create a model to store the information:

from elixir import *
class Event(Entity):
    has_field('event_id', Integer, primary_key=True)

    has_field('summary', Unicode(200))
    has_field('location', Unicode(200))
    has_field('description', Unicode())
    has_field('url', String(255))

    has_field('dtstart', DateTime)
    has_field('dtend', DateTime)

    belongs_to('post', of_kind="Post",
        colname='post_id', inverse='events')

    using_options(tablename='a2d_event')


class HCalendarAggregatorMixin(object):
    def process_entry_hcalendar(self, post, feed, entry,
        posts, parsed_data):

        if not hasattr(self, "saveEvent"):
            return
        content = StringIO('<div>' +
            genshi.HTML(post.content).render('xhtml') + 
            '</div>')
        hc = HCalendar(content)
        for event in hc.get_events():
            print "Received: %s" % (event,)
            self.saveEvent(**event)

    def saveEvent(self, **event):
        e = Event(**event)
        e.save()

Just need to add HCalendarAggregatorMixin to our class list, and we're done.  (I'm not sure about this mixin stuff, but it works for now...)

I haven't noticed particularly much use of hCalendar in South Africa, but it looks like it somewhat works on Charl and my feeds:

mysql> SELECT summary, location, url, dtstart, dtend FROM a2d_event\G
*************************** 1. row ***************************
 summary: CLUG Talk: Erlang
location: Chemical Engineering Lecture Theatre
     url: http://wiki.clug.org.za/wiki/Talks_2007
 dtstart: 2007-04-24 16:30:00
   dtend: 2007-04-24 18:00:00
*************************** 2. row ***************************
 summary: myXchange 25 April 2007 Meeting
location: Upstairs at Harrys above Spar in York Street, George
     url: http://myxchange.pbwiki.com/
 dtstart: 2007-04-25 18:30:00
   dtend: 2007-04-25 21:30:00
2 rows in set (0.00 sec)

Of course, I'm ignoring quite a few pretty large issues (like timezones...), and there's quite a bit more to getting this right for production quality and to handle all the weird combinations (hCalendar locations that are hCards themselves, &c.).  Ignorance is bliss, though...