Tags: , ,

I finally remembered to add Andy Rabagliati (aka wizzy)'s blog to my feed.  He covers a variety of local (South African) and regional (rest of Africa) topics, but his coverage of two Zulu cultural events he's attended has stuck with me the most so far - one on a Zulu wedding, and another on a ceremony held by an important Zulu sangoma.

Every single post on the entire My Digital Life mass blog site is not a technology post.  In fact, at the moment, I'd say less than 5% are.  Please remove your silly blanket membership in "Technology" on Amatomu simply because you are on the Intarwebs.

The Management

While Amatomu has been pretty good at regularly releasing small improvements to their aggregator, Afrigator has gone for a big bang, releasing their new version with new features today.

Afrigator quite cleverly also created an additional channel for community participation, having started a mailing list for people to hear about their new version beforehand, and to discuss it afterwards.  Amatomu suddenly feels like a corporate in comparison (although we know otherwise).

Of interest to me is OpenID login support and a "private" aggregator of the feeds you particularly are interested in.

But, nothing dramatically exciting yet - Amatomu still leads in terms of the information they provide and the visualisation thereof.

Another important aspect of building an aggregator, besides tags/keywords, is being able to find posts based on particular words in the text.  In the "assigned keywords" vs. "words in the page" battle, "words in the page" still has the lead.

This is typically done using a "full text index" of the content - of a post in our case.  There are two main ways to do this - either in your standard database (MySQL's MATCH ... AGAINST construct, or tsearch2 for PostgreSQL), or in an external indexing system just for this sort of indexing and searching, using something like Lucene or Xapian.

There are some interesting trade-offs here...

One item I mentioned in terms of custom aggregator code might be parsing the post (or the content located at the post's link) for microformats.  Microformats are basically a way to embed semantic information or structure into web pages within the existing tool of HTML.  Microformats can provide additional structured information about people or companies (hCard) or events (hCalendar) or reviews (hReview) or locations (hCard again) that are contained in the web page, or they may provide additional metadata for the current page - the providing a set of words that describe the page ("technorati" tags or rel-tag) or the license under which the page falls (rel-license). 

If the feed provides an HTML version of the content (type="html" or type="xhtml", or just by convention), it may be that there are microformats in that HTML.  It may be that only the text is available, and then you have some decisions to make as to whether to get the information from the link given in the feed for the post.  There are some trade-offs here - you can't be sure which data on that page has to do exclusively with the post (for example, there may be data in a blog roll that is on every page, and doesn't have to do with this post in particular).

 

Once you've modeled the core objects, and then done the core aggregation logic for feeds and posts, you have pretty much everything else to do.  At the very least, you're going to want to put this together in some sort of interface - either for users to view, or for programs to check against.  And, if you're trying to build something like Amatomu, you've got a lot more work to do.

There are a number of additional standards above the original RSS (which a non-trivial amount of feeds probably still use), and those additional standards allow you to capture more metadata about the posts.  Atom has a standard around categories, and there are a few other ways people indicate categories in their feeds as well (and they may just call them "tags").  Someone might "geotag" their posts using GeoRSS.  There are ways to indicate the license of your work - especially with Creative Commons licenses.  There are podcasts using enclosures, and vodcasts which may or may not be using Yahoo!'s Media RSS, and all sorts of other stuff in the feeds you're aggregating.

In my previous aggregator post, I set up Feed and Post models to capture the core information about these items.  We can now store the information from the aggregation process in a persistent location, and this can be used by some external program to view the aggregation of content.  Now we just need to get the information from somewhere...

The core aggregation logic is quite simple:

  1. We'll fetch the feed.
  2. We'll parse it.
  3. We'll save various bits of feed information to the database.
  4. For each entry in the feed, either create or update the information in the database.

There are a few optimisations we can do there for various levels of winnage.  For example, a pretty big win is that there's no point parsing the feed if we can be sure the feed hasn't changed.  An even bigger win given the rate of change of individual feeds on the Internet, is that there's no point fetching the full feed if it hasn't changed - use of eTag and If-Modified-Since can save us from not only unnecessary work, but unnecessary traffic (important in South Africa) and just being a good Internet citizen.  A small win is that we don't need to update the database if the entry in the feed hasn't changed.

Over on Eric Edelstein's, on talk of building another blog aggregator (to compete with Afrigator and Amatomu - listed in alphabetical order so as not to denote preference), I boasted at how easy it is to build the aggregator portion of it.  (Well, I'm fairly certain I did.  It seems the comments have all disappeared.  I'm not the only one pretty sure there were comments there...)

I did it quite significantly differently before, but I'm now building another version of the aggregator, hopefully in a way that gets rid of most of the tedium of aggregation, but allowing the results to be stored in whatever way the developer-user wants, and to allow capturing more about the feeds and posts than a generic aggregator will.

While I wanted to abstract away the particular model, I started with models for Feed and Post as a beginning, using Elixir, drawing inspiration on fields and definitions from the FeedJack aggregator for Django.

Tags: , , ,

Oh dear, I'm "blogging about blogging" again.  I apologise in advance.  And although I know it's silly to talk about "true blogs", I'm going to do it anyway.  More apologies. 

The "technology" section of Amatomu (maybe I should suggest mod_rewrite to them for prettier URLs...) is starting to get a bit crowded.  If it isn't gadgets, it's games.  Both of which I think should be considered "lifestyle" more than "technology".  Sure, there's the occasional useful bit of technical information about the gadgets or games (or their platforms), but the primary point of interest on those sites is the gadget or game, which while made from technology, is mostly about the experience of owning it or playing it.

The last time I was talking code (I mean, besides the little example usage snippet when I was announcing TGOpenIDLogin, in between my reporting back about the GeekDinner, the Western Cape Linux User Group  meeting, the Cape Town Python User Group meeting, and so forth), I was showing how I converted the Wordpress Sociable plugin to a TurboGears widget (innovatively named TGSociable).

The ultimate purpose of this was to make a plugin for Gibe (my little weblog engine) which would add the sociable icons with the correct URLs to blog entry pages, without having to put anything sociable-specific in the code.

When writing the comment format "plugin system" so that Gibe could use something other than the built-in TinyMCE, I used pkg_resources to create an "entry point" so that other packages could provide functionality to Gibe, and Gibe would know about them without anything but installing the package.

So, I started with the near-simplest possible Plugin class:

class Plugin(object):
    def post_top_widgets(self, post, widgets, context):
        # widgets to put above the post
        pass