Another important aspect of building an aggregator, besides tags/keywords, is being able to find posts based on particular words in the text.  In the "assigned keywords" vs. "words in the page" battle, "words in the page" still has the lead.

This is typically done using a "full text index" of the content - of a post in our case.  There are two main ways to do this - either in your standard database (MySQL's MATCH ... AGAINST construct, or tsearch2 for PostgreSQL), or in an external indexing system just for this sort of indexing and searching, using something like Lucene or Xapian.

There are some interesting trade-offs here...

One item I mentioned in terms of custom aggregator code might be parsing the post (or the content located at the post's link) for microformats.  Microformats are basically a way to embed semantic information or structure into web pages within the existing tool of HTML.  Microformats can provide additional structured information about people or companies (hCard) or events (hCalendar) or reviews (hReview) or locations (hCard again) that are contained in the web page, or they may provide additional metadata for the current page - the providing a set of words that describe the page ("technorati" tags or rel-tag) or the license under which the page falls (rel-license). 

If the feed provides an HTML version of the content (type="html" or type="xhtml", or just by convention), it may be that there are microformats in that HTML.  It may be that only the text is available, and then you have some decisions to make as to whether to get the information from the link given in the feed for the post.  There are some trade-offs here - you can't be sure which data on that page has to do exclusively with the post (for example, there may be data in a blog roll that is on every page, and doesn't have to do with this post in particular).

 

Once you've modeled the core objects, and then done the core aggregation logic for feeds and posts, you have pretty much everything else to do.  At the very least, you're going to want to put this together in some sort of interface - either for users to view, or for programs to check against.  And, if you're trying to build something like Amatomu, you've got a lot more work to do.

There are a number of additional standards above the original RSS (which a non-trivial amount of feeds probably still use), and those additional standards allow you to capture more metadata about the posts.  Atom has a standard around categories, and there are a few other ways people indicate categories in their feeds as well (and they may just call them "tags").  Someone might "geotag" their posts using GeoRSS.  There are ways to indicate the license of your work - especially with Creative Commons licenses.  There are podcasts using enclosures, and vodcasts which may or may not be using Yahoo!'s Media RSS, and all sorts of other stuff in the feeds you're aggregating.

In my previous aggregator post, I set up Feed and Post models to capture the core information about these items.  We can now store the information from the aggregation process in a persistent location, and this can be used by some external program to view the aggregation of content.  Now we just need to get the information from somewhere...

The core aggregation logic is quite simple:

  1. We'll fetch the feed.
  2. We'll parse it.
  3. We'll save various bits of feed information to the database.
  4. For each entry in the feed, either create or update the information in the database.

There are a few optimisations we can do there for various levels of winnage.  For example, a pretty big win is that there's no point parsing the feed if we can be sure the feed hasn't changed.  An even bigger win given the rate of change of individual feeds on the Internet, is that there's no point fetching the full feed if it hasn't changed - use of eTag and If-Modified-Since can save us from not only unnecessary work, but unnecessary traffic (important in South Africa) and just being a good Internet citizen.  A small win is that we don't need to update the database if the entry in the feed hasn't changed.

Over on Eric Edelstein's, on talk of building another blog aggregator (to compete with Afrigator and Amatomu - listed in alphabetical order so as not to denote preference), I boasted at how easy it is to build the aggregator portion of it.  (Well, I'm fairly certain I did.  It seems the comments have all disappeared.  I'm not the only one pretty sure there were comments there...)

I did it quite significantly differently before, but I'm now building another version of the aggregator, hopefully in a way that gets rid of most of the tedium of aggregation, but allowing the results to be stored in whatever way the developer-user wants, and to allow capturing more about the feeds and posts than a generic aggregator will.

While I wanted to abstract away the particular model, I started with models for Feed and Post as a beginning, using Elixir, drawing inspiration on fields and definitions from the FeedJack aggregator for Django.