Another important aspect of building an aggregator, besides tags/keywords, is being able to find posts based on particular words in the text.  In the "assigned keywords" vs. "words in the page" battle, "words in the page" still has the lead.

This is typically done using a "full text index" of the content - of a post in our case.  There are two main ways to do this - either in your standard database (MySQL's MATCH ... AGAINST construct, or tsearch2 for PostgreSQL), or in an external indexing system just for this sort of indexing and searching, using something like Lucene or Xapian.

There are some interesting trade-offs here...

One item I mentioned in terms of custom aggregator code might be parsing the post (or the content located at the post's link) for microformats.  Microformats are basically a way to embed semantic information or structure into web pages within the existing tool of HTML.  Microformats can provide additional structured information about people or companies (hCard) or events (hCalendar) or reviews (hReview) or locations (hCard again) that are contained in the web page, or they may provide additional metadata for the current page - the providing a set of words that describe the page ("technorati" tags or rel-tag) or the license under which the page falls (rel-license). 

If the feed provides an HTML version of the content (type="html" or type="xhtml", or just by convention), it may be that there are microformats in that HTML.  It may be that only the text is available, and then you have some decisions to make as to whether to get the information from the link given in the feed for the post.  There are some trade-offs here - you can't be sure which data on that page has to do exclusively with the post (for example, there may be data in a blog roll that is on every page, and doesn't have to do with this post in particular).

 

Once you've modeled the core objects, and then done the core aggregation logic for feeds and posts, you have pretty much everything else to do.  At the very least, you're going to want to put this together in some sort of interface - either for users to view, or for programs to check against.  And, if you're trying to build something like Amatomu, you've got a lot more work to do.

There are a number of additional standards above the original RSS (which a non-trivial amount of feeds probably still use), and those additional standards allow you to capture more metadata about the posts.  Atom has a standard around categories, and there are a few other ways people indicate categories in their feeds as well (and they may just call them "tags").  Someone might "geotag" their posts using GeoRSS.  There are ways to indicate the license of your work - especially with Creative Commons licenses.  There are podcasts using enclosures, and vodcasts which may or may not be using Yahoo!'s Media RSS, and all sorts of other stuff in the feeds you're aggregating.

In my previous aggregator post, I set up Feed and Post models to capture the core information about these items.  We can now store the information from the aggregation process in a persistent location, and this can be used by some external program to view the aggregation of content.  Now we just need to get the information from somewhere...

The core aggregation logic is quite simple:

  1. We'll fetch the feed.
  2. We'll parse it.
  3. We'll save various bits of feed information to the database.
  4. For each entry in the feed, either create or update the information in the database.

There are a few optimisations we can do there for various levels of winnage.  For example, a pretty big win is that there's no point parsing the feed if we can be sure the feed hasn't changed.  An even bigger win given the rate of change of individual feeds on the Internet, is that there's no point fetching the full feed if it hasn't changed - use of eTag and If-Modified-Since can save us from not only unnecessary work, but unnecessary traffic (important in South Africa) and just being a good Internet citizen.  A small win is that we don't need to update the database if the entry in the feed hasn't changed.

After much poking, cajoling, and downright finger pointing and laughing by Bryn and a bunch of other "friends", I now have Gibe, the little project saving my sanity from the tedium of burn-out and under-stimulation, in a publically accessible place for more people to do more finger pointing and laughing.

Gibe is just your standard web log software - people can log in and add posts, other people can read the posts and make comments on them, and there's anti-spam (using akismet) and there's also a bit of a beginnings of a plugin architecture there for people who want to expand it beyond what it does now.

 

Part of the magic of TurboGears Widgets (and carried on in ToscaWidgets) is that you have a bundle of resources that come along with the widget object you add to your page.  Generally, this is a couple of JavaScript files, some CSS files, and some images.  This is certainly the case if you're using a Widget as part of creating a theme for a TurboGears application.

The resources are registered as a static directory in the framework (in TG, accessible as /tg_widgets/%(resource_name)s/ - usually resource_name is the module name), and you can then use JSLink and CSSLink and friends to provide the correct URL when referencing those resources.  pkg_resources is used to provide the base path for the resources, thus allowing resources enclosed in .egg files to be found.

The widgets.py file for a TurboGears Widget generally contains this by default:

resource_dir = pkg_resources.resource_filename("tgsociable",
                                         "static")
widgets.register_static_directory("tgsociable", resource_dir)

Having the request go through the framework - through an interpreter at all, can slow things down quite a bit.  It's not so much a throughput issue as a latency issue - it takes longer to actually start sending the data for the resources.  As you can imagine, the .egg file format isn't as fast as reading from the filesystem either.

So far, I've been looking at modifying existing pages in Gibe (my still as-yet-unreleased TurboGears blog application) - adding widgets to post pages, dynamically adding the comment field and handling it for different comment formats, and adding additional fields at blog entry create/edit-time and handling these fields to add tagging (or whatever).  Adding new pages (or replacing the default ones) is pretty much necessary - for example, to add a page where there is a list of all pages with a particular tag.

I use Routes for dispatching incoming URLs to functions in Gibe.  It's not the default dispatcher in TurboGears, but it's pretty easy to set up (there's a TurboGears/Routes integration recipe on the TurboGears wiki).

Why go through the bother?

It makes adding new pages easy - no matter how complicated the URL structure is and where the dynamic portions are.  It also makes it easy to pass through the dynamic portions, and also to pass through defaults if the dynamic portions don't exist.   The killer function is named routes, which allows me to look up where something is (ie, generate the URL for it), and not hard-code the link to where the page is.  That means that I can totally change the URL structure of the site without changing any code.

In terms of dynamicism, the worst cases I've explained so far in Gibe are the little comment format hack to allow the use of Postmarkup instead of HTML in comments, and the adding of little trivial plugins to add visual widgets at the top and bottom of blog entries.  Certainly not rocket science.  And, well, neither is this...

My next task was to look at how one could add additional fields to the blog entry create/edit screen - to allow plugins to ask for additional information like tags, geographical location, your mood, what you're currently listening to, and other vain things that nobody really cares about (I mean, it's a blog, it's not like it's useful...).

I used tags as my test case, since that is at least something I can see some value in, and it's something that's already around except for the actual entry of the tags.  Until now, I've been manually typing in things like:

INSERT INTO post_tag
    SELECT 625 AS post_id, tag_id FROM tag
        WHERE name IN
            ('gibe', 'python', 'code','amatomu',
            'blogs','me','tgsociable');

The last time I was talking code (I mean, besides the little example usage snippet when I was announcing TGOpenIDLogin, in between my reporting back about the GeekDinner, the Western Cape Linux User Group  meeting, the Cape Town Python User Group meeting, and so forth), I was showing how I converted the Wordpress Sociable plugin to a TurboGears widget (innovatively named TGSociable).

The ultimate purpose of this was to make a plugin for Gibe (my little weblog engine) which would add the sociable icons with the correct URLs to blog entry pages, without having to put anything sociable-specific in the code.

When writing the comment format "plugin system" so that Gibe could use something other than the built-in TinyMCE, I used pkg_resources to create an "entry point" so that other packages could provide functionality to Gibe, and Gibe would know about them without anything but installing the package.

So, I started with the near-simplest possible Plugin class:

class Plugin(object):
    def post_top_widgets(self, post, widgets, context):
        # widgets to put above the post
        pass

I was innocently trying to add OpenID to an application, following the advice on Damian Cugley's article on using OpenID with Turbogears and on the TurboGears' documentation site's article on OpenID with Identity when I realised that it was way too much like hard work to implement.  Edit this page, put this there, and so forth.  Why couldn't I just import something and have it Just Work?

Well, it seems, that's not hard at all, actually.  Thus, TGOpenIDLogin, which Just Works (well, python-openid Just Works, and I just use that) when you transfer login and so forth to it.  It's a turboGears controller that you can hook up into any TurboGears application and have it take care of logging in of people using OpenID.

It can't be simpler:

from tgopenidlogin.controllers import OpenIDLoginController

class Root(controllers.RootController):
    ...

    openid = OpenIDLoginController(User, VisitIdentity)
    login = openid.login
    logout = openid.logout

It remembers where you were trying to go, and comes with a simple OpenID form to put on any page which will remember what page you were on when you tried to log in - it's tgopenidlogin.widgets.OpenIDLoginForm.

You need to pass in the model for your User and VisitIdentity objects so that it can create users and update user details from their OpenID server, and so that it can log them on. Your User model needs to support usernames of 255 characters long. You can also pass in the web path to the OpenIDLoginController (it defaults to "/openid" relative to your web app base). You can pass in your own OpenID store, or it'll use a SQLite store (well, if you have pysqlite2 installed). You can also set your OpenID trust_root, or it'll default to the base of your web application.

Nothing invalidates logging in with plain username and password if you still want that. Your current login page can have a separate form (using widgets.OpenIDLoginForm, for simplicity) or a link to the TGOpenIDLoginController - just don't put in login = openid.login in your controller...

Still quite a bit to do:

  • It doesn't save original_parameters like the standard TurboGears login does.  This will require storing the parameters somewhere - probably the session.
  • The User/VisitIdentity stuff might only work with SQLAlchemy with assign_mapper and Elixir and maybe SQLObject - non-assign_mapper SQLAlchemy will need a separate handler.  This is probably easiest handled by making it easy to inherit from the controller.
  • The post-authentication action, redirecting to the target page, might not be useful for places that want full registration.  Again, probably best to stub it out with a default implementation, and let people inherit from the controller and override.
  • Oh, I haven't really tested interoperability.  I just used the example server from python-openid, and one that just failed, and few pages without OpenID server links.