Another important aspect of building an aggregator, besides tags/keywords, is being able to find posts based on particular words in the text.  In the "assigned keywords" vs. "words in the page" battle, "words in the page" still has the lead.

This is typically done using a "full text index" of the content - of a post in our case.  There are two main ways to do this - either in your standard database (MySQL's MATCH ... AGAINST construct, or tsearch2 for PostgreSQL), or in an external indexing system just for this sort of indexing and searching, using something like Lucene or Xapian.

There are some interesting trade-offs here...


If you're doing this from your standard database, then you can generate a list of all objects (posts in this case) that contain a particular set of words and that also are visible to me due to the permission system of the application (which we don't have in this case).

If you're doing this from an external indexing system, then you'd have to filter the results from your external indexing system with a second call to check each result (or a bunch of results) to see if this result should be visible to a particular user, or be visible based on additional search criteria (date last viewed by this particular user, for example).

The built-into-the-database index does have downsides.  For example, it generally has a small subset of the query capability of the external indexing systems.  The query syntax in your application needs to handle the query to the text index portion of your database separately from the "database" portion of your query.  So, "python programming" may be the query passed to the text indexing portion, but you're on your own to generate the "D.date_created BETWEEN x AND Y" in your SQL, and making that 'D' alias available via a join and so forth.

On the other hand, an external indexing system needs only be passed one query.  You can just generate "python programming date_created:[20070401 TO 20070501]".  If you don't have the concept of additional permissions that would prevent a user from being able to see any of the search results, you have the exact answer to your query in one go in one query.  (Actually, you probably should use a TermQuery rather than do the date as I did.)

In the case of a public aggregator (ie, something like Planet for a user-specific set of feeds, or like Amatomu where people subscribe their feeds), we can thus use an external indexing system.  If we were building a private aggregator - a multi-user feed reader, for example - then it might be easier to go with an in-database indexing system.

PyLucene, despite being a seemingly inelegant combination of Java code, gcj, forming a dynamic library, and making calls into that, seems to have quite a bit of mind-share.  It just happens to be very easy to use from their Ubuntu binary package, though, so I ended up using it.

Our process_entry_pylucene method is quite straightforward, actually.  (It would be a lot nicer if it was easy to ask for a ElementTree of our content - I'll probably add that to the base pyaggregator.):

from PyLucene import IndexWriter, StandardAnalyzer, Document, Field

class PyLuceneAggregatorMixin(object):
    def process_entry_lucene(self, post, feed, entry, posts, parsed_data):
        if not hasattr(self, 'writer'):
            return

        if not self.writer:
            return

        content = StringIO('<div>' +
            genshi.HTML(post.content).render('xhtml') +
            '</div>')
        e = ET.ElementTree()
        e.parse(content)
        e = e.getroot()
        text = flatten(e)


        doc = Document()
        doc.add(Field("id", str(post.post_id), Field.Store.YES, Field.Index.UN_TOKENIZED))
        doc.add(Field("title", post.title, Field.Store.YES, Field.Index.TOKENIZED))
        doc.add(Field("contents", text, Field.Store.NO, Field.Index.TOKENIZED))
        self.writer.addDocument(doc)

We just need to pass in a writer when we set up our aggregator:

indexDir = '/tmp/foo'
writer = IndexWriter(indexDir, StandardAnalyzer(), True)
class MyAggregator(PyLuceneAggregatorMixin, ElixirAggregatorMixin, Aggregator):
    writer = writer
options = {
    'verbose': True,
    'reraiseentryexceptions': True,
}
processor = MyAggregator(**options)
for feed in Feed.select():
    processor.process_feed(feed)

writer.optimize()
writer.close()

When you introduce threading into the environment (for example, use Django or TurboGears), this is a lot harder - mostly when we're doing the search.  TurboLucene solves a lot of these issues, but does expect things to be set up and laid out like it expects.  (Don't worry, I'm putting together patches now...)

What is quite nice is how simple the searching code can become.  Here's a TurboGears controller for a planet-like public aggregator I'm mocking up:

    @expose(template="genshi:tgaggregator.templates.postlist")
    @add_plugin_data()
    @paginate('posts')
    def search(self, q, submit = None):
        posts = turbolucene.search(q)
        return dict(posts = posts, query = q)

Can't get much simpler than that...