Another important aspect of building an aggregator, besides tags/keywords, is being able to find posts based on particular words in the text. In the "assigned keywords" vs. "words in the page" battle, "words in the page" still has the lead.
This is typically done using a "full text index" of the content - of a post in our case. There are two main ways to do this - either in your standard database (MySQL's MATCH ... AGAINST construct, or tsearch2 for PostgreSQL), or in an external indexing system just for this sort of indexing and searching, using something like Lucene or Xapian.
There are some interesting trade-offs here...
If you're doing this from your standard database, then you can generate a list of all objects (posts in this case) that contain a particular set of words and that also are visible to me due to the permission system of the application (which we don't have in this case).
If you're doing this from an external indexing system, then you'd have to filter the results from your external indexing system with a second call to check each result (or a bunch of results) to see if this result should be visible to a particular user, or be visible based on additional search criteria (date last viewed by this particular user, for example).
The built-into-the-database index does have downsides. For example, it generally has a small subset of the query capability of the external indexing systems. The query syntax in your application needs to handle the query to the text index portion of your database separately from the "database" portion of your query. So, "python programming" may be the query passed to the text indexing portion, but you're on your own to generate the "D.date_created BETWEEN x AND Y" in your SQL, and making that 'D' alias available via a join and so forth.
On the other hand, an external indexing system needs only be passed one query. You can just generate "python programming date_created:[20070401 TO 20070501]". If you don't have the concept of additional permissions that would prevent a user from being able to see any of the search results, you have the exact answer to your query in one go in one query. (Actually, you probably should use a TermQuery rather than do the date as I did.)
In the case of a public aggregator (ie, something like Planet for a user-specific set of feeds, or like Amatomu where people subscribe their feeds), we can thus use an external indexing system. If we were building a private aggregator - a multi-user feed reader, for example - then it might be easier to go with an in-database indexing system.
PyLucene, despite being a seemingly inelegant combination of Java code, gcj, forming a dynamic library, and making calls into that, seems to have quite a bit of mind-share. It just happens to be very easy to use from their Ubuntu binary package, though, so I ended up using it.
Our process_entry_pylucene method is quite straightforward, actually. (It would be a lot nicer if it was easy to ask for a ElementTree of our content - I'll probably add that to the base pyaggregator.):
from PyLucene import IndexWriter, StandardAnalyzer, Document, Field
class PyLuceneAggregatorMixin(object):
def process_entry_lucene(self, post, feed, entry, posts, parsed_data):
if not hasattr(self, 'writer'):
return
if not self.writer:
return
content = StringIO('<div>' +
genshi.HTML(post.content).render('xhtml') +
'</div>')
e = ET.ElementTree()
e.parse(content)
e = e.getroot()
text = flatten(e)
doc = Document()
doc.add(Field("id", str(post.post_id), Field.Store.YES, Field.Index.UN_TOKENIZED))
doc.add(Field("title", post.title, Field.Store.YES, Field.Index.TOKENIZED))
doc.add(Field("contents", text, Field.Store.NO, Field.Index.TOKENIZED))
self.writer.addDocument(doc)
We just need to pass in a writer when we set up our aggregator:
indexDir = '/tmp/foo'
writer = IndexWriter(indexDir, StandardAnalyzer(), True)
class MyAggregator(PyLuceneAggregatorMixin, ElixirAggregatorMixin, Aggregator):
writer = writer
options = {
'verbose': True,
'reraiseentryexceptions': True,
}
processor = MyAggregator(**options)
for feed in Feed.select():
processor.process_feed(feed)
writer.optimize()
writer.close()
When you introduce threading into the environment (for example, use Django or TurboGears), this is a lot harder - mostly when we're doing the search. TurboLucene solves a lot of these issues, but does expect things to be set up and laid out like it expects. (Don't worry, I'm putting together patches now...)
What is quite nice is how simple the searching code can become. Here's a TurboGears controller for a planet-like public aggregator I'm mocking up:
@expose(template="genshi:tgaggregator.templates.postlist")
@add_plugin_data()
@paginate('posts')
def search(self, q, submit = None):
posts = turbolucene.search(q)
return dict(posts = posts, query = q)
Can't get much simpler than that...
The code is all available in the pyaggregator [cheeseshop.python.org] entry on the Python Cheese Shop.
I did a bit of work with Lucene as well a while back....
The most of the work I had to do work was around improving the quality of the search results - Your task is a lot easier if the content you're indexing has a decent semantic structure (like your case, where you can safely assume you're working with well-formed XHTML, and you can modify the XHML to index better) - I ended up having to do a fair bit of "sanitizing" of the textual representation of an HTML document/PDF/word document that I was indexing though.
HTML especially was a pain in the ass - There's so much stuff you have to strip out, and figuring out what is meaningful and not of the content that you are processing is not that trivial. E.g. What constitutes metadata like navigation which you may not necessarily want to index, and what constitutes actual content. You want a parser as well that accepts the horrible horrible markup you may be faced with, even if it technically shouldn't parse :)
Word wasn't a walk in the park either, the Word->Text conversion had, shall we say, issues :P
That should be "most of the work I had to do was around".
This is one of the hard decisions to make when you get a partial feed. With a full feed, you have all the interesting stuff in your RSS feed. With a partial feed, some "interesting" stuff later on in the posting is on an HTML page.
Unfortunately, there's also "anti-interesting" stuff on that HTML page - as you say, stuff like navigation, temporary quicklinks based on the last comments made on the entire site, and so forth.
And then there are the comments being made on the post - which may or may not be "interesting" to us. (Especially if we start talking about spambots.)
There is at least one Python program (I forget the name) that you can run against a whole bunch of pages on a site and it'll determine the "interesting" zone on the page, and return that for you when passing new pages in. It does totally fall apart when the styling on the site changes, though.