The past long weekends and public holidays has allowed me to start writing some code again, and nope (Neil's Object Publishing Experiment) is the codename for what I've been playing with. It uses the Twisted/ZODB integration from The Shuttleworth Foundation's SchoolTool (headed by Zope developer Steve Alexander) to provide a really simple object publishing environment, stealing many ideas/concepts from Zope 3.

I'm just waiting on some licensing information from The Shuttleworth Foundation (schooltool's GPL, and I would like the Twisted/ZODB integration available under LGPL or BSD/ZPL), and then I'll release some code.

Hopefully someone'll come up with a better name for nope, as for obvious reasons it sounds a little too much like Zope...

The building blocks of nope are just starting to settle - it uses Zope interfaces and adapters to publish objects (currently only to web), and has the basics of object paths and traversal.

It uses Twisted as the encompassing framework, and this allows a telnet interface (manhole) to manipulate the object database. Currently, that's the only real way to create objects once nope's running.

I'm writing a simple blog system as the test product for nope, and I found I missed Zope's Catalog for searching (just like me to look at searching before working out how to add entries via the web...). So, I decided to write one for nope, equipped with little knowledge of how one writes a catalog.

I broke the catalog into two modules: The first deals with breaking up some text (or HTML) into words, and returning a dictionary with words as keys, and the number locations of that word in the text. That's catalogutil.py:

"""Text Catalog"""

import re
import sgmllib
import StringIO

word_splitter = re.compile('''[- .,\n\(\)/_]''')

shortest_word = 1
longest_word = 16

common_words = (
    "and",
    "are",
    "but",
    "can",
    "don't",
    "for",
    "have",
    "the",
    "that",
    "this",
    "who",
    "will",
    "you",
    "your",
)

suffixes = (
    "'d",
    "'re",
    "'ll",
    "'ve",
)

class StrippingParser(sgmllib.SGMLParser):
    def __init__(self):
        self.text = StringIO.StringIO()
        sgmllib.SGMLParser.__init__(self)

    def handle_data(self, data):
        self.text.write(data)
        # Ensure space between data segments separated by tags
        self.text.write(' ')

def getWordsHTML(paragraph):
    sp = StrippingParser()
    sp.feed(paragraph)
    return getWords(sp.text)

def getWordsText(paragraph):
    return getWords(StringIO.StringIO(paragraph))

def split(words):
    return [w for w in word_splitter.split(words) if w]

def getWords(text):

    wordnum = 0
    pos = 0
    chunk = 128
    words = {}

    while 1:
        text.seek(pos)
        data = text.read(chunk)
        last = data.rfind(' ')
        if last == -1:
            pos = pos + chunk
            continue
        pos = pos + last

        data = data[:last]

        for word in split(data):
            word = word.strip()
            if len(word) == 0:
                continue
            wordnum += 1
            used_suffix = 0

            if len(word) < shortest_word:
                continue
            if len(word) > longest_word:
                continue
            word = word.lower()
            suffix_start = word.rfind("'")

            if suffix_start == len(word):
                word = word[:-1]
                suffix_start = word.rfind("'")
            if suffix_start == 0:
                word = word[1:]

            if suffix_start >= 0:
                suffix = word[suffix_start:]
                if suffix in suffixes:
                    used_suffix = 1
                    word = word[:suffix_start]

            #if word in common_words:
            #    continue

            if used_suffix:
                word = word + suffix
            words.setdefault(word, [])
            words[word].append(wordnum)

        if text.tell() == text.len:
            break

    return words

testdata = """
You should create a localconfig.py file in the directory from which
you're going to run sisynala.  Or, set PYTHONPATH to a directory in
which your localconfig.py lives.  This allows you to easily have
multiple configurations for different log files.  Or just stick it in
the sisynala python directory if you're certain you're going to only
have one.

There's an example localconfig.py in the examples/ directory of the
source distribution, and in share/examples/sisynala once you've
installed.

EXCLUDEDBYFUNC is a function that is called to exclude certain types of
traffic for reasons that can't be covered by the other exclusion
mechanisms.  MY_AGENTS and PLACES_I_VISIT are helper lists used by the
example EXCLUDEDBYFUNC function, and can be tossed if you're not using
them.  Return 0 from the function if the line should not be ignored;
return 1 if it should be.

Change MY_REFERERS to your web site URL, and any other local names that
might apply to your web site.

Change EXCLUDEIPS to anything that's almost certainly coming from your
own bots (like I exclude my planet and some other automated scripts) and
for people who screw you around and can't be blocked in any way but IP
address.

EXCLUDESEARCHES are exact search phrases (ie, not partial matches) to
exclude from the searches page.

IP_TO_NAME is for if you don't use name resolution (I don't, personally)
and want to identify certain IPs by name.  You can identify multiple IPs
as the same name, and they'll be grouped.

REFERERSPAMMERS are for domains which have been used as fake referers to
generate links back to that domain in your statistics pages.

Change HTMLEXCLUDEPATTERNS to string patterns (currently, not needed
regex yet) applied to the full path of the request.  I don't care about
people going to '/stats/', for example, so I exclude it.

HTMLPATTERNS (which only things that pass HTMLEXCLUDEPATTERNS get passed
through) lists string patterns applied to the full path of the request
which _will_ be considered valid HTML files (ie, not downloads).  You
can just put '/' in there to include everything.

DOWNLOADEXCLUDEPATTERNS is analogous to HTMLEXCLUDEPATTERNS, except it
affects what gets considered a download.

DOWNLOADPATTERNS is analogous to HTMLPATTERNS.  You might want to put
something like '.tar.gz' or whatever in there if you don't have specific
download directories (ie, like /files/ and /dist/ on my site).
"""

def main():
    import pprint
    pprint.pprint(getWordsText(testdata))

if __name__ == "__main__":
    main()

I'm not sure how important the suffixes stuff is - it doesn't seem that Google breaks them down. I tend to use constructs like ``Neil'll've fixed it...'', so it might be useful for my writing!

The second half is the catalog object that'll live in ZODB. It's still hard-coded for my blog entries, as I'm not entirely sure how I'm going to deal with objects explaining to the catalog how to catalog them. It'll probably be using adapters, but I don't want to enter YAGNI territory, nope wants to stay simple.

The catalog (catalog.py) is a tiny bit reliant on some Zope conventions (__parent__, getPath) and on ZODB-related modules (persistence, IOBTree), but the former can easily be faked, and the latter can easily be converted to straight dictionaries and lists (ie, PersistentDict = dict, Persistent = object, IOBTree = list).

I haven't created the equivalent of the Zope ObjectHub yet; again I'm not sure of how to implement it yet. So, my Blog class is currently providing its features. The ObjectHub provides a mechanism to turn a path to an object into a unique hub id. This means the catalog just stores hub ids, not paths or references.

"""Text Catalog"""

import os.path
from sets import Set

from persistence import Persistent
from persistence.dict import PersistentDict
#from persistence.list import PersistentList

from zodb.btrees.IOBTree import IOBTree

from catalogutil import getWordsHTML, getWordsText, split

"""
Catalog layout

catalog is a PersistentDict, with catalog name as keys.

catalog["name"] is a PersistentDict, with words as keys.
    (OOBTree worthwhile?)

catalog["name"]["word"] is an IOBTree, with entry ids as keys.

catalog["name"]["word"][54] is a list, containing the word number in the
document.
    (List because it never gets updated - so not a PersistentList)
    (Could be a tuple - worth it?)

    (Should we have a dictionary of entry to list of words for ease of deletion?)


catalogdel is a PersistentDict, with catalog name as keys.

catalogdel["name"] is a PersistentDict, with entry ids as keys
    (IOBTree worthwhile?)

catalogdel["name"][54] is a list, containing the words in the document.
    (Should we get into word ids?)

"""

class Catalog(Persistent):
    def __init__(self):
        self.catalog = PersistentDict()
        self.catalogdel = PersistentDict()
        self.times = PersistentDict()
        self.nameToAttr = PersistentDict()

        self.createCatalog("data", "extended")
        #self.catalog["data"] = PersistentDict()
        #self.catalog["data"] = PersistentDict()

    def add(self, entry):
        entrypath = os.path.join(entry.__parent__.getPath(), entry.__name__, "")
        id = self.findHub().path_to_hubid[entrypath]

        self.times[entry.posted] = id

        wordlist = []

        for name, attribute in self.nameToAttr.items():
            catalog = self.catalog[name]
            catalogdel = self.catalogdel[name]
            data = getattr(entry, attribute)
            for word, occurences in getWordsHTML(data).items():
                if not catalog.has_key(word):
                    catalog[word] = IOBTree()
                catalog[word][id] = occurences
                wordlist.append(word)

            catalogdel[id] = wordlist

    def remove(self, entry):
        del self.times[entry.posted]

    def createCatalog(self, name, attribute):
        self.catalog[name] = PersistentDict()
        self.catalogdel[name] = PersistentDict()
        self.nameToAttr[name] = attribute

    def findHub(self):
        # Um, yeah.
        return self.__parent__

    def findPhrase(self, name, words):
        #print "findPhrase: %s: %s" % (name, words)
        catalog = self.catalog[name]
        entries = None
        for word in words:
            #print "Finding entries for %s: " % (word),
            if not word in catalog:
                #print "Not in catalog: %s" % (word)
                return Set()

            word_results = Set(catalog[word])
            #print list(word_results)

            if entries:
                entries = entries.intersection(Set(catalog[word]))
            else:
                entries = Set(catalog[word])

        return_entries = []

        """For each entry, get the list of positions of the first word.

        Then, for each of those positions, check to see if each of the
        remaining words have their respective positions in their
        position lists.

        If we get through all the words and they have the correct
        positions in their position list, we've got a hit for this
        entry!"""

        for entry in entries:
            #print "Looking in entry %s" % (entry)
            word = words[0]
            positions = list(catalog[word][entry])
            #print "Position for first word (%s): %s" % (word, positions)
            for position in positions:
                found = 1
                for i, word in enumerate(words[1:]):
                    #print "Positions for word #%d (%s): %s" % (i, word, list(catalog[word][entry]))
                    if position + i + 1 not in catalog[word][entry]:
                        found = None
                        break
                if found:
                    return_entries.append(entry)
                    """We're returning a list of entries, so no need to
                    look further in this entry"""
                    break

        return return_entries

    def search(self, name, words):
        catalog = self.catalog[name]
        return_entries = None
        for word in words:
            if len(split(word)) > 1:
                word_results = self.findPhrase(name, split(word))
            else:
                if catalog.has_key(word):
                    word_results = catalog[word]
                else:
                    return Set()

            if return_entries:
                return_entries = return_entries.intersection(word_results)
            else:
                return_entries = Set(word_results)

        if return_entries is None:
            return Set()

        return return_entries

I relatively easily turned this into an on-disk (or web-based) catalog (albeit imperfectly). catalogweb.py catalogs the files from my web page:

#!/usr/bin/env python

import sys

sys.path.append('/home/nbm/MyProjects/nope/src')
sys.path.append('/home/nbm/Publishing/Zope3/lib/python2.3/site-packages')

from zodb.db import DB
from zodb.storage.file import FileStorage

from transaction import get_transaction

from nope.catalog import Catalog

class TestEntry:
    def __init__(self, name, extended):
        import time
        self.extended = extended
        self.posted = time.time()
        self.__parent__ = self
        self.__name__ = name

    def getPath(self):
        return '/'

class ParentFaker:
    path_to_hubid = {
    }

files = (
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/starting-tnb.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/vacancies.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff1.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/words.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff2.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/building-communities-with-weblogs/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/masks.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/development-good-practise-using-oss/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/opensource-digitaldivide.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/perfect.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/bannergrab/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/sisynala/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/stikiwiki/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/tnntprss/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/nbm/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/books/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index-nocss.html',
)

def search(c, searchterms):
    print "Searching for %s: " % (searchterms),
    print [files[i] for i in c.search("data", searchterms)]

def main():
    db = DB(FileStorage('Index.fs'))
    conn = db.open()
    root = conn.root()

    c = Catalog()
    c.__parent__ = ParentFaker()
    for i, f in enumerate(files):
        print f
        c.__parent__.path_to_hubid[f + '/'] = i
        c.add(TestEntry(f, open(f).read()))

    root['catalog'] = c

    get_transaction().commit()


    #search(c, ["others can find"])
    #print [files[i] for i in c.search("data", ["others can find"])]

if __name__ == "__main__":
    main()

Since the catalog only stores hub ids, I've unfortunately had to replicate the files list, but that could trivially be stored in ZODB instead. Also need to rewrite the catalog to take a file-like object instead of a string - that way I can index larger documents.

#!/usr/bin/env python

import sys

sys.path.append('/home/nbm/MyProjects/nope/src')
sys.path.append('/home/nbm/Publishing/Zope3/lib/python2.3/site-packages')

from zodb.db import DB
from zodb.storage.file import FileStorage

from nope.catalog import Catalog

class TestEntry:
    def __init__(self, name, extended):
        import time
        self.extended = extended
        self.posted = time.time()
        self.__parent__ = self
        self.__name__ = name

    def getPath(self):
        return '/'

class ParentFaker:
    path_to_hubid = {
    }

files = (
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/starting-tnb.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/vacancies.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff1.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/words.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff2.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/building-communities-with-weblogs/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/masks.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/development-good-practise-using-oss/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/opensource-digitaldivide.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/perfect.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/bannergrab/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/sisynala/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/stikiwiki/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/tnntprss/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/nbm/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/books/index.html',
        '/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index-nocss.html',
)

def search(c, searchterms):
    print "Searching for %s: " % (searchterms),
    print [files[i] for i in c.search("data", searchterms)]

def main():
    db = DB(FileStorage('Index.fs'))
    conn = db.open()
    root = conn.root()

    c = root['catalog']
    search(c, ["others can find"])

if __name__ == "__main__":
    main()

Of course, this example won't make any sense, but it seems to work.

Oh, I also have a non-unittest using test class for the catalog (test_catalog.py):

#!/usr/bin/env python

import catalog

class TestEntry:
    def __init__(self, name, extended):
        import time
        self.extended = extended
        self.posted = time.time()
        self.__parent__ = self
        self.__name__ = name

    def getPath(self):
        return '/'

class ParentFaker:
    path_to_hubid = {
        '/foo/': 1,
        '/bar/': 2,
        '/baz/': 3,
    }

def test1():
    c = catalog.Catalog()
    c.__parent__ = ParentFaker()
    c.add(TestEntry('foo', "The big fat chief dwarf sat on the tin roof."))
    c.add(TestEntry('bar', "I really don't know why tin is better."))
    c.add(TestEntry('baz', "Balin was a dwarf chief in moria."))

    #print c.catalog["data"].keys()
    print "Should be 1, 3"
    print list(c.search("data", ["dwarf"]))
    print "Should be 1, 2"
    print list(c.search("data", ["tin"]))
    print "Should be 1"
    print list(c.search("data", ["dwarf","tin"]))
    print "Should be 2"
    print list(c.search("data", ["really"]))
    print "Should be empty"
    print list(c.search("data", ["asdf"]))
    print "Should be 1"
    print list(c.search("data", ["chief dwarf"]))
    print "Should be 3"
    print list(c.search("data", ["chief in moria", "was a dwarf"]))

def test2():
    print "\n\nTest 2:\n"
    c = catalog.Catalog()
    c.__parent__ = ParentFaker()
    c.add(TestEntry('foo', open("../../test/index.html").read()))

    print list(c.search("data", ["peter"]))
    print list(c.search("data", ["peter * hamilton"]))

def main():
    test1()
    test2()

if __name__ == "__main__":
    main()

As you can see, I'm working on adding replacing words with a wildcard, so I can search for ``Peter F. Hamilton'' if I can't recall his middle initial.

1 old-style comments

  1. Tom HoffmanOctober 20, 2004 at 11:52 PM.

    Hi Neil, I was pretty surprised to come across your comments about SchoolTool. I'm the manager of that project now that Steve is working on Ubuntu. I came across your blog while looking for info about doing SMS with Twisted. Did you ever get any response from the Foundation about licensing? Let me know if you're interested in doing some paid work on SchoolTool at some point in the future. We don't currently have any South African developers, and I'd like to change that eventually.
blog comments powered by Disqus