Introducing nope; the naive catalog
19 Apr 2004
The past long weekends and public holidays has allowed me to start writing some code again, and nope (Neil's Object Publishing Experiment) is the codename for what I've been playing with. It uses the Twisted/ZODB integration from The Shuttleworth Foundation's SchoolTool (headed by Zope developer Steve Alexander) to provide a really simple object publishing environment, stealing many ideas/concepts from Zope 3.
I'm just waiting on some licensing information from The Shuttleworth Foundation (schooltool's GPL, and I would like the Twisted/ZODB integration available under LGPL or BSD/ZPL), and then I'll release some code.
Hopefully someone'll come up with a better name for nope, as for obvious reasons it sounds a little too much like Zope...
The building blocks of nope are just starting to settle - it uses Zope interfaces and adapters to publish objects (currently only to web), and has the basics of object paths and traversal.
It uses Twisted as the encompassing framework, and this allows a telnet interface (manhole) to manipulate the object database. Currently, that's the only real way to create objects once nope's running.
I'm writing a simple blog system as the test product for nope, and I found I missed Zope's Catalog for searching (just like me to look at searching before working out how to add entries via the web...). So, I decided to write one for nope, equipped with little knowledge of how one writes a catalog.
I broke the catalog into two modules: The first deals with breaking up some text (or HTML) into words, and returning a dictionary with words as keys, and the number locations of that word in the text. That's catalogutil.py:
"""Text Catalog"""
import re
import sgmllib
import StringIO
word_splitter = re.compile('''[- .,\n\(\)/_]''')
shortest_word = 1
longest_word = 16
common_words = (
"and",
"are",
"but",
"can",
"don't",
"for",
"have",
"the",
"that",
"this",
"who",
"will",
"you",
"your",
)
suffixes = (
"'d",
"'re",
"'ll",
"'ve",
)
class StrippingParser(sgmllib.SGMLParser):
def __init__(self):
self.text = StringIO.StringIO()
sgmllib.SGMLParser.__init__(self)
def handle_data(self, data):
self.text.write(data)
# Ensure space between data segments separated by tags
self.text.write(' ')
def getWordsHTML(paragraph):
sp = StrippingParser()
sp.feed(paragraph)
return getWords(sp.text)
def getWordsText(paragraph):
return getWords(StringIO.StringIO(paragraph))
def split(words):
return [w for w in word_splitter.split(words) if w]
def getWords(text):
wordnum = 0
pos = 0
chunk = 128
words = {}
while 1:
text.seek(pos)
data = text.read(chunk)
last = data.rfind(' ')
if last == -1:
pos = pos + chunk
continue
pos = pos + last
data = data[:last]
for word in split(data):
word = word.strip()
if len(word) == 0:
continue
wordnum += 1
used_suffix = 0
if len(word) < shortest_word:
continue
if len(word) > longest_word:
continue
word = word.lower()
suffix_start = word.rfind("'")
if suffix_start == len(word):
word = word[:-1]
suffix_start = word.rfind("'")
if suffix_start == 0:
word = word[1:]
if suffix_start >= 0:
suffix = word[suffix_start:]
if suffix in suffixes:
used_suffix = 1
word = word[:suffix_start]
#if word in common_words:
# continue
if used_suffix:
word = word + suffix
words.setdefault(word, [])
words[word].append(wordnum)
if text.tell() == text.len:
break
return words
testdata = """
You should create a localconfig.py file in the directory from which
you're going to run sisynala. Or, set PYTHONPATH to a directory in
which your localconfig.py lives. This allows you to easily have
multiple configurations for different log files. Or just stick it in
the sisynala python directory if you're certain you're going to only
have one.
There's an example localconfig.py in the examples/ directory of the
source distribution, and in share/examples/sisynala once you've
installed.
EXCLUDEDBYFUNC is a function that is called to exclude certain types of
traffic for reasons that can't be covered by the other exclusion
mechanisms. MY_AGENTS and PLACES_I_VISIT are helper lists used by the
example EXCLUDEDBYFUNC function, and can be tossed if you're not using
them. Return 0 from the function if the line should not be ignored;
return 1 if it should be.
Change MY_REFERERS to your web site URL, and any other local names that
might apply to your web site.
Change EXCLUDEIPS to anything that's almost certainly coming from your
own bots (like I exclude my planet and some other automated scripts) and
for people who screw you around and can't be blocked in any way but IP
address.
EXCLUDESEARCHES are exact search phrases (ie, not partial matches) to
exclude from the searches page.
IP_TO_NAME is for if you don't use name resolution (I don't, personally)
and want to identify certain IPs by name. You can identify multiple IPs
as the same name, and they'll be grouped.
REFERERSPAMMERS are for domains which have been used as fake referers to
generate links back to that domain in your statistics pages.
Change HTMLEXCLUDEPATTERNS to string patterns (currently, not needed
regex yet) applied to the full path of the request. I don't care about
people going to '/stats/', for example, so I exclude it.
HTMLPATTERNS (which only things that pass HTMLEXCLUDEPATTERNS get passed
through) lists string patterns applied to the full path of the request
which _will_ be considered valid HTML files (ie, not downloads). You
can just put '/' in there to include everything.
DOWNLOADEXCLUDEPATTERNS is analogous to HTMLEXCLUDEPATTERNS, except it
affects what gets considered a download.
DOWNLOADPATTERNS is analogous to HTMLPATTERNS. You might want to put
something like '.tar.gz' or whatever in there if you don't have specific
download directories (ie, like /files/ and /dist/ on my site).
"""
def main():
import pprint
pprint.pprint(getWordsText(testdata))
if __name__ == "__main__":
main()
I'm not sure how important the suffixes stuff is - it doesn't seem that Google breaks them down. I tend to use constructs like ``Neil'll've fixed it...'', so it might be useful for my writing!
The second half is the catalog object that'll live in ZODB. It's still hard-coded for my blog entries, as I'm not entirely sure how I'm going to deal with objects explaining to the catalog how to catalog them. It'll probably be using adapters, but I don't want to enter YAGNI territory, nope wants to stay simple.
The catalog (catalog.py) is a tiny bit reliant on some Zope conventions (__parent__, getPath) and on ZODB-related modules (persistence, IOBTree), but the former can easily be faked, and the latter can easily be converted to straight dictionaries and lists (ie, PersistentDict = dict, Persistent = object, IOBTree = list).
I haven't created the equivalent of the Zope ObjectHub yet; again I'm not sure of how to implement it yet. So, my Blog class is currently providing its features. The ObjectHub provides a mechanism to turn a path to an object into a unique hub id. This means the catalog just stores hub ids, not paths or references.
"""Text Catalog"""
import os.path
from sets import Set
from persistence import Persistent
from persistence.dict import PersistentDict
#from persistence.list import PersistentList
from zodb.btrees.IOBTree import IOBTree
from catalogutil import getWordsHTML, getWordsText, split
"""
Catalog layout
catalog is a PersistentDict, with catalog name as keys.
catalog["name"] is a PersistentDict, with words as keys.
(OOBTree worthwhile?)
catalog["name"]["word"] is an IOBTree, with entry ids as keys.
catalog["name"]["word"][54] is a list, containing the word number in the
document.
(List because it never gets updated - so not a PersistentList)
(Could be a tuple - worth it?)
(Should we have a dictionary of entry to list of words for ease of deletion?)
catalogdel is a PersistentDict, with catalog name as keys.
catalogdel["name"] is a PersistentDict, with entry ids as keys
(IOBTree worthwhile?)
catalogdel["name"][54] is a list, containing the words in the document.
(Should we get into word ids?)
"""
class Catalog(Persistent):
def __init__(self):
self.catalog = PersistentDict()
self.catalogdel = PersistentDict()
self.times = PersistentDict()
self.nameToAttr = PersistentDict()
self.createCatalog("data", "extended")
#self.catalog["data"] = PersistentDict()
#self.catalog["data"] = PersistentDict()
def add(self, entry):
entrypath = os.path.join(entry.__parent__.getPath(), entry.__name__, "")
id = self.findHub().path_to_hubid[entrypath]
self.times[entry.posted] = id
wordlist = []
for name, attribute in self.nameToAttr.items():
catalog = self.catalog[name]
catalogdel = self.catalogdel[name]
data = getattr(entry, attribute)
for word, occurences in getWordsHTML(data).items():
if not catalog.has_key(word):
catalog[word] = IOBTree()
catalog[word][id] = occurences
wordlist.append(word)
catalogdel[id] = wordlist
def remove(self, entry):
del self.times[entry.posted]
def createCatalog(self, name, attribute):
self.catalog[name] = PersistentDict()
self.catalogdel[name] = PersistentDict()
self.nameToAttr[name] = attribute
def findHub(self):
# Um, yeah.
return self.__parent__
def findPhrase(self, name, words):
#print "findPhrase: %s: %s" % (name, words)
catalog = self.catalog[name]
entries = None
for word in words:
#print "Finding entries for %s: " % (word),
if not word in catalog:
#print "Not in catalog: %s" % (word)
return Set()
word_results = Set(catalog[word])
#print list(word_results)
if entries:
entries = entries.intersection(Set(catalog[word]))
else:
entries = Set(catalog[word])
return_entries = []
"""For each entry, get the list of positions of the first word.
Then, for each of those positions, check to see if each of the
remaining words have their respective positions in their
position lists.
If we get through all the words and they have the correct
positions in their position list, we've got a hit for this
entry!"""
for entry in entries:
#print "Looking in entry %s" % (entry)
word = words[0]
positions = list(catalog[word][entry])
#print "Position for first word (%s): %s" % (word, positions)
for position in positions:
found = 1
for i, word in enumerate(words[1:]):
#print "Positions for word #%d (%s): %s" % (i, word, list(catalog[word][entry]))
if position + i + 1 not in catalog[word][entry]:
found = None
break
if found:
return_entries.append(entry)
"""We're returning a list of entries, so no need to
look further in this entry"""
break
return return_entries
def search(self, name, words):
catalog = self.catalog[name]
return_entries = None
for word in words:
if len(split(word)) > 1:
word_results = self.findPhrase(name, split(word))
else:
if catalog.has_key(word):
word_results = catalog[word]
else:
return Set()
if return_entries:
return_entries = return_entries.intersection(word_results)
else:
return_entries = Set(word_results)
if return_entries is None:
return Set()
return return_entries
I relatively easily turned this into an on-disk (or web-based) catalog (albeit imperfectly). catalogweb.py catalogs the files from my web page:
#!/usr/bin/env python
import sys
sys.path.append('/home/nbm/MyProjects/nope/src')
sys.path.append('/home/nbm/Publishing/Zope3/lib/python2.3/site-packages')
from zodb.db import DB
from zodb.storage.file import FileStorage
from transaction import get_transaction
from nope.catalog import Catalog
class TestEntry:
def __init__(self, name, extended):
import time
self.extended = extended
self.posted = time.time()
self.__parent__ = self
self.__name__ = name
def getPath(self):
return '/'
class ParentFaker:
path_to_hubid = {
}
files = (
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/starting-tnb.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/vacancies.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff1.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/words.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff2.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/building-communities-with-weblogs/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/masks.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/development-good-practise-using-oss/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/opensource-digitaldivide.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/perfect.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/bannergrab/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/sisynala/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/stikiwiki/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/tnntprss/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/nbm/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/books/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index-nocss.html',
)
def search(c, searchterms):
print "Searching for %s: " % (searchterms),
print [files[i] for i in c.search("data", searchterms)]
def main():
db = DB(FileStorage('Index.fs'))
conn = db.open()
root = conn.root()
c = Catalog()
c.__parent__ = ParentFaker()
for i, f in enumerate(files):
print f
c.__parent__.path_to_hubid[f + '/'] = i
c.add(TestEntry(f, open(f).read()))
root['catalog'] = c
get_transaction().commit()
#search(c, ["others can find"])
#print [files[i] for i in c.search("data", ["others can find"])]
if __name__ == "__main__":
main()
Since the catalog only stores hub ids, I've unfortunately had to replicate the files list, but that could trivially be stored in ZODB instead. Also need to rewrite the catalog to take a file-like object instead of a string - that way I can index larger documents.
#!/usr/bin/env python
import sys
sys.path.append('/home/nbm/MyProjects/nope/src')
sys.path.append('/home/nbm/Publishing/Zope3/lib/python2.3/site-packages')
from zodb.db import DB
from zodb.storage.file import FileStorage
from nope.catalog import Catalog
class TestEntry:
def __init__(self, name, extended):
import time
self.extended = extended
self.posted = time.time()
self.__parent__ = self
self.__name__ = name
def getPath(self):
return '/'
class ParentFaker:
path_to_hubid = {
}
files = (
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/starting-tnb.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/vacancies.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff1.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/words.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/ff2.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/building-communities-with-weblogs/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/masks.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/development-good-practise-using-oss/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/opensource-digitaldivide.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/writings/perfect.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/bannergrab/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/sisynala/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/stikiwiki/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/tnntprss/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/code/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/nbm/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/books/index.html',
'/home/nbm/MyProjects/nope/test/mithrandr.moria.org/index-nocss.html',
)
def search(c, searchterms):
print "Searching for %s: " % (searchterms),
print [files[i] for i in c.search("data", searchterms)]
def main():
db = DB(FileStorage('Index.fs'))
conn = db.open()
root = conn.root()
c = root['catalog']
search(c, ["others can find"])
if __name__ == "__main__":
main()
Of course, this example won't make any sense, but it seems to work.
Oh, I also have a non-unittest using test class for the catalog (test_catalog.py):
#!/usr/bin/env python
import catalog
class TestEntry:
def __init__(self, name, extended):
import time
self.extended = extended
self.posted = time.time()
self.__parent__ = self
self.__name__ = name
def getPath(self):
return '/'
class ParentFaker:
path_to_hubid = {
'/foo/': 1,
'/bar/': 2,
'/baz/': 3,
}
def test1():
c = catalog.Catalog()
c.__parent__ = ParentFaker()
c.add(TestEntry('foo', "The big fat chief dwarf sat on the tin roof."))
c.add(TestEntry('bar', "I really don't know why tin is better."))
c.add(TestEntry('baz', "Balin was a dwarf chief in moria."))
#print c.catalog["data"].keys()
print "Should be 1, 3"
print list(c.search("data", ["dwarf"]))
print "Should be 1, 2"
print list(c.search("data", ["tin"]))
print "Should be 1"
print list(c.search("data", ["dwarf","tin"]))
print "Should be 2"
print list(c.search("data", ["really"]))
print "Should be empty"
print list(c.search("data", ["asdf"]))
print "Should be 1"
print list(c.search("data", ["chief dwarf"]))
print "Should be 3"
print list(c.search("data", ["chief in moria", "was a dwarf"]))
def test2():
print "\n\nTest 2:\n"
c = catalog.Catalog()
c.__parent__ = ParentFaker()
c.add(TestEntry('foo', open("../../test/index.html").read()))
print list(c.search("data", ["peter"]))
print list(c.search("data", ["peter * hamilton"]))
def main():
test1()
test2()
if __name__ == "__main__":
main()
As you can see, I'm working on adding replacing words with a wildcard, so I can search for ``Peter F. Hamilton'' if I can't recall his middle initial.
1 old-style comments
Tom Hoffman — October 20, 2004 at 11:52 PM.