Minimalist Searching
In this episode of our minimalist series: the search support …
- A user should be able to writte a query string and see a list of results sorted by relevance.
- The simplest form of the query string will be a list of worlds.
- Each result will contain a link to a page in Basie.
- The information used during the searches will come from:
- Tickets’ number and subjects.
- Wiki pages’ names and content.
- Emails’ subjects and bodies.
- The search results will be filtered by project and content type (wiki page, ticket, email).
Current search app
Basie’s search app does not use any external library for indexing text and then searching the index. Instead it:
- (index) Defines an InvertedIndex model which stores for each term (words in lower case contained in specific fields like the tickets’ subjects) in an instance of a model (registered by another app): the term, a reference to the instance and the relevance (tf-idf) of that term in the instance.
- (indexing) Allows a fullindexing per registered model in which each instance of the model is procesed to get its terms and then for all the terms of the model a relevance value is computed. The relevance value takes into account: the number of instances of that model, the number of ocurrences of the term in all the instances of that model, the number of terms in that instance and the number of ocurrences of the term in the instance.
- (re-indexing) To update the index when an instance of a particular model is modified that model is fully reindexed (This slow operation takes place for example when a user is editing a ticket).
- (searching) Defines two kind of queries: ModelQuery and AdvancedQuery. The first of them receive a string which is splited in terms and returns a list with all the instances that contains at least one of the terms sorted by the number of ocurrences of the terms in the instance (Note that the tf-idf value is not used). The second kind of query is more complex, because it allows also to specify terms that have to be in all the result instances and terms that can not be in the result instances, but the relevance of an instance is computed in the same way.
The other features of the search app are:
- The registration mechanism (other apps use it to specify what fields of its models will be processed to obtain the terms).
- A custom command to do a fullindex of all the registered models.
- The web interface (views, urls, forms, templates) to use ModelQuery and AdvancedQuery.
Whoosh
Whoosh is a library, like Lucene or Xapian, for indexing text and then searching the index. It is implemented in Python and it can be installed from PyPI with easy_install Whoosh.
To understand Whoosh, there are a few important terms and concepts:
- Documents: The individual pieces of content you want to make searchable (Ex. a wiki page, an email or a ticket).
- Fields: Each document contains a set of fields (Ex. “title”, “content”, “url”, …). Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
- Schema: object defining the fields that are indexed for each document. Some of the kind of fields available are: ID (This type simply indexes, and optionally stores, the entire value of the field as a single unit. That is, it doesn’t break it up into individual words), TEXT (This type is for body text. It indexes, and optionally stores, the text and stores term positions to allow phrase searching).
- Storage: object that represents the medium in which the index will be stored. Currently the two options are storing the index as a set of files in a directory or in ram.
- Queries: a query string with a sintax similar to Lucene’s. It lets you connect terms with AND or OR, it lets you eleminate terms with NOT, it lets you group terms together into clauses with parentheses, and it lets you specify different fields to search. By default it joins clauses together with AND (so by default, all terms you specify must be in the document for the document to match).
An example of the use of Whoosh to replace many of the search app capabilities:
import whoosh.index as index from whoosh.fields import ID, TEXT from whoosh.qparser import QueryParser # Create an index stored as a set of files in a directory. ix = index.create_in("index_dir", title=TEXT(stored=True), content=TEXT, url=ID(stored=True, unique=True), project=ID, kind=ID) # Indexing documents writer = ix.writer() writer.add_document(title=u"My Wiki Page", content=u"A normal page with some text.", url=u"basie/wiki/my-wiki-page", project=u"Basie", kind=u"wiki") writer.add_document(title=u"Re: my message", content=u"Hi! a normal email message.", url=u"basie-summer/mail/56", project=u"Basie Summer", kind=u"mail") writer.add_document(title=u"#456", content=u"This ticket is a TODO.", url=u"basie/mail/456", project=u"Basie", kind=u"ticket") writer.commit() # Updating a document. writer.update_document(url=u"basie/wiki/my-wiki-page", content=u"New text for my normal wiki page.") # Searching documents using "content" as the default field and BM25F as the scoring algorith. ix = index.open_dir("index_dir") searcher = ix.searcher() parser = QueryParser("content", schema = ix.schema) # {'url': u'basie-summer/mail/56', 'title': u'Re: my message'} # {'url': u'basie/wiki/my-wiki-page', 'title': u'My Wiki Page'} query = parser.parse(u"Normal") for document in searcher.search(query): print document # Will find only {'url': u'basie/wiki/my-wiki-page', 'title': u'My Wiki Page'} query = parser.parse(u"Normal project:Basie")
Though the use of this library will make obsolete almost everything in the current search app, I think that it is worthwhile.
Thoughts?
See also:
Looks very promising. If multiple words are specified in a search are they implicitly ANDed or ORed? And does Whoosh give us richer Boolean operations automatically?
Greg Wilson
1 Jun 09 at 8:47 pm
They are implicitly ANDed. Whoosh gives AND, OR, NOT and () to group.
zuzelvp
1 Jun 09 at 8:58 pm
There is also an example of the support of prefixes (Ex. aa*). Though the documentation about the query syntax is not complete, they said in the getting started that it has a syntax similar to Lucene (http://lucene.apache.org/java/2_3_2/queryparsersyntax.html). I am not sure if everything from Lucene’s syntax is implemented.
zuzelvp
1 Jun 09 at 9:06 pm
I just found that though the are implicitly ANDed by default we can change that:
In [43]: from whoosh.query import Or
In [44]: parser = QueryParser(”content”, schema = ix.schema, conjunction=Or)
In [45]: query = parser.parse(u”normal page”)
In [46]: for document in searcher.search(query):
print document
….:
….:
{’url’: u’basie/wiki/my-wiki-page’, ‘title’: u’My Wiki Page’}
{’url’: u’basie-summer/mail/56′, ‘title’: u’Re: my message’}
zuzelvp
1 Jun 09 at 9:23 pm
Big question:
So when does the reindexing happen, and is it lightning fast?
Christian Muise
2 Jun 09 at 7:27 am
The first review request is in http://review.basieproject.org/r/229/. The documents in the index are created, updated or deleted in the post_save and post_delete callbacks. My TODO for today is to tests how fast is Whoosh to see if that is a good solution.
zuzelvp
2 Jun 09 at 8:30 am
@Christian
After the tests I did today I concluded that the reindexing happens very fast. I loaded the mail in beta (407 messages from our list) from a json in 27s, then I ran a loop in the Django shell to create wiki pages with the subjects as names an the bodies as content. With all that information in the database the editions to wiki pages where lightning fast. Also I did some small queries that had both wiki pages and mails as results.
Tomorrow I will add unittests to r229 so it can be reviewed and commited. After that we can make more performance tests to the new wsearch app.
zuzelvp
2 Jun 09 at 5:29 pm
Have you looked at Sphinx? http://www.sphinxsearch.com/
Heard really good things about it. Excellent Python bindings, too.
Andrey Petrov
5 Jun 09 at 1:38 pm