Contents

Introduction

This document will give you an overview of how the indexing and searching processes work within Beagle and present the general architecture and design.

Beagle is written in C# and runs on top of the Mono framework. The core of Beagle, the backends and filters are all written in C#. Because Beagle has a client-server design and exposes an XML-based messaging system, external applications may search Beagle's indexes and index additional information using any language they like. At the time of this writing, there are Mono, C, and Python APIs for this.

Beagle essentially provides two services: indexing of the user's data and allowing the user to search that data. We often refer to the data as the user's personal information space. This includes not only files on disk but also emails, viewed web pages, instant messenger conversations, and anything else you can think of. One's personal information space is ever increasing, and Beagle aims to make all of it available at the user's fingertips easily.

In this guide we'll take a look at some of the component parts that make up Beagle, including backends, filters, and indexables. We'll then take a look in depth at the indexing process and the steps data must go through before it's indexed. In future versions, we'll take a look at how exactly queries work.

This guide was written by Joe Shaw, with parts taken from existing documents by Daniel Drake and Debjyoti Bera.

Backends

The most important component to Beagle is the backend. At the highest level, the backend connects a data source to the Beagle core. A data source might be the file system in the user's home directory, it might be a mbox file for emails, it could be structured data accessed through library APIs, or it could be stored entirely on a remote computer. The following are some backends shipped in Beagle and their data sources:

  • File system backend - the user's home directory
  • Evolution mail backend - local and IMAP mail from Evolution, stored in ~/.evolution
  • Gaim backend - instant messenger conversations stored in ~/.gaim/logs
  • Evolution data server backend - addressbook contacts and calendar items from Evolution stored in the Evolution data server
  • Google backend - Google web search API

Generally, there are two models for backends: querying backends and indexing backends.

Querying backends

Querying backends implement the low-level backend interfaces and are completely responsible for searching data stores. These backends do not store any data in Beagle's indexes. The Google backend is an example of a querying backend. Since it is not possible to index the whole of Google's content locally like an indexing backend, it is necessary to query Google. These backends are much less common than indexing backends, and we won't go into much detail on them here.

Indexing backends

These backends index information into Beagle's Lucene indexes. The data is usually (but needn't necessary be) local data. Indexing backends usually crawl initially to populate the indexes and set up watches to notice changes as they happen in real-time later. The file system backend, the Evolution mail backend, and the Gaim IM log backend are examples of indexing backends.

Advantages of backends

The backend design is one of Beagle's greatest strengths because it empowers the developer to search and index data in the best way they see fit. Backends are dynamically loaded, which makes it possible for people to easily develop and deploy their own backends outside of the upstream Beagle tree.

Implementing backends

At the lowest level, all backends must implement the IQueryable interface. However, since nearly all backends are indexing backends, they should derive from the LuceneFileQueryable class (for file-based backends) or the LuceneQueryable class (for non-file base backends). These classes implement IQueryable themselves. All the nasty details of searching the low-level indexes are quite fortunately hidden from the programmer. See our Backend Tutorial page for a step-by-step guide to building indexing backends.

On the other hand, if you are not indexing data (for example, as with the Google backend), you must implement IQueryable directly.

Indexing Process

Most backends will build on top of LuceneQueryable to index data from its data source. Data is indexed in one of two ways: by crawling the data and indexing each item one by one; and by watching for changes to that data and indexing a changed item in real-time. In both cases, however, the backend creates one or more indexable objects and schedules them for indexing. Indexables are then often passed to a filter which extracts the plain text content from a structured file and attaches additional metadata based on the file type.

Indexables

The indexable object represents a single item to be indexed, updated, or removed from the index. This could be a single file on disk, an IM conversation, an email, or an addressbook entry. Every indexable has a URI associated with it.

Indexables created for the purpose of adding to the index can have properties, which are key-value pairs, associated with them. These properties usually represent metadata somehow associated with the item, and (depending on the property's type) can be stored in the index and searched. These indexables can also have a file or stream attached to them, which is the content of the item. For a file, for example, this would be the contents of the file. For an IM conversation, the contents of the conversation. Contents aren't required, however; there is no content in an addressbook contact, for example, only metadata.

Indexables can be created which change only properties on an existing item in the index. This is common in backends where metadata changes, but the contents of the indexable don't. Examples include a renamed file or an email that was marked as read.

Lastly, indexables can also be created for removing an item from the index. The only thing necessary is the URI of the item being removed.

Advantages of indexables

One important thing to note about indexables is that they don't necessary have a one-to-one relationship to files on disk. This is a very important distinction because it means that backends can adapt very easily to however data is stored.

For example, Evolution stores its local mailboxes in mbox format. A one-to-one relationship does not work with mbox files, since each email message is a distinct, individual item that should be indexed separately. Our one-to-many relationship makes this possible.

Filters

Filters are associated with one or more MIME types and are responsible for parsing the structured content of an indexable and converting that into metadata and textual content. The filter is essential in the indexing process, as it gives more context to the document being indexed and ensures that only text is stored in the text fields of the index.

For example, if we are asked to index an HTML file, the HTML filter should strip out all of the HTML tags so that they are not indexed as content. Meanwhile it should also pull out any useful metadata such as the hyperlinks, the author of the document, and anything else it can understand.

As another example, if we are asked to index an image file, there is no textual content there, so we shouldn't pass any to the indexer. However, this possibly is a lot of useful metadata: the dimensions of the image, comments added by image manipulation software, EXIF data about the image and camera, tags added by photo library software, and much more.

To see how to implement filters, view our Filter Tutorial page.

The difference between backends and filters

If you are looking to add support for a new application or data type to Beagle, you may wonder whether you should create a backend or a filter.

Both backends and filters can set metadata on indexables. The general rule of thumb is that backends represent data sources like application-specific data stores or data bases. Filters, on the other hand, are used to process the content of an indexable and as a result are usually specific to a certain type of file or data. In some cases it makes sense to create both backends and filters.

Put more simply, backends get the data from somewhere. Filters just process the data; they don't care where the data came from.

Examples of backends include: the file system (usually the user's home directory), Evolution email, Evolution calendar and addressbook, Gaim IM logs, Tomboy notes.

Examples of filters include: OpenOffice documents; HTML; audio, video, and images; plain text; email.

In some cases you will want to create both a backend and filter. In the backend you will extract the data from the store and add any backend-specific metadata to the indexable. Then the indexable will be passed off to the filter where it will extract content-specific metadata.

A good existing example of this are the email filter and backends. There are (at least) three email backends: one of Evolution, one of Thunderbird, and one for KMail. Each of these applications store email and the associated data in different file and database formats. They each have their own set of folders and have different types of metadata like tags, labels, etc. The backends extract this information and attach it to the indexable. At the core of each email, however, is an RFC 822-compliant blob of email data. The email filter takes that blob and is able to extract the name and address of the person who sent the email, the date and time the email was sent, the subject, the names and addresses of the recipients, any text and attachments included in the mail and turn it into a useful item in the index.

Flow of the indexing process

The indexing process generally works like this:

Image:indexing-flow.png

1. The data sits in the data source, such as the file system or a mailbox.

2. The backend extracts an item from the data source (such as a file or individual email) and creates an indexable object to represent it. It usually sets some backend-specific metadata on it (such as the file name or mailbox name). In some cases the backend already knows the MIME type of the object and sets it. The backend has the option of turning off the filtering step; if this is the case, we go to step 5.

3. If the MIME type of the indexable isn't known, the filter factory attempts to determine what it is. If there is a filter which matches the MIME type, we go to step 4. Otherwise we ignore the content of the file and go to step 5.

4. The content is filtered based on MIME type. The filter extracts metadata based on the content as well as the plain text content which will be indexed.

5. We index all the metadata on the indexable (from the backend and the filter) and if we have plain text content from the filter, we index that as well.

Crawling

As mentioned before, indexing usually happens two ways: by crawling the data and watching for real-time changes to that data.

Crawling is very similar to how the web is indexed. Data is crawled one file, web page, or email at a time, and items are added, updated, or removed as it goes along. When finished, this provides a static snapshot of the user's data.

To ensure that the user's data will always be indexed, it is important for every backend to crawl all its data when it is started. The first time Beagle is run, this crawl will index everything from this data source. On subsequent starts, this crawl is needed to index any changes since the last time Beagle was run.

It is important to note that although data must be crawled each time Beagle is started, it is not necessarily indexed. Backends should ensure that data that has been previously indexed and still up to date is not needlessly reindexed, because indexing is an expensive operation.

The crawl at daemon startup is often called the initial crawl and is usually the only crawl performed. However, some backends may continue to crawl after the initial crawl has finished. This is entirely up to the backend author.

Real-time changes

Usually as part of the startup process or the initial crawl, backends set up real-time watches on data and get notification whenever that data changes. How exactly this is done varies depending on the backend and data source. Backends whose data sources are files -- such as the file system and Gaim backends -- use inotify to get notification of changes. Other backends like the Evolution data server backend get notification through library APIs. Regardless of how these notifications are sent, the backend watches for them and indexes data as it is changed in real-time.

It is generally a good practice to set up watches as you initially crawl the data; that way a second pass isn't necessary to add watches.

Scheduling indexing tasks

Indexing all of the user's data is a time consuming task, and it's not uncommon for it to take several hours if this is the first time Beagle has been run. Of course, the user probably doesn't want to wait for this to happen, she probably wants to use her computer!

To try to be as unobtrusive to the user as possible, Beagle has a scheduler which determines when to run certain tasks. As tasks are run, system load and time to execute are factors in determining how often to run tasks. Tasks can be given different priorities which also affect when they are run. A task with a priority of "Immediate" is generally run as soon as possible, while a task with a "Idle" priority would only be run when there is nothing else to be run first. The following is a list of priorities in order from highest to lowest:

  • Immediate (do it right now)
  • Delayed (do it soon)
  • Maintenance (do it when there are no higher priority tasks from the same source)
  • Idle (do it only when there are no other higher priority tasks from any source)
  • Shutdown (do it on shutdown)

In general, backends run the initial crawl task with a priority of "Delayed" but real-time events are "Immediate".

Indexable generators

Normally, when real-time events come in, backends create individual indexable objects for the updated item, create a scheduler task for that indexable, and send the task to the scheduler. When crawling, however, there is the very real potential to create dozens, hundreds or even thousands of indexables, especially when starting with an empty index. Doing so taxes the scheduler and prevents other tasks from running.

To solve this problem, we have indexable generators. An indexable generator is an object which creates indexable objects on demand when the scheduler asks for them. Instead of creating indexables, individual tasks for the indexables and adding them one by one to the scheduler, we create a single scheduler task for the indexable generator and pass it to the scheduler instead. When it comes time to process the indexable generator, it creates indexables as necessary.

An indexable generator must implement the IIndexableGenerator interface.

Child indexables

Some data formats can act as containers for other data. An example of this is email, which can contain not only metadata and text but also attachments in other formats. If a document is sent in email, it's important that we index the content of that document as well as the content of the email. To deal with this, filters can create child indexables.

Child indexables have a special ParentUri field which points to the indexable that created it. They also inherit all the properties of its parent, but placed in the "parent" namespace. After the filter has finished running on the parent, tasks are created for each of the children and they are scheduled to run. Filters run on the child indexables just like any other indexable. If a parent indexable is removed from the index, all of the children are also removed.

The beagle daemon and beagle helper process

Beagle has a difficult job. Essentially, its mission is to index every piece of information in every format on your computer. Beagle must deal with a wide variety of formats of data, and to do this it uses as much existing software as possible. However, no piece of software is completely robust, not all data is well-formed, and things can occasionally go wrong. To help make Beagle more robust in the face of failures, the filtering and storing steps of Beagle's indexing process take place out-of-process from the main Beagle daemon in the index helper process.

Recall the indexing process from earlier:

1. Data source -> 2. backend -> 3. filter factory -> 4. filter -> 5. index

The backends (1-2 above) run in the Beagle daemon while filtering and storing the data in the index (3-5 above) happen in the index helper process. That way, if a helper process crashes due to a malformed file or bug in a filter, the daemon can continue running and start a replacement index helper. It also has the side effect of enforcing the backend/filter split. Filters are entirely transient and operate only on the data passed to it. Backends can (and usually must) maintain state, such as its progress while crawliing and the watches it has set up for real-time updates.


This page was last modified 20:54, 31 December 2006. This page has been accessed 9,234 times.

  
MediaWiki

Copyright © 2004-2007