Project Home

An Algebraic DB&IR Approach to
Personal Information Management

 

Motivation

Most Personal Information Management Systems (PIMS) today face a daunting task of dealing with large collections of data from diverse sources. These data are not limited to plain, unstructured text files or structured data that can be easily fit into a conventional Database Management Systems (DBMS). For example, a personal desktop may typically contain an extremely heterogeneous collection of data including text, pictures, music, emails, XML, LaTeX and Microsoft Office documents scattered across a hierarchy of folders (see Fig 1). What we lack today is a means of managing and searching them in a convenient, unified fashion.

Figure 1
Fig.1 A personal desktop dataspace

Recently, this issue has gained considerable attention both from industry as well as research community. While several popular vendors such as Microsoft, Apple and Google have been offering keyword-based desktop search tools, their search range is limited to the file system managed by an underlying Operating Systems (OS). They seriously lack capability of retrieving a particular segment of a document contents. For example, if we wish to search for a particular section in the contents of a LaTeX file, these tools will return the name of this file (with the full contents), instead of only that desired section. However naive in their approach, keyword-based desktop search tools are nevertheless an important first step towards searching a mixed collection of data.

In database research community, there has been a lot of emphasis on the need of new principles for managing a heterogeneous collection of data. Recently, a graph data model and a new XPath-like query language has been proposed for managing and accessing one’s personal data scattered across various data sources such as desktop PC, email servers and so on. However, similar to several other database languages such as SQL and XQuery, the proposed query language is very complex. Moreover, it inherits one of the serious drawbacks of XPath of XPath-like query languages, that is, users are expected to have knowledge of the underlying structure of the data that they are going to query. This inconvenience is discouraging to a large section of naive (desktop) users who are already overwhelmed by a huge volume of data having no fixed schema or structure.

We argue that, similar to challenges in several other new applications, rather than perceiving them as mere database issues, the challenges in desktop search must be understood in a wider perspective, if these challenges are to be met effectively for the benefit of a wider audience. This project takes an integrated ‘database/information retrieval’ (DB/IR) approach to searching a desktop dataspace which is a heterogeneous collection of data in a personal desktop. We identify not only general, but also several specific requirements and challenges in this approach. A particularly important issue that we highlight is how to achieve DB-like performance gain in this integrated DB/IR query platform.