Project Home

An Algebraic DB&IR Approach to
Personal Information Management

 

Schedule

The project will be carried out over the period of two years. In the first year, we shall focus on developing a formal framework for a suitable data model and a query model. In the following year, we shall carry out experiments based on this framework. The details of our research program and approach are given below:

First Year
The research program for this fiscal year can be divided into following subjects:

  1. Define a unified data model for enabling seamless search across a heterogeneous data collection in a unified fashion. A rooted ordered tree is in consideration as it can represent most personal information (folder hierarchy, structural contents of a file, unstructured data such as music, image files and so on). This representation may be somewhat less versatile but, nonetheless, simpler than a graph model. Our model may be unable to represent shortcuts (links) of files/folders, and references in LaTeX/XML files. However, this will have little effect on most practical search applications. The simplicity of our model, on the other hand, enables us to develop a flexible and powerful query model.
  2. Develop a unified query framework that offers ease-of-use, and at the same time guarantees effective and efficient retrieval of personal information. The core components of our query framework will be:
    Figure 2
    Fig.2 An integrated DB/IR query framework.

    • IR-like query interface: We intend to retain a simple IR-like query interface - similar to the one most naive users are familiar with. A query is expressed in terms a set of keywords with an additional set of filter conditions. We shall explore a class of filter conditions with a special property as a means for reducing a large number of potentially irrelevant candidates, thereby achieving a significant performance gain. We shall explore and define what this special property should be.
    • DB-like query processing engine: Instead of developing a new query engine from scratch, we shall extend our past query model that we developed for non-traditional data. We shall investigate how this extension can be accomplished with minimum effort. Our approach to query processing has always been algebraic, which has a number of advantages over other approaches such as algorithmic; the most important one being its ability to provide opportunities for logical query optimization. Our main interest will be how to achieve logical optimization. We shall investigate what kind of equivalence rules are available for algebraic manipulation of our query operations.
    • IR-like post-processor: This component is intended for computing the score of each candidate in an answer set, ranking the candidate answers in the order of relevance, and then presenting the final answer to the user. This will be developed in the following year.
  3. We shall also start developing a prototype system based on our formal model.

Second Year
The research program for this fiscal year can be divided into following subjects:

  1. Develop a novel effective means of presenting an answer set. Answers to queries posed over our dataspace will be different from traditional ones. As our dataspace usually consists of extremely heterogeneous collections of data, an answer set to a query is highly likely to include candidate answers comprising units of various data sources. We intend to deal this issue by setting up:
    • support for integrated search result: As both our data model and our query processing engine, will treat each and every unit of our data source uniformly, we can expect that this kind of integrated result will be well-supported in our framework.
    • a novel ranking model: This is necessary for graceful handling of heterogeneity in answer set, thereby providing the best ranked result. Several unconventional factors (e.g. type and size/length of data unit, keywords proximity among data units) are in consideration for this ranking model.
  2. Carry out empirical evaluation of our model. The experimentations will be carried out on a real personal dataspace. A Java-based full system implementation is under consideration. A relational database will be used for storing and updating dataspace indices. Three main objectives of this evaluation are:
    1. how much performance gain can be obtained by our logical optimization;
    2. how effective is the model in achieving a high recall without severe loss in precision, and
    3. how elegantly our ranking model can handle data heterogeneity.
  3. Publish our findings in a database-related international conference (e.g. SIGMOD or VLDB).

We intend to broaden the scope of our research agendas concerning information explosion even after completion of this project. In particular, we shall focus on several modern application areas such as enterprise and government data management, digital libraries, WWW and so on, where data heterogeneity will be of great concern.