Text databases are becoming larger and larger, the best example being the
World Wide Web (or just Web). For this reason, the importance of the information
retrieval (IR) and related topics such as text mining, is increasing every
day [Baeza-Yates & Ribeiro-Neto, 1999]. However, doing experiments in large
text collections is not easy, unless the Web is used. In fact, although reference
collections such as TREC [Harman, 1995] are very useful, their size are several
orders of magnitude smaller than large databases. Therefore, scaling is an
important issue. One partial solution to this problem is to have good models
of text databases to be able to analyze new indices and searching algorithms
before making the effort of trying them in a large scale. In particular if our
application is searching the Web. The goals of this article are two fold: (1) to
present in an integrated manner many different results on how to model nat
ural language text and document collections, and (2) to show their relations,
consequences, advantages, and drawbacks.
We can distinguish three types of models: (1) models for static databases,
(2) models for dynamic databases, and (3) models for queries and their answers.
Models for static databases are the classical ones for natural language
text. They are based in empirical evidence and include the number of different
words or vocabulary (Heaps’ law), word distribution (Zipf’s law), word
length, distribution of document sizes, and distribution of words in documents.
We formally relate the Heaps’ and Zipf’s empirical laws and show that they
can be explained from a simple finite state model.
Dynamic databases can be handled by extensions of static models, but there
are several issues that have to be considered. The models for queries and their
answers have not been formally developed until now. Which are the correct
assumptions? What is a random query? How many occurrences of a query are
found? We propose specific models to answer these questions.