A range of electronic corpora has become increasingly accessible via the WWW and CD-ROM. This development coincided with improvements in the standards governing the collecting, encoding and archiving of such data. Less attention, however, has been paid to making other types of digital data available. This is especially true of that which one might describe as 'unconventional', namely, dialects, child language and bilingual databases. This book is a first step toward developing similar standards for enriching and preserving these neglected resources.
Only two or three decades ago, those of us who had the patience and the wherewithal to construct a computerized corpus of recorded speech, however clunky, were the envy of our colleagues. In those days, linguists interested in quantitative analysis simply slogged through their audio-tapes, extracting unfathomable quantities of data by hand. Cedergren, to name but one notable example, analyzed 53,038(!) tokens of phonological variables, culled individually from her tapes, in her 1973 analysis of Panamanian Spanish.
The gold standard for transcribed corpora at the time was the concordance, possessed by a fortunate few, and coveted by all who were doomed to manual extraction. Of course the vintage concordance was largely limited to lexically-based retrieval, but at least it was searchable. The papers that Joan Beal, Karen Corrigan and Hermann Moisl have assembled in these companion volumes are eloquent testimony to how far the field of corpus linguistics – now rife with electronic corpora – has come in so short a time.