Learning Structure and Schemas from Documents (Studies in Computational Intelligence)

Learning Structure and Schemas from Documents (Studies in Computational Intelligence), 9783642229121 (3642229123), Springer, 2011

It was not long ago that database systems were revolutionized through the birth of the relational concepts and theory, which are now materialized in most commercial database management systems. It was then thought that data should be structured in a rather simple way to support the day-to-day business operations. This has certainly been made possible and supported by the mature transaction management concepts, and mathematically based query language and processes. Whilst this has been the backbone of businesses today, it cannot be ignored that most data in the world is not structured, or at least, easily and conveniently structured in a relational way.

The rise of an Internet world has undeniably contributed to the explosion of data in the digital universe. It has been reported not so long ago that there were over 1 trillion web pages – this roughly equals to almost 150 web pages per man, woman, and child on Earth, and the number of web pages increases billions per day. Another good example to illustrate data explosion in the digital universe is facebook. Also, it has been reported that the number of facebook users has surpassed the 500-million mark and is still growing strongly. If facebook users were citizen of a country called facebook, this country would have been the third largest country in the world, after China and India. And the kind of data is not easily categorized as structured data, since mostly, they are free from rigidly and constraining structures.

The rapidly growing volume of available digital documents of various formats and the possibility to access these through Internet-based technologies, have led to the necessity to develop solid methods to properly organize and structure documents in large digital libraries and repositories. Due to the extremely large volumes of documents and to their unstructured form, most of the research efforts in this direction are dedicated to automatically infer structure and schemas that can help to better organize huge collections of documents and data. This book covers the latest advances in structure inference in heterogeneous collections of documents and data. The book brings a comprehensive view of the state-of-the-art in the area, presents some lessons learned and identifies new research issues, challenges and opportunities for further research agenda and developments. The selected chapters cover a broad range of research issues, from theoretical approaches to case studies and best practices in the field. Researcher, software developers, practitioners and students interested in the field of learning structure and schemas from documents will find the comprehensive coverage of this book useful for their research, academic, development and practice activity.