The goal of this book is to provide an overview of the current state of knowledge of educational
data mining (EDM). The primary goal of EDM is to use large-scale educational data
sets to better understand learning and to provide information about the learning process.
Although researchers have been studying human learning for over a century, what is different
about EDM is that it makes use not of experimental subjects learning a contrived task for
20 minutes in a lab setting; rather, it typically uses data from students learning school subjects,
often over the course of an entire school year. For example, it is possible to observe students
learning a skill over an eight-month interval and make discoveries about what types of
activities result in better long-term learning, to learn about the impact of what time students
start their homework has on classroom performance, or to understand how the length of
time students spend reading feedback on their work impacts the quality of their later efforts.
In order to conduct EDM, researchers use a variety of sources of data such as intelligent
computer tutors, classic computer-based educational systems, online class discussion
forums, electronic teacher gradebooks, school-level data on student enrollment, and standardized
tests. Many of these sources have existed for decades or, in the case of standardized
testing, about 2000 years. What has recently changed is the rapid improvement in
storage and communication provided by computers, which greatly simplifies the task of
collecting and collating large data sets. This explosion of data has revolutionized the way
we study the learning process.
In many ways, this change parallels that of bioinformatics 20 years earlier: an explosion
of available data revolutionized how much research in biology was conducted. However,
the larger number of data was only part of the story. It was also necessary to discover,
adapt, or invent computational techniques for analyzing and understanding this new,
vast quantity of data. Bioinformatics did this by applying computer science techniques
such as data mining and pattern recognition to the data, and the result has revolutionized
research in biology. Similarly, EDM has the necessary sources of data. More and more
schools are using educational software that is capable of recording for later analysis every
action by the student and the computer. Within the United States, an emphasis on educational
accountability and high stakes standardized tests has resulted in large electronic
databases of student performance. In addition to these data, we need the appropriate computational
and statistical frameworks and techniques to make sense of the data, as well as
researchers to ask the right questions of the data.