In the early days of pattern recognition and statistical data analysis life was
rather simple: datasets were relatively small, collected from well-designed experiments,
analyzed using a few methods that had good theoretical background.
Explosive growth of the use of computers led to the creation of huge amounts of
data of all kinds, coming from business, finance, scientific experiments, medical
evaluations, Web-related data, multimedia, various imaging techniques, sensor
networks, and many others sources. Human experts are not capable of deep
exploration of large amounts of data, especially when an expertize in several
different areas is necessary for this purpose.
The need for scalable algorithms applicable for massive data sets, discovering
novel patterns in data, contributed to the growing interest in data mining and
knowledge discovery. It led to the development of new machine learning techniques,
including the use of inspirations from nature to develop evolutionary,
neural and fuzzy algorithms for data analysis and understanding. Computational
intelligence became a field in its own, combining all these methods, and
the number of algorithms available for data analysis was rapidly growing. These
algorithms started to be collected in larger data mining packages, with a free
package called Weka starting the trend, and many others that soon followed.
These packages added better user interfaces, environments to build data flow
schemes by connecting various modules and test various algorithms. However,
the number of modules at each stage: pre-processing, data acquisition, feature
selection and construction, instance selection, classification, association and approximation
methods, optimization techniques, pattern discovery, clusterization,
visualization and post-processing became unmanageably large. A large data mining
package allows for billions of combinations, making the process of knowledge
discovery increasingly difficult. Gone are the days when the life of a data miner
was simple and a background course in multivariate statistics was all that was
needed to do the job.