Computer Vision: Models, Learning, and Inference

Computer Vision: Models, Learning, and Inference, 9781107011793 (1107011795), Cambridge University Press, 2012

There are already many computer vision textbooks, and it is reasonable to question the need for another. Let me explain why I chose to write this volume.

Computer vision is an engineering discipline; we are primarily motivated by the real-world concern of building machines that see. Consequently, we tend to categorize our knowledge by the real-world problem that it addresses. For example, most existing vision textbooks contain chapters on object recognition and stereo vision. The sessions at our research conferences are organized in the same way. The role of this book is to question this orthodoxy: Is this really the way that we should organize our knowledge?

Consider the topic of object recognition. A wide variety of methods have been applied to this problem (e.g., subspace models, boosting methods, bag of words models, and constellation models). However, these approaches have little in common. Any attempt to describe the grand sweep of our knowledge devolves into an unstructured list of techniques. How can we make sense of it all for a new student? I will argue for a different way to organize our knowledge, but first let me tell you how I see computer vision problems.

We observe an image and from this we extract measurements. For example, we might use the RGB values directly or we might filter the image or perform some more sophisticated preprocessing. The vision problem or goal is to use the measurements to infer the world state. For example, in stereo vision we try to infer the depth of the scene. In object detection, we attempt to infer the presence or absence of a particular class of object.

To accomplish the goal, we build a model. The model describes a family of statistical relationships between the measurements and the world state. The particular member of that family is determined by a set of parameters. In learning we choose these parameters so they accurately reflect the relationship between the measurements and the world. In inference we take a new set of measurements and use the model to tell us about the world state. The methods for learning and inference are embodied in algorithms. I believe that computer vision should be understood in these terms: the goal, the measurements, the world state, the model, the parameters, and the learning and inference algorithms.

We could choose to organize our knowledge according to any of these quantities, but in my opinion what is most critical is the model itself – the statistical relationship between the world and the measurements. There are three reasons for this. First, the model type often transcends the application (the same model can be used for diverse vision tasks). Second, the models naturally organize themselves neatly into distinct families (e.g., regression, Markov random fields, camera models) that can be understood in relative isolation. Finally, discussing vision on the level of models allows us to draw connections between algorithms and applications that initially appear unrelated. Accordingly, this book is organized so that each main chapter considers a different family of models.

On a final note, I should say that I found most of the ideas in this book very hard to grasp when I was first exposed to them. My goal was to make this process easier for subsequent students following the same path; I hope that this book achieves this and inspires the reader to learn more about computer vision.