You may be reading this book for many reasons. It could be because you heard all about
Hadoop and what it can do to crunch petabytes of data in a reasonable amount of time.
While reading into Hadoop you found that, for random access to the accumulated data,
there is something called HBase. Or it was the hype that is prevalent these days addressing
a new kind of data storage architecture. It strives to solve large-scale data
problems where traditional solutions may be either too involved or cost-prohibitive. A
common term used in this area is NoSQL.
No matter how you have arrived here, I presume you want to know and learn—like I
did not too long ago—how you can use HBase in your company or organization to
store a virtually endless amount of data. You may have a background in relational
database theory or you want to start fresh and this “column-oriented thing” is something
that seems to fit your bill. You also heard that HBase can scale without much
effort, and that alone is reason enough to look at it since you are building the next webscale
system.
I was at that point in late 2007 when I was facing the task of storing millions of documents
in a system that needed to be fault-tolerant and scalable while still being maintainable
by just me. I had decent skills in managing a MySQL database system, and was
using the database to store data that would ultimately be served to our website users.
This database was running on a single server, with another as a backup. The issue was
that it would not be able to hold the amount of data I needed to store for this new
project. I would have to either invest in serious RDBMS scalability skills, or find something
else instead.
Obviously, I took the latter route, and since my mantra always was (and still is) “How
does someone like Google do it?” I came across Hadoop. After a few attempts to use
Hadoop directly, I was faced with implementing a random access layer on top of it—
but that problem had been solved already: in 2006, Google had published a paper
titled “Bigtable”* and the Hadoop developers had an open source implementation of it
called HBase (the Hadoop Database). That was the answer to all my problems. Or so
it seemed...