|
It’s been four years since, via a post to the Apache JIRA, the first version of Sqoop was
released to the world as an addition to Hadoop. Since then, the project has taken several
turns, most recently landing as a top-level Apache project. I’ve been amazed at how
many people use this small tool for a variety of large tasks. Sqoop users have imported
everything from humble test data sets to mammoth enterprise data warehouses into the
Hadoop Distributed Filesystem, HDFS. Sqoop is a core member of the Hadoop ecosystem,
and plug-ins are provided and supported by several major SQL and ETL vendors.
And Sqoop is now part of integral ETL and processing pipelines run by some of
the largest users of Hadoop.
The software industry moves in cycles. At the time of Sqoop’s origin, a major concern
was in “unlocking” data stored in an organization’s RDBMS and transferring it to Hadoop.
Sqoop enabled users with vast troves of information stored in existing SQL tables
to use new analytic tools like MapReduce and Apache Pig. As Sqoop matures, a renewed
focus on SQL-oriented analytics continues to make it relevant: systems like Cloudera
Impala and Dremel-style analytic engines offer powerful distributed analytics with SQLbased
languages, using the common data substrate offered by HDFS.
The variety of data sources and analytic targets presents a challenge in setting up effective
data transfer pipelines. Data sources can have a variety of subtle inconsistencies:
different DBMS providers may use different dialects of SQL, treat data types differently,
or use distinct techniques to offer optimal transfer speeds. Depending on whether you’re
importing to Hive, Pig, Impala, or your own MapReduce pipeline, you may want to use
a different file format or compression algorithm when writing data to HDFS. Sqoop
helps the data engineer tasked with scripting such transfers by providing a compact but
powerful tool that flexibly negotiates the boundaries between these systems and their
data layouts.
|
|
|
|