may 2013 hug: apache sqoop 2 - a next generation of data transfer tools

14
1 Apache Sqoop 2: The next generation of data transfer Hari Shreedharan Software Engineer, Cloudera Contributor, Apache Sqoop Committer/PMC Member, Apache Flume

Upload: yahoo-developer-network

Post on 26-Jan-2015

113 views

Category:

Technology


3 download

DESCRIPTION

Apache Sqoop 2 is the next generation of the massively successful open source tool designed to transfer data between traditional SQL databases and warehouses into Apache Hadoop. Sqoop 2 is designed as a client-server system with a repository which stores connection and job information. Sqoop 2 is designed to support secure job submission and multiple different roles for users. In this talk, we will discuss the issues users faced in Sqoop 1, and the design of Sqoop 2 and how the issues faced in Sqoop 1 are being handled in Sqoop 2. Presenter(s): Hari Shreedharan, Software Engineer, Cloudera

TRANSCRIPT

Page 1: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

1

Apache Sqoop 2: The next generation of data transfer

Hari Shreedharan

Software Engineer, ClouderaContributor, Apache SqoopCommitter/PMC Member, Apache Flume

Page 2: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

2

What is Sqoop?

• Apache Top-Level Project• SQl to hadOOP• Tool to transfer data from relational databases

• Teradata, MySQL, PostgreSQL, Oracle, Netezza• To Hadoop ecosystem

• HDFS (text, sequence file), Hive, HBase, Avro• And vice versa

Page 3: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Why Sqoop?

• Efficient/Controlled resource utilization• Concurrent connections, Time of operation

• Datatype mapping and conversion• Automatic, and User override

• Metadata propagation• Sqoop Record• Hive Metastore• Avro

3

Page 4: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 1

4

Page 5: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 1

• Based on Connectors• Responsible for Metadata lookups, and Data Transfer• Majority of connectors are JDBC based• Non-JDBC (direct) connectors for optimized data transfer

• Connectors responsible for all supported functionality• HBase Import, Avro Support, ...

5

Page 6: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

6

Sqoop 1 Challenges

• Cryptic, contextual command line arguments• Security concerns• Type mapping is not clearly defined• Client needs access to Hadoop binaries/configuration

and database• JDBC model is enforced

Page 7: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 1 Challenges

• Non-uniform functionality• Different connectors support different capabilities

• Overlapped/Duplicated functionality• Different connectors may implement same capabilities

differently• High coupling with Hadoop

• Database vendors required to understand Hadoop idiosyncrasies in order to build connectors.

7

Page 8: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 2

8

Page 9: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 2 – Design Goals

• Security and Separation of Concerns• Role based access and use

• Ease of extension• No low-level Hadoop knowledge needed • No functional overlap between Connectors

• Ease of Use• Uniform functionality• Domain specific interactions

9

Page 10: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

10

Sqoop 2: Connection vs Job metadata

There are two distinct sets of options to pass into Sqoop:

Job (distinct per table)Connection (distinct per database)

Page 11: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Sqoop 2: Workings

• Connectors register metadata• Metadata enables creation of Connections and Jobs• Connections and Jobs stored in Metadata Repository• Operator runs Jobs that use appropriate connections• Admins set policy for connection use

11

Page 12: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

12

Sqoop 2: Security

• Support for secure access to external systems via role-based access to connection objects

• Administrators create/edit/delete connections• Operators use connections

Page 13: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

13

Current Status: Sqoop 2

• Primary focus of the Sqoop Community • Current Release: 1.99.2

• bits and docs: http://sqoop.apache.org/

Page 14: May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

14