sqoop2 refactoring for generic data transfer - hadoop strata sqoop meetup

18
Sqoop 2 Refactoring for generic data transfer Abraham Elmahrek

Upload: aaamase

Post on 01-Dec-2014

60 views

Category:

Software


0 download

DESCRIPTION

Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.

TRANSCRIPT

Page 1: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Sqoop 2Refactoring for generic data transfer

Abraham Elmahrek

Page 2: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Cloudera Ingest!

Page 3: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Introduction to Sqoop 2

Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client.

Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.

Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

Ease of use Extensible Security

Page 4: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Life of a Request

• Client– Talks to server over REST + JSON– Does nothing but sends requests

• Server– Extracts metadata from data source– Delegates to execution engine– Does all the heavy lifting really

• MapReduce– Parallelizes execution of the job

Page 5: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Workflow

Page 6: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Job Types

IMPORT into Hadoop and EXPORT out of Hadoop

Page 7: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Responsibilities

Connector responsibilities Sqoop framework responsibilities

Transfer data from Connector A to Hadoop

Page 8: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Connector Definitions

• Connectors define:– How to connect to a data source– How to extract data from a data source– How to load data to a data source

public Importer getImporter(); // Supply extract method

public Importer getExporter(); // Supply load method

public class getConnectionConfigurationClass();

public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT

Page 9: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Intermediate Data Format

• Describe a single record as it moves through Sqoop• currently available

– CSV

col1,col2,col3,...col1,col2,col3,......

Page 10: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

• Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem– HBase to HDFS not supported– HDFS to Accumulo not supported

• Hadoop ecosystem not well defined– Accumulo was not considered part of Hadoop ecosystem– What’s next? Kafka?

What’s Wrong w/ Current Implementation?

Page 11: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Refactoring

• Connectors already defined extractors and loaders– Refactor the connector SDK

• Pull out HDFS integration to a connector• Improve Schema integration

Transfer data from Connector A to Connector B

Page 12: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Connector SDK

• Connectors assume all roles• Add Direction for FROM and TO• Initializers and destroyers for both directions

Connector responsibilities

Page 13: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

HDFS Connector

• Move Hadoop role to connector• Schemaless• Data formats

– Text (CSV)– Sequence– etc.

Page 14: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Schema Improvements

• Schema per connector• Intermediate data format (IDF) has a Schema• Introduce matcher• Schema represents data as it moves through the system

Page 15: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Matcher

• Matcher ensures data goes to right place• Combinations

– FROM and TO schema– FROM schema– TO schema– No schema = Error

Page 16: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Matcher

Ensure that FROM schema matches TO schema by index location of Schema

Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.

Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

Location Name User defined

Page 17: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Checkout http://ingest.tips for general ingest

Page 18: Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Thank you