sqoop2 refactoring for generic data transfer - hadoop strata sqoop meetup

Post on 01-Dec-2014

60 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.

TRANSCRIPT

Sqoop 2Refactoring for generic data transfer

Abraham Elmahrek

Cloudera Ingest!

Introduction to Sqoop 2

Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client.

Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.

Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

Ease of use Extensible Security

Life of a Request

• Client– Talks to server over REST + JSON– Does nothing but sends requests

• Server– Extracts metadata from data source– Delegates to execution engine– Does all the heavy lifting really

• MapReduce– Parallelizes execution of the job

Workflow

Job Types

IMPORT into Hadoop and EXPORT out of Hadoop

Responsibilities

Connector responsibilities Sqoop framework responsibilities

Transfer data from Connector A to Hadoop

Connector Definitions

• Connectors define:– How to connect to a data source– How to extract data from a data source– How to load data to a data source

public Importer getImporter(); // Supply extract method

public Importer getExporter(); // Supply load method

public class getConnectionConfigurationClass();

public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT

Intermediate Data Format

• Describe a single record as it moves through Sqoop• currently available

– CSV

col1,col2,col3,...col1,col2,col3,......

• Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem– HBase to HDFS not supported– HDFS to Accumulo not supported

• Hadoop ecosystem not well defined– Accumulo was not considered part of Hadoop ecosystem– What’s next? Kafka?

What’s Wrong w/ Current Implementation?

Refactoring

• Connectors already defined extractors and loaders– Refactor the connector SDK

• Pull out HDFS integration to a connector• Improve Schema integration

Transfer data from Connector A to Connector B

Connector SDK

• Connectors assume all roles• Add Direction for FROM and TO• Initializers and destroyers for both directions

Connector responsibilities

HDFS Connector

• Move Hadoop role to connector• Schemaless• Data formats

– Text (CSV)– Sequence– etc.

Schema Improvements

• Schema per connector• Intermediate data format (IDF) has a Schema• Introduce matcher• Schema represents data as it moves through the system

Matcher

• Matcher ensures data goes to right place• Combinations

– FROM and TO schema– FROM schema– TO schema– No schema = Error

Matcher

Ensure that FROM schema matches TO schema by index location of Schema

Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.

Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

Location Name User defined

Checkout http://ingest.tips for general ingest

Thank you

top related