apache sqoop: unlocking hadoop for your relational database
DESCRIPTION
Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.TRANSCRIPT
Unlocking Hadoop for Your Rela4onal DB
Kathleen Ting | @kate_ting Technical Account Manager, Cloudera | Sqoop PMC Member Hadoop User Group UK 10 April 2014
Who Am I?
• Started 3 yr ago as 1st Cloudera Support Eng • Now manages Cloudera’s 2 largest customers
• Sqoop CommiJer, PMC Member • Co-‐Author of the Apache Sqoop Cookbook
What is Sqoop?
• Apache Top-‐Level Project • SQl to hadOOP • Tool to transfer data from rela4onal databases
• Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To/From Hadoop ecosystem • HDFS (text, sequence file), Hive, HBase, Avro
3
Why Sqoop?
• Efficient/Controlled resource u4liza4on • Concurrent connec4ons, Time of opera4on
• Datatype mapping and conversion • Automa4c, and User override
• Metadata propaga4on • Sqoop Record • Hive Metastore • Avro
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
Sqoop 1 Architecture
7
Sqoop 1 Command Line
sqoop TOOL PROPS ARG [-- EXTRA] • TOOL: import, export • PROPS
• Hadoop (java) proper4es • -Dwhatever.whenever=yes
• ARG • Generic SQOOP arguments • --table, --connect, ...
• EXTRA • connector specific • --schema (PostgreSQL and Microsoa SQL Server)
Sqoop 1 Example
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities
sqoop export \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop --password sqoop \ --table cities \ --export-dir /temp/cities
Sqoop 1 Challenges
• Cryp4c, contextual command line arguments • Security concerns • Type mapping is not clearly defined • Client needs access to Hadoop binaries/configura4on and database
• JDBC model is enforced
10
Troubleshoo4ng Sqoop 1
• Versions: Sqoop, Hadoop, OS, JDBC • Console log aaer running with the --verbose flag
• Capture the en4re output via sqoop import … &> sqoop.log • En4re Sqoop command including the op4ons-‐file if applicable • Expected output and actual output • Table defini4on • Small input data set that triggers the problem
• Especially with export, malformed data is oaen the culprit • Hadoop task logs
• Oaen the task logs contain further informa4on describing the problem • Permissions on input files
Troubleshoo4ng Sqoop 1
Imported table has more rows than source table? • Data contains char used as Hive’s delimiters
• Clean up data • --hive-drop-import-delims
• Removes \n, \t, and \01 char
• --hive-delims-replacement “SPECIAL” • Replaces \n, \t, and \01 char with string SPECIAL
• Not restricted to Hive -‐ any import job using text files • Ensure output files have one line per imported row
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Protec4ng Your Password
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ -P
sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --password-file my-sqoop-password
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop Works on CLI Not in Oozie
Character parameter '|' has multiple characters; only the first will be used.
Got error creating database manager: java.io.IOException:
No manager for connect string: "jdbc:teradata...”
Sqoop Works on CLI Not in Oozie
sqoop import --password "spEci@l\$" \ –connect 'jdbc:x:/yyy;db=sqoop’
• Remove all escaping that you’ve added for the shell • Use <arg> vs <command> tags as content is considered to be one parameter
• Put all -‐D parameters into configura4on sec4on • Install driver into workflow’s lib/ directory or shared ac4on library /user/oozie/share/lib/sqoop/
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Choosing Proper Connector
• JDBC driver is dependency for all three connectors
• Sqoop automa4cally chooses most op4mal connector (OraOoop, built-‐in,
Generic JDBC Connector) • Or explicitly chose: --connection-manager com.quest.oraoop.OraOopConnManager
Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Overriding Type Mapping
-‐-‐map-‐column-‐java parameter • comma separated list of key-‐value pairs
• key = exact column name • value = target Java type
sqoop import \
--map-column-java \
c1=Float,c2=String,c3=String ...
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
Sqoop 2 Architecture
25
Sqoop 2 Design Goals
• Security and Separa4on of Concerns • Role based access and use
• Ease of extension • No low-‐level Hadoop knowledge needed • No func4onal overlap between Connectors
• Ease of Use • Uniform func4onality • Domain specific interac4ons
Sqoop 2 UI in Hue
• Troubleshoo4ng • sqoop.log file is located in @LOGDIR@ and the rest should be in server/logs/*
• Look for catalina.out, catalina.log, localhost-‐*.log
28
29
30
31
32
33
34
35
36
37
Agenda
Sqoop 1 • Sqoop 1 Architecture • Sqoop 1 Command Line • Sqoop 1 Examples • Sqoop 1 Challenges • Troubleshoo4ng Sqoop 1 • Common Sqoop 1 Issues
• Protec4ng Your Password • Sqoop Works on CLI Not in Oozie • Choosing Proper Connector • Overriding Type Mapping
Sqoop 2 • Sqoop 2 Architecture • Sqoop 2 Design Goals • Sqoop 2 UI in Hue Resources
Resources
39
Sqoop 2 http://archive-primary.cloudera.com/cdh5/cdh/5/sqoop2/
Sqoop 1