bridging the gap of relational to hadoop using sqoop @ expedia

39
Bridging the gap of Relational to Hadoop using Sqoop@Expedia (Enhancing Sqoop for Synchronization) Shashank Tandon, Expedia Kopal Niranjan, Expedia

Upload: dataworks-summithadoop-summit

Post on 13-Apr-2017

492 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Bridging the gap of Relational to Hadoop using Sqoop@Expedia(Enhancing Sqoop for Synchronization)

Shashank Tandon, Expedia

Kopal Niranjan, Expedia

Page 2: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential2

Agenda• Problem statement

• Why- Sqoop

• Expedia Enhancements for Sqoop.

• New Tool : Hive Merge

• Data Synchronization

• Demo

Page 3: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential3

Data Synchronization

Page 4: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential4

Problem Statement• Import huge amount of data available on RDBMS to Hive

table

• Support multiple partitions on Hive while importing.

• Regular updates happening on RDBMS.–Merge the new/updated data to hive tables.–Merge the data in parallel.

Page 5: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential5

Community Solution - Sqoop• Sqoop is an open source tool designed to efficiently

transfer bulk data between Hadoop and structured data stores such as relational databases.

• Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc.

Page 6: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential6

Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs.

- Hive Merge - Merges the incremental data migrated to hdfs into your

existing hive tables.- Supports merge based on composite keys- Merges older partitions as well as add new partitions.

Page 7: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential7

Enhanced Sqoop Features - Hive Dynamic Partition

- Hive Dynamic Partition with Partition Format

- Hive External Table

- Compression like Snappy

Page 8: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential8

Hcatalog for Hive- Hcatalog is a java wrapper on top of Hive metastore.

- Sqoop supports all the latest Hive features using Hcatalog.

Page 9: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential9

External tables with HCatalog

Page 10: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential10

Sqoop Import to Hive Managed Table

• Sqoop connects to mysql database test

• Import table MYTABLE in a hive managed table test_part1

• The hive managed table is located in /apps/hive/warehouse

Page 11: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential11

Page 12: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential12

New Enhancement :Import to Hive External Table

• The above command creates a hive table in the user managedDirectory /user/root/test_part2

Page 13: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential13

Page 14: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential14

Dynamic Partitioning with HCatalog

Page 15: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential15

Sqoop Import to Hive Static Partition• Can pass only 1 static partition as sqoop argument

Page 16: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential16

Sqoop Import to Hive Static Partition• Check Hive Partition

Page 17: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential17

Sqoop Import to Hive Static Partition on Date column

• Can pass only 1 static partition as sqoop argument with date value specified manually.

Page 18: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential18

Questions

How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ?

How to Import Data if the date format is month or day or year?Is there any way that I can pass the format ?

Page 19: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential19

New Enhancement : Import to Hive Dynamic Partition

• A new argument is passed –hcatalog-dynamic-partition-keys in sqoop.

• It works along with current static partition key.

• If both are passed then it will give more preference to static partition key.

Page 20: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential20

Page 21: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential21

New Enhancement : Import to Hive Dynamic Partition with Date Format

• A new argument is passed –hcatalog-dynamic-partition-key-format with argument –hcatalog-dynamic-partition-keys.

• Check the Hive Partitions after the Sqoop Import.

• The partitions created will be in the user-specified format.

Page 22: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential22

Page 23: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential23

Password encrypted in Sqoop Metastore• Password will now be saved in Sqoop metastore in

encrypted manner.

• The logic is same as done in file encryption where generic passkey and algorithm is passed in command line.

Page 24: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential24

Issues with Sqoop Merge Tool• Designed to merge two directories on HDFS. Will need

modification to support merging of Hive tables.

• The output directory must be specified while performing the merge.

• Supports merge based on a single column.

• To merge many partitions, each will require separate sequential Sqoop jobs.

Page 25: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential25

Merge Incremental data using Sqoop and Hive External Table

• Import records from base table to a HDFS directory.

• Import updates using incremental imports to another HDFS directory.

• Create a hive external table for both the directories.

• Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables.

Page 26: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential26

Merge Incremental data using Sqoop and Hive External Table

• The view now contains the most up-to-date set of records.

• Generate a table from the view created in above step.

• Replace the base table with the entries from the above generated table.

Page 27: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential27

New Tool: Hive Merge• Import original base table into Hive

Page 28: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential28

New Tool : Hive merge • Import incremental data into Hive

Page 29: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential29

• Finally merge data using tool hive-merge.

New Tool : Hive merge

Page 30: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential30

Acquiring locks during Hive Merge• In order to allow only single Hive merge happen on same

table, tool acquire lock in the start and release lock once it finishes.

Page 31: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential31

Performance metrics : Hive Merge tool

Page 32: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential32

Other Key Enhancements• Save encrypted password in Sqoop Metastore

• Teradata varchar/char support

• Teradata current timestamp support

• Sqoop Job runs for Incremental Import

• Snappy compression support in Hcatalog

Page 33: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential33

Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us:

• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format

• SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog

Page 34: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential34

• SQOOP-2585: Merging hive tables using sqoop

• SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import

• SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form.

• SQOOP-2331: Snappy Compression Support in Sqoop-Hcatalog

Page 35: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

35

Demo

Page 36: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential36

Questions

Page 37: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential37

Hive Merge Internal ArchitectureStep 1: Identify partitions to update. Skip this step for non-partitioned tables.

Page 38: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential38

Hive Merge Internal ArchitectureStep 2: Merge the new partitions with the old partitions(only for partitioned tables).

Page 39: Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

| Expedia Inc. Proprietary & Confidential39

Hive Merge Internal ArchitectureStep 3: Delete older versions.