jackhare- a framework for sql to nosql translation using mapreduce

JackHare

a framework for SQL to NoSQL translation using MapReduce

Presented by 康志強2013.10.22

Received: 15 December 2012 / Accepted: 6 September 2013© Springer Science+Business Media New York 2013

1

Wu-Chun Chung·Hung-Pin Lin·Shih-Chang Chen·Mon-Fong Jiang·Yeh-Ching Chung

• Introduction• Related work• The JackHare framework architecture• Unstructured data processing in HBase• Experimental results• Conclusions

Outline

2

• BigData 的問題 (massive data)– 資料的存取速度– 資料合併的問題

平行處理時資料的即時性、正確性。• Hadoop MapReduce– to process the massive data in parallel.

• Hadoop distributed file system– difficult to update data frequently

Introduction

3

• Hbase– to place the data over a scale-out storage system– to manipulate the changeable data in a transparent

way– the Hbase interface is not friendly

• JackHare– 遵守 ANSI-SQL 和 JDBC-4.0 規格的 API ，用來操作

Apache Hbase– using MapReduce framework for processing the

unstructured data in HBase

Introduction

4

• 資料的存取速度– 1990, 硬碟可存 1,370M, 傳輸速度 4.4MB/s– 現在 ,1 TB, 傳輸速度 100MB/s– 平行進行資料讀取及寫入 , 加快速度

• Hadoop Distributed File System– difficult to update data frequently in such file system

Introduction

5

• 資料合併的問題– 正確性

• MapReduce– 分散式程式框架– Map 就是將一個工作分到多個 Node– Reduce 就是將各個 Node 的結果再重新結合成最

後的結果– 資料本地化– 運用高階的查詢語言 (Pig, Hive)

Introduction

6

• MapReduce

Introduction

7

• Hbase– 架構在 HDFS 上的分散式資料庫– 使用列 (row) 和行 (column) 為索引存取資料值– 每一筆資料都有一個時間戳記 (timestamp) ，因

此同一個欄位可依不同時間存在多筆資料。(Version)

– HBase 的資料表 (table) 是由許多 row 及數個column family 組成

– 可供 MapReduce 的程式當作資料來源或儲存媒介

Introduction

8

• Hbase

Introduction

9

• NoSQL 資料庫• http://www.ithome.com.tw/itadm/article.php?c=6336

0&s=5

Introduction

10

http://www.ithome.com.tw/itadm/article.php?c=63360&s=5

http://www.ithome.com.tw/itadm/article.php?c=63360&s=5

• JackHare– allowing users to use the ANSI-SQL queries to

manipulate large-scale data– 遵守 ANSI-SQL 和 JDBC-4.0 規格的 API ，用來操作

Apache Hbase– using MapReduce framework for processing the

unstructured data in Hbase

Introduction

11

• Pig– HDFS 與 MapReduce 叢集環境中執行– Pig Latin - a simpler procedural language– http://pig.apache.org/docs/r0.12.0/basic.html#nest

edblock• Hive– 提供類似 SQL 的查詢語言來查詢資料 (HiveQL)– 可管理 HDFS 的資料– https://cwiki.apache.org/confluence/display/Hive/T

utorial

Related work

12

http://pig.apache.org/docs/r0.12.0/basic.html#nestedblock

http://pig.apache.org/docs/r0.12.0/basic.html#nestedblock

https://cwiki.apache.org/confluence/display/Hive/Tutorial

https://cwiki.apache.org/confluence/display/Hive/Tutorial

• YSmart– An SQL-to-MapReduce Translator– http://ysmart.cse.ohio-state.edu/

• S2MART– Smart Sql to Map-Reduce Translators

Related work

13

http://ysmart.cse.ohio-state.edu/

• HadoopDB– An Architectural Hybrid of MapReduce and DBMS

Technologies for Analytical– HadoopDB provides SQL query via a translation

called SQL-MR-SQL (SMS), based on Hive.– http://db.cs.yale.edu/hadoopdb/hadoopdb.html

• Clydesdale– structured data processing on MapReduce– focuses on processing the data fitting a star schema

Related work

14

http://db.cs.yale.edu/hadoopdb/hadoopdb.html

http://db.cs.yale.edu/hadoopdb/hadoopdb.html

• SQL 查詢轉換為 MapReduce• Hbase– 滿足頻繁的數據更新– 維持 NoSQL 數據庫的可擴展性和可靠性

Related work

15

The JackHare framework architecture

16

• User submits an ANSI-SQL query by SQL client application.

• The compiler scans and parses the ANSI-SQL query.

• Lookup the related table name, column families and column qualifier of HBase.

• Generate MapReduce code according to the query commands and metadata.


17

• Access HBase and execute the MapReduce job.• The results wrapped back from the back-end.• The returned results are shown on SQL client

application according to RDB schema.


18


SQuirreL

19

• remap the data in relational database to HBase

Unstructured data processing in HBase

20

• remap the data in relational database to HBase


21

• Analysis of SQL clauses

– SELECT, FROM and WHERE clauses– Extended clauses• GROUP BY• HAVING• ORDER BY• JOIN• AGGREGATE FUNCTIONs


22

• Experimental environment– two Intel Xeon L5640 CPU, 24 GB ram and3 TB HD– 16-node virtual machine cluster on four physical

machines– Hadoop 0.20.203 (15 October, 2013: release 2.2.0 available)– Hbase 0.92.0 (2013-09-20 | Version: 0.97.0-SNAPSHOT)– Hive 0.9.0– JAVA 1.6.0, maximum heap size is 512 MB

Experimental results

23

• Experimental environment– Node : two cores at 2 GHz with 4 GB ram and 400

GB storage space– MySQL : two cores at 2 GHz, 4 GB ram and– 800 GB hard disk– 3 Table : LOT, WAFER and DIE


24

• Results


25


26


27


28

Conclusions

29

• 報告完畢… .

30

jackhare- a framework for sql to nosql translation using mapreduce

Technology

mapreduce framework

changeable data

data frequently3

introduction jackhare

hbase interface

available hbase

hbase analysis of sql

ansisql query