jackhare- a framework for sql to nosql translation using mapreduce
DESCRIPTION
20131022論文報告TRANSCRIPT
JackHare
a framework for SQL to NoSQL translation using MapReduce
Presented by 康志強2013.10.22
Received: 15 December 2012 / Accepted: 6 September 2013© Springer Science+Business Media New York 2013
1
Wu-Chun Chung·Hung-Pin Lin·Shih-Chang Chen·Mon-Fong Jiang·Yeh-Ching Chung
• Introduction• Related work• The JackHare framework architecture• Unstructured data processing in HBase• Experimental results• Conclusions
Outline
2
• BigData 的問題 (massive data)– 資料的存取速度– 資料合併的問題
平行處理時資料的即時性、正確性。• Hadoop MapReduce– to process the massive data in parallel.
• Hadoop distributed file system– difficult to update data frequently
Introduction
3
• Hbase– to place the data over a scale-out storage system– to manipulate the changeable data in a transparent
way– the Hbase interface is not friendly
• JackHare– 遵守 ANSI-SQL 和 JDBC-4.0 規格的 API ,用來操作
Apache Hbase– using MapReduce framework for processing the
unstructured data in HBase
Introduction
4
• 資料的存取速度– 1990, 硬碟可存 1,370M, 傳輸速度 4.4MB/s– 現在 ,1 TB, 傳輸速度 100MB/s– 平行進行資料讀取及寫入 , 加快速度
• Hadoop Distributed File System– difficult to update data frequently in such file system
Introduction
5
• 資料合併的問題– 正確性
• MapReduce– 分散式程式框架– Map 就是將一個工作分到多個 Node– Reduce 就是將各個 Node 的結果再重新結合成最
後的結果– 資料本地化– 運用高階的查詢語言 (Pig, Hive)
Introduction
6
• MapReduce
Introduction
7
• Hbase– 架構在 HDFS 上的分散式資料庫– 使用列 (row) 和行 (column) 為索引存取資料值– 每一筆資料都有一個時間戳記 (timestamp) ,因
此同一個欄位可依不同時間存在多筆資料。(Version)
– HBase 的資料表 (table) 是由許多 row 及數個column family 組成
– 可供 MapReduce 的程式當作資料來源或儲存媒介
Introduction
8
• Hbase
Introduction
9
• NoSQL 資料庫• http://www.ithome.com.tw/itadm/article.php?c=6336
0&s=5
Introduction
10
• JackHare– allowing users to use the ANSI-SQL queries to
manipulate large-scale data– 遵守 ANSI-SQL 和 JDBC-4.0 規格的 API ,用來操作
Apache Hbase– using MapReduce framework for processing the
unstructured data in Hbase
Introduction
11
• Pig– HDFS 與 MapReduce 叢集環境中執行– Pig Latin - a simpler procedural language– http://pig.apache.org/docs/r0.12.0/basic.html#nest
edblock• Hive– 提供類似 SQL 的查詢語言來查詢資料 (HiveQL)– 可管理 HDFS 的資料– https://cwiki.apache.org/confluence/display/Hive/T
utorial
Related work
12
• YSmart– An SQL-to-MapReduce Translator– http://ysmart.cse.ohio-state.edu/
• S2MART– Smart Sql to Map-Reduce Translators
Related work
13
• HadoopDB– An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical– HadoopDB provides SQL query via a translation
called SQL-MR-SQL (SMS), based on Hive.– http://db.cs.yale.edu/hadoopdb/hadoopdb.html
• Clydesdale– structured data processing on MapReduce– focuses on processing the data fitting a star schema
Related work
14
• SQL 查詢轉換為 MapReduce• Hbase– 滿足頻繁的數據更新– 維持 NoSQL 數據庫的可擴展性和可靠性
Related work
15
The JackHare framework architecture
16
• User submits an ANSI-SQL query by SQL client application.
• The compiler scans and parses the ANSI-SQL query.
• Lookup the related table name, column families and column qualifier of HBase.
• Generate MapReduce code according to the query commands and metadata.
The JackHare framework architecture
17
• Access HBase and execute the MapReduce job.• The results wrapped back from the back-end.• The returned results are shown on SQL client
application according to RDB schema.
The JackHare framework architecture
18
The JackHare framework architecture
SQuirreL
19
• remap the data in relational database to HBase
Unstructured data processing in HBase
20
• remap the data in relational database to HBase
Unstructured data processing in HBase
21
• Analysis of SQL clauses
– SELECT, FROM and WHERE clauses– Extended clauses• GROUP BY• HAVING• ORDER BY• JOIN• AGGREGATE FUNCTIONs
Unstructured data processing in HBase
22
• Experimental environment– two Intel Xeon L5640 CPU, 24 GB ram and3 TB HD– 16-node virtual machine cluster on four physical
machines– Hadoop 0.20.203 (15 October, 2013: release 2.2.0 available)– Hbase 0.92.0 (2013-09-20 | Version: 0.97.0-SNAPSHOT)– Hive 0.9.0– JAVA 1.6.0, maximum heap size is 512 MB
Experimental results
23
• Experimental environment– Node : two cores at 2 GHz with 4 GB ram and 400
GB storage space– MySQL : two cores at 2 GHz, 4 GB ram and– 800 GB hard disk– 3 Table : LOT, WAFER and DIE
Experimental results
24
• Results
Experimental results
25
Experimental results
26
Experimental results
27
Experimental results
28
Conclusions
29
• 報告完畢… .
30