hbase
DESCRIPTION
Hbase. The HBase. HBase is a distributed column-oriented database built on top of HDFS. Easy to scale to demand HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. Use MapReduce to search - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/1.jpg)
Hbase
![Page 2: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/2.jpg)
The HBase
• HBase is a distributed column-oriented database built on top of HDFS.– Easy to scale to demand
• HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets.– Use MapReduce to search
• HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state.
![Page 3: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/3.jpg)
Data Model
• A data model similar to Bigtable.– a data row has a sortable row key and an arbitrary
number of columns– the table is stored sparsely, rows in the same table
can have widely varying numbers of columnsConceptual View
Physical Storage View
![Page 4: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/4.jpg)
Example
• Capture network packets into HDFS, save to a file for every minute.• Run MapReduce app, estimate flow status.
– count tcp, udp, icmp packet number– compute tcp, udp, or all packet flow
• The result save to HBase.– row key and timestamp are the captrue time
Row-key Timestamp
Tcp:count
Tcp:flow Udp:count
Udp:flow …
2001003291011
126985290000
2423432 7989010927
387897 8991645466
…
201003291012
126985296000
2899787 10939993009
481241 8163769889
…
… … … … … … …
![Page 5: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/5.jpg)
Display
• Specify start time and stop time to scan table then estimate data and display as flow graph.
• Sample output
![Page 6: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/6.jpg)
The performance of accessing files to HDFS directly and
through a HDFS-based FTP server
![Page 7: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/7.jpg)
• ssh 登入 namenode 下達指令– 上傳檔案至 HDFS :
• hadoop fs -Ddfs.block.size= 資料區塊位元組數 -Ddfs.replication= 資料區塊複製數量 -put 本機資料 HDFS 檔案目錄
– 由 HDFS 下載檔案:• hadoop fs -get HDFS 上的資料 本機目錄
Accessing files to HDFS directly(1/7)
![Page 8: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/8.jpg)
• 觀察透過 HDFS 參數的調整,讓 HDFS 在不同條件下的檔案讀取效能。之後的標題中若標示 R=1 ,表示某檔案在 HDFS 中的複製(備份)數量。
Accessing files to HDFS directly(2/7)
![Page 9: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/9.jpg)
Accessing files to HDFS directly(3/7, R=1)
MB 級(100MB)
GB 級(3.3GB)
1M 4 120.52M 3.7 103.24M 3.4 93.58M 3.2 81.716M 2.7 84.832M 2.5 65.964M 1.8 65.1128M 1.9 51.6256M 1.9 62.7512M 1.9 77
(橫軸表示資料分割區塊大小,單位: byte )
(縱軸表示一份資料完全寫入 HDFS 所需要的時間,單位:秒)
![Page 10: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/10.jpg)
Accessing files to HDFS directly(4/7, R=1)
MB 級(100MB)
GB 級(3.3GB)
1M 2.3 61.52M 2.2 634M 2.1 62.48M 2.1 61.816M 2.1 61.732M 2.1 62.264M 2.1 60.1128M 2 61.8256M 2 63.1512M 2 62.1
(橫軸表示資料分割區塊大小,單位: byte )(縱軸表示一份資料完全從 HDFS 讀出所需要的時間,單位:秒)
![Page 11: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/11.jpg)
Accessing files to HDFS directly(5/7, R=2)
MB 級(100MB)
GB 級(3.3GB)
1M 3.8 224.52M 3.4 190.14M 3.2 1478M 3 131.216M 2.8 133.532M 3.1 124.964M 3.2 118.7128M 3.2 120.5256M 3.3 143.3512M 3.3 124.9
(橫軸表示資料分割區塊大小,單位: byte )(縱軸表示一份資料完全寫入 HDFS 所需要的時間,單位:秒)
![Page 12: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/12.jpg)
Accessing files to HDFS directly(6/7, R=2)
MB 級(100MB)
GB 級(3.3GB)
1M 2.3 63.42M 2.3 61.54M 2.2 61.58M 2.1 60.116M 2 6032M 2.1 58.464M 1.9 58128M 2.1 61.7256M 2 61.5512M 2 58.5
(橫軸表示資料分割區塊大小,單位: byte )(縱軸表示一份資料完全從 HDFS 讀出所需要的時間,單位:秒)
![Page 13: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/13.jpg)
• 結論– 在運行 NameNode daemon 的 namenode
server 上直接上下載檔案,原則上資料區塊大小以 64MB 或 128MB 效能較佳。
– 資料區塊複製數越多,雖在檔案寫入時會花較久的時間,但在檔案讀取時速度會些許提升。
Accessing files to HDFS directly(7/7)
![Page 14: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/14.jpg)
• 使用者用 FTP client 連上 FTP server 後– lfs 表示一般的 FTP server daemon 直接存取
local file system 。– HDFS 表示由我們撰寫的 FTP server daemon ,
透過與位在同一台 server 上的 NameNode daemon 溝通後,存取 HDFS 。
• 之後上傳/下載完檔案花費之總秒數皆為測量 3 次秒數平均後之結果
• 網路頻寬約維持在 10Mb/s~12Mb/s 間
Accessing files through a HDFS-based FTP server(1/3)
![Page 15: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/15.jpg)
Accessing files through a HDFS-based FTP server(2/3)
lfs HDFS
0.5GB 46.33 62.33
1.0GB 95 138.67
1.5GB 141.67 199
2.0GB 188.67 270
2.5GB 237 346
3.0GB 288.33 400
3.5GB 345 471
4.0GB 383.67 472
(橫軸:上傳單一檔案 GB 數)
(縱軸:上傳完檔案花費總秒數)
( HDFS :檔案區塊大小 128MB ,複製數 =2 )
![Page 16: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/16.jpg)
Accessing files through a HDFS-based FTP server(3/3)
lfs HDFS
0.5GB 48 45
1.0GB 92 91
1.5GB 141 137.33
2.0GB 192 185.67
2.5GB 236.67 226.33
3.0GB 273.33 278
3.5GB 322 320.33
4.0GB 380.33 378.67
(橫軸:下載單一檔案 GB 數)
(縱軸:下載完檔案花費總秒數)
( HDFS :檔案區塊大小 128MB ,複製數 =2 )
![Page 17: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/17.jpg)
Hadoop 認證分析
• the name node has no notion of the identity of the real user 。(沒有真實用戶的概念)
• User Identity :– The user name is the equivalent of 「 whoami 」 .
– The group list is the equivalent of 「 bash -c groups 」 .
• The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user.
![Page 18: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/18.jpg)
Why Using Proxy to connect name node
• DataNodes do not enforce any access control on accesses to its data blocks 。 (client 可與datanode 直接連線,提供 Block ID 即可read 、 write) 。
• Hadoop client(any user)can access HDFS or submit Mapreduce Job 。
• Hadoop only works with SOCKS v5. ( in client , ClientProtocol and SubmissionProtocol )
• 結論: hadoop(Private IP 叢集 )+ RADIUS + SOCKS proxy 。
![Page 19: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/19.jpg)
結構
![Page 20: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/20.jpg)
結構
![Page 21: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/21.jpg)
Hadoop SOCKS
• 只需在 Hadoop client 設定 SOCKS 連線, Namenode 無需設定。
![Page 22: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/22.jpg)
User 認證
• 使用 SOCKS protocol 的method(username 、 password) 辨識 Proxy transfer 的權限。
• 由 RADIUS Server 紀錄 user 是否可以存取 hadoop 。(user-group)
• User 使用 Hadoop client(whoami) 的執行身分來存取 Hadoop 。
![Page 23: Hbase](https://reader037.vdocuments.us/reader037/viewer/2022103102/56815300550346895dc11cec/html5/thumbnails/23.jpg)
SOCKS Proxy 優缺點
• 優點:– 可進行 user 認證。– 可過濾 IP range ,限制使用 proxy 的網域。– 不會儲存 transfer 的封包, 單純 forward 。
• 缺點:– Client 端需支援 SOCKS protocol 。– 可能會成為 Bottleneck ,傳輸速度 (transfer) 與硬體和
選用的 SOCKS 軟體有關。