dancing with the elephant h base1_final
TRANSCRIPT
Dancing With The Elephant
Persistence with HBase: Part 1
www.smart-platform.com@smartplatf
Event Sponsors
We will discuss
• Introduction to Hadoop• HBase: Definition, Storage Model, Usecases• Basic Data Access from shell• Hands-on with HBase API
What is Hadoop
• Framework for distributed processing of large datasets(BigData)
• HDFS+MapReduce• HDFS: (Data)
Distributed Filesystem responsible for storing data across cluster
Provides replication on cheap commodity hardware Namenode and DataNode processes
• MapReduce: (Processing) May be a future session
HBase: What
• a sparse, distributed, persistent, multidimensional, sorted map ( defined by Google’s paper on BigTable)
• Distributed NoSQL Database designed on top of HDFS
RDBMS Woes (with massive data)
• Scaling is Hard and Expensive• Turn off relational features/secondary indexes.. to
scale• Hard to do quick reads at larger tables sizes(500
GB)• Single point of failures• Schema changes
HBase: Why
• Scalable: Just add nodes as your data grows• Distributed: Leveraging Hadoop’s HDFS
advantages • Built on top of Hadoop : Being part of the
ecosystem, can be integrated to multiple tools• High performance for read/write
Short-Circuit reads Single reads: 1 to 10 ms, Scan for: 100s of rows in 10ms
• Schema less• Production-Ready where data is in order of
petabytes
HBase: Storage Model 1
HTable
• Tables are split into regions• Region: Data with continuous range of RowKeys
from [Start to End) sorted Order• Regions split as Table grows (Region size can be
configured)• Table Schema defines Column Families• (Table, RowKey, ColumnFamily, ColumnName, Timestamp)
Value
HTable(Data Structure)
• SortedMap(RowKey, List(
SortedMap(Column, List(
Value, Timestamp)
))
)
HBase: Data Read/Write
• Get: Random read• Scan: Sequential read• Put: Write/Update
HBase: Data Access Clients
• Demo of HBase shell• Java API
HBase: API
• Connection• DDL• DML• Filters• Hands-On
HBase: API
• Configuration: holds details where to find the cluster and tunable setting .
• Hconnection : represent connection to the cluster.
• HBaseAdmin: handles DDL operations(create, list,drop,alter).
• Htable (HTableInterface) :is a handle on a single Hbase table. Send “command” to the table (Put , Get , Scan , Delete , Increment)
HBase: API:DDL
Group name: ddl (Data Defination Language)
Commands: alter, create, describe, disable, drop, enable, exists, is_disabled, is_enabled, list
HBase: API:DDL
HBaseConfiguration conf = new HBaseConfiguration();conf.set("hbase.master","localhost:60010"); HBaseAdmin hbase = new HBaseAdmin(conf);HTableDescriptor desc = new HTableDescriptor(" testtable
");HColumnDescriptor meta = new HColumnDescriptor("
colfam1 ".getBytes());HColumnDescriptor prefix = new HColumnDescriptor("
colfam2 ".getBytes());desc.addFamily(meta);desc.addFamily(prefix);hbase.createTable(desc);
HBase: API:DML
Group name: dml (Data Manipulation Language)
Commands: count, delete, deleteall, get, get_counter, incr, put, scan, truncate
HBase: API:DML PUT
HTable table = new HTable(conf, "testtable");Put put = new Put(Bytes.toBytes("row1"));put.add(Bytes.toBytes("colfam1"),
Bytes.toBytes("qual1"),Bytes.toBytes("val1"));put.add(Bytes.toBytes("colfam1"),
Bytes.toBytes("qual2"),Bytes.toBytes("val2"));table.put(put);
HBase: API:DML GET
Configuration conf = HBaseConfiguration.create();HTable table = new HTable(conf, "testtable");Get get = new Get(Bytes.toBytes("row1"));get.addColumn(Bytes.toBytes("colfam1"),
Bytes.toBytes("qual1"));Result result = table.get(get);byte[] val =
result.getValue(Bytes.toBytes("colfam1"),Bytes.toBytes("qual1"));System.out.println("Value: " + Bytes.toString(val));
HBase: API:DML SCAN
Scan scan1 = new Scan();ResultScanner scanner1 = table.getScanner(scan1);
for (Result res : scanner1) {System.out.println(res);
}scanner1.close();
Other Projects around HBase
• SQL Layer: Phoenix, Hive, Impala• Object Persistence: Lily, Kundera
FollowUp
• Part2: Building KeyValue Data store in HBase Challenges we faced in SMART
• {Rahul, vinay}@briotribes.com
Shoutout To
HBase: Usecase (Facebook)
• Facebook Messaging: Titan 1.5 M ops per second at peak 6B+ messages per day 16 columns per operation across diff. families
• Facebook insights: Puma provides developers and Page owners with metrics about
their content > 1 M counter increments per second