kudu austin oct 2015.pptx
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
David Alves on behalf of the Kudu team
Kudu: Resolving Transac@onal and Analy@c Trade-‐offs in Hadoop
1
2 © Cloudera, Inc. All rights reserved.
Kudu Storage for Fast Analy@cs on Fast Data
• New upda@ng column store for Hadoop
• Apache-‐licensed open source
• Beta now available
Columnar Store Kudu
3 © Cloudera, Inc. All rights reserved.
Mo@va@on and Goals Why build Kudu?
3
4 © Cloudera, Inc. All rights reserved.
Mo@va@ng Ques@ons
• Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies? • Are we posi@oned to take advantage of advancements in the hardware landscape?
5 © Cloudera, Inc. All rights reserved.
Current Storage Landscape in Hadoop HDFS excels at:
• Efficiently scanning large amounts of data
• Accumula@ng data with high throughput
HBase excels at: • Efficiently finding and wri@ng
individual rows • Making data mutable
Gaps exist when these proper@es are needed simultaneously
6 © Cloudera, Inc. All rights reserved.
Changing Hardware landscape
• Spinning disk -‐> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant: • 64-‐>128-‐>256GB over last few years
• Takeaway 1: The next bo?leneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind. • Takeaway 2: Column stores are feasible for random access
7 © Cloudera, Inc. All rights reserved.
• High throughput for big scans (columnar storage and replica@on) Goal: Within 2x of Parquet
• Low-‐latency for short accesses (primary key indexes and quorum replica@on) Goal: 1ms read/write on SSD
• Database-‐like seman@cs (ini@ally single-‐row ACID)
• RelaHonal data model • SQL query • “NoSQL” style scan/insert/update (Java client)
Kudu Design Goals
8 © Cloudera, Inc. All rights reserved.
Kudu Design Goals
Integration with Impala, Hive, and Spark SQL As Kudu is not an isolated solution, it was very important to provide the best possible integration with the other query engines in the Hadoop ecosystem. For example, in Impala Kudu is not only used as a new storage engine, but column filters can be directly pushed down to Kudu, improving the scan performance and lowering the amount of data that needs to be exchanged between Kudu and Impala during query execution. First tests show that the performance of Kudu in such an integration scenario depends on the usecase and can be up to 10 times faster than scanning a comparable table in HBase and only up to 30% slower than querying a table on HDFS. The performance depends on a series of factors like caching strategies, data size and how effectively primary key filters can be pushed down to Kudu.
What do I use Kudu for? We talked about how Kudu is made for SQL, allows fast scans, and allows fast mutability at scale. With that in context, let’s look at the variety of use cases done in Hadoop today and see where Kudu fits in.
If we look at Kudu in the above figure, we will see that many of the traditional SQL use cases can be simplified and improved with Kudu instead of a combination of HDFS and HBase. However, it is very important to note that HFDS and HBase will not disappear. They will get replaced in use cases where they cannot play their strengths and are now focused more on the use cases they excel at. Of course it is possible to achieve a SQL solution without Kudu, however these will come with more complexity in data processing and storage pipeline containing a combination of HDFS or performance issues with HBase, but let us highlight the ease of use when storing data in Kudu.
9 © Cloudera, Inc. All rights reserved.
Kudu Usage
• Table has a SQL-‐like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-‐composite primary key • Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan()
• Integra@ons with MapReduce, Spark, and Impala • more to come!
9
10 © Cloudera, Inc. All rights reserved.
Use cases and architectures
11 © Cloudera, Inc. All rights reserved.
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combinaHon of sequenHal and random reads and writes
● Time Series ○ Examples: Stream market data; fraud detec@on & preven@on; risk monitoring ○ Workload: Insert, updates, scans, lookups
● Machine Data AnalyHcs ○ Examples: Network threat detec@on ○ Workload: Inserts, scans, lookups
● Online ReporHng ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
12 © Cloudera, Inc. All rights reserved.
Real-‐Time Analy@cs in Hadoop Today Fraud Detec@on in the Real World = Storage Complexity
ConsideraHons: ● How do I handle failure
during this process? ● How oten do I reorganize
data streaming in into a format appropriate for repor@ng?
● When repor@ng, how do I see
data that has not yet been reorganized?
● How do I ensure that
important jobs aren’t interrupted by maintenance?
New Par@@on
Most Recent Par@@on
Historic Data
HBase
Parquet File
Have we accumulated enough data?
Reorganize HBase file
into Parquet
• Wait for running opera@ons to complete • Define new Impala par@@on referencing the newly wriwen Parquet file
Incoming Data (Messaging System)
Repor@ng Request
Impala on HDFS
13 © Cloudera, Inc. All rights reserved.
Real-‐Time Analy@cs in Hadoop with Kudu
Improvements: ● One system to operate
● No cron jobs or background processes
● Handle late arrivals or data correcHons with ease
● New data available immediately for analyHcs or operaHons
Historical and Real-‐@me Data
Incoming Data (Messaging System)
Repor@ng Request
Storage in Kudu
14 © Cloudera, Inc. All rights reserved.
How it works Replica@on and distribu@on
14
15 © Cloudera, Inc. All rights reserved.
Tables and Tablets
• Table is horizontally parHHoned into tablets • Range or hash par@@oning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5), with RaX consensus • Allow read from any replica, plus leader-‐driven writes with low MTTR
• Tablet servers host tablets • Store data on local disks (no HDFS)
15
16 © Cloudera, Inc. All rights reserved.
Metadata
• Replicated master* • Acts as a tablet directory (“META” table) • Acts as a catalog (table schemas, etc) • Acts as a load balancer (tracks TS liveness, re-‐replicates under-‐replicated tablets)
• Caches all metadata in RAM for high performance • 80-‐node load test, GetTableLoca@ons RPC perf: • 99th percen@le: 68us, 99.99th percen@le: 657us • <2% peak CPU usage
• Client configured with master addresses • Asks master for tablet loca@ons as needed and caches them
16
17 © Cloudera, Inc. All rights reserved.
18 © Cloudera, Inc. All rights reserved.
Rat consensus
18
TS A
Tablet 1 (LEADER)
Client
TS B
Tablet 1 (FOLLOWER)
TS C
Tablet 1 (FOLLOWER)
WAL
WAL WAL
2b. Leader writes local WAL
1a. Client-‐>Leader: Write() RPC
2a. Leader-‐>Followers: UpdateConsensus() RPC
3. Follower: write WAL
4. Follower-‐>Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader-‐>Client: Success!
19 © Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure: • Leader can s@ll achieve majority • Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader elec@on! • New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-‐1)/2 failures
19
20 © Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure: • Leader no@ces that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica • Leader copies the data over to the new one, which joins as a new FOLLOWER
20
21 © Cloudera, Inc. All rights reserved.
How it works Storage engine internals
21
22 © Cloudera, Inc. All rights reserved.
Tablet design
• Inserts buffered in an in-‐memory store (like HBase’s memstore) • Flushed to disk • Columnar layout, similar to Apache Parquet
• Updates use MVCC (updates tagged with @mestamp, not in-‐place) • Allow “SELECT AS OF <@mestamp>” queries and consistent cross-‐tablet scans
• Near-‐op@mal read path for “current @me” scans • No per row branches, fast vectorized decoding and predicate evalua@on
• Performance worsens based on number of recent updates
22
23 © Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-‐memory map (MemStore) and later flush to on-‐disk files (HFile/SSTable) • Reads perform an on-‐the-‐fly merge of all on-‐disk HFiles
• Kudu • Shares some traits (memstores, compac@ons) • More complex. • Slower writes in exchange for faster reads (especially scans)
23
24 © Cloudera, Inc. All rights reserved.
LSM Insert Path
24
MemStore INSERT
Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
flush
25 © Cloudera, Inc. All rights reserved.
LSM Insert Path
25
MemStore INSERT
Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2”
HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2”
flush
HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
26 © Cloudera, Inc. All rights reserved.
LSM Update path
26
MemStore UPDATE
HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2”
HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully decoupled” from reads. Random-‐write workload is transformed to fully sequen@al!
27 © Cloudera, Inc. All rights reserved.
LSM Read path
27
MemStore
HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row keys
R1: c1=blah c2=2 R2: c1=newval c2=5 ….
CPU intensive!
Must always read rowkeys
Any given row may exist across mul@ple HFiles: must always
merge!
The more HFiles to merge, the slower it reads
28 © Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
28
MemRowSet INSERT (“todd”,
“$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
29 © Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
29
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT (“doug”, “$1B”, “Hadoop man”)
flush
30 © Cloudera, Inc. All rights reserved.
Kudu storage -‐ Updates
30
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2 Delta MS
Delta MS
Each DiskRowSet has its own DeltaMemStore to accumulate updates
base data
base data
31 © Cloudera, Inc. All rights reserved.
Kudu storage -‐ Updates
31
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2 Delta MS
Delta MS
UPDATE set pay=“$1M” WHERE name=“todd”
Is the row in DiskRowSet 2? (check bloom filters)
Is the row in DiskRowSet 1? (check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find offset: rowid = 150
150: col 1=$1M
base data
32 © Cloudera, Inc. All rights reserved.
Kudu storage – Read path
32
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2 Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in DiskRowSet 1
Any row is only in exactly one DiskRowSet– no need to merge cross-‐
DRS!
Updates are merged based on ordinal offset within DRS: array indexing, no
string compares
base data
base data
33 © Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
33
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2 Delta MS
Delta MS
0: pay=foo REDO DeltaFile Flush
A REDO delta indicates how to transform between the ‘base
data’ (columnar) and a later version
base data
base data
34 © Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compac@on
34
name pay role
DiskRowSet(pre-‐compac@on) Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta applica@on work on reads
name pay role
DiskRowSet(post-‐compac@on) Delta MS
Unmerged REDO deltas UNDO deltas
If a column has few updates, doesn’t need to be re-‐wriwen: those deltas maintained in new DeltaFile
Merge updates for columns with high update percentage
base data
35 © Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compac@ons
35
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe]
[jon, julie, linda, mary]
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets with overlapping key ranges
36 © Cloudera, Inc. All rights reserved.
Kudu storage – Compac@on policy
• Solves an op@miza@on problem (knapsack problem) • Minimize “height” of rowsets for the average key lookup • Bound on number of seeks for write or random-‐read
• Restrict total IO of any compac@on to a budget (128MB) • No long compacHons, ever • No “minor” vs “major” disHncHon • Always be compac@ng or flushing • Low IO priority maintenance threads
36
37 © Cloudera, Inc. All rights reserved.
Kudu trade-‐offs
• Random updates will be slower • HBase model allows random updates without incurring a disk seek • Kudu requires a key lookup before update, bloom lookup before insert
• Single-‐row reads may be slower • Columnar design is op@mized for scans • Future: may introduce “column groups” for applica@ons where single-‐row access is more important • Especially slow at reading a row that has had many recent updates (e.g YCSB “zipfian”)
37
38 © Cloudera, Inc. All rights reserved.
Benchmarks
38
39 © Cloudera, Inc. All rights reserved.
TPC-‐H (Analy@cs benchmark)
• 75TS + 1 master cluster • 12 (spinning) disk each, enough RAM to fit dataset • Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4 • TPC-‐H Scale Factor 100 (100GB)
• Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
39
40 © Cloudera, Inc. All rights reserved.
-‐ Kudu outperforms Parquet by 31% (geometric mean) for RAM-‐resident data -‐ Parquet likely to outperform Kudu for HDD-‐resident (larger IO requests)
41 © Cloudera, Inc. All rights reserved.
What about Apache Phoenix? • 10 node cluster (9 worker, 1 master) • HBase 1.0, Phoenix 4.3 • TPC-‐H LINEITEM table only (6B rows)
41
2152
219 76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000 Load TPCH Q1 COUNT(*)
COUNT(*) WHERE…
single-‐row lookup
Time (sec)
Phoenix
Kudu
Parquet
42 © Cloudera, Inc. All rights reserved.
What about NoSQL-‐style random access? (YCSB)
• YCSB 0.5.0-‐snapshot • 10 node cluster (9 worker, 1 master) • HBase 1.0 • 100M rows, 10M ops
42
43 © Cloudera, Inc. All rights reserved.
But don’t trust me (a vendor)…
43
44 © Cloudera, Inc. All rights reserved.
About Xiaomi Mobile Internet Company Founded in 2010
Smartphones SoXware
E-‐commerce
MIUI
Cloud Services
App Store/Game
Payment/Finance
…
Smart Home
Smart Devices
45 © Cloudera, Inc. All rights reserved.
Big Data AnalyHcs Pipeline Before Kudu
• Long pipeline high latency(1 hour ~ 1 day), data conversion pains
• No ordering Log arrival(storage) order not exactly logical order e.g. read 2-‐3 days of log for data in 1 day
46 © Cloudera, Inc. All rights reserved.
Big Data Analysis Pipeline Simplified With Kudu
• ETL Pipeline(0~10s latency) Apps that need to prevent backpressure or require ETL
• Direct Pipeline(no latency) Apps that don’t require ETL and no backpressure issues
OLAP scan Side table lookup Result store
47 © Cloudera, Inc. All rights reserved.
Use Case 1 Mobile service monitoring and tracing tool
Requirements
u High write throughput >5 Billion records/day and growing
u Query latest data and quick response Iden@fy and resolve issues quickly
u Can search for individual records Easy for troubleshoo@ng
Gather important RPC tracing events from mobile app and backend service. Service monitoring & troubleshoo@ng tool.
48 © Cloudera, Inc. All rights reserved.
Use Case 1: Benchmark
Environment
u 71 Node cluster u Hardware
CPU: E5-‐2620 2.1GHz * 24 core Memory: 64GB Network: 1Gb Disk: 12 HDD
u Sotware Hadoop2.6/Impala 2.1/Kudu
Data u 1 day of server side tracing data
~2.6 Billion rows ~270 bytes/row 17 columns, 5 key columns
49 © Cloudera, Inc. All rights reserved.
Use Case 1: Benchmark Results
1.4 2.0 2.3 3.1 1.3 0.9 1.3
2.8 4.0
5.7 7.5
16.7
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M record/s 39.5k record/s
Parquet 114.6 23.5M record/s 331k records/s
Bulk load using impala (INSERT INTO):
Query latency:
* HDFS parquet file replica@on = 3 , kudu table replica@on = 3 * Each query run 5 @mes then take average
50 © Cloudera, Inc. All rights reserved.
Use Case 1: Result Analysis
u Lazy materializa@on Ideal for search style query Q6 returns only a few records (of a single user) with all columns
u Scan range pruning using primary index Predicates on primary key Q5 only scans 1 hour of data
u Future work Primary index: speed-‐up order by and dis@nct Hash Par@@oning: speed-‐up count(dis@nct), no need for global shuffle/merge
51 © Cloudera, Inc. All rights reserved.
Use Case 2 OLAP PaaS for ecosystem cloud
u Provide big data service for smart hardware startups (Xiaomi’s ecosystem members)
u OLAP database with some OLTP features u Manage/Ingest/query your data and serving results in one place
Backend/Mobile App/Smart Device/IoT …
52 © Cloudera, Inc. All rights reserved.
What Kudu is not
52
53 © Cloudera, Inc. All rights reserved.
Kudu is…
• NOT a SQL database • “BYO SQL”
• NOT a filesystem • data must have tabular structure
• NOT a replacement for HBase or HDFS • Cloudera con@nues to invest in those systems • Many use cases where they’re s@ll more appropriate
• NOT an in-‐memory database • Very fast for memory-‐sized workloads, but can operate on larger data too!
53
54 © Cloudera, Inc. All rights reserved.
Ge�ng started
54
55 © Cloudera, Inc. All rights reserved.
Ge�ng started as a user
• hwp://getkudu.io • kudu-‐[email protected]
• Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-‐to-‐install VM
• CSD and Parcels • For installa@on on a Cloudera Manager-‐managed cluster
55
56 © Cloudera, Inc. All rights reserved.
Ge�ng started as a developer
• hwp://github.com/cloudera/kudu • All commits go here first
• Public gerrit: hwp://gerrit.cloudera.org • All code reviews happening here
• Public JIRA: hwp://issues.cloudera.org • Includes bugs going back to 2013. Come see our dirty laundry!
• kudu-‐[email protected]
• Apache 2.0 license open source • Contribu@ons are welcome and encouraged!
56
57 © Cloudera, Inc. All rights reserved.
Demo? (if we have @me and internet gods willing)
57
58 © Cloudera, Inc. All rights reserved.
hwp://getkudu.io/ @getkudu