uncovering the mysteries of data access in hadoop...copyright elasticsearch 2013. copying,...
Post on 10-Oct-2020
3 Views
Preview:
TRANSCRIPT
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Uncovering the mysteries of data access in Hadoop
Costin Leau@costinl
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Provides native integration with Hadoop
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch
Elasticsearch = OSS Search & Analytics engine
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
Native integration w/ Hadoop eco-system
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop
Hadoop Distributed File System (HDFS)
Map Reduce Framework (M/R)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Map/Reduce
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Access
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Reading Data
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Map/Reduce
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Focusing on Data Access
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Locality
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Locality
Data
- Critical
- Persistent
- Big
Code
- Small
- Stateless
- Transient
I/O expensive �� CPU/RAM cheap
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Compression & Codecs
Saves disk space (by using the free CPU)
Name Extension Splittable
Gzip .gz No
Bzip2 .bz2 No
Snappy .snappy YES
LZO .lzo No
LZ4 .lz4 YES
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Serialization
Converts Objects to byte streams (and back)
Hadoop Writable
JDK serialization
Avro
Protocol Buffers
Thrift
Kryo
MsgPack
JSON/Smile
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Main APIs
OutputF
ormat
RecordW
riter
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Main APIs
Allow Hadoop to read/write data
Tied to the data source
Handle data format (serialization/protocol/etc..)
Are bundled with the Hadoop job
- restrictions on size, state, configuration
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Input/Output Format
Fundamental for data retrieval/store
Handle splitting
Understands data format
- File based (relies to a the Hadoop FS)- Hdfs, s3, webfs, etc...
- Protocol based- HTTP/Rest, JDBC/RDMS ...
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
InputSplits
Divide the data intro fragments
- format / application-specific bounds
- processed separately
- self-container
- have no dependency on one another
Critical for scalability (and performance)
Drive the number of tasks running in parallel
Ideally are data aware
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Record Reader/Writer
Translate the split into objects (map of k/v)
Responsible for:
- object creation
- data structure parsing
- progress monitoring
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Formats and Records
Short sentence.
InputFormat
1 Short sentence
RecordReader
The quick brown fox jumps \over the lazy dog.
1 The quick brown fox jumps over the lazy dog.
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Out of the box
Type Format Records
File Text Line
SequenceFile Binary Key/Value
RCFile/ORCFile BinaryColumn Groups
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Adding a data store
Sharding is critical
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
External data stores features
Scalable / Sharding ?
Data locality ?
Streaming ?
Collocation (with Hadoop) ?
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Source Example - RDBMS
Sharding – no
Data Locality – none
Streaming – supported by some
Collocation – no
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
DB Formats and Records
InputFormat
RecordReader
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Using a RDBMS w/ Hadoop
JDBC-based (I|O)Format/Record(R|W)
- available in Hadoop out of the box
Usage discouraged due to:
- Lack of batching- Multiple, short queries
- Multiple, short transactions
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Alternatives
RDBMS love batching
Export data to HDFS (Apache Sqoop)
Import data from HDFS to RDBMS
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Source Example - HBase
Distributed, scalable column datastore
Excellent for high rates of row-level updates
Based on HDFS (can use Map/Reduce)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Data Source Example - HBase
Sharding – yes
Data Locality – yes
Streaming – partial
Collocation – yes (no need it’s all HDFS)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
Sharding – yes
Data Locality – yes
Streaming – partial
Collocation – yes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Elasticsearch Hadoop
InputFormat
RecordReader
{ “key”:”value”, “key”:”value”} {...}
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Hadoop Eco-system
Library Format Record
Cascading Tap Scheme
Pig Load/StoreFunction Record
Hive HiveStorageHandler / SerDe SerDe / Record
Built upon (I|O)Format and Record(R|W)
Handle type conversion
Implement complex operations
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Wrap-up
Focus on data store capabilities first
Dump data to HDFS for stores w/o sharding
top related