Download - Big data: current technology scope
![Page 1: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/1.jpg)
Roman Nikitchenko, 09.10.2014
BIG.DATATechnology scope
![Page 2: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/2.jpg)
2www.vitech.com.ua
Any real big data is just about DIGITAL LIFE FOOTPRINT
![Page 3: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/3.jpg)
3www.vitech.com.ua
BIG DATA is not about the
data. It is about OUR ABILITY TO HANDLE THEM.
![Page 4: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/4.jpg)
4www.vitech.com.ua
Arguments for meetings with management ;-)
But we are always special, don't you?
What is our stack of big data technologies?
Our stack
Some of our specifics
Couple of buzz words
![Page 5: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/5.jpg)
5www.vitech.com.ua
YARN
Linear scalability: 2 times more power costs
2 times more money
No natural keys so load balancing is perfect
No 'special' hardware so staging is closer to
production.
![Page 6: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/6.jpg)
6www.vitech.com.ua
HADOOP magic is here!
![Page 7: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/7.jpg)
7www.vitech.com.uaWhat is
it?
What is HADOOP?
● Hadoop is open source framework for big data. Both distributed storage and processing.
● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.
● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
![Page 8: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/8.jpg)
8www.vitech.com.ua
Why hadoop
?
x MAX+
=
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
What is HADOOP INDEED?
![Page 9: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/9.jpg)
9www.vitech.com.ua
SIMPLE BUT RELIABLE
● Really big amount of data stored in reliable manner.
● Storage is simple, recoverable and cheap (relatively).
● The same is about processing power.
![Page 10: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/10.jpg)
10www.vitech.com.ua
COMPLEX INSIDE, SIMPLE OUTSIDE● Complexity is burried
inside. Most of really complex operations are taken by engine.
● Interface is remote, compatible between versions so clients are relatively safe against implementation changes.
![Page 11: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/11.jpg)
11www.vitech.com.ua
DECENTRALIZED● No single point of failure
(almost).
● Scalable as close to linear as possible.
● No manual actions to recover in case of failures
![Page 12: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/12.jpg)
12www.vitech.com.ua
Hadoop historical top view
● HDFS serves as file system layer
● MapReduce originally served as distributed processing framework.
● Native client API is Java but there are lot of alternatives.
● This is only initial architecture and it is now more complex.
![Page 13: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/13.jpg)
13www.vitech.com.ua
HDFS top
view
● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where.
● Actual work is performed by data nodes.
HDFS is... scalable
![Page 14: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/14.jpg)
14www.vitech.com.ua
● Files are stored in large enough blocks. Every block is replicated to several data nodes.
● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode.
● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.
HDFS is... reliable
![Page 15: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/15.jpg)
15www.vitech.com.ua
NO BACKUPS
![Page 16: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/16.jpg)
16www.vitech.com.ua
● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner.
● Large class of jobs can be adopted but not all of them.
MapReduce is...
![Page 17: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/17.jpg)
17www.vitech.com.ua
BIG DATA
processing:
requirements
● Work is to be balanced.
● Work can be shared in accordance to data placement.
● Work is to be balanced to reflect resource balance.
DISTRIBUTION LOAD HAS TO BE SHARED
![Page 18: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/18.jpg)
18www.vitech.com.ua
DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE● Process data on the
same nodes as it is stored on with MapReduce.
● Distributed storage — distributed processing.
![Page 19: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/19.jpg)
19www.vitech.com.ua
DISTRIBUTION + LOCALITY TOGETHER THEY GO!YOUR DATA
BIG DATA
BIG DATA
Partition
Partition
WORK TO DO
Do it locally
BIG DATA
Share it
JOINED RESULT
Partition
Data partitioning drives work sharing. Good partitioning — good scalability.
![Page 20: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/20.jpg)
20www.vitech.com.ua
● New component (YARN) forms resource management layer and completes real distributed data OS.
● MapReduce is from now only one among other YARN appliactions.
Now with resource management
![Page 21: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/21.jpg)
21www.vitech.com.ua
● Better resource balance for heterogeneous clusterss and multple applications.
● Dynamic applications over static services.
● Much wider applications model over simple MapReduce. Things like Spark ot Tez.
Why YARN is SO important?
![Page 22: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/22.jpg)
22www.vitech.com.ua
First ever worldDATA OS
10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.
![Page 23: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/23.jpg)
23www.vitech.com.ua
Hadoop: don't do it
yourself
![Page 24: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/24.jpg)
24www.vitech.com.ua
● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet.
Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move!
● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.
Choose your destiny! We did.
![Page 25: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/25.jpg)
25www.vitech.com.ua
HBase motivat
ion
● Designed for throughput, not for latency.
● HDFS blocks are expected to be large. There is issue with lot of small files.
● Write once, read many times ideology.
● MapReduce is not so flexible so any database built on top of it.
● How about realtime?
But Hadoop is...
![Page 26: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/26.jpg)
26www.vitech.com.ua
HBase motivat
ion
BUT WE OFTEN NEED...
LATENCY, SPEED and all Hadoop properties.
![Page 27: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/27.jpg)
27www.vitech.com.ua
High layer applications
Resource management
Distributed file system
YARN
![Page 28: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/28.jpg)
28www.vitech.com.ua
Table
Region
Region
Row
Key Family #1 Family #2 ...Column Column ... ...
...
...
...
Data is placed in tables.
Tables are split into regions based on row key ranges.
Columns are grouped into families.Every table row
is identified by unique row key.
Every row consists of columns.
Logical data model
![Page 29: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/29.jpg)
29www.vitech.com.ua
Table
Region
RegionRow
Key Family #1 Family #2 ...Column Column ... ...
...
● Data is stored in HFile.● Families are stored on
disk in separate files.● Row keys are
indexed in memory.● Column includes key,
qualifier, value and timestamp.● No column limit.● Storage is block based.
HFile: family #1
Row key Column Value TS
... ... ... ...
... ... ... ...
HFile: family #2
Row key Column Value TS
... ... ... ...
... ... ... ...
● Delete is just another marker record.
● Periodic compaction is required.
Real data model
![Page 30: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/30.jpg)
30www.vitech.com.ua
DATA
META
RS RS RS RS
ClientMasterZookeeper
Zookeeper coordinates distributed elements and is primary contact point for client.
Master server keeps metadata and manages data distribution over Region servers.
Region servers manage data table regions.
Clients directly communicate with region server for data.
Clients locate master through ZooKeeper then needed regions through master.
Hbase: infrastructure view
![Page 31: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/31.jpg)
31www.vitech.com.ua
DATA
META
Rack
DN DN
RS RS
Rack
DN DN
RS RS
Rack
DN DN
RS RSNameNode
Client
MasterZookeeper
Zookeeper coordinates distributed elements and is primary contact point for client.
Master server keeps metadata and manages data distribution over Region servers.
Region servers manage data table regions.
Actual data storage service including replication is on HDFS data nodes.
Clients directly communicate with region server for data.
Clients locate master through ZooKeeper then needed regions through master.
Together with HDFS
![Page 32: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/32.jpg)
32www.vitech.com.ua
DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.
![Page 33: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/33.jpg)
33www.vitech.com.uaZookee
per
… because coordinating distributed systems is a Zoo
Apache ZooKeeper
![Page 34: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/34.jpg)
34www.vitech.com.ua
Apache ZooKeeper
We use this guy:● As a part of Hadoop /
HBase infrastructure● To coordinate MapReduce
job tasks
![Page 35: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/35.jpg)
35www.vitech.com.ua
Apache Spark
● Better MapReduce with at least some MapReduce elements able to be reused.
● Dynamic, faster to startup and does not need anything from cluster.
● New job models. Not only Map and Reduce.
● Results can be passed through memory including final one.
![Page 36: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/36.jpg)
36www.vitech.com.ua
● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX
● But it can index ANYTHING. Search result is document ID
INDEX UPDATE
Search responses
INDEX QUERY
Index update request is analyzed, tokenized,
transformed... and the same is for queries.
SOLR is just about search
![Page 37: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/37.jpg)
37www.vitech.com.ua
● HBase handles user data change online requests.
● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.
● Indexes are built on SOLR so HBase data are searchable.
![Page 38: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/38.jpg)
38www.vitech.com.ua
ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.
![Page 39: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/39.jpg)
39www.vitech.com.ua
HDFS
HBase: Data and search integration
HBase regions
Data update
Client
User just puts (or deletes) data.
Search responses
Lily HBase NRT indexer
Replication can be set up to column
family level.
REPLICATIONHBasecluster
Translates data changes into SOLR
index updates.
SOLR cloudSearch requests (HTTP)
Apache Zookeeper does all coordination
Finally provides search
Serves low level file system.
![Page 40: Big data: current technology scope](https://reader036.vdocuments.us/reader036/viewer/2022062406/55a0fe021a28ab0d2e8b4593/html5/thumbnails/40.jpg)
40www.vitech.com.ua
Questions and discussion