hadoop world overview trends and topics

53
Trends and Topics lentyn Kropov lutions Architect, SAG, SoftServe

Upload: valentin-kropov

Post on 25-Jan-2017

135 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop world overview trends and topics

Trends and Topics

Valentyn KropovSolutions Architect, SAG, SoftServe

Page 2: Hadoop world overview trends and topics

Agenda1. Conference Overview

2. Bright Future of Hadoop Map-Reduce

3. Apache Spark Data Frames

4. Cloudera Kudu

5. Most Popular Reference Architecture

6. Use Cases

Page 3: Hadoop world overview trends and topics

#1Conference Overview

Page 4: Hadoop world overview trends and topics
Page 5: Hadoop world overview trends and topics
Page 6: Hadoop world overview trends and topics
Page 7: Hadoop world overview trends and topics

#2Bright Future of

Hadoop MapReduce

Page 8: Hadoop world overview trends and topics
Page 9: Hadoop world overview trends and topics

Spark is a Future

Cloudera Anounces One Platform Initiative (Sep, 9 2015)

Page 10: Hadoop world overview trends and topics

Spark is a Present

It appeared in 72% of presentations and use-cases

At Hadoop World Conference

Page 11: Hadoop world overview trends and topics

Spark is Easier to Code

Map Reduce / Java Spark / Scala

Page 12: Hadoop world overview trends and topics

Spark is Faster

Up to 100x faster!

Page 13: Hadoop world overview trends and topics

Spark is Interactive

Page 14: Hadoop world overview trends and topics

Spark is Real-Time

Page 15: Hadoop world overview trends and topics

And they have Power • 400 contributors• From 100+ companies• Databricks (1 y.o, 30->100

people, $47 million)• Cloudera (370 patches, 43k

lines of code)

Page 16: Hadoop world overview trends and topics

Cloudera One Platform: Read More

http://goo.gl/jSK0h6

Page 17: Hadoop world overview trends and topics

#3Spark Data Frames

Page 18: Hadoop world overview trends and topics

Most of Data is Still Structured!

• No Sorting?• No Joins?• No Aggregations?• No Filtering?• No cross-DB connections?

Page 19: Hadoop world overview trends and topics

Data Frame is…• API

• like a Table (RDBMS)

• or Data Frame (Python/R)

• Abstraction layer over

RDD

Page 20: Hadoop world overview trends and topics

Construct Data Frame# Constructs a DataFrame from Hive users = context.table("users") # from JSON files in S3 logs = context.load("s3n://data.json", "json")

Page 21: Hadoop world overview trends and topics

Filtering

# Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)

Page 22: Hadoop world overview trends and topics

Group By

# Count the number of young users by gender young.groupBy("gender").count()

Page 23: Hadoop world overview trends and topics

Joins!

# Join users with another DataFrame called logs users.join(logs, logs.userId == users.userId, "left_outer")

Page 24: Hadoop world overview trends and topics

Spark Languages

Spark Survey 2015

Page 25: Hadoop world overview trends and topics

Why Not Python + RDD?

Page 26: Hadoop world overview trends and topics

Data Frames and Python

• Compiled into JVM bytecode

• Data Never Leaves the JVM

• Python passes commands only

• Commands are pushed down

Page 27: Hadoop world overview trends and topics

Data Frames Performance

Page 28: Hadoop world overview trends and topics

Data Frames: Read More

http://www.slideshare.net/JonHaddad/enter-the-snake-pit-for-fast-and-easy-spark

Page 29: Hadoop world overview trends and topics

#4Cloudera’s Kudu

Page 30: Hadoop world overview trends and topics

What’s Kudu?• Columnar Storage for Hadoop• Not just a file-format • Supports low-latency random access (ms)• Good alternative for Impala + Parquet• Integrates with Spark, Hadoop, Impala• It’s in Beta now

Page 31: Hadoop world overview trends and topics

Faster than Parquet

Page 32: Hadoop world overview trends and topics

Kudu: Architecture

Page 33: Hadoop world overview trends and topics

Kudu: use-cases

• Write: newly-arrived data immediately

available to users

• Time-Series applications which needs to

support both random and scattered reads

Page 34: Hadoop world overview trends and topics

Kudu: Read More

http://getkudu.io/

Page 35: Hadoop world overview trends and topics

#5Most Popular

Reference Architecture

Page 36: Hadoop world overview trends and topics

Reference Architecture

Yarn (90%)Mesos (10%)

Page 37: Hadoop world overview trends and topics

Kafka

• Highly-scalable

• Fault-tolerant (commit-log)

• Partition-based load-balancing

Page 38: Hadoop world overview trends and topics

Spark Streaming

• Processes data in micro batches (Dstream,

windows slides)

• Supports data locality with Cassandra

• Real-time data science (Data Frames, Mlib)

• BI Support (Spark SQL)

Page 39: Hadoop world overview trends and topics

Cassandra• No SPOF

• Masterless (easy operations and scaling)

• Replicates data across data-centers

• Most mature and fast growing

• Evolves into New SQL (transactions)

• SQL-like-CQL

Page 40: Hadoop world overview trends and topics

Spark

• Is Awesome for Analytics (both

real-time and batch)

Page 41: Hadoop world overview trends and topics

Reference Architecture: Read More

http://www.datastax.com/dev/blog/streaming-big-data-with-spark-spark-streaming-kafka-cassandra-and-akka

Page 42: Hadoop world overview trends and topics

#6Netflix Big Data

Platform

Page 43: Hadoop world overview trends and topics

Netflix: Size

•20PB DW on S3•Read ~10% of data daily•Write ~10% of read data

daily•500 billion events daily

Page 44: Hadoop world overview trends and topics

Netflix: Analyze

•300 Data Scientists•Python, R, Scala, etc

Page 45: Hadoop world overview trends and topics

Netflix: Compute and Storage

• Separate Compute and storage (S3)• To have heterogeneous

clusters• And no-downtime upgrades

Page 46: Hadoop world overview trends and topics

Netflix: Architecture

Page 47: Hadoop world overview trends and topics

Netflix: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43373

Page 48: Hadoop world overview trends and topics

#7Big Data

Mission to Mars

Page 49: Hadoop world overview trends and topics

Mission Orion

Page 50: Hadoop world overview trends and topics

Mission Orion: Size

• 350k measurands• 2TB / hour• 1200 telemetry sensors• 3 x 1GB networks busy• Data retention is 25 years

Page 51: Hadoop world overview trends and topics

Data Reader/Simulator IngestPacket

Measurands (GPBs) Kafka

Message Bus

Packet Measurands

(GPBs) Deduplication

(Spark)

HBase Writer(Spark)

mach5-sample Obj

Splitter + Decom (GDS)

C++ client Reads Packets and

Decommutates

Tlm Data

Packet Measurands GPB File

(represents a Packet(s) and contains

decommutated measurands)

Header Metadataapid:seqctr:time: value1

…..

apid:seqctr:time: valueN

mach5-sample (Spark)

Packet Measurands

(GPBs)

Lockheed Martin Proprietary Information

StorageAnalytics

HDFS

HFiles (HBase-RDD)

Mach-5 Data Ingest for Orion

HBase

Web/UITomcatGlassfish

Etc.

TraceFOSS

widgets

Aggregation

(Spark)

Alerting(Spark)

Limit Checking(Spark)

Page 52: Hadoop world overview trends and topics

Orion: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43181

Page 53: Hadoop world overview trends and topics

Thanks!

[email protected]