2014 sept 26_thug_lambda_part1

37
LAMBDA ARCHITECTURE: AN EMERGING PATTERN OF USE WITH HADOOP AND NEARREALTIME FRAMEWORKS PART 1 – SEPT 26 2014 Adam Muise – Principal Architect, Hortonworks

Upload: adam-muise

Post on 28-Nov-2014

266 views

Category:

Technology


0 download

DESCRIPTION

Sept 26, 2014 - Intro to Storm, Kafka, and the Lambda Architecture.

TRANSCRIPT

Page 1: 2014 sept 26_thug_lambda_part1

LAMBDA  ARCHITECTURE:  AN  EMERGING  PATTERN  OF  USE  WITH  HADOOP  AND  NEAR-­‐REALTIME  FRAMEWORKS  PART  1  –  SEPT  26  2014  

Adam  Muise  –  Principal  Architect,  Hortonworks  

Page 2: 2014 sept 26_thug_lambda_part1

Who  is                                        ?  

Page 3: 2014 sept 26_thug_lambda_part1

We  do  Hadoop  

The  leaders  of  Hadoop’s  development  

Community  driven,    Enterprise  Focused  

Drive  InnovaEon  in  the  plaForm  –  We  lead  the  roadmap    

100%  Open  Source  –  DemocraEzed  Access  to  Data  

Page 4: 2014 sept 26_thug_lambda_part1

What  is  Lambda  Architecture?  

Page 5: 2014 sept 26_thug_lambda_part1

Batch  Processing  +  

Stream  Processing  =  Lambda  Architecture  

Page 6: 2014 sept 26_thug_lambda_part1

Batch  Layer  -­‐  Handles  ETL  -­‐  TradiEonal  IntegraEon  -­‐  OTen  System  of  Record  -­‐  Archive  -­‐  Large-­‐scale  analyEcs  

Page 7: 2014 sept 26_thug_lambda_part1

Speed  Layer  -­‐  Handles  Event  streams  -­‐  Near-­‐RealEme  predicEve  analyEcs  -­‐  AlerEng/Trending  -­‐  Processing/Parsing  for  Micro-­‐Batch  ETL  -­‐  OTen  an  ingest  layer  for  NoSQL  DB  data  or  Search  indexes  (Solr,  ES,  etc)  

Page 8: 2014 sept 26_thug_lambda_part1

Lambda  Architecture  

Page 9: 2014 sept 26_thug_lambda_part1

Example  ImplementaEon  of  Lambda  Architecture  

Page 10: 2014 sept 26_thug_lambda_part1

Cisco OpenSOC Conceptual Architecture

Raw Network Stream

Network Metadata Stream

Netflow

Syslog

Raw Application Logs

Other Streaming Telemetry

Hive HBase

Raw Packet Store

Long-Term Store

Elastic Search

Real-Time Index

Network Packet Mining

and PCAP Reconstruction

Log Mining and Analytics

Big Data Exploration, Predictive Modeling

Applications + Analyst Tools Pars

e +

Fo

rmat

En

rich

Ale

rt

Threat Intelligence Feeds

Enrichment Data

Page 11: 2014 sept 26_thug_lambda_part1

OpenSOC Deployment at Cisco

Hardware footprint (40u)

§ 14 Data Nodes (UCS C240 M3)

§ 3 Cluster Control Nodes (UCS C220 M3)

§ 2 ESX Hypervisor Hosts (UCS C220 M3)

§ 1 PCAP Processor (UCS C220 M3 +

Napatech NIC)

§ 2 SourceFire Threat alert processors

§ 1 Anue Network Traffic splitter

§ 1 Router

§ 1 48 Port 10GE Switch

Software Stack

§ HDP 2.1

§ Kafka 0.8

§ Elastic Search 1.1

§ MySQL 5.5

Page 12: 2014 sept 26_thug_lambda_part1

OpenSOC - Stitching Things Together

Access Messaging System Data Collection Source Systems Storage Real Time Processing

Storm Kafka

B Topic

N Topic

Elastic Search

Index

Web Services

Search

PCAP Reconstruction

HBase

PCAP Table

Analytic Tools

R / Python

Power Pivot

Tableau

Hive

Raw Data

ORC

Passive Tap

PCAP Topic

DPI Topic

A Topic

Telemetry Sources

Syslog

HTTP

File System

Other

Flume

Agent A

Agent B

Agent N

B Topology

N Topology

A Topology

PCAP

Traffic Replicator PCAP Topology

DPI Topology

Page 13: 2014 sept 26_thug_lambda_part1

PCAP Topology

Storage Real Time Processing

Storm

Elastic Search

Index

HBase

PCAP Table

Hive

Raw Data

ORC

Kafka Spout

Parser Bolt

HDFS Bolt

HBase Bolt

ES Bolt

Page 14: 2014 sept 26_thug_lambda_part1

DPI Topology & Telemetry Enrichment

Storage Real Time Processing

Storm

Elastic Search

Index

HBase

PCAP Table

Hive

Raw Data

ORC

Kafka Spout

Parser Bolt

GEO Enrich

Whois Enrich

CIF Enrich

HDFS Bolt

ES Bolt

Page 15: 2014 sept 26_thug_lambda_part1

Enrichments

Parser Bolt

GEO Enrich

RAW Message

{!“msg_key1”: “msg value1”,!“src_ip”: “10.20.30.40”,!“dest_ip”: “20.30.40.50”,!“domain”: “mydomain.com”!}!

Who Is

Enrich

"geo":[ {"region":"CA",!"postalCode":"95134",!"areaCode":"408",!"metroCode":"807",!"longitude":-121.946,!"latitude":37.425,!"locId":4522,!"city":"San Jose",!"country":"US"! }]!

CIF Enrich

"whois":[ {!"OrgId":"CISCOS",!"Parent":"NET-144-0-0-0-0",!"OrgAbuseName":"Cisco Systems Inc",!"RegDate":"1991-01-171991-01-17",!"OrgName":"Cisco Systems",!"Address":"170 West Tasman Drive",!"NetType":"Direct Assignment"!} ],!“cif”:”Yes”!

Enriched Message

Cache

MySQL

Geo Lite Data

Cache

HBase

Who Is Data

Cache

HBase

CIF Data

Page 16: 2014 sept 26_thug_lambda_part1

Refresh  on  YARN  

Page 17: 2014 sept 26_thug_lambda_part1

YARN  =  Yet  Another  Resource  NegoEator  

Page 18: 2014 sept 26_thug_lambda_part1

Resource  Manager  +  

Node  Managers  =  YARN  

Resource  Manager  

AppMaster  

Node  Manager  

Scheduler  

AppMaster  

AppMaster  

Node  Manager  

Node  Manager  

Node  Manager  

Container  

Container  

MapReduce  

Container  

Storm  

Container  

Container  

Container  

Pig  

Container  

Container  

Container  

Page 19: 2014 sept 26_thug_lambda_part1

YARN  abstracts  resource  management  so  you  can  run  all  sorts  

of  distributed  applicaEons  

HDFS  

MapReduce  V2  

YARN  MapReduce  V?   STORM  

MPI  Giraph   HBase  Tez   …  and  more  Spark  

Page 20: 2014 sept 26_thug_lambda_part1

YARN  can  provide  long-­‐running  containers*  for  applicaEons  like  

Storm,  Hbase,  JBoss,  etc    

*  -­‐  With  the  help  of  Apache  Slider:    hdp://wiki.apache.org/incubator/SliderProposal  

Page 21: 2014 sept 26_thug_lambda_part1

Refresh  on  Storm  

Page 22: 2014 sept 26_thug_lambda_part1

Storm  is  a  distributed  execuEon  engine  that  handles  streaming  data  

Page 23: 2014 sept 26_thug_lambda_part1

Storm  processes  streaming  event  data  as  tuples.  Each  event  is  

generated/ingested  through  a  spout  and  processed  in  series  of  bolts.  The  spouts  and  bolts  form  a  topology.  

Page 24: 2014 sept 26_thug_lambda_part1

Refresh  on  Summingbird  

Page 25: 2014 sept 26_thug_lambda_part1

Summingbird  is  a  processing  framework  that  runs  over  Storm.  It  uses  Scala  and  has  MapReduce-­‐like  features.  Technically  it’s  Scalding  

(Cascading  with  Scala).  

Page 26: 2014 sept 26_thug_lambda_part1

Refresh  on  Spark  

Page 27: 2014 sept 26_thug_lambda_part1

Spark  is  a  framework  designed  to  handle  in-­‐memory  compuEng.  If  you  are  using  Hadoop,  you  are  typically  

running  Spark  on  YARN.  

Page 28: 2014 sept 26_thug_lambda_part1

Spark  uses  RDDs  (Resilient  Distributed  Datasets)  as  a  primiEve  to  enable  in-­‐memory  processing.  RDDs  can  be  created  from  all  sorts  

of  data,  like  HDFS  files:  

scala>  val  distFile  =  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")    distFile:  spark.RDD[String]  =  spark.HadoopRDD@1d4cee08  

Page 29: 2014 sept 26_thug_lambda_part1

Spark  Streaming  constructs  DStream  primiEves  from  a  streaming  data  

source.  DStreams  are  actually  made  up  of  many  RDDs  and  use  the  

common  Spark  Engine.    

Page 30: 2014 sept 26_thug_lambda_part1

Spark  Streaming  has  an  expressive  API  to  allow  typical  transformaEons  

on  RDDs  or  DStreams.  This  is  comparable  to  MapReduce.  

Page 31: 2014 sept 26_thug_lambda_part1

Some  common  implementaEons  in  the  field…  

Page 32: 2014 sept 26_thug_lambda_part1

1.  Kaha  +  Storm  +  HBase||HDFS  +  SolrCloud||ElasEcSearch  

Page 33: 2014 sept 26_thug_lambda_part1

2.  Kaha  +  Spark  Streaming  +  HDFS  

Page 34: 2014 sept 26_thug_lambda_part1

Let’s  walk  through  some  code.  

Page 35: 2014 sept 26_thug_lambda_part1
Page 36: 2014 sept 26_thug_lambda_part1

Let’s  walk  through  some  code.    

Super  simple:  hdps://github.com/amuise/kappaeg  

 Super  real:  

hdps://github.com/OpenSOC  hdp://www.slideshare.net/JamesSirota/cisco-­‐opensoc  

Page 37: 2014 sept 26_thug_lambda_part1

Thanks  THUGs.