2014 sept 26_thug_lambda_part1

Post on 28-Nov-2014

266 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Sept 26, 2014 - Intro to Storm, Kafka, and the Lambda Architecture.

TRANSCRIPT

LAMBDA  ARCHITECTURE:  AN  EMERGING  PATTERN  OF  USE  WITH  HADOOP  AND  NEAR-­‐REALTIME  FRAMEWORKS  PART  1  –  SEPT  26  2014  

Adam  Muise  –  Principal  Architect,  Hortonworks  

Who  is                                        ?  

We  do  Hadoop  

The  leaders  of  Hadoop’s  development  

Community  driven,    Enterprise  Focused  

Drive  InnovaEon  in  the  plaForm  –  We  lead  the  roadmap    

100%  Open  Source  –  DemocraEzed  Access  to  Data  

What  is  Lambda  Architecture?  

Batch  Processing  +  

Stream  Processing  =  Lambda  Architecture  

Batch  Layer  -­‐  Handles  ETL  -­‐  TradiEonal  IntegraEon  -­‐  OTen  System  of  Record  -­‐  Archive  -­‐  Large-­‐scale  analyEcs  

Speed  Layer  -­‐  Handles  Event  streams  -­‐  Near-­‐RealEme  predicEve  analyEcs  -­‐  AlerEng/Trending  -­‐  Processing/Parsing  for  Micro-­‐Batch  ETL  -­‐  OTen  an  ingest  layer  for  NoSQL  DB  data  or  Search  indexes  (Solr,  ES,  etc)  

Lambda  Architecture  

Example  ImplementaEon  of  Lambda  Architecture  

Cisco OpenSOC Conceptual Architecture

Raw Network Stream

Network Metadata Stream

Netflow

Syslog

Raw Application Logs

Other Streaming Telemetry

Hive HBase

Raw Packet Store

Long-Term Store

Elastic Search

Real-Time Index

Network Packet Mining

and PCAP Reconstruction

Log Mining and Analytics

Big Data Exploration, Predictive Modeling

Applications + Analyst Tools Pars

e +

Fo

rmat

En

rich

Ale

rt

Threat Intelligence Feeds

Enrichment Data

OpenSOC Deployment at Cisco

Hardware footprint (40u)

§ 14 Data Nodes (UCS C240 M3)

§ 3 Cluster Control Nodes (UCS C220 M3)

§ 2 ESX Hypervisor Hosts (UCS C220 M3)

§ 1 PCAP Processor (UCS C220 M3 +

Napatech NIC)

§ 2 SourceFire Threat alert processors

§ 1 Anue Network Traffic splitter

§ 1 Router

§ 1 48 Port 10GE Switch

Software Stack

§ HDP 2.1

§ Kafka 0.8

§ Elastic Search 1.1

§ MySQL 5.5

OpenSOC - Stitching Things Together

Access Messaging System Data Collection Source Systems Storage Real Time Processing

Storm Kafka

B Topic

N Topic

Elastic Search

Index

Web Services

Search

PCAP Reconstruction

HBase

PCAP Table

Analytic Tools

R / Python

Power Pivot

Tableau

Hive

Raw Data

ORC

Passive Tap

PCAP Topic

DPI Topic

A Topic

Telemetry Sources

Syslog

HTTP

File System

Other

Flume

Agent A

Agent B

Agent N

B Topology

N Topology

A Topology

PCAP

Traffic Replicator PCAP Topology

DPI Topology

PCAP Topology

Storage Real Time Processing

Storm

Elastic Search

Index

HBase

PCAP Table

Hive

Raw Data

ORC

Kafka Spout

Parser Bolt

HDFS Bolt

HBase Bolt

ES Bolt

DPI Topology & Telemetry Enrichment

Storage Real Time Processing

Storm

Elastic Search

Index

HBase

PCAP Table

Hive

Raw Data

ORC

Kafka Spout

Parser Bolt

GEO Enrich

Whois Enrich

CIF Enrich

HDFS Bolt

ES Bolt

Enrichments

Parser Bolt

GEO Enrich

RAW Message

{!“msg_key1”: “msg value1”,!“src_ip”: “10.20.30.40”,!“dest_ip”: “20.30.40.50”,!“domain”: “mydomain.com”!}!

Who Is

Enrich

"geo":[ {"region":"CA",!"postalCode":"95134",!"areaCode":"408",!"metroCode":"807",!"longitude":-121.946,!"latitude":37.425,!"locId":4522,!"city":"San Jose",!"country":"US"! }]!

CIF Enrich

"whois":[ {!"OrgId":"CISCOS",!"Parent":"NET-144-0-0-0-0",!"OrgAbuseName":"Cisco Systems Inc",!"RegDate":"1991-01-171991-01-17",!"OrgName":"Cisco Systems",!"Address":"170 West Tasman Drive",!"NetType":"Direct Assignment"!} ],!“cif”:”Yes”!

Enriched Message

Cache

MySQL

Geo Lite Data

Cache

HBase

Who Is Data

Cache

HBase

CIF Data

Refresh  on  YARN  

YARN  =  Yet  Another  Resource  NegoEator  

Resource  Manager  +  

Node  Managers  =  YARN  

Resource  Manager  

AppMaster  

Node  Manager  

Scheduler  

AppMaster  

AppMaster  

Node  Manager  

Node  Manager  

Node  Manager  

Container  

Container  

MapReduce  

Container  

Storm  

Container  

Container  

Container  

Pig  

Container  

Container  

Container  

YARN  abstracts  resource  management  so  you  can  run  all  sorts  

of  distributed  applicaEons  

HDFS  

MapReduce  V2  

YARN  MapReduce  V?   STORM  

MPI  Giraph   HBase  Tez   …  and  more  Spark  

YARN  can  provide  long-­‐running  containers*  for  applicaEons  like  

Storm,  Hbase,  JBoss,  etc    

*  -­‐  With  the  help  of  Apache  Slider:    hdp://wiki.apache.org/incubator/SliderProposal  

Refresh  on  Storm  

Storm  is  a  distributed  execuEon  engine  that  handles  streaming  data  

Storm  processes  streaming  event  data  as  tuples.  Each  event  is  

generated/ingested  through  a  spout  and  processed  in  series  of  bolts.  The  spouts  and  bolts  form  a  topology.  

Refresh  on  Summingbird  

Summingbird  is  a  processing  framework  that  runs  over  Storm.  It  uses  Scala  and  has  MapReduce-­‐like  features.  Technically  it’s  Scalding  

(Cascading  with  Scala).  

Refresh  on  Spark  

Spark  is  a  framework  designed  to  handle  in-­‐memory  compuEng.  If  you  are  using  Hadoop,  you  are  typically  

running  Spark  on  YARN.  

Spark  uses  RDDs  (Resilient  Distributed  Datasets)  as  a  primiEve  to  enable  in-­‐memory  processing.  RDDs  can  be  created  from  all  sorts  

of  data,  like  HDFS  files:  

scala>  val  distFile  =  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")    distFile:  spark.RDD[String]  =  spark.HadoopRDD@1d4cee08  

Spark  Streaming  constructs  DStream  primiEves  from  a  streaming  data  

source.  DStreams  are  actually  made  up  of  many  RDDs  and  use  the  

common  Spark  Engine.    

Spark  Streaming  has  an  expressive  API  to  allow  typical  transformaEons  

on  RDDs  or  DStreams.  This  is  comparable  to  MapReduce.  

Some  common  implementaEons  in  the  field…  

1.  Kaha  +  Storm  +  HBase||HDFS  +  SolrCloud||ElasEcSearch  

2.  Kaha  +  Spark  Streaming  +  HDFS  

Let’s  walk  through  some  code.  

Let’s  walk  through  some  code.    

Super  simple:  hdps://github.com/amuise/kappaeg  

 Super  real:  

hdps://github.com/OpenSOC  hdp://www.slideshare.net/JamesSirota/cisco-­‐opensoc  

Thanks  THUGs.  

top related