2014 sept 26_thug_lambda_part1

LAMBDA ARCHITECTURE: AN EMERGING PATTERN OF USE WITH HADOOP AND NEAR-‐REALTIME FRAMEWORKS PART 1 – SEPT 26 2014

Adam Muise – Principal Architect, Hortonworks

Who is ?

We do Hadoop

The leaders of Hadoop’s development

Community driven, Enterprise Focused

Drive InnovaEon in the plaForm – We lead the roadmap

100% Open Source – DemocraEzed Access to Data

What is Lambda Architecture?

Batch Processing +

Stream Processing = Lambda Architecture

Batch Layer -‐ Handles ETL -‐ TradiEonal IntegraEon -‐ OTen System of Record -‐ Archive -‐ Large-‐scale analyEcs

Speed Layer -‐ Handles Event streams -‐ Near-‐RealEme predicEve analyEcs -‐ AlerEng/Trending -‐ Processing/Parsing for Micro-‐Batch ETL -‐ OTen an ingest layer for NoSQL DB data or Search indexes (Solr, ES, etc)

Lambda Architecture

Example ImplementaEon of Lambda Architecture

Cisco OpenSOC Conceptual Architecture

Raw Network Stream

Network Metadata Stream

Netflow

Syslog

Raw Application Logs

Other Streaming Telemetry

Hive HBase

Raw Packet Store

Long-Term Store

Elastic Search

Real-Time Index

Network Packet Mining

and PCAP Reconstruction

Log Mining and Analytics

Big Data Exploration, Predictive Modeling

Applications + Analyst Tools Pars

Threat Intelligence Feeds

Enrichment Data

OpenSOC Deployment at Cisco

Hardware footprint (40u)

§ 14 Data Nodes (UCS C240 M3)

§ 3 Cluster Control Nodes (UCS C220 M3)

§ 2 ESX Hypervisor Hosts (UCS C220 M3)

§ 1 PCAP Processor (UCS C220 M3 +

Napatech NIC)

§ 2 SourceFire Threat alert processors

§ 1 Anue Network Traffic splitter

§ 1 Router

§ 1 48 Port 10GE Switch

Software Stack

§ HDP 2.1

§ Kafka 0.8

§ Elastic Search 1.1

§ MySQL 5.5

OpenSOC - Stitching Things Together

Access Messaging System Data Collection Source Systems Storage Real Time Processing

Storm Kafka

B Topic

N Topic

Elastic Search

Web Services

Search

PCAP Reconstruction

PCAP Table

Analytic Tools

R / Python

Power Pivot

Tableau

Raw Data

Passive Tap

PCAP Topic

DPI Topic

A Topic

Telemetry Sources

Syslog

File System

Agent A

Agent B

Agent N

B Topology

N Topology

A Topology

Traffic Replicator PCAP Topology

DPI Topology

PCAP Topology

Storage Real Time Processing

Elastic Search

PCAP Table

Raw Data

Kafka Spout

Parser Bolt

HDFS Bolt

HBase Bolt

ES Bolt

DPI Topology & Telemetry Enrichment

Storage Real Time Processing

Elastic Search

PCAP Table

Raw Data

Kafka Spout

Parser Bolt

GEO Enrich

Whois Enrich

CIF Enrich

HDFS Bolt

ES Bolt

Enrichments

Parser Bolt

GEO Enrich

RAW Message

{!“msg_key1”: “msg value1”,!“src_ip”: “10.20.30.40”,!“dest_ip”: “20.30.40.50”,!“domain”: “mydomain.com”!}!

Who Is

Enrich

"geo":[ {"region":"CA",!"postalCode":"95134",!"areaCode":"408",!"metroCode":"807",!"longitude":-121.946,!"latitude":37.425,!"locId":4522,!"city":"San Jose",!"country":"US"! }]!

CIF Enrich

"whois":[ {!"OrgId":"CISCOS",!"Parent":"NET-144-0-0-0-0",!"OrgAbuseName":"Cisco Systems Inc",!"RegDate":"1991-01-171991-01-17",!"OrgName":"Cisco Systems",!"Address":"170 West Tasman Drive",!"NetType":"Direct Assignment"!} ],!“cif”:”Yes”!

Enriched Message

Geo Lite Data

Who Is Data

CIF Data

Refresh on YARN

YARN = Yet Another Resource NegoEator

Resource Manager +

Node Managers = YARN

Resource Manager

AppMaster

Node Manager

Scheduler

AppMaster

Node Manager

Container

MapReduce

Container

YARN abstracts resource management so you can run all sorts

of distributed applicaEons

MapReduce V2

YARN MapReduce V? STORM

MPI Giraph HBase Tez … and more Spark

YARN can provide long-‐running containers* for applicaEons like

Storm, Hbase, JBoss, etc

* -‐ With the help of Apache Slider: hdp://wiki.apache.org/incubator/SliderProposal

Refresh on Storm

Storm is a distributed execuEon engine that handles streaming data

Storm processes streaming event data as tuples. Each event is

generated/ingested through a spout and processed in series of bolts. The spouts and bolts form a topology.

Refresh on Summingbird

Summingbird is a processing framework that runs over Storm. It uses Scala and has MapReduce-‐like features. Technically it’s Scalding

(Cascading with Scala).

Refresh on Spark

Spark is a framework designed to handle in-‐memory compuEng. If you are using Hadoop, you are typically

running Spark on YARN.

Spark uses RDDs (Resilient Distributed Datasets) as a primiEve to enable in-‐memory processing. RDDs can be created from all sorts

of data, like HDFS files:

scala> val distFile = sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt") distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08

Spark Streaming constructs DStream primiEves from a streaming data

source. DStreams are actually made up of many RDDs and use the

common Spark Engine.

Spark Streaming has an expressive API to allow typical transformaEons

on RDDs or DStreams. This is comparable to MapReduce.

Some common implementaEons in the field…

1. Kaha + Storm + HBase||HDFS + SolrCloud||ElasEcSearch

2. Kaha + Spark Streaming + HDFS

Let’s walk through some code.

Super simple: hdps://github.com/amuise/kappaeg

Super real:

hdps://github.com/OpenSOC hdp://www.slideshare.net/JamesSirota/cisco-‐opensoc

Thanks THUGs.

2014 sept 26_thug_lambda_part1

Technology

rme 2014 sept

sept calendar 2014

trackrecord - sept 2014

9th sept 2014

acm sept 2014

newsletter, sept 2014

sept 22, 2014

17th sept 2014

sept oct 2014

sept. 29, 2014

sept & oct 2014

quantum.effects.in.biology - sept 2014

statcon 2014 sept

sept 2014 passport

newsletter- apr 2014.- sept 2014

3 sept 2014

ohio ahead – business meeting. treasurer’s report sept....

marinenews (sept,2014)

laserworld sept 2014

sept. 22, 2014