2014 sept 26_thug_lambda_part1
Post on 28-Nov-2014
266 Views
Preview:
DESCRIPTION
TRANSCRIPT
LAMBDA ARCHITECTURE: AN EMERGING PATTERN OF USE WITH HADOOP AND NEAR-‐REALTIME FRAMEWORKS PART 1 – SEPT 26 2014
Adam Muise – Principal Architect, Hortonworks
Who is ?
We do Hadoop
The leaders of Hadoop’s development
Community driven, Enterprise Focused
Drive InnovaEon in the plaForm – We lead the roadmap
100% Open Source – DemocraEzed Access to Data
What is Lambda Architecture?
Batch Processing +
Stream Processing = Lambda Architecture
Batch Layer -‐ Handles ETL -‐ TradiEonal IntegraEon -‐ OTen System of Record -‐ Archive -‐ Large-‐scale analyEcs
Speed Layer -‐ Handles Event streams -‐ Near-‐RealEme predicEve analyEcs -‐ AlerEng/Trending -‐ Processing/Parsing for Micro-‐Batch ETL -‐ OTen an ingest layer for NoSQL DB data or Search indexes (Solr, ES, etc)
Lambda Architecture
Example ImplementaEon of Lambda Architecture
Cisco OpenSOC Conceptual Architecture
Raw Network Stream
Network Metadata Stream
Netflow
Syslog
Raw Application Logs
Other Streaming Telemetry
Hive HBase
Raw Packet Store
Long-Term Store
Elastic Search
Real-Time Index
Network Packet Mining
and PCAP Reconstruction
Log Mining and Analytics
Big Data Exploration, Predictive Modeling
Applications + Analyst Tools Pars
e +
Fo
rmat
En
rich
Ale
rt
Threat Intelligence Feeds
Enrichment Data
OpenSOC Deployment at Cisco
Hardware footprint (40u)
§ 14 Data Nodes (UCS C240 M3)
§ 3 Cluster Control Nodes (UCS C220 M3)
§ 2 ESX Hypervisor Hosts (UCS C220 M3)
§ 1 PCAP Processor (UCS C220 M3 +
Napatech NIC)
§ 2 SourceFire Threat alert processors
§ 1 Anue Network Traffic splitter
§ 1 Router
§ 1 48 Port 10GE Switch
Software Stack
§ HDP 2.1
§ Kafka 0.8
§ Elastic Search 1.1
§ MySQL 5.5
OpenSOC - Stitching Things Together
Access Messaging System Data Collection Source Systems Storage Real Time Processing
Storm Kafka
B Topic
N Topic
Elastic Search
Index
Web Services
Search
PCAP Reconstruction
HBase
PCAP Table
Analytic Tools
R / Python
Power Pivot
Tableau
Hive
Raw Data
ORC
Passive Tap
PCAP Topic
DPI Topic
A Topic
Telemetry Sources
Syslog
HTTP
File System
Other
Flume
Agent A
Agent B
Agent N
B Topology
N Topology
A Topology
PCAP
Traffic Replicator PCAP Topology
DPI Topology
PCAP Topology
Storage Real Time Processing
Storm
Elastic Search
Index
HBase
PCAP Table
Hive
Raw Data
ORC
Kafka Spout
Parser Bolt
HDFS Bolt
HBase Bolt
ES Bolt
DPI Topology & Telemetry Enrichment
Storage Real Time Processing
Storm
Elastic Search
Index
HBase
PCAP Table
Hive
Raw Data
ORC
Kafka Spout
Parser Bolt
GEO Enrich
Whois Enrich
CIF Enrich
HDFS Bolt
ES Bolt
Enrichments
Parser Bolt
GEO Enrich
RAW Message
{!“msg_key1”: “msg value1”,!“src_ip”: “10.20.30.40”,!“dest_ip”: “20.30.40.50”,!“domain”: “mydomain.com”!}!
Who Is
Enrich
"geo":[ {"region":"CA",!"postalCode":"95134",!"areaCode":"408",!"metroCode":"807",!"longitude":-121.946,!"latitude":37.425,!"locId":4522,!"city":"San Jose",!"country":"US"! }]!
CIF Enrich
"whois":[ {!"OrgId":"CISCOS",!"Parent":"NET-144-0-0-0-0",!"OrgAbuseName":"Cisco Systems Inc",!"RegDate":"1991-01-171991-01-17",!"OrgName":"Cisco Systems",!"Address":"170 West Tasman Drive",!"NetType":"Direct Assignment"!} ],!“cif”:”Yes”!
Enriched Message
Cache
MySQL
Geo Lite Data
Cache
HBase
Who Is Data
Cache
HBase
CIF Data
Refresh on YARN
YARN = Yet Another Resource NegoEator
Resource Manager +
Node Managers = YARN
Resource Manager
AppMaster
Node Manager
Scheduler
AppMaster
AppMaster
Node Manager
Node Manager
Node Manager
Container
Container
MapReduce
Container
Storm
Container
Container
Container
Pig
Container
Container
Container
YARN abstracts resource management so you can run all sorts
of distributed applicaEons
HDFS
MapReduce V2
YARN MapReduce V? STORM
MPI Giraph HBase Tez … and more Spark
YARN can provide long-‐running containers* for applicaEons like
Storm, Hbase, JBoss, etc
* -‐ With the help of Apache Slider: hdp://wiki.apache.org/incubator/SliderProposal
Refresh on Storm
Storm is a distributed execuEon engine that handles streaming data
Storm processes streaming event data as tuples. Each event is
generated/ingested through a spout and processed in series of bolts. The spouts and bolts form a topology.
Refresh on Summingbird
Summingbird is a processing framework that runs over Storm. It uses Scala and has MapReduce-‐like features. Technically it’s Scalding
(Cascading with Scala).
Refresh on Spark
Spark is a framework designed to handle in-‐memory compuEng. If you are using Hadoop, you are typically
running Spark on YARN.
Spark uses RDDs (Resilient Distributed Datasets) as a primiEve to enable in-‐memory processing. RDDs can be created from all sorts
of data, like HDFS files:
scala> val distFile = sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt") distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
Spark Streaming constructs DStream primiEves from a streaming data
source. DStreams are actually made up of many RDDs and use the
common Spark Engine.
Spark Streaming has an expressive API to allow typical transformaEons
on RDDs or DStreams. This is comparable to MapReduce.
Some common implementaEons in the field…
1. Kaha + Storm + HBase||HDFS + SolrCloud||ElasEcSearch
2. Kaha + Spark Streaming + HDFS
Let’s walk through some code.
Let’s walk through some code.
Super simple: hdps://github.com/amuise/kappaeg
Super real:
hdps://github.com/OpenSOC hdp://www.slideshare.net/JamesSirota/cisco-‐opensoc
Thanks THUGs.
top related