data engineering quick guide

DATA ENGINEERING QUICK GUIDEASIM JALIS

GALVANIZE

BIG DATA

WHY HADOOP?How can we create a supercomputerUsing cheap Linux boxes?

WHAT IS HADOOP?Operating system for cluster of machinesCombines small weak computersTo create Big Data systemUnified disk and processing power

HADOOP

WHY HDFS?How can we store a petabyte-sized fileUsing cheap Linux boxes?

WHAT IS HDFS?Split petabye file into 128 MB blocksDistribute blocks across Hadoop clusterMake 3 copies of each block for insurance

WHY MAPREDUCE?How can we process the data in HDFSWithout pulling it out and pushing the result back?

WHAT IS MAPREDUCE?Send program to where the data is on HDFSProcess petabyte file by processing each blockThen combining the result

MAPREDUCE

WHY HIVE?How can people who don’t know JavaWrite MapReduce jobs?

WHAT IS HIVE?Hive translates SQL to MapReduce jobs

HIVESELECT *FROM salesWHERE amount > 400;

WHY PIG?How can people who don’t know Java or SQLWrite MapReduce jobs?

WHAT IS PIG?Pig translates PigLatin to MapReduce jobsPigLatin is a scripting language comparable to SQL

PIGhigh_sales = FILTER sales_data BY amount > 400;

WHY SPARK?How can we make MapReduce fasterAnd the API less clunky?

WHAT IS SPARK?Spark is like MapReduceSpark has a cleaner API and is fasterSpeed up because it saves intermediate results in memory

SPARKsc.textFile("shakespeare.txt"). flatMap(line => line.split("\\W+")). map(word => (word,1)). reduceByKey((count1,count2) => (count1 + count2)). saveAsTextFile("output")

WHY SPARK SQL?How can people who don’t know Scala, Python, or JavaWrite Spark code?

WHAT IS SPARK SQL?Spark SQL is like Hive for SparkHive translates SQL to MapReduceSpark SQL translates SQL to Spark

SPARK SQLSELECT *FROM salesWHERE amount > 400;

SPARK SQL

WHAT IS SPARK MLLIB?Machine Learning algorithms on SparkAnalyze data to extract insights

WHAT IS MACHINE LEARNING?Technique Question

Regression Predict revenue next month

Classification Is tumor cancerous or benign

Clustering Which customers are similar to eachother

Recommendation Which movie will you like

REAL-TIME TECHNOLOGIES

WHAT IS THE DIFFERENCEBETWEEN REAL-TIME AND

BATCH?Term Means Example

Real-Time

Process data whenit arrives

Reject credit cardtransaction

Batch Process dataperiodically

Flag suspicioustransaction at night

BATCHProcessing Layer SQL Layer

MapReduce Hive, Pig

Spark Spark SQL

REAL-TIMEHBaseKafkaSpark StreamingLambda Architecture

WHY HBASE?How can we store petabytes of data on HDFSAnd do fast read and writes like a database?

WHAT IS HBASE?HBase is a NoSQL database on top of HDFSCan store petabytes of dataReads/writes much faster than traditional database andHDFS

WHY KAFKA?How can we hold onto incoming data and not lose itWhen we are getting a million messages per second?

WHAT IS KAFKA?Kafka is TiVo for the clusterIt stores real-time data as it comes inCan store a week of dataQueuing system for Hadoop cluster

WHY SPARK STREAMING?How can we process data as it comes inInstead of every night (using Spark or MapReduce)

WHAT IS SPARK STREAMING?Spark Streaming is a library on top of SparkIt allows processing data as soon as it comes inSits in front of Kafka

SPARK STREAMING

WHY LAMBDA ARCHITECTURE?How can we watch historical trends and what ishappening right now?How can we show bestsellers from this year and from lasthour?

WHAT IS LAMBDAARCHITECTURE?

Big Data system which can handle both batch and real-timeUses historical data as well as real-time dataBest of both worlds

LAMBDA ARCHITECTURE

REVIEW

BATCH REVIEWTechnology Description

Hadoop Cluster operating system

HDFS Stores petabytes of data on 100s or 1000s ofmachines

MapReduce Processes data in HDFS

Hive SQL MapReduce

Pig PigLatin MapReduce

Spark Faster MapReduce

Spark SQL SQL Spark

REAL-TIME REVIEWTechnology Description

HBase Fast NoSQL database on top ofHDFS

Kafka Queues incoming data into cluster

Spark Streaming Process in real-time

LambdaArchitecture

Combines real-time and batch

data engineering quick guide

Data & Analytics