data engineering quick guide
TRANSCRIPT
![Page 1: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/1.jpg)
DATA ENGINEERING QUICK GUIDEASIM JALIS
GALVANIZE
![Page 2: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/2.jpg)
BIG DATA
![Page 3: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/3.jpg)
WHY HADOOP?How can we create a supercomputerUsing cheap Linux boxes?
![Page 4: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/4.jpg)
WHAT IS HADOOP?Operating system for cluster of machinesCombines small weak computersTo create Big Data systemUnified disk and processing power
![Page 5: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/5.jpg)
HADOOP
![Page 6: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/6.jpg)
![Page 7: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/7.jpg)
WHY HDFS?How can we store a petabyte-sized fileUsing cheap Linux boxes?
![Page 8: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/8.jpg)
WHAT IS HDFS?Split petabye file into 128 MB blocksDistribute blocks across Hadoop clusterMake 3 copies of each block for insurance
![Page 9: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/9.jpg)
HDFS
![Page 10: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/10.jpg)
WHY MAPREDUCE?How can we process the data in HDFSWithout pulling it out and pushing the result back?
![Page 11: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/11.jpg)
WHAT IS MAPREDUCE?Send program to where the data is on HDFSProcess petabyte file by processing each blockThen combining the result
![Page 12: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/12.jpg)
MAPREDUCE
![Page 13: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/13.jpg)
WHY HIVE?How can people who don’t know JavaWrite MapReduce jobs?
![Page 14: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/14.jpg)
WHAT IS HIVE?Hive translates SQL to MapReduce jobs
![Page 15: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/15.jpg)
HIVESELECT *FROM salesWHERE amount > 400;
![Page 16: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/16.jpg)
HIVE
![Page 17: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/17.jpg)
WHY PIG?How can people who don’t know Java or SQLWrite MapReduce jobs?
![Page 18: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/18.jpg)
WHAT IS PIG?Pig translates PigLatin to MapReduce jobsPigLatin is a scripting language comparable to SQL
![Page 19: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/19.jpg)
PIGhigh_sales = FILTER sales_data BY amount > 400;
![Page 20: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/20.jpg)
PIG
![Page 21: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/21.jpg)
WHY SPARK?How can we make MapReduce fasterAnd the API less clunky?
![Page 22: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/22.jpg)
WHAT IS SPARK?Spark is like MapReduceSpark has a cleaner API and is fasterSpeed up because it saves intermediate results in memory
![Page 23: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/23.jpg)
SPARKsc.textFile("shakespeare.txt"). flatMap(line => line.split("\\W+")). map(word => (word,1)). reduceByKey((count1,count2) => (count1 + count2)). saveAsTextFile("output")
![Page 24: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/24.jpg)
SPARK
![Page 25: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/25.jpg)
WHY SPARK SQL?How can people who don’t know Scala, Python, or JavaWrite Spark code?
![Page 26: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/26.jpg)
WHAT IS SPARK SQL?Spark SQL is like Hive for SparkHive translates SQL to MapReduceSpark SQL translates SQL to Spark
![Page 27: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/27.jpg)
SPARK SQLSELECT *FROM salesWHERE amount > 400;
![Page 28: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/28.jpg)
SPARK SQL
![Page 29: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/29.jpg)
WHAT IS SPARK MLLIB?Machine Learning algorithms on SparkAnalyze data to extract insights
![Page 30: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/30.jpg)
WHAT IS MACHINE LEARNING?Technique Question
Regression Predict revenue next month
Classification Is tumor cancerous or benign
Clustering Which customers are similar to eachother
Recommendation Which movie will you like
![Page 31: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/31.jpg)
REAL-TIME TECHNOLOGIES
![Page 32: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/32.jpg)
WHAT IS THE DIFFERENCEBETWEEN REAL-TIME AND
BATCH?Term Means Example
Real-Time
Process data whenit arrives
Reject credit cardtransaction
Batch Process dataperiodically
Flag suspicioustransaction at night
![Page 33: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/33.jpg)
BATCHProcessing Layer SQL Layer
MapReduce Hive, Pig
Spark Spark SQL
![Page 34: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/34.jpg)
REAL-TIMEHBaseKafkaSpark StreamingLambda Architecture
![Page 35: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/35.jpg)
WHY HBASE?How can we store petabytes of data on HDFSAnd do fast read and writes like a database?
![Page 36: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/36.jpg)
WHAT IS HBASE?HBase is a NoSQL database on top of HDFSCan store petabytes of dataReads/writes much faster than traditional database andHDFS
![Page 37: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/37.jpg)
HBASE
![Page 38: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/38.jpg)
WHY KAFKA?How can we hold onto incoming data and not lose itWhen we are getting a million messages per second?
![Page 39: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/39.jpg)
WHAT IS KAFKA?Kafka is TiVo for the clusterIt stores real-time data as it comes inCan store a week of dataQueuing system for Hadoop cluster
![Page 40: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/40.jpg)
KAFKA
![Page 41: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/41.jpg)
WHY SPARK STREAMING?How can we process data as it comes inInstead of every night (using Spark or MapReduce)
![Page 42: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/42.jpg)
WHAT IS SPARK STREAMING?Spark Streaming is a library on top of SparkIt allows processing data as soon as it comes inSits in front of Kafka
![Page 43: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/43.jpg)
SPARK STREAMING
![Page 44: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/44.jpg)
WHY LAMBDA ARCHITECTURE?How can we watch historical trends and what ishappening right now?How can we show bestsellers from this year and from lasthour?
![Page 45: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/45.jpg)
WHAT IS LAMBDAARCHITECTURE?
Big Data system which can handle both batch and real-timeUses historical data as well as real-time dataBest of both worlds
![Page 46: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/46.jpg)
LAMBDA ARCHITECTURE
![Page 47: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/47.jpg)
REVIEW
![Page 48: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/48.jpg)
BATCH REVIEWTechnology Description
Hadoop Cluster operating system
HDFS Stores petabytes of data on 100s or 1000s ofmachines
MapReduce Processes data in HDFS
Hive SQL MapReduce
Pig PigLatin MapReduce
Spark Faster MapReduce
Spark SQL SQL Spark
![Page 49: Data Engineering Quick Guide](https://reader036.vdocuments.us/reader036/viewer/2022062316/586fdee01a28ab18428b6d7f/html5/thumbnails/49.jpg)
REAL-TIME REVIEWTechnology Description
HBase Fast NoSQL database on top ofHDFS
Kafka Queues incoming data into cluster
Spark Streaming Process in real-time
LambdaArchitecture
Combines real-time and batch