![Page 1: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/1.jpg)
CC5212-1
PROCESAMIENTO MASIVO DE DATOS
OTOÑO 2017
Lecture 1: Introduction
Aidan Hogan
![Page 2: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/2.jpg)
THE VALUE OF DATA
![Page 3: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/3.jpg)
Soho, London, 1854
![Page 4: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/4.jpg)
Cholera: What we know now …
![Page 5: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/5.jpg)
Cholera: What we knew in 1854
![Page 6: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/6.jpg)
1854: Galen’s miasma theory of cholera
![Page 7: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/7.jpg)
1854: The hunt for the invisible cholera
![Page 8: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/8.jpg)
John Snow: 1813–1858
![Page 9: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/9.jpg)
John Snow: 1813–1858
![Page 10: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/10.jpg)
The Survey of Soho
![Page 11: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/11.jpg)
The Survey of Soho
![Page 12: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/12.jpg)
What the data showed …
![Page 13: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/13.jpg)
What the data showed …
![Page 14: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/14.jpg)
616 deaths, 8 days later …
![Page 15: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/15.jpg)
Cholera: What we knew in 1855
![Page 16: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/16.jpg)
Cholera boil notice ca. 1866
![Page 17: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/17.jpg)
Cholera boil notice ca. 1866
![Page 18: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/18.jpg)
Thirty years before discovery of V. cholerae
![Page 19: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/19.jpg)
John Snow: Father of Epidemiology
![Page 20: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/20.jpg)
Epidemiology’s Success Stories
![Page 21: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/21.jpg)
Value of data: Not just epidemiology
![Page 22: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/22.jpg)
(Paper) Notebooks no longer good enough
![Page 23: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/23.jpg)
THE GROWTH OF DATA
![Page 24: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/24.jpg)
“Big Data”
1 Wiki = 1 Wikipedia
English Wikipedia
≈ 51 GB of data
(2015 dump)(Text; No edit history)(XML, uncompressed)
![Page 25: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/25.jpg)
“Big Data”
Wikimedia Commons
≈ 24 TB of data
≈ 470.6 Wiki
(2014 dump)
![Page 26: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/26.jpg)
“Big Data”
Sloan Digital Sky Survey
≈ 200 GB / day
≈ 4 Wiki / day
(2013, generated by SDSS)
![Page 27: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/27.jpg)
“Big Data”
≈ 8 TB / day
≈ 157 Wiki / day
(2013, generated)
![Page 28: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/28.jpg)
“Big Data”
Large Hadron Collider
≈ 68 TB / day
≈ 1,370 Wiki / day
(2012, collision data generated)
![Page 29: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/29.jpg)
“Big Data”
≈ 600 TB / day
≈ 11,764 Wiki / day
(2014, incoming Hive data)
![Page 30: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/30.jpg)
“Big Data”
NSA Surveillance
≈ 29 PB / day
≈ 568,627 Wiki / day
(2013, processed)
![Page 31: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/31.jpg)
“Big Data”
≈ 100 PB / day
≈ 2,000,000 Wiki / day
(2014, processed)
![Page 32: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/32.jpg)
“Big Data”
Internet Traffic
≈ 2,417 PB / day
≈ 47,000,000 Wiki / day
(2014, Cisco estimates)
![Page 33: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/33.jpg)
Data: A Modern-day Bottleneck?
![Page 34: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/34.jpg)
The ‘V’s of “Big Data”
![Page 35: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/35.jpg)
“BIG DATA” IN ACTION …
![Page 36: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/36.jpg)
Getting Home (Waze)
“What’s the fastest route to get home right now?”
• Processes journeys as background knowledge
• “Participatory Sensing”
![Page 37: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/37.jpg)
Predicting Pre-crime (PredPol)
“What areas of the city are most need of police patrol at
13:55 on Mondays?”
• PredPol system used by Santa Cruz (US) police patrols
• Predictions based on 8 years of historical crime data
![Page 38: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/38.jpg)
Getting Elected President (Narwhal)
“Who are the undecided voters and how can I convince
them to vote for me?”
• User profiles built and integrated from online sources
• Targeted messages sent to voters based on profile
![Page 39: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/39.jpg)
Winning Jeopardy (IBM Watson)
“Can a computer beat human experts at Jeopardy?”
• Indexed 200 million pages of content
• An ensemble of 100 processing techniques
![Page 40: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/40.jpg)
“BIG DATA” NEEDS
“MASSIVE DATA PROCESSING” …
![Page 41: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/41.jpg)
Every Application is Different …
• Data can be
– (Semi-)Structured data
• (Relational DBs, JSON, XML, CSV, HTML form data)
– Unstructured data
• (text document, comments, tweets)
– And everything in-between!
![Page 42: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/42.jpg)
Every Application is Different …
• Processing can involve:
– Database Management/Analytics
• (indexing, querying, joins, aggregation)
– Natural Language Processing
• (keyword search, topic extraction, entity recognition,
machine translation, sentiment analysis, etc.)
– Data Mining and Statistics
• (pattern recognition, classification, event detection,
recommendations, etc.)
– Or something else / A mix
![Page 43: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/43.jpg)
So where to start?
![Page 44: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/44.jpg)
Scale is a Common Factor …
I have an algorithm.
I have a machine that can
process 1,000 input items
in an hour.
If I buy a machine that is
n times as powerful, how
many input items can I
process in an hour?
Note: Not the
same machine!
Quadratic O(n2)
often too much
Depends on what the
algorithm is!!
![Page 45: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/45.jpg)
Scale is a Common Factor …
• One machine that’s ntimes as powerful?
• n machines that are
equally as powerful?vs.
![Page 46: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/46.jpg)
Scale is a Common Factor …
• Data-intensive (our focus!)
– Inexpensive algorithms / Large inputs
– e.g., Google, Facebook, Twitter
• Compute-intensive (not our focus!)
– More expensive algorithms / Smaller inputs
– e.g., climate simulations, chess games, combinatorials
• No black and white!
![Page 47: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/47.jpg)
“MASSIVE DATA PROCESSING” NEEDS
“DISTRIBUTED COMPUTING” …
![Page 48: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/48.jpg)
Distributed Computing
• Need more than one machine!
• Google ca. 1998:
![Page 49: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/49.jpg)
Distributed Computing
• Need more than one machine!
• Google ca. 2014:
![Page 50: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/50.jpg)
Data Transport Costs
MainMemory
Hard-diskSolid-state
DiskNetwork
• Need to divide tasks over many machines
– Machines need to communicate
… but not too much!
– Data transport costs (simplified):
Need to minimise network costs!
![Page 51: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/51.jpg)
Data Placement
• Need to think carefully about where to put
what data!
I have four machines to run a
website. I have 10 million users.
Each user has personal profile
data, photos, friends and games.
How should I split the data up
over the machines?
Depends on the application!
But some general principles and
design choices apply.
![Page 52: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/52.jpg)
Network/Node Failures
• Need to think about failures!
![Page 53: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/53.jpg)
Network/Node Failures
• Need to think (even more!) carefully about
where to put what data!
I have four machines to run a
website. I have 10 million users.
Each user has personal profile
data, photos, friends and games.
How should I split the data up
over the machines?
(Again)
Depends on the application!
But some general principles and
design choices apply.
![Page 54: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/54.jpg)
Human Distributed Computation
![Page 55: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/55.jpg)
“DISTRIBUTED COMPUTING”
LIMITS & CHALLENGES …
![Page 56: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/56.jpg)
Distribution Not Always Applicable!
![Page 57: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/57.jpg)
Distributed Development Difficult
• Distributed systems can be complex
• Multiple machines; need to take care of– Data in different locations
– Logs and messages in different places
– Different users with different priorities
– Different network capabilities
– Need to balance load!
– Need to handle failures!
• Tasks may take a long time!– Bugs may not become apparent for hours
– Lots of data = lots of counter-examples
![Page 58: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/58.jpg)
Frameworks/Abstractions can Help
• For Distrib. Processing • For Distrib. Storage
![Page 59: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/59.jpg)
HOW DOES TWITTER WORK?
![Page 60: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/60.jpg)
Based on 2013 slides by Twitter lead
architect: Raffi Krikorian
“Twitter Timelines at Scale”
![Page 61: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/61.jpg)
• 150 million active worldwide users
• 400 million tweets per day
– mean: 4,600 tweets/second
– max: 150,000 tweets/second
• 300,000 queries/second for user timelines
• 6,000 queries/second for custom search
Big Data at Twitter
Which aspect is most
important to optimise?
![Page 62: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/62.jpg)
Supporting timelines: write
• mean: 4,000 tweets/second
![Page 63: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/63.jpg)
High-fanout
![Page 64: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/64.jpg)
Supporting timelines: read
• 300,000 queries/second
1ms @p50
4ms @p99
![Page 65: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/65.jpg)
Supporting text search
• Information retrieval
– Earlybird: Lucene clone
– Write once
– Query many
![Page 66: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/66.jpg)
Timeline vs. Search
300,000 requests/sec
4,600 requests/sec 6,000 requests/sec
4,600 requests/sec
![Page 67: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/67.jpg)
Twitter: Full Architecture
![Page 68: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/68.jpg)
“PROCESAMIENTO MASIVO DE DATOS”
ABOUT THE COURSE …
![Page 69: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/69.jpg)
What the Course Is/Is Not
• Data-intensive not compute-intensive
• Distributed tasks not networking
• Commodity hardware not supercomputers
• General methods not specific algorithms
• Practical methods with a little theory
![Page 70: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/70.jpg)
What the Course Is
• Principles of Distributed Computing [1 week]
• Distributed Processing Frameworks [4 weeks]
• Information Retrieval [3 weeks]
• Principles of Distributed Databases [3 weeks]
• Projects [1–2 weeks]
![Page 71: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/71.jpg)
Course Structure
• ~1.5 hours of lectures per week [Monday]
• 1.5 hours of labs per week [Wednesday]
– To be turned in by next Monday evening
– Mostly Java
– In B08; on laptops
http://aidanhogan.com/teaching/cc5212-1-2017/
![Page 72: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/72.jpg)
Course Marking
• 50% for Weekly Labs (~5% a lab!)
• 15% for Small Class Project
• 35% for Exam(s)
![Page 73: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/73.jpg)
Outcomes!
![Page 74: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/74.jpg)
Outcomes!
![Page 75: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/75.jpg)
Outcomes!
![Page 76: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/76.jpg)
Outcomes!
![Page 77: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/77.jpg)
Outcomes!
![Page 78: CC5212-1 Procesamiento Masivo de Datos 2017 - …aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017-01.pdf · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 1: Introduction](https://reader031.vdocuments.us/reader031/viewer/2022022615/5ba1cdba09d3f2b16a8d109d/html5/thumbnails/78.jpg)
Questions?