background materialdouglas/classes/bigdata/lectures/2019su/background-grad.pdf · hadoop, matlab,...
TRANSCRIPT
![Page 2: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/2.jpg)
Schedule
• Undergraduateso June 30, 2:00-6:00, Background materialo July 2, 2:00-6:00, Data findingo July 4, 2:00-6:00, Data finding and machine learningo July 6, 2:00-6:00, Machine learning
• Graduateso July 1, 2:30-5:30, Introduction and data findingo July 3, 8:30-11:30, Data finding and machine learningo July 3, 2:30-5:30, Machine learning
2
![Page 3: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/3.jpg)
Introduction
3
![Page 4: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/4.jpg)
Useful References
• http://www.mgnet.org/~douglas/Classes/bigdata/2019su-index.html
• Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, 2nd ed. (version 2.1), Stanford University, 2014. The most up to date version is online at http://www.mmds.org. I will lecture from the 3rd edition draft as well.
• Andriy Burkov, The Hundred-Page Machine Learning Book, http://themlbook.com/wiki/doku.php, 2019.
4
![Page 5: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/5.jpg)
Useful References
• Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.
• Deep Learning exercises using TensorFlow, https://www.coursera.org/learn/intro-to-deep-learning/home/welcome.o https://github.com/hse-aml/intro-to-dl
5
![Page 6: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/6.jpg)
Useful Software
• TensorFlowo Version 1.13 is stable. Version 2.0.0-beta is not.o Anaconda or Miniconda environmentso Additional Python packages: jupyter, matplotlib,
pandas
• Tableau• MapReduce, Spark, and workflow systems• Many problems run 1000X faster on a GPU
6
![Page 7: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/7.jpg)
Some Sources of Big Data
• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Government or industry regulation/statistics• Closed circuit camera identification
7
![Page 8: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/8.jpg)
Oil/Gas Pipelines
Picture courtesy of Miriam Webster Dictionary 8
![Page 9: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/9.jpg)
Pipeline Network Properties
• Pipe diameters range from 2 inches to 5 feet.
• Rarely straight and level.• Contain– Possibly different grades of
oil or gas simultaneously.– Pigs as separators.– Sensors (inside and
outside)• Not restricted to oil/gas
pipelines (water, etc.).
9
![Page 10: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/10.jpg)
1970’s Modeling
• Problem modeled mathematically based on time dependent, nonlinear coupled partial differential equations (two models).– Sensors on all pipeline components (recall the cartoon).– Distributed GRID computing with scattered phone booths:
• 2 minicomputers, 4 array processors, a heat pump on top, and a U.S. nickel soldered in place to allow “free” calls for telemetry.
• Sensors provided data (temperature, pressure, and velocity) dynamically based on need and anomalies and controlled by the environment and running model.
• No central computing, just central and distributed control sites.• 2,000 pieces of telemetry/minute in complete KSA network (1978).
10
![Page 11: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/11.jpg)
Current Modeling
• 3D math models of pipelines with topography.• Central computing and fiber optic TCP/IP with
Gigabit Ethernet backup near pipelines.• Many more sensors plus ones to measure pipe
(shape) changes, internal pollutants and external gas leakages.
• When 1978 system replaced in KSA in 1998, 100,000 times the telemetry/minute. In 2014, a tsunami of uncountable data.
11
![Page 12: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/12.jpg)
Monitoring Site Evolution
• In 1970’s, primitive center where “what if” scenarios were run to keep pipelines from breaking in parallel with regular monitoring.
• Now, large scale visualization is used to monitor pipelines in a multiscale framework. Individual high resolution monitors (1080p and 4K+) used for “what if” scenarios.
• Always trying to find anomalies in the data streams to avoid pipeline problems.
12
![Page 13: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/13.jpg)
Computer Science Techniques
13
![Page 14: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/14.jpg)
Hash Tables
• A hash table is a data structure with N buckets.– N is usually a prime number and may be quite
large.– Each bucket contains data.– Accessed using a hash function Key = h(x).• h(x) must be inexpensive to evaluate.• Key is an index 0, 1, …, N-1 into the hash table.• Data x can be found only in bucket h(x).
14
![Page 15: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/15.jpg)
Storing a Hash Table
• If the data is very simple (numbers or short strings), then a spreadsheet may be optimal.
• If the data is arbitrary, then dynamically allocated memory techniques are common.– Common to use linked lists inside of each bucket.– Can be error prone.–Must remember to deallocate all of the hash table
when done, which can also be error prone.–Must decide if duplicates are allowed in a bucket.
15
![Page 16: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/16.jpg)
Common Data Structure
16
012
…
N-2N-1
0
0
0
0
…
Buckets Data for each bucket
Variations:• doubly
linked lists• nested
tables• spreadsheet
![Page 17: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/17.jpg)
Hash Table Functionalities
• Search• Add– Uses Search
• Delete– Uses Search
• Modify (optional)– Uses Search
• Change order of data in a bucket (optional)– Uses Search and possibly Delete and Add
17
![Page 18: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/18.jpg)
Functionality
• Search(x)– Compute Key = h(x)– For each data stored in bucket Key, compare x to
the data.• If a match, then return something that allows the data
to be accessed.• If there is no match, return a Failure notice.
18
![Page 19: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/19.jpg)
Functionality
• Add(x)– F = Search(x)– If F ≠ Failure, then• If no duplicates are allowed, return something that
allows the data to be accessed (and that it is already in the hash table).
– Otherwise,• Probably make a copy of x and add it to bucket h(x).
– Usually added as the first or last element in bucket h(x).– Usually have to modify the linked list for bucket h(x).
19
![Page 20: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/20.jpg)
Functionality
• Delete(x)– F = Search(x)– If F ≠ Failure, then• Remove the data from bucket h(x). This usually means
deleting the copy of x and relinking inside the linked list. There may be other bookkeeping, too.• Return Success.
– Otherwise,• Return Failure.
20
![Page 21: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/21.jpg)
Simple Examples
• Dataset D consists of combinations of a, b, c, …, x, y, z of exactly string length 3.
• We encode each letter by 00, 01, 02, ..., 23, 24, 25. So, abz is 000125 = 125.
• Consider two hash functions:– h1(x) = x mod 7– h2(x) = leading encoded letter in x
• We get two very different hash tables.
21
![Page 22: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/22.jpg)
Example Dataset D
• D = { abc, def, acd,zaa, bbb, bzq,zxw, faq, cap,eld, ssa, bab }, or encoded
• D = { 102, 30405, 203,250000, 10101, 12516,252322, 50016, 20015,41103, 181800, 10001 }
22
![Page 23: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/23.jpg)
h1(x) for D
• The number of buckets is 7 (a prime).• This is not necessarily a well balanced hash
table since too many members of D go into bucket 0.
• We can store the hash table using linked lists.23
x h1(x) x h1(x) x h1(x)
102 4 30405 4 203 0
250000 2 10101 0 12516 0
252322 2 50016 0 20015 1
41103 6 181800 3 10001 5
![Page 24: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/24.jpg)
Hash Table for h1(x)
24
0123456
Buckets Data for each bucket
203 10101 12516 10101 0
20015 0
250000 252322 0
181800 0
0102 30405
10001 0
41103 0
![Page 25: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/25.jpg)
h2(x) for D
• The number of buckets is 26 (not a prime).• This is a very different distribution of data
than for h1(x) and more balanced for our particular D.
• We can store it as a table or spreadsheet.25
x h2(x) x h2(x) x h2(x)
102 0 30405 3 203 0
250000 25 10101 1 12516 1
252322 25 50016 5 20015 2
41103 4 181800 18 10001 1
![Page 26: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/26.jpg)
Hash Table for h2(x)
26
key value value value
0 102 203
1 10101 12516 10001
2 20015
3 30405
4 41103
5 50016
6
7
8
9
10
11
12
key value value value
13
14
15
16
17
18 181800
19
20
21
22
23
24
25 250000 252322
![Page 27: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/27.jpg)
Fracking Data Example
• Open database maintained by the Pennsylvania State government based on the fractured oil and gas wells in the Marcellus Basin.
• There are about 8,000 wells that have been drilled and information is maintained about each in this database.
• Each state in the United States has at least one public database about fracking wells.
• 15.3 million Americans live within 1 mile (1.8 km) of a well drilled since 2000.
• Spreadsheets in the comma separated values format (.csv) or PDF common.
27
![Page 28: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/28.jpg)
Fracking Data File Information
• Each file contains information for a period of time during 2000-2014o Locations of wellso Owner of propertyo Approximate latitude and longitude of each wello Drilling companyo Production information
§ Potential production§ Actual production (units: barrels for oil, 1000 cubic feet for
gas)§ Active/Inactive
o Much more information with some cells blank
28
![Page 29: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/29.jpg)
Interesting Questions
• What are the production curves?o Are they uniform in regions or do they vary a lot?
• How long is there a good payout? (0, 12, 39-40, …, 120 months?)
• Are there some drillers whose wells are more likely to not be in production after some period of time?
• Where are clusters of wells?• How do you visualize the data?• How do you put the data into the right format in order
to ask the right questions and get answers quickly?
29
![Page 30: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize](https://reader030.vdocuments.us/reader030/viewer/2022040523/5e8649e442fb810b0d1f8a5c/html5/thumbnails/30.jpg)
Data Files
• Approximately 574 MB of files.• First things to do:
o Determine how to use the data (Excel, MongoDB, Hadoop, Matlab, R, etc.).
o Use the data to answer some simple, but interesting questions.
o Visualize the results (Excel, Matlab, R, Tableau, etc.).• Thereafter,
o Determine how to answer general, complex questions.o Use a general database approach that uses all of your
computer’s cores and GPUs.
30