the importance of data - github pages · the importance of data caroline matthews cloud solution...

15
The Importance of Data Caroline Matthews Cloud Solution Architect

Upload: others

Post on 19-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

The Importance of DataCaroline Matthews

Cloud Solution Architect

Page 2: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

80 billion connected “things” by 2025

180 zettabytesof digital data by 2025

One Zettabyte is 1021 bytes

That is 1,000,000,000,000,000,000,000 bytes

• a thousand Exabytes• a billion Terabytes• a trillion Gigabytes

Page 3: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data
Page 4: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data
Page 5: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data
Page 6: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

So what can we do?

Page 7: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Historical Data

Prepare Data Model is used on Current/Future

Data

Data Understanding

Train Model

Test the Model Deploy the Model

Do it all again!

Page 8: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Look for outliers/missing dataDescriptive Statistics | Visualization

Cleansing techniquesRemove | Substitute | Estimate

Feature SelectionWhat features are most important?

Feature EngineeringBe creative! What additional features can we calculate?

Page 9: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Algorithm Categoriesü Classify – predict yes/noü Regression – estimate numerical valuesü Clustering – create similar looking groups

of observations

Train with a subset of prepared data

Train many models – Experiment!

Page 10: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Run model with holdout/test data set

Measure (Grade your models!)

Compare Models

Page 11: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data
Page 12: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Gather model results and compare to actual outcomes

Monitor of Time

Feed data back into the process

And repeat!

Page 13: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

We are going to need some data!

Page 14: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Data ReposüKaggle: https://www.kaggle.com/datasetsüUCI Machine Learning Repository:

http://archive.ics.uci.edu/ml/datasets.htmlüUN: http://www.un.org/en/databases/index.htmlüWorld Health Organization:

http://apps.who.int/gho/data/node.resourcesüCDC: https://wonder.cdc.gov/Welcome.htmlüFederal Highway Administration: https://nhts.ornl.gov/üDatahub Collections: https://datahub.io/collectionsüAwesome Public Datasets:

https://github.com/awesomedata/awesome-public-datasets

Page 15: The Importance of Data - GitHub Pages · The Importance of Data Caroline Matthews Cloud Solution Architect. 80 billion connected “things” by 2025 180 zettabytes of digital data

Image / NLP ReposüMS Coco: http://cocodataset.org/#homeüImageNet: http://www.image-net.org/üOpen Images:

https://storage.googleapis.com/openimages/web/index.htmlüTwenty Newsgroups (UCI):

https://archive.ics.uci.edu/ml/datasets/Twenty+NewsgroupsüWikipedia Corpus: https://nlp.cs.nyu.edu/wikipedia-data/üSpoken Digit: https://github.com/Jakobovski/free-spoken-digit-datasetüSentiment Analysis: http://help.sentiment140.com/for-students/