![Page 1: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/1.jpg)
1© 2017 The MathWorks, Inc.
Cleaning Up and Managing Dirty Data in MATLAB
Siddharth Sundar, Application Engineer
![Page 2: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/2.jpg)
5
Agenda
▪ Typical Challenges with Data Handling and Management
▪ A Fundamental Valuation Example
▪ A Text Analytics Example
▪ What about Cleaning Large Datasets?
▪ Summary and Resources
![Page 3: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/3.jpg)
7
Typical Challenges in Data Cleaning, Management
▪ We are Drowning in Data
– Data Volume and Variety
– Different sources, types, sizes
– Garbage-in garbage-out
![Page 4: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/4.jpg)
8
So many Data Sources
Spark+Hadoop
Local disk
Shared folders
Databases
Flat files/Excel
DatafeedsWebpages
![Page 5: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/5.jpg)
9
So many kinds of Data
![Page 6: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/6.jpg)
10
So many kinds of Data
![Page 7: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/7.jpg)
11
Typical Challenges in Data Cleaning, Management
▪ Drowning in Data
– Data Volume and Variety
– Different sources, types, sizes
– Garbage-in garbage-out
▪ Poor Data Quality
![Page 8: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/8.jpg)
12
Poor Data Quality
![Page 9: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/9.jpg)
13
Typical Challenges in Data Cleaning, Management
▪ Drowning in Data
– Data Volume and Variety
– Different sources, types, sizes
– Garbage-in garbage-out
▪ Poor Data Quality
– Poorly formatted files
– Irregularly sampled data
– Redundant, Missing data, Outliers
▪ Need for more customized analytics
– No one size fits all
![Page 10: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/10.jpg)
14
Agenda
▪ Typical Challenges with Data Handling and Management
▪ A Fundamental Valuation Example
▪ A Text Analytics Example
▪ What about Cleaning Large Datasets?
▪ Summary and Resources
![Page 11: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/11.jpg)
15
Demo: Fundamental Valuation of S&P100 securities
Goal:
▪ Fundamental valuation for ranking stocks
based on historical EPS trends
Approach
▪ Access data from CSV files
▪ Preprocess to clean-up text (missing data and
outliers)
▪ Calculate strength of historical EPS trends
![Page 12: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/12.jpg)
16
How do we handle Missing Data?
Does missing data
have meaning?
Type of data
Replace value with value of preceding
instance
Large, temporarily ordered
data
Does the data follow a simple
distribution
Impute with simple ML
model
Remove instances with
missing data
Impute missing values with
column median
Impute missing values with column
mean
Numerical
Convert missing values to meaningful
number
Categorical
Missing values become their own
category
Yes
Dataset is big and little
data is missing at randomOtherwise
NoNo
Yes
![Page 13: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/13.jpg)
17
Summary: Fundamental Valuation Example
▪ Interactive tools to import, visualize
data
▪ Code generation from interactive
tools
▪ Built-in clean up functions
▪ Align and calculate group stats
▪ Save time
![Page 14: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/14.jpg)
18
Agenda
▪ Typical Challenges with Data Handling and Management
▪ A Fundamental Valuation Example
▪ A Text Analytics Example
▪ What about Cleaning Large Datasets?
▪ Summary and Resources
![Page 15: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/15.jpg)
19
Demo: Sentiment Analysis of SEC filings
Goal:
▪ Analyze the sentiment of SEC filings
for S&P 100 companies to use as a
stock picking/ranking indicator
Approach
▪ Access data directly from HTML/PDF
▪ Preprocess to clean-up text and deal
with domain-specific terms
▪ Predict sentiment
![Page 16: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/16.jpg)
20
Summary: Sentiment Analysis Example
▪ String Class in MATLAB
▪ Easy to use/read functions to do
text processing
▪ Text visualization functions
▪ Less regexp, more built-in
commands
▪ More processing, less time
![Page 17: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/17.jpg)
21
Agenda
▪ Typical Challenges with Data Handling and Management
▪ A Fundamental Valuation Example
▪ A Text Analytics Example
▪ What about Cleaning Large Datasets?
▪ Summary and Resources
![Page 18: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/18.jpg)
22
Demo: Technicals calculation to time the market
▪ Objective
– Calculate technical indicators on Big
intraday data
▪ Data
– Intraday tick data scraped from the web
– Missing data, outliers etc.
▪ Approach
– Preprocess data
– Explore data
– Calculate technicals
![Page 19: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/19.jpg)
23
How do you work with tall arrays in MATLAB?
▪ datastore
– Points to the data
▪ Tall array
– Variable representation of
the data in your workspace
▪ Functions
– Operate on tall arrays
>> fileLoc = ’.\datasets\*.csv';
>> ds = datastore(fileLoc);
>> tt = tall(ds);
>> tt = fillmissing(t,’nearest');
tall
Spark + Hadoop
Local diskShared folders
Databases
![Page 20: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/20.jpg)
24
Summary: Technicals Demo
▪ Big Data handled just like data that
fits in memory (Tall)
▪ No need for use of Mapreduce or
other Big Data
technologies/frameworks
▪ Easy Big Data visualization
▪ Scalability of MATLAB models
![Page 21: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/21.jpg)
25
Agenda
▪ Typical Challenges with Data Handling and Management
▪ A Fundamental Valuation Example
▪ A Text Analytics Example
▪ What about Cleaning Large Datasets?
▪ Summary and Resources
![Page 22: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/22.jpg)
26
Revisiting the Challenges with Handling Dirty Data
▪ Drowning in Data
– Data Volume and Variety
– Different sources, types, sizes
– Garbage-in garbage-out
▪ Poor Data Quality
– Poorly formatted files
– Irregularly sampled data
– Redundant, Missing data, Outliers
▪ No one size fits all solution for Data cleaning
![Page 23: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/23.jpg)
27
Get Training
Accelerate your learning curve:
- Customized curriculum
- Learn best practices
- Practice on real-world examples
Options to fit your needs:
- Self-paced (online)
- Instructor led (online and in-person)
- Customized curriculum (on-site)
CPE Approved
Provider
![Page 24: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/24.jpg)
28
Training Roadmap
MATLAB for Financial Applications
Programming Techniques
Interactive User Interfaces
Parallel Computing Time-Series Modeling (Econometrics)
Statistical Methods
Optimization Techniques
Data Analysis and Modeling Application Development
Risk Management
Machine Learning
Asset Allocation
Interfacing with Databases
Interfacing with Excel
Content for On-site Customization
![Page 25: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/25.jpg)
29
Want to Learn More?
Marshall Alphonso
Senior Finance Engineer, New York
Mike DeLucia
Senior Account Manager
Chuck Castricone
Account Manager
Account ManagersEngineers
Siddharth Sundar
Finance Application Engineer
![Page 26: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in](https://reader034.vdocuments.us/reader034/viewer/2022042211/5eb1d7655ac8db63c870e9b4/html5/thumbnails/26.jpg)
30© 2017 The MathWorks, Inc.