data analytics with matlab - mathworks.com · data analytics with matlab tackling the challenges of...

39
1 © 2014 The MathWorks, Inc. Data Analytics with MATLAB Tackling the Challenges of Big Data Adrienne James, PhD MathWorks 7 th October 2014

Upload: vuthuy

Post on 21-Jun-2019

230 views

Category:

Documents


4 download

TRANSCRIPT

1© 2014 The MathWorks, Inc.

Data Analytics with MATLAB

Tackling the Challenges of Big Data

Adrienne James, PhD

MathWorks

7th October 2014

2

Big Data in Industry

ENERGYAsset Optimization

FINANCEMarket Risk, Regulatory

AUTOFleet Data Analysis

AEROMaintenance, reliability

Medical DevicesPatient Outcomes

3

PROCESSING OPTIONS

• MATLAB RESTful interface to Cluster

• MATLAB Hadoop Streaming

• NoSQL connector (e.g. mongo)

• MATLAB / Java App accessing Cluster

• MATLAB Map-Reduce Components

4

Key takeaways

New functions for analysing data that does not fit in memory on your

desktop

– datastore

– mapreduce

& that can scale for use with Hadoop

Additional techniques for predictive modelling with large data

– Work with large data in memory on a cluster (spmd)

Deploy predictive models

– Bring MATLAB analytics to the Web

– Share analytics with a wider community of users

5

How big is big? What characterises “big” data?

Wikipedia

“Any collection of data sets so large and complex that it becomes difficult to

process using … traditional data processing applications.”

Volume : amount of data

Velocity : speed at which data is generated or needs to be analysed

Variety : range of data types/data sources

6

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics?

7

Example: Airline Delay Analysis

Data

– BTS/RITA Airline On-Time Statistics

– 123.5M records, 29 fields

Analysis Tasks

– Calculate delay patterns

– Visualize summaries

– Estimate & evaluate predictive models

8

Considerations: Large Data AnalyticsAirline Data Characteristics

1. Size & type of data?

CSV Data

22 files

12GB

9

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is my data?• Small subset available locally

• Entire data set stored elsewhere

10

Big Data Analysis with MATLAB – start on the desktop

Explore

Prototype

Scale

Access Share/Deploy

Work on your desktop

Start “simple”

Basic statistics

Explore data

11

Demo: Exploring departure delays using datastore

Explore approaches pre- & post-

Start with a small subset …

What happens as the data size grows?

…. until eventually it does not fit in memory on your desktop machine

datastore

12

Access & explore bigger data on the desktop more easily

Easily specify data set

– Single text file (or collection of text files)

– Database (using Database Toolbox)

Preview data structure and format

Customise data to import

using column names

Incrementally read

subsets of the data

airdata = datastore('*.csv');

airdata.SelectedVariables = {'Distance', 'ArrDelay‘};

data = read(airdata);

datastore

13

load

datastore extends Data Access Landscape

SMALL Increasing Data Size

memmapfile

matfile

API

databasedatabase.

ODBCConnection

Text files

Databases

.MAT files

Binary files

Images

textscan,

readtable

+programming

ImageAdapterimread, …

fread, …

SystemObjectsstreaming data

post-

readtable

Import

Tool

datastoretextscan

pre-

14

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics Initially, simple statistics & data exploration

• Small subset available locally

• Entire data set stored elsewhere

15

Big Data Analysis with MATLAB

Explore

Prototype

Scale

Access Share/Deploy

Scale to a cluster

Start locally and then …..

16

Datastore

HDFS

Reduce

Node

Node

Node Data

Data

Data

Map

ReduceMap

ReduceMap

Map Reduce

Map

Map

Reduce

Reduce

What is ?

A Big Data Platform

17

A bit of audience participation – mapreduce ….

18

Introducing the mapreduce programming framework

Input filesIntermediate files

(local disk)Output files

Newspaper

pages

For each page how many

times do “Steve”, “Emily” and

“David” get mentioned?

Total

mentions

Steve 11%

Emily 58%

David 31%

Example:

National

popularity contest

19

mapreduce concept – group counts

Map Reduce

Input filesIntermediate files

(local disk)Output files

20

Demo: Exploring mapreduce

21

Datastore

Explore and Analyze Data on Hadoop

MATLAB

MapReduce

Code

HDFS

Node Data

MATLAB

Distributed

Computing

Server

Node Data

Node Data

Map Reduce

Map Reduce

Map Reduce

Hadoop

ds = datastore('hdfs://myserver:7867/data/file1.txt');

22

Considerations: Large Data AnalyticsData Characteristics

1. Size & type of data?

2. Where is your data?

3. What hardware do you have access to?

4. Analysis Characteristics Explore predictive modelling

Cluster

23

Big Data Analysis with MATLAB

Explore

Prototype

Scale

Access Share/Deploy

Scale to a cluster

Options for more involved

algorithms ….

• may require all data in memory

• multiple iterations …

24

Data Analytics Landscape

easily

partitioned;

independent

tasks

iterative

all data needed in

memory at once

SMALL Increasing Data Size

SIMPLE

COMPLEX

Algorithm

complexity

More programming

effort required

Built-in

numerical & statistical

algorithms

spmddistributed

arrays

gpuarray

parfor

vectorisationmapreduce

25

Working with more “complex” algorithms with data in memory

on a cluster

MDCS

1987 1988 1989 1990 1991 1992

Instr

uctions

Reduced D

ata

Client

26

Demo: Predictive Modelling

Logistic Regression & Neural Networks

10 busiest airport origins & 7 largest airline carriers

Explore & compare prediction quality of two models to predict flights delayed for more than

20 minutes

– Randomly partition data into test and training sets (cvpartition)

– Model #1: Logistic Regression

– Model #2: Neural Network

Predictor Variables: DayOfWeek,Origin,Airline,DepTime,Distance

27

Single Program, Multiple Data

Lab 1

>> mycode

Lab 2

>> mycode

Lab 3

>> mycode

Lab 4

>> mycode

28

Single Program, Multiple Data

Parallel Pool

Lab 1

Lab 2

Lab 3

Lab 4

Client

spmd

a = rand;

end

a = rand;

a = rand;

a = rand;

a = rand;

Cluster

29

Explore Big Data

Explore

Prototype

Access Share/Deploy

Subset data by filtering or variable selection

and gain insight with visualization

Scale

Explore

Prototype

Scale

Access Share/Deploy

30

Highlights: Airline Delay Analysis

Start small

Scale up

Quick prototyping on large data

Interactive exploration

Interspersed visualizations

Predictive modelling with large data

31

Deploy

Explore

Prototype

Scale

Access Share/Deploy

Hadoop

Enterprise

WebDesktop

32

Web Analytics: Analysis of traffic around Paris

http://rumeur.bruitparif.fr/

33

Predictive Data Analytics – Load Demand Forecasting

34

Demo

Station:

35

MATLAB on Hadoop

Two modes of operation

Execute mapreduce on Hadoop from your MATLAB desktop using

MATLAB Distributed Computing Server

– Extends your desktop environment for use with Hadoop

– Execute algorithms within Hadoop MapReduce on data stored in HDFS

Create standalone applications or libraries for deploying to production

instances of Hadoop

– Locked down package for use in production environments

– Integration of MATLAB analytics with operational systems

36

Key takeaways

New functions for analysing data that does not fit in memory on your

desktop

– datastore

– mapreduce

& that can scale for use with Hadoop

Additional techniques for predictive modelling with large data

– Work with large data in memory on a cluster (spmd)

Deploy predictive models

– Bring MATLAB analytics to the Web

– Share analytics with a wider community of users

37

New Big Data Capabilities in MATLAB

Memory and Data Access

64-bit processors

Memory Mapped Variables

Disk Variables

Databases

Datastores

Platforms

Desktop (Multicore, GPU)

Clusters

Cloud Computing (MDCS on EC2)

Hadoop

Programming Constructs

Streaming

Block Processing

Parallel-for loops

GPU Arrays

SPMD and Distributed Arrays

MapReduce

38

Additional Resources

MathWorks Web Site

Big Data With MATAB: http://www.mathworks.com/discovery/big-data-matlab.html

MapReduce & Hadoop: http://www.mathworks.com/discovery/matlab-mapreduce-hadoop.html

Machine Learning with MATLAB: http://www.mathworks.com/machine-learning/index.html

A selection of user stories

LiquidNet: Lean Data Analysis: The Awesome Data Dexterity of MATLAB Desktop

Ruuki Metals: Steel Manufacturing Process Analytics

CEESAR: Data Processing Framework Supporting Large Scale Driving Data Analysis

Daimler AG: Analyzing Test Data from a Worldwide Fleet of Fuel Cell Vehicles

39

Thank You