analyzing fleet data with - matlab · analyzing fleet data with matlab and spark rory adams matthew...

15
1

Upload: others

Post on 27-Jul-2020

18 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

1

Page 2: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

2© 2015 The MathWorks, Inc.

Analyzing Fleet Data with

MATLAB and Spark

Rory Adams

Matthew Elliot

Paul Peeling

Page 3: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

3

Analyzing Fleet Data with MATLAB and Spark

▪ The MathWorks Fleet

▪ Organizing Big Data

▪ Event Detection and Studies

▪ Optimizing and Deploying Calculations

Page 4: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

4

The MathWorks Fleet

▪ 1300 trip log files

▪ 21 unique vehicles

▪ Approx 39 unique

channels

▪ Data collected over 1.5

years

Page 5: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

5

MathWorks Automotive Fleet – Data Collection

Data

WarehouseServer

Engineers

MATLAB

Production Server

• Enrich data

• File creation

4G LTE

Bluetooth

OBDII

Phone

Page 6: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

6

Automotive Vehicle Fleet - Intrinsic Hierarchy

Vehicles

Signals

Messages

Trips (files)

Time – Value pairs

Page 7: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

7

Questions

▪ How different factors affect how a particular driver drives?

▪ Real-world vehicle performance of things like: fuel economy, emissions,

vehicle dynamics, ride and handling, prognostics, and durability?

▪ How do you work with terabytes of data to distill out critical information?

▪ Once you do have the critical information, how to you iterate back through

your terabytes of data to extract relevant (time) slices for further study or

analysis?

Page 8: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

8

Challenges

▪ Traditional tools and approaches won’t work

– Accessing the data is hard; processing it is even harder

– Need to learn new tools and new coding styles

– Have to rewrite algorithms, often at a lower level of abstraction

▪ Quality of your results can be impacted

– e.g., by being forced to work on a subset of your data

– Learning new tools and rewriting algorithms can hurt productivity

▪ Time required to conduct analysis

– Need to leverage parallel computing on desktop and cluster

Page 9: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

9

Organizing Big Data

1. Represent your data as a datastore object

2. Efficiency still matters

3. Compressed file formats reduce data transfer speed

4. Avoid redundant data

5. Prefer stacked tables

6. write checkpoints to save intermediate results

Page 10: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

10

Common Big Data File Formats

▪ Characteristics of Big Data files and compression techniques

– “Splittable”: Can files be split if larger than HDFS block size

– Storage and I/O Efficiency: How efficiently can data be stored, read and/or written

Type “Splittable” Storage Format

CSV/text ✓ Delimited ASCII text Human readable, but not storage efficient

JSON ASCII text, attribute-pairs Human readable, but not storage efficient

AVRO ✓ Row based Serialization technology, good storage efficiency

Parquet ✓ Column oriented Very efficient I/O when accessing columns

ORC/RC ✓ Stored collections of rows Highly efficient for parallel processing collections of rows

bzip2LZOSnappy

✓ Compression Very good storage efficiencyGood I/O performance for frequently accessed dataGood I/O performance for frequently accessed data

Page 11: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

11

Event Detection and Studies

1. Prototype with tall datatypes

2. gather as late as possible

3. Understand why multiple passes may be necessary

4. Perform joins with in-memory data to supply metadata

5. groupsummary to perform parallel computations over subsets

Page 12: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

12

What is Hadoop?

Framework for distributed processing of large data sets across clusters of computers

Hadoop

MapReduce use

decreasing…

…but Hadoop

is alive and well

HDFS:

• Scalable and economical file system

YARN:• Efficient resource manager

• Minimizes data movement

Ecosystem of Applications

(Yet Another Resource Negotiator)

Page 13: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

13

What is Spark?

Spark Core(Batch Processing)

• Parallel execution engine• Generalized execution model• In-memory computing – speed• Out-of-memory computing – size• Java, Scala, Python, and R APIs

Apache Spark is an open source cluster computing

framework originally developed in the AMPLab at

University of California, Berkeley

Efficient for iterative algorithms – Machine Learning

Page 14: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

14

Spark Execution Model

Worker Node

Executor Cache

Worker Node

Executor Cache

Worker Node

Executor Cache

Master

Name Node

YARN

(Resource Manager)

Data Node Data Node Data Node

Worker Node

Executor Cache

Data Node HDFS

Task Task Task Task

Edge Node

Client Libraries

Client Application

Jobs submittedvia client libraries

YARN assigns/manages resources

Executor accesses local data

Caching minimizes disk writes/reads

Can “spill” to disk if data too big

Page 15: Analyzing Fleet Data with - Matlab · Analyzing Fleet Data with MATLAB and Spark Rory Adams Matthew Elliot ... 1300 trip log files ... Apache Spark is an open source cluster computing

15

Optimizing and Deploying Calculations

1. Use parpool to distribute computations on a multi-core machine

2. Use Spark on Linux to distribute computations on a cluster

3. Distribute data on HDFS, or S3 etc. (use datastore.AlternateFileSystemRoots)

4. For ad-hoc and interactive use, use tall against YARN

5. For detailed control and automated use, use the MATLAB API for Spark