analyzing fleet data with - matlab · analyzing fleet data with matlab and spark rory adams matthew...

2© 2015 The MathWorks, Inc.

Analyzing Fleet Data with

MATLAB and Spark

Rory Adams

Matthew Elliot

Paul Peeling

3

Analyzing Fleet Data with MATLAB and Spark

▪ The MathWorks Fleet

▪ Organizing Big Data

▪ Event Detection and Studies

▪ Optimizing and Deploying Calculations

4

The MathWorks Fleet

▪ 1300 trip log files

▪ 21 unique vehicles

▪ Approx 39 unique

channels

▪ Data collected over 1.5

years

5

MathWorks Automotive Fleet – Data Collection

Data

WarehouseServer

Engineers

MATLAB

Production Server

• Enrich data

• File creation

4G LTE

Bluetooth

OBDII

Phone

6

Automotive Vehicle Fleet - Intrinsic Hierarchy

Vehicles

Signals

Messages

Trips (files)

Time – Value pairs

7

Questions

▪ How different factors affect how a particular driver drives?

▪ Real-world vehicle performance of things like: fuel economy, emissions,

vehicle dynamics, ride and handling, prognostics, and durability?

▪ How do you work with terabytes of data to distill out critical information?

▪ Once you do have the critical information, how to you iterate back through

your terabytes of data to extract relevant (time) slices for further study or

analysis?

8

Challenges

▪ Traditional tools and approaches won’t work

– Accessing the data is hard; processing it is even harder

– Need to learn new tools and new coding styles

– Have to rewrite algorithms, often at a lower level of abstraction

▪ Quality of your results can be impacted

– e.g., by being forced to work on a subset of your data

– Learning new tools and rewriting algorithms can hurt productivity

▪ Time required to conduct analysis

– Need to leverage parallel computing on desktop and cluster

9

Organizing Big Data

1. Represent your data as a datastore object

2. Efficiency still matters

3. Compressed file formats reduce data transfer speed

4. Avoid redundant data

5. Prefer stacked tables

6. write checkpoints to save intermediate results

10

Common Big Data File Formats

▪ Characteristics of Big Data files and compression techniques

– “Splittable”: Can files be split if larger than HDFS block size

– Storage and I/O Efficiency: How efficiently can data be stored, read and/or written

Type “Splittable” Storage Format

CSV/text ✓ Delimited ASCII text Human readable, but not storage efficient

JSON ASCII text, attribute-pairs Human readable, but not storage efficient

AVRO ✓ Row based Serialization technology, good storage efficiency

Parquet ✓ Column oriented Very efficient I/O when accessing columns

ORC/RC ✓ Stored collections of rows Highly efficient for parallel processing collections of rows

bzip2LZOSnappy

✓ Compression Very good storage efficiencyGood I/O performance for frequently accessed dataGood I/O performance for frequently accessed data

11

Event Detection and Studies

1. Prototype with tall datatypes

2. gather as late as possible

3. Understand why multiple passes may be necessary

4. Perform joins with in-memory data to supply metadata

5. groupsummary to perform parallel computations over subsets

12

What is Hadoop?

Framework for distributed processing of large data sets across clusters of computers

Hadoop

MapReduce use

decreasing…

…but Hadoop

is alive and well

HDFS:

• Scalable and economical file system

YARN:• Efficient resource manager

• Minimizes data movement

Ecosystem of Applications

(Yet Another Resource Negotiator)

13

What is Spark?

Spark Core(Batch Processing)

• Parallel execution engine• Generalized execution model• In-memory computing – speed• Out-of-memory computing – size• Java, Scala, Python, and R APIs

Apache Spark is an open source cluster computing

framework originally developed in the AMPLab at

University of California, Berkeley

Efficient for iterative algorithms – Machine Learning

14

Spark Execution Model

Worker Node

Executor Cache

Worker Node

Executor Cache

Worker Node

Executor Cache

Master

Name Node

YARN

(Resource Manager)

Data Node Data Node Data Node

Worker Node

Executor Cache

Data Node HDFS

Task Task Task Task

Edge Node

Client Libraries

Client Application

Jobs submittedvia client libraries

YARN assigns/manages resources

Executor accesses local data

Caching minimizes disk writes/reads

Can “spill” to disk if data too big

15

Optimizing and Deploying Calculations

1. Use parpool to distribute computations on a multi-core machine

2. Use Spark on Linux to distribute computations on a cluster

3. Distribute data on HDFS, or S3 etc. (use datastore.AlternateFileSystemRoots)

4. For ad-hoc and interactive use, use tall against YARN

5. For detailed control and automated use, use the MATLAB API for Spark

analyzing fleet data with - matlab · analyzing fleet data with matlab and spark rory adams matthew...

Documents