analyzing fleet data with - matlab · analyzing fleet data with matlab and spark rory adams matthew...
TRANSCRIPT
1
2© 2015 The MathWorks, Inc.
Analyzing Fleet Data with
MATLAB and Spark
Rory Adams
Matthew Elliot
Paul Peeling
3
Analyzing Fleet Data with MATLAB and Spark
▪ The MathWorks Fleet
▪ Organizing Big Data
▪ Event Detection and Studies
▪ Optimizing and Deploying Calculations
4
The MathWorks Fleet
▪ 1300 trip log files
▪ 21 unique vehicles
▪ Approx 39 unique
channels
▪ Data collected over 1.5
years
5
MathWorks Automotive Fleet – Data Collection
Data
WarehouseServer
Engineers
MATLAB
Production Server
• Enrich data
• File creation
4G LTE
Bluetooth
OBDII
Phone
6
Automotive Vehicle Fleet - Intrinsic Hierarchy
Vehicles
Signals
Messages
Trips (files)
Time – Value pairs
7
Questions
▪ How different factors affect how a particular driver drives?
▪ Real-world vehicle performance of things like: fuel economy, emissions,
vehicle dynamics, ride and handling, prognostics, and durability?
▪ How do you work with terabytes of data to distill out critical information?
▪ Once you do have the critical information, how to you iterate back through
your terabytes of data to extract relevant (time) slices for further study or
analysis?
8
Challenges
▪ Traditional tools and approaches won’t work
– Accessing the data is hard; processing it is even harder
– Need to learn new tools and new coding styles
– Have to rewrite algorithms, often at a lower level of abstraction
▪ Quality of your results can be impacted
– e.g., by being forced to work on a subset of your data
– Learning new tools and rewriting algorithms can hurt productivity
▪ Time required to conduct analysis
– Need to leverage parallel computing on desktop and cluster
9
Organizing Big Data
1. Represent your data as a datastore object
2. Efficiency still matters
3. Compressed file formats reduce data transfer speed
4. Avoid redundant data
5. Prefer stacked tables
6. write checkpoints to save intermediate results
10
Common Big Data File Formats
▪ Characteristics of Big Data files and compression techniques
– “Splittable”: Can files be split if larger than HDFS block size
– Storage and I/O Efficiency: How efficiently can data be stored, read and/or written
Type “Splittable” Storage Format
CSV/text ✓ Delimited ASCII text Human readable, but not storage efficient
JSON ASCII text, attribute-pairs Human readable, but not storage efficient
AVRO ✓ Row based Serialization technology, good storage efficiency
Parquet ✓ Column oriented Very efficient I/O when accessing columns
ORC/RC ✓ Stored collections of rows Highly efficient for parallel processing collections of rows
bzip2LZOSnappy
✓ Compression Very good storage efficiencyGood I/O performance for frequently accessed dataGood I/O performance for frequently accessed data
11
Event Detection and Studies
1. Prototype with tall datatypes
2. gather as late as possible
3. Understand why multiple passes may be necessary
4. Perform joins with in-memory data to supply metadata
5. groupsummary to perform parallel computations over subsets
12
What is Hadoop?
Framework for distributed processing of large data sets across clusters of computers
Hadoop
MapReduce use
decreasing…
…but Hadoop
is alive and well
HDFS:
• Scalable and economical file system
YARN:• Efficient resource manager
• Minimizes data movement
Ecosystem of Applications
(Yet Another Resource Negotiator)
13
What is Spark?
Spark Core(Batch Processing)
• Parallel execution engine• Generalized execution model• In-memory computing – speed• Out-of-memory computing – size• Java, Scala, Python, and R APIs
Apache Spark is an open source cluster computing
framework originally developed in the AMPLab at
University of California, Berkeley
Efficient for iterative algorithms – Machine Learning
14
Spark Execution Model
Worker Node
Executor Cache
Worker Node
Executor Cache
Worker Node
Executor Cache
Master
Name Node
YARN
(Resource Manager)
Data Node Data Node Data Node
Worker Node
Executor Cache
Data Node HDFS
Task Task Task Task
Edge Node
Client Libraries
Client Application
Jobs submittedvia client libraries
YARN assigns/manages resources
Executor accesses local data
Caching minimizes disk writes/reads
Can “spill” to disk if data too big
15
Optimizing and Deploying Calculations
1. Use parpool to distribute computations on a multi-core machine
2. Use Spark on Linux to distribute computations on a cluster
3. Distribute data on HDFS, or S3 etc. (use datastore.AlternateFileSystemRoots)
4. For ad-hoc and interactive use, use tall against YARN
5. For detailed control and automated use, use the MATLAB API for Spark