time-series databases and machine learning · © 2015 mapr technologies 3 top-ranked nosql...

34
© 2015 MapR Technologies 1 ® © 2015 MapR Technologies Time-Series Databases and Machine Learning Jimmy Bates November 2017

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 1

®

© 2015 MapR Technologies

Time-Series Databases and Machine Learning Jimmy Bates November 2017

Page 2: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 2

Top-Ranked Hadoop Distribution

1 Read Write File System

2 Ease of Data Ingestion

3 World Record Performance

4 Complete Data Protection

5 High Availability

6 Real Multi-tenancy

7 Enterprise-grade Security

8 Unbiased Open Source

Page 3: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 3

Top-Ranked NoSQL

Top-Ranked Hadoop Distribution

Page 4: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 4

1 Real-Time Operational Analytics

2 Unified Data Access

3 Performance

4 Scalability and Reliability

5 Security

6 Architectural and Administrative Simplicity

Top-Ranked NoSQL

Page 5: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 5

Top-Ranked NoSQL

Top-Ranked Hadoop Distribution

Top-Ranked SQL-on-Hadoop Solution

Page 6: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 6

1 The Most SQL Options on Hadoop

2 Including our work with Apache Drill

Top-Ranked SQL-on-Hadoop Solution

Page 7: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 7

1 What is Time Series

2 A Basic Workflow

3 The Data Science User

4 A Quick Look at OpenTSDB

5 Resources for Getting Started

Agenda

Page 8: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 8

Time Series Data What is Time Series Data? •  Any set of data-points that have time associated with it. •  Examples of time series data include

•  weather, •  web clickstream, •  product sale, •  stock trade, •  machine logs, •  fleets, •  sensors, •  devices, •  network traffic, •  system login

Page 9: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 9

Value of Time Series Analytics

Use Cases Cut Across Different Industries

Forecasting

Seasonality Correlation

Exec Dashboards

Trend Analysis

Preventative Maintenance

Environmental Monitoring (Climate)

Cell Tower Data Analysis

Stock Trade Data Analysis

IOT Sensor Data

Operational Behavioral

Analysis

Page 10: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 10

Typical Time Series Analytics Flow

Data Store

Data Sources

Batch ETL

Prep

Extract

Aggregate Reports

Environment: IT teams/Application Development groups/Analytics groups Persona: Enterprise Architect, NoSQL Developer, Database developer, Analyst Task: Conduct analyses or provide reports to management

Page 11: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 11

Typical Time Series Analytics Flow - Challenges

Data Store

Data Sources

Batch Processing

Prep

Extract

Aggregate Reports

•  Larger Volumes of data from newer sources

•  Expensive to store

•  Takes a lot of time to process information and do simple aggregations

•  Not Real-time

Page 12: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 12

Challenges with Status Quo •  Existing systems aren’t well suited to store high volumes of time-

series data •  Building aggregations and statistical computations from time-series

data is currently done in batch Need: •  Cost effective and reliable way to store and analyze large amounts of

time series data from various sources.

•  Ability to deploy real-time dash-boarding and monitoring capabilities on aggregated data

Page 13: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 13

Time Series Solution Example

MapR Data Platform

Page 14: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 14

Self-Service Data Exploration

Data Agility with Less IT Required

Single SQL Interface for Structured and Semi-Structured Data

MapR Data Exploration Advantages for IoT

© 2015 MapR Technologies 14

Page 15: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 15

Squaring the Circle •  Enter Apache Drill

•  Drill is SQL compliant –  Uses standard syntax and semantics

•  Drill extends SQL –  First class treatment of objects, lists –  Full support for destructuring, flattening –  Full power of relational model can be applied to complex data

Page 16: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 16

Drill Provides Scalable and Extended SQL

Page 17: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 17

Introduction to Open TSDB

HBase or

MapR-DB

Page 18: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 18

Speeding up OpenTSDB

20,000 data points per second per node in the cluster

Why can’t it be faster ?

Page 19: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 19

Speeding up OpenTSDB: open source MapR extensions

Available on Github: https://github.com/mapr-demos/opentsdb

Page 20: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 20

Speeding up OpenTSDB: open source MapR extensions

Message queue Samples

Users

Collector MapR table

Web service Log

Buffering data for 1 hour in collector allows >1000x

decrease in insertion rate

Logging latest hour of data allows clean restart of collector (lambda + epsilon architecture)

Web service queries database and collector

Page 21: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 21

Direct Blob Loading for Testing Server

CollectorServer

Collector

HBaseor

MapR DB

TSD TSD

Grafana

OpenTSDB UI

ScriptsBlob Loader

Blob Loader

Page 22: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 22

A first example: Time-series data

Page 23: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 23

Column names as data •  When column names are not pre-defined, they can convey

information

•  Examples –  Time offsets within a window for time series –  Top-level domains for web crawlers –  Vendor id’s for customer purchase profiles

•  Predefined schema is impossible for this idiom

Page 24: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 24

Relational Model for Time-series

15:51:0315:52:0715:52:11

101101101

Sample time

Time series ID

Row key

value

Data values

1.160.040.08

Page 25: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 25

Table Design: Point-by-Point

Page 26: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 26

Table Design: Hybrid Point-by-Point + Sub-table

After close of window, data in row is restated as column-oriented tabular value in different column family.

Page 27: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 27

Compression Results

Samples are 64b time, 16 bit sample

Sample time at 10kHz Sample time jitter makes it important to keep original time-stamp How much overhead to retain time-stamp?

Page 28: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 28

Insertion Speeds •  Inserting pre-bundled data allows humongous data rates

•  >100 M s/s on 4 nodes •  >200 M s/s on 8 nodes

•  Linear scaling if enough data sources

Page 29: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 29

How fast is OpenTSDB ? •  OpenTSDB can scale to writing millions of data points per

'second' on commodity servers with regular spinning hard drives •  What if we needed 100-1000x faster writing speed ?

Page 30: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 30

Problem: How Do Load Test Data at Large Scale? •  Testing at large scale is a more realistic measure of performance

than on a small sample •  But with high velocity data, how do you set up test?

–  For a sample equivalent to long term data, you need ingest rates of 100 to 1000x faster than normal production

-  OR- -  You must wait years to load up test data.

What’s the solution?

Page 31: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 31

The need for rapid data loading •  Let’s say you have 1M samples / second •  Suppose you want to test your system •  Perhaps with a year of data •  And you want to load that data in << 1 year

•  100x real-time = 100M samples / second

Page 32: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 32

Page 33: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 33

Short Books by Ted Dunning & Ellen Friedman •  Published by O’Reilly in 2014 and 2015 •  For sale from Amazon or O’Reilly •  Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

Page 34: Time-Series Databases and Machine Learning · © 2015 MapR Technologies 3 Top-Ranked NoSQL Top-Ranked Hadoop Distribution ®

®© 2015 MapR Technologies 34

Thank You @mapr maprtech

[email protected]

MapRTechnologies

maprtech

mapr-technologies