physical data storage stephen dawson-haggerty. data sources smap - data exploration/visualization -...

Physical Data Storage

Stephen Dawson-Haggerty

Data Sources

- Data exploration/visualization- Control Loops- Demand response- Analytics- Mobile feedback- Fault detection

Hadoop

Applications

StreamFS

Time-Series Databases

• Expected workload• Related work• Server architecture• API• Performance• Future directions

Write Workload

• sMAP Sources– HTTP/REST protocol for exposing physical

information– Data trickles in as its generated– Typical data rates: 1 reading/1-60s

• Bulk imports– Existing databases– Migrations

Read Workload

• Plotting engine• Matlab & python

adaptors for analysis

• Mobile apps• Batch analysis

Dominated by range queries

Latency is important, for interactive data exploration

Page Cache Lock Manager

Key-Value Store

Storage Alloc.

Time-series Interface

Bucketing RPC Compression

insert

resample

aggregate

Storage mapper

Time series interface

db_open()

db_query(streamid, start, end) Query points in a range

db_next(streamid, ref), db_prev(...) Query points near a reference time

db_add(streamid, vector) Insert points into the database

db_avail(streamid) Retrieve storage map

db_close()

All data is part of a stream, identified only by streamid

A stream is a series of tuples: (timestamp, sequence, value, min, max)

Storage Manager: BDB

• Berkeley Database: embedded key-value store• Store binary blobs using B+ trees• Very mature: around since 1992, supports

transactions, free-threading, replication• We use version 4

RPC Evolution

• First: shared memory– Low latency

• Move to threaded TCP• Google protocol buffers– zig-zag integer representation, multiple language

bindings– Extensible for multiple versions

On-Disk Format

• All data stores perform poorly with one key per reading– index size is high– unnecessary

• Solution: bucket readings• Excellent locality of reference

with B+ tree intexes– Data sorted by streamid and

timestamp– Range queries translate into

mostly large sequential IOs

bucket

(streamid, timestamp)

• Represent in memory with materialized structure – 32b/rec– Inefficient on disk – lots of

repeated data, missing fields• Solution: compression

– First: delta encode each bucket in protocol buffer

– Second: Huffman Tree or Run Length encoding (zlib)

• Combined compression 2x better than gzip or either one

• 1m rec/second compress/decompress on modest hardware

On-Disk Format

compress

bdb page

Other Services: Storage Mapping

• What is in the database?– Compute a set of tuples (start, end, n)

• The desired interpretation is “the data source was alive”

• Different data sources have different ways of maintaining this information and maintaining confidence– Sometimes you have to infer it from the data– Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!”

dead or alive?

readingdb6

• Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments– behind www.openbms.org

• > 2 billion points in 10k streams– 12Gb on disk ~= 5b/rec including index– So... we fit in memory!

• Import at around 300k points/sec– We maxed out the NIC

Low Latency RPC

Compression ratios

Write load

Importing old data: 150k points/sec Continuous write load: 300-500pts/sec

Future thoughts

• A component of a cloud storage stack for physical data

• Hadoop adaptor: improve Mapreduce performance over Hbase solution

• The data is small: 2 billion points in 12GB– We can go a long time without distributing this

very much– Probably necessary for reasons other than

performance

THE END

physical data storage stephen dawson-haggerty. data sources smap - data exploration/visualization -...

repeated data

closeall data

alivedifferent data

datasometime data sources

generatedtypical data

disk formatall data

series of tuples

vectorinsert points

Documents

2199 haggerty rd for sale or lease -...

sidler haggerty articulationconference2013

smap applications plan molly brown smap applications...

haggerty, hitt power point

soil moisture active passive mission passive (smap ...soil...

haggerty 04

smap applications plan

10. core validation site data flow in smap cal/val (a....

smap 2021 schedule

soil moisture active passive (smap) algorithm theoretical...

soil moisture active passive (smap) ancillary data report

smap cal/val

haggerty shopper 0409

accessing smap data - nasa arset · 07/07/2015 · nasa...

haggerty, hitt power_point

use of smap data in noaa nwp and drought … of smap data in...

evaluation of smap level 2 soil moisture algorithms using...

downscaling smap soil moisture data using modis data

smap l1b radiometer half-orbit time-ordered brightness...

smap a geospatial data collection system for organisations...