firehose engineering
DESCRIPTION
TRANSCRIPT
![Page 1: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/1.jpg)
Firehose Engineeringdesigninghigh-volumedata collection systems
Josh BerkusFOSS4G-NA
Wash. DC 2012
![Page 2: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/2.jpg)
![Page 3: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/3.jpg)
Firehose Database Applications (FDA)
(1) very high volume of data input from many automated producers
(2) continuous processing and aggregation of incoming data
![Page 4: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/4.jpg)
Mozilla Socorro
![Page 5: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/5.jpg)
Upwind
![Page 6: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/6.jpg)
GPS Applications
![Page 7: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/7.jpg)
Firehose Challenges
![Page 8: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/8.jpg)
1. Volume
● 100's to 1000's facts/second● GB/hour
![Page 9: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/9.jpg)
1. Volume
● spikes in volume● multiple uncoorindated sources
![Page 10: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/10.jpg)
1. Volume
volume always grows over time
![Page 11: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/11.jpg)
2. Constant flow
since data arrives 24/7 …
while the user interface can be down, data collection can never be down
![Page 12: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/12.jpg)
ETL
2. Constant flow● can't stop
receiving to process
● data can arrive out of order
![Page 13: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/13.jpg)
3. Database size
● terabytes to petabytes● lots of hardware● single-node DBMSes aren't enough● difficult backups, redundancy,
migration● analytics are resource-consumptive
![Page 14: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/14.jpg)
3. Database size
● database growth● size grows quickly● need to expand storage● estimate target data size● create data ageing policies
![Page 15: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/15.jpg)
3. Database size
“We will decide on a data retention policy when we run out
of disk space.”– every business user everywhere
![Page 16: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/16.jpg)
4.
![Page 17: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/17.jpg)
many components= many failures
![Page 18: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/18.jpg)
4. Component failure
● all components fail● or need scheduled downtime● including the network
● collection must continue● collection & processing must
recover
![Page 19: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/19.jpg)
solving firehose problems
![Page 20: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/20.jpg)
socorro project
![Page 21: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/21.jpg)
http://crash-stats.mozilla.com
![Page 22: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/22.jpg)
![Page 23: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/23.jpg)
Mozilla Socorro
collectors
processors
webservers
reports
![Page 24: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/24.jpg)
![Page 25: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/25.jpg)
socorro data volume
● 3000 crashes/minute● avg. size 150K
● 40TB accumulated raw data● 500GB accumulated metadata /
reports
![Page 26: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/26.jpg)
dealing with volume
load balancers collectors
![Page 27: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/27.jpg)
dealing with volume
monitor processors
![Page 28: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/28.jpg)
dealing with size
data40TB
expandible
metadataviews500GBfixed size
![Page 29: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/29.jpg)
dealing with component failure
● 30 Hbase nodes
● 2 PostgreSQL servers
● 6 load balancers
● 3 ES servers● 6 collectors● 12 processors● 8 middleware &
web servers
… lots of failures
Lots of hardware ...
![Page 30: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/30.jpg)
load balancing & redundancy
load balancers collectors
![Page 31: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/31.jpg)
elastic connections
● components queue their data● retain it if other nodes are down
● components resume work automatically
● when other nodes come back up
![Page 32: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/32.jpg)
elastic connections
collector
reciever local file queue
crash mover
![Page 33: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/33.jpg)
server management
● puppet● controls configuration of all servers● makes sure servers recover● allows rapid deployment of
replacement nodes
![Page 34: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/34.jpg)
Upwind
![Page 35: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/35.jpg)
Upwind
● speed● wind speed● heat● vibration● noise● direction
![Page 36: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/36.jpg)
Upwind
1. maximize power generation
2. make sure turbine isn't damaged
![Page 37: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/37.jpg)
dealing with volume
each turbine:
90 to 700 facts/second
windmills per farm: up to 100
number of farms: 40+
est. total: 300,000 facts/second
(will grow)
![Page 38: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/38.jpg)
dealing with volume
localstorage
historian analyticdatabase
reports
localstorage
historian analyticdatabase
reports
![Page 39: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/39.jpg)
dealing with volume
localstorage
historian analyticdatabase
localstorage
historian analyticdatabase
masterdatabase
![Page 40: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/40.jpg)
multi-tenant partitioning
● partition the whole application● each customer gets their own
toolchain
● allows scaling with the number of customers
● lowers efficiency● more efficient with virtualization
![Page 41: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/41.jpg)
dealing with:constant flow and size
historianminutebuffer
hourstable
daystable
monthstable
yearstable
historianminutebuffer
hourstable
daystable
monthstable
historianminutebuffer
hourstable
daystable
![Page 42: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/42.jpg)
time-based rollups
● continuously accumulate levels of rollup
● each is based on the level below it● data is always appended, never
updated● small windows == small resources
![Page 43: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/43.jpg)
time-based rollups
● allows:● very rapid summary reports for
different windows● retaining different summaries for
different levels of time● batch/out-of-order processing● summarization in parallel
![Page 44: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/44.jpg)
firehose tips
![Page 45: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/45.jpg)
data collection must be:
● continuous● parallel● fault-tolerant
![Page 46: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/46.jpg)
data processing must be:
● continuous● parallel● fault-tolerant
![Page 47: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/47.jpg)
every component must be able to fail
● including the network● without too much data loss● other components must
continue
![Page 48: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/48.jpg)
5 tools to use
1. queueing software
2. buffering techniques
3. materialized views
4. configuration management
5. comprehensive monitoring
![Page 49: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/49.jpg)
4 don'ts
1. use cutting-edge technology
2. use untested hardware
3. run components to capacity
4. do hot patching
![Page 50: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/50.jpg)
master your own
![Page 51: Firehose Engineering](https://reader033.vdocuments.us/reader033/viewer/2022051514/5495b0c5b47959474d8b4dc9/html5/thumbnails/51.jpg)
Contact● Josh Berkus: [email protected]
● blog: blogs.ittoolbox.com/database/soup
● PostgreSQL: www.postgresql.org● pgexperts: www.pgexperts.com
● Postgres BOF 3:30PM Today● pgCon, Ottawa, May 2012● Postgres Open, Chicago, Sept. 2012
The text and diagrams in this talk is copyright 2011 Josh Berkus and is licensed under the creative commons attribution license. Title slide image and other fireman images are licensed from iStockPhoto and may not be reproduced or redistributed separate from this presentation. Socorro images are copyright 2011 Mozilla Inc.