a holistic view of operational capabilities—roy rapoport, insight engineering at netflix
Post on 11-Apr-2017
316 Views
Preview:
TRANSCRIPT
(one way to think of)
Operational Insight Roy Rapoport
@royrapoport or rsr@netflix.com March 30, 2017
Insight Engineering @ Netflix
• 17 + 3 people
• Two groups
• Largest Single Cloud Spend Item
Real-Time Operational Insight
What does that mean?
What does that mean?
What does that mean?
What does that mean?
What does it all mean?
John Boyd
Observe
Orient
Decide
Act OODA
“This approach favors agility over raw power in dealing with human opponents in any endeavor” - Wikipedia
This Is What We Do
Observe
Orient
Decide
Act OODA
Observe
Orient
Decide
Act OODA
Telemetry platforms
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Alerting Decision-Output Systems
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Alerting Decision-Output Systems
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Alerting Decision-Output Systems
Remediation
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Alerting Decision-Output Systems
Remediation
OODA* AT
Observe• Atlas
• Time Series Database
• Up to ~3B time series
• 6H, 3D, 18D tiers
• Indefinite Retention
• Custom
• Expensive
Observe
• Chronos
• What the Hell Changed?
• Change Reporting System
• Entirely Automated
• ~18 months*
• REST on top of ElasticSearch
Observe
• Salp
• Based on Dapper
• Distributed Trace
• Dependency Discovery
Orient• Already Showed Some
• Lumen: Extensible Dashboards
• JSON Construction
• Surprising Use Cases
• Easily Extensible
• Data Sources
• Visualizations
Decide
• Alerting: Simple
Decide
• Alerting: Simple
• Alerting: Not Simple
sps_all,nf.region,(,us-east-1,),:in,name,COUNTER-playback-action-APIPlaybackStartAction,:eq, :and,isSupplementalVideo,false,:eq,:and,operation,onComplete,:eq,:and,device.operationalName, streaming_stick,,:eq,:and,:sum,:set,sps_1h_offset,sps_all,:get,1h,:offset,:set,entering_trough,sps_1h_ offset,:get,0.95,:mul,sps_all,:get,:gt,:set,smoothed,sps_all,:get,10,0.1,0.02,:des,:set,low_volume, smoothed,:get,-0.005,:mul,0.1,:add,:set,mid_volume,smoothed,:get,-0.00125,:mul,0.1,:add,:set,base, 0.06,:set,min_pct,1,smoothed,:get,20,:lt,low_volume,:get,:mul,smoothed,:get,80,:lt,mid_volume,:get, :mul,:add,entering_trough,:get,0.05,:mul,:add,base,:get,:add,:sub,10,0.1,0.02,:des,:set,sps_all,: get,min_pct,:get,smoothed,:get,:mul,:lt,10,:rolling-count,5,:ge,$device.operationalName Alert Trigger, :legend
Decide
• Alerting: Simple
• Alerting: Not Simple
• Kepler: Outlier Detection
Decide
• Alerting: Simple
• Alerting: Not Simple
• Kepler: Outlier Detection
Decide
• Alerting: Simple
• Alerting: Not Simple
• Kepler: Outlier Detection
• Automated Canary Analysis
Decide• Alerting: Simple
• Alerting: Not Simple
• Kepler: Outlier Detection
• Automated Canary Analysis
• Active:
• UVAD
• MVAD
Act
Act
Act
Observe
Orient
Decide
Act OODA
Telemetry platforms
Data Viz Graphs
Dashboards
Alerting Outlier Detection
Anomaly Detection Canary Analysis
Remediation
?
top related