a holistic view of operational capabilities—roy rapoport, insight engineering at netflix

Post on 11-Apr-2017

316 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

(one way to think of)

Operational Insight Roy Rapoport

@royrapoport or rsr@netflix.com March 30, 2017

Insight Engineering @ Netflix

• 17 + 3 people

• Two groups

• Largest Single Cloud Spend Item

Real-Time Operational Insight

What does that mean?

What does that mean?

What does that mean?

What does that mean?

What does it all mean?

John Boyd

Observe

Orient

Decide

Act OODA

“This approach favors agility over raw power in dealing with human opponents in any endeavor” - Wikipedia

This Is What We Do

Observe

Orient

Decide

Act OODA

Observe

Orient

Decide

Act OODA

Telemetry platforms

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Alerting Decision-Output Systems

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Alerting Decision-Output Systems

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Alerting Decision-Output Systems

Remediation

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Alerting Decision-Output Systems

Remediation

OODA* AT

Observe• Atlas

• Time Series Database

• Up to ~3B time series

• 6H, 3D, 18D tiers

• Indefinite Retention

• Custom

• Expensive

Observe

• Chronos

• What the Hell Changed?

• Change Reporting System

• Entirely Automated

• ~18 months*

• REST on top of ElasticSearch

Observe

• Salp

• Based on Dapper

• Distributed Trace

• Dependency Discovery

Orient• Already Showed Some

• Lumen: Extensible Dashboards

• JSON Construction

• Surprising Use Cases

• Easily Extensible

• Data Sources

• Visualizations

Decide

• Alerting: Simple

Decide

• Alerting: Simple

• Alerting: Not Simple

sps_all,nf.region,(,us-east-1,),:in,name,COUNTER-playback-action-APIPlaybackStartAction,:eq, :and,isSupplementalVideo,false,:eq,:and,operation,onComplete,:eq,:and,device.operationalName, streaming_stick,,:eq,:and,:sum,:set,sps_1h_offset,sps_all,:get,1h,:offset,:set,entering_trough,sps_1h_ offset,:get,0.95,:mul,sps_all,:get,:gt,:set,smoothed,sps_all,:get,10,0.1,0.02,:des,:set,low_volume, smoothed,:get,-0.005,:mul,0.1,:add,:set,mid_volume,smoothed,:get,-0.00125,:mul,0.1,:add,:set,base, 0.06,:set,min_pct,1,smoothed,:get,20,:lt,low_volume,:get,:mul,smoothed,:get,80,:lt,mid_volume,:get, :mul,:add,entering_trough,:get,0.05,:mul,:add,base,:get,:add,:sub,10,0.1,0.02,:des,:set,sps_all,: get,min_pct,:get,smoothed,:get,:mul,:lt,10,:rolling-count,5,:ge,$device.operationalName Alert Trigger, :legend

Decide

• Alerting: Simple

• Alerting: Not Simple

• Kepler: Outlier Detection

Decide

• Alerting: Simple

• Alerting: Not Simple

• Kepler: Outlier Detection

Decide

• Alerting: Simple

• Alerting: Not Simple

• Kepler: Outlier Detection

• Automated Canary Analysis

Decide• Alerting: Simple

• Alerting: Not Simple

• Kepler: Outlier Detection

• Automated Canary Analysis

• Active:

• UVAD

• MVAD

Act

Act

Act

Observe

Orient

Decide

Act OODA

Telemetry platforms

Data Viz Graphs

Dashboards

Alerting Outlier Detection

Anomaly Detection Canary Analysis

Remediation

?

top related