![Page 1: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/1.jpg)
Next Generation of DevOpsAIOps in Practice @Baidu
Xianping Qu & Jingjing Ha
![Page 2: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/2.jpg)
About Baidu
![Page 3: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/3.jpg)
Agenda
• History of Baidu SRE team• Next generation of DevOps – AIOps• Best practice based on AIOps• Future
![Page 4: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/4.jpg)
Tools (2007-2009)
Dev OPQA
• DevOps– Deployment– Monitoring– Budgeting– Consulting– …
• Problems– Human labor
![Page 5: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/5.jpg)
Systems (2009-2012)
Operation System
Dev OPQA
• Building operation systems– Service management system– Monitoring system– Deployment system– Traffic scheduling system– Naming service– …
• Problems– Human labor(GUI, configure)
![Page 6: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/6.jpg)
Platforms (2012-2014)
Operation Service
Dev OPQAO
peration Platform
• Building operation platforms– API– Configurable– Executable– …
• Problems– Reusability– Scalability
![Page 7: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/7.jpg)
Standardization (2014-)
• Building operation standards– Unified language– Unified method– Unified solution– …
• Problems– Need a brain
Services, IDCsClusters, Servers
![Page 8: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/8.jpg)
AIOps (2014-)
• Intelligent Operation Platforms– Development framework– Big data– Algorithm
• Data mining, machine learning…
![Page 9: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/9.jpg)
Development framework
User Code 1 User Code 2 User Code …
User Code Management
Develop
Kit
Operation Abstraction Layer
Interface
Driver Runtime Environment
Cloud/Paas Scheduler
Develop Tool-chain
IDE
Build
Debug
Simulation
Test
Prof
![Page 10: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/10.jpg)
Operation Knowledge Database
product
Meta data Metric data Event data
app service instance
host IDC
person
network
Unified data
model
Data source
Dataproduction
process
mapping
service management model metaDB,TSDB,eventDB
mining
API & View
cleaning
feedback
throughput latency
cpu mem io
...
anomaly
root cause
change
remediation
...
raw data
authority, quota
controlling
bandwidth
error
disk
rtt
...
resulting data
structuraldata
calculating
Management platforms Monitoring platforms Operation platforms
![Page 11: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/11.jpg)
Solution
Operation Knowledge Database
Deployment Management Incident Management
General Components and Tools(transmission、storage、scheduling…)
…
Operation Development Framework ( SDK, RE )
Ops Algorithm Development
Solution Development
Ops Platform Development
Anomaly detection
Traffic scheduling
Root cause analysis
Trend forecasting
Other data mining &machine learning
algorithms
![Page 12: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/12.jpg)
Best practice based on AIOps• Incident Management
– Single cluster stop-loss by traffic shifting
• Deploy Management– Unattended deployment with automated checker
• Consulting– ChatBots do Consultation
![Page 13: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/13.jpg)
When will a failure occur?• Infrastructure issue • Program defects
• Change exception • Dependent service unavailable
![Page 14: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/14.jpg)
How to stop loss?
• Limited Failure in one cluster
– Deployment isolation
– Dependency decoupling
– Reduce global risk
When single cluster fails do perform traffic shifting
• Capacity redundancy
– Availability and cost trade-off
– N+M redundancy
– Service degradation
![Page 15: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/15.jpg)
Two layer traffic shifting @Baidu
Data Center1
Web Service cluster
BFE Cluster L7 LB
BGWL4 LB
Data Center2
Web Service cluster
BFE Cluster L7 LB
BGW L4 LB
Data Center3
Web Service cluster
BFE ClusterL7 LB
BGWL4 LB
User set
User set User
set
User set
Dependent Service cluster
Dependent Service cluster
Dependent Service cluster
Internet User set
shift traffic between user set and the edge node
shift traffic between front end and back end
BGW: Baidu Gate Way, layer-4 load balancer
BFE: Baidu Front End, layer-7 load balancer
![Page 16: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/16.jpg)
Two layer traffic shifting @Baidu
• shift traffic between user set and edge node
– 10 minute to shift 80% traffic to the healthy edge node because
of DNS caching in the client side and ISP side
• shift traffic between front end and back end
– 10 second to shift 100% traffic to the healthy backend by
changing BFE’s routing configuration
![Page 17: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/17.jpg)
Shift traffic between front end and back end
Concerns:• Service Capacity• Intranet bandwidth
Web Service cluster
BFE Cluster L7 LB
BGWL4 LB
Web Service cluster
BFE Cluster L7 LB
BGW L4 LB
Web Service cluster
BFE ClusterL7 LB
BGWL4 LB
User set
User set User
set
User set
Dependent Service cluster
Dependent Service cluster
Dependent Service cluster
Internet User set
Scenarios:• Web service cluster • Dependent service cluster• Internal network switch
Data Center1 Data Center2 Data Center3
![Page 18: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/18.jpg)
Shift traffic between user set and the edge node
Web Service cluster
BFE Cluster L7 LB
BGWL4 LB
Web Service cluster
BFE Cluster L7 LB
BGW L4 LB
Web Service cluster
BFE ClusterL7 LB
BGWL4 LB
User set
User set User
set
User set
Dependent Service cluster
Dependent Service cluster
Dependent Service cluster
Internet User set
Concerns:• Bandwidth• BGW/BFE Capacity• Delay
Scenarios :• BGW & BFE• External network switch
Data Center1 Data Center2 Data Center3
![Page 19: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/19.jpg)
Shift traffic between user set and the back end
Web Service cluster
BFE Cluster L7 LB
BGWL4 LB
Web Service cluster
BFE Cluster L7 LB
BGW L4 LB
Web Service cluster
BFE ClusterL7 LB
BGWL4 LB
User set
User set User
set
User set
Dependent Service cluster
Dependent Service cluster
Dependent Service cluster
Internet User set
Concerns:• Bandwidth• BGW/BFE Capacity• Delay• Service Capacity• Intranet bandwidth
Scenarios::• Entire data center failure
Data Center1 Data Center2 Data Center3
![Page 20: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/20.jpg)
Single cluster stop-loss before AIOPs
Manual decision making
Perception Judgment Execution
Scattered Scripts
• Depends on personal experience
• Partial information
Traditional monitoring
• Handcrafted anomaly detection
• Lots of False Negatives and False Positives
• Poor code quality• Low availability
• Critical moment is unreliable• Wrong or slow decision making • No real-time feedback leads
to more serious failures
![Page 21: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/21.jpg)
Single cluster stop-loss after AIOPs
Automated decision making Standard framework
• Rely on Algorithm Platform • Global information
Intelligent monitoring
• Intelligent anomaly detection• Abnormal event stored in
Operation Knowledge Database
• High precision and recall
• Development framework• Deployment framework
• High development efficiency• High availability
• Accurate and quick decision making
• Feedback control
Perception Judgment Execution
![Page 22: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/22.jpg)
The architecture of single cluster stop-loss
Abnormalevent DB
Network monitoring
Service Health Control
BFE Configuration Management
ServiceDegradation
VIP Health Control
DNS ConfigurationManagement
Internal Traffic Load balancer
Capacity DB
Metric DB
Operation Knowledge Database
Businessmonitoring
Applicationmonitoring
Execution
Judgment
Perception
Anomaly detectionAlgorithm
Traffic shifting
Algorithm
Algorithm Platform
![Page 23: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/23.jpg)
Unattended deployment with automated checker
beginpre-check
pipe line
deploycheck
post-check
end
deploycheck
…confirm
Person Robot
confirm
confirm
confirm
Manual multiple metric dashboards check
Manual single anomaly event dashboard check
Automated API check and notify
Intelligent Anomaly detection
Operation Knowledge Database
Check process optimization
![Page 24: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/24.jpg)
ChatBots do consultation • Key points of building ChatBots
query example intention slots
xx模块从昨晚到现在有上线么? Change query Module : xx;Time:从昨晚到现在
今天xx模块有全流量上线么? Change query Module : xx; Time : 今天 ;Stage:全流量
• Change consultation scenario
1. Accumulate manually labeled query
2. Train an NLP model to understand the questions offline
3. Translate natural language questions into structured questions
4. Query operation knowledge database
5. Display results on SRE Service desk
![Page 25: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/25.jpg)
Future
• Dynamicresourceallocation• Capacitymanagement• Identificationofperformanceproblems• …
![Page 26: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –](https://reader030.vdocuments.us/reader030/viewer/2022040409/5ec66c8fd9efa86407282528/html5/thumbnails/26.jpg)
[email protected]@baidu.com