ai driven day2 operation · ai driven day2 operation lai kwai seng technical solution architect,...

Post on 24-Jul-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AI Driven Day2 Operation

Lai Kwai Seng

Technical Solution Architect, Cisco Systems

Agenda

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

• Introduction to Data Center Telemetry

• Data Center Telemetry Use Cases

• Operationalizing Telemetry

• Network Insights Resources

• Network Insights Advisor

• Network Assurance

• Key Takeaways

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

syslog

SNMP

CLI

Hard to Operationalize

Incomplete

Unstructured

Device-Specific

Slow

How to manage Network?

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Telemetry Frees the Data

As Much Useful DataAs Efficiently as Possible

Sensing & measurement

Where Data Is Created Where Data Is Useful

Storage & analysis

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Key Telemetry Characteristics

Efficient Delivery

Tool-Chain consumption and Integration

Structure andAutomation

Data-model DrivenConsistent format

Push not Pull

Analytics-readyDataUDP

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Use Cases

• Network Health

• Anomaly detection

• Troubleshooting / Remediation

• SLAs, Performance Tuning

• Capacity Planning

• Security

Trends

• Real time statistics

• Centralized / Software-defined

• Speed

• Scale

Why This Matters NowWhat hasn’t changed What has changed

Capabilities

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Data Center Visibility Use Cases

Network Health

• CPU and memory utilization

• Forwarding table utilization

• Protocol state and events

• Environmental data

Path and Latency Measurement

• End-to-end visibility

• Path tracing over time

• Flow latency monitoring

Network Performance

• Interface utilization

• Buffer monitoring

• Microburst detection

• Drop event correlation

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Memory

Power

Temperature

CPU

TCAM

System Info and Environmentals

Are my switches healthy?

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

! Neighbor Lost!

Alert:

t

OSPF Routes over Time

Protocol State and Events

OSPF Process State

Process ID 10

Router ID 10.1.1.1

Area 0.0.0.0

OSPF Interfaces

105

Hypervisor Hypervisor

Is routing working as expected?

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Monitoring Buffer Utilization and Drops

Incast or other oversubscription

Packet drops!

I see queue drops – but who’s affected?!

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Path and Latency Measurement

Application performance is slow between Server A &

Server B!

Server A Server B

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Resources - Customer Benefits

Network

Insights

Resources

Resource UtilizationFabric-Wide Capacity Planning, Trend Monitoring

Troubleshoot Application LatencyIdentify Traffic/Protocol behavior

Identify/Predict Failing Devices Operations

Event AnalyticsEndpoint Analytics

Avoid Environmental (CPU, Power, Memory, Fan, StorageRelated Failures

Identify Subtle Path-Related issuesTrack endpoint details and moves

Statistics

Environmental Monitoring

Flow Analytics

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

NIR Architecture

Data Lake

Data Lake Connector

Telemetry SourcesACI/NX-OS

Hardware & Software

Message Bus (Kafka)

REST APIs

Anomaly & Correlation

Engines

Telemetry Collectors

REST Client

NIR GUI

NIR

13

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Correlation EngineCorrelate normalized telemetry data streams from Transformation Receiver

LLDP

Buffer and Queue stats

Flow details

End-to-end Flow Path

End-to-end Path Latency

Buffer Occupancy and drops along Flow Path

Correlation based on timestamp and matching 5-tuple

Pipelines

Configs

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Operational Intelligence Engine for Network Insights

Dynamic CorrelationCorrelate information across data sources

Failure Prediction & Corrective ActionAbility to predict failure and provide corrective action

Intelligent InsightsAbility to discover information with ease

Proactive AlertsSee problems before end users do and alert

Dynamic Correlation

Proactive Alerts

Failure Prediction and Corrective Action

Intelligent Insights

Increase Availability and Performance

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Advisor -- Customer Benefits

Network

Insights

Advisor

Software/Hardware RecommendationsWorkarounds

Avoid multiple TAC calls

Significant CAPEX

And OPEX Savings

Remove Complexity

Avoid Outages

Faster Deployment times

Anomalies

Forwarding State Check

Network Anomaly Detection

Keep Network up to dateAdhere to Cisco policies Recommendations

Prevent traffic black holing

Avoid downtimes

Known Bugs/PSIRTs

Unknown runtime

Config anomalies

EOL/EOSField NoticesSMUs

Version Scale

Limits/Hardening

Check

Configuration

Network Insights Advisor - Customer Benefits

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Advisor Targeted Use CasesProactive supportability insights

Fabric wide analysis

Advisories

Provides advisories based on anomalies, bugs,

PSIRTs and field notices. Measure upgrade impact

Dashboard ”Give me a summary of issues”

Anomalies

hardening checks, scale checks

Bugs and PSIRTs

Known bugs and vulnerabilities in the

system

Network

Provides:

• Running config of all devices

• “show tech” from all devices (including APIC)

Cisco

Provides:

• Best practices updates

• PSIRTs, FNs, EOS/EOL

• Software release notifications

• Digitized signatures of knowndefects

First, We Need Data!

NIACisco

Every 24h

Cloud Data

User-specified interval

Network On-prem Data

20

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Known Bugs

Use Case – Notify About Issues

Fabric

NIA

Insight DB

1

3 Alert / Inform

Monitor

Detected:

CSCDT2396 SAL1820SDRE

Recommend:

Upgrade S/W to NXOS

7.0(3)I7(3)

WeeklySync

2 Detect

4 Implement

Alert RemediateDetect

Network Insights issue detection

HardeningCheck

SignatureMatching

AdvisoryServices NIA – Core

StorageTech Support and ‘show run’ collection

Data Sources

Interacting with Cisco Services via NIA-PROXY

NIA – GUI

Tech supports from the switch collected and matched with signatures of external known caveats

Hardening guide is digitized into signatures and matched with show run from each switch

Insights DB

Bugs/PSIRTs detection

Updated periodically with signatures from the cloud

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Use Case – Notify Me About recommended Releases

Fabric

NIA

Insight DB

1

3 Alert / Inform

Monitor

Push Notification

2 Identify Switches

4 Implement

s

p p p

Notifications

Affected devices: 3

Leaf 1, Leaf 2, Leaf 3

With BUG ID: XYZ

Recommend:

Upgrade S/W to NXOS

7.0(3)I7(3)

Alert RemediateDetectAlert RemediateDetect

p

s

Affected devicesS/W Notify

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Assurance Engine: How it Works

• How it Works

24

Capture DC Wide Intent, Policy, Control/State across

Forwarding & Security

Precise Mathematical Models that codify Cisco’s 30+ Years of Networking and Cross Customer Domain Knowledge

Data Collection Formal Modeling of Network Continuous Analysis

Models verify that Network operates per Intent and accurately tell what is

wrong, where, why, impact and how to fix

Reactive Troubleshooting to Proactive Operations - continuously, network wide

Continuous Assurance Workflows

Is my network compliant with Governance Rules ?

Compliance analysis

Did something change in my network ?

Epoch Delta analysis

Can A talk to B ?

Connectivity Analysis

Smart Events & Compliance Score for Compliance

COMPLIANCE VIOLATED SMART EVENT

• Identify compliant policy

• Identify requirements satisfied

• Identify compliant EPGs

• Identify non compliant policy

• Identify requirements violated

• Identify non-compliant EPGs

COMPLIANCE SATISFIED SMART EVENT

COMPLIANCE SCORE

Epoch Delta AnalysisCorrelated Ad hoc Analysis Workflow

4 Qs, correlated answers…

• What changed?

• Who was impacted?

• Was it due to config changes?

• What happened as a result?

Use Cases

• Change Management

• Root-cause analysis

• Migration

• Maintenance Upgrades

• Capacity Management

Before /

BaselineAfter /

Current

Health Delta - SummaryChange in the health of the Fabric

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Epoch Delta Workflow – Policy DeltaImpact, Change, Operator

What got impacted ?

Who made the changes ?

What has changed ?

Details of

impact, if any

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Forwarding Connectivity AnalysisUse Cases

• Forwarding Communication Issues across entire fabric

• Visibility into Route Leakage

• Visibility into Fabric Communication with External Network

• Policy and Forwarding Inconsistencies

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Key Takeaways

• Nexus leads the industry in telemetry capabilities

• Combination of software and hardware streaming provides deepest level of network visibility

• Platforms for consuming, analyzing, visualizing telemetry data available or being developed for both ACI and standalone

• Both Cisco turnkey solutions and custom/third-party integrations exist today

top related