kentik detect engine - network field day 2017
TRANSCRIPT
KDE Quick Stats(kentik detect engine)
NetFlow in the Cloud
• 125+ Billion Flows/Day stored• 1,000,000+ FPS• 50 “Large” Queries/s, thousands of sub-qps• 75+ TB flow data stored/day
(25+ compressed)
SNMP, BGP, network performance too!
KDE High-Level• KDE is a hybrid system:
○ Fusing / Ingest Layer○ Distributed column store db / query engine○ Realtime stream processing for anomaly detection
• We evaluated various existing engines: ES, Hadoop, Cassandra, Storm, Spark, SILK, Druid, Kafka....
• Couldn’t find performance, multi-tenancy, and network savvy
so we wrote our own...
Ingest & Fusion layer
Storage layer(flow specific)
Querylayer
Each layer has separate and different scaling characteristics
Query engine and UI
Query interfaces
SQL
WWW
REST
Datasources Clients
SELECT flowFROM routerWHERE …
>_
KDE architecture
KDE Architecture
BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow(HTTPS)
NetFlow(UDP)
NetFlow(UDP)
kFlow(HTTPS)
kFlow(HTTP)
kFlow(HTTP)
relay
relay
proxy
proxy
proxy
client
C
client
C
client
C
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow(HTTPS)
NetFlow(UDP)
kFlow(HTTPS)
kFlow(HTTPS)
kFlow(HTTPS)
proxy
proxy
proxy
client
C
client
C
client
C
BGP VIP NetFlow
(UDP) relay
VIP + Relay
• One IP bound to multiple servers
• Sharded by Source-IP• Validate Sender as Kentik
Customer• Pass flow on (raw UDP
socket) to correct proxy
• Relay handles load balancing (Kentik specific, UDP+TCP)
relay
Proxy
BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow(HTTPS)
NetFlow(UDP)
NetFlow(UDP)
kFlow(HTTPS)
relay
relay
kFlow(HTTP)
client
C
client
C
client
C
kFlow(HTTP)
• Inspect flow & determine type:V5, V9, IPFIX, SFlow, KFlow
• Need to resample?
• Configured Sample Rate
• Launch Client Process for each device
• Poll for device changes
• Monitor health
• Relaunch of client crash
proxy
proxy
proxy
BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow(HTTPS)
NetFlow(UDP)
NetFlow(UDP)
kFlow(HTTPS)
relay
relay
proxy
proxy
proxy
kFlow(HTTP)
kFlow(HTTP)
client
C
client
C
client
C
Client(where the magic happens)
• One per device configured to send flow
• * goes in, KFlow comes out
client
C
NetFlow
SFlow
IPFix
kFlow
Step 2: Enrichment
• BGP - Route data for xxx• GeoIP - Where does my traffic start and end• SNMP - Interface names and descriptions• Tagging - business classification: cost-centers,
user-info, peering info• App Specific Data - URL/DNS requests, MYSQL
query• Performance data (NPM) - Retransmits, network latency,
appl latency
• coming soon:• Timestamped event Data (syslog)• Threat feeds
DATA FUSION in CLIENT
DecoderModules
MemTables
NetFlow v5
NetFlow v9
IPFIX
BGP RIB
Custom Tags
SNMP Poller
BGP Daemon
Enrichment DB
DATA FUSION
Geo ←→ IP
ASN ←→ IP
SFlow
ROUTER
FLOW FRIENDLY DATASTORE
Single flowfused row
sent to storage
PCAP
PCAPagent
proxy
Step 3: Resampling & Unification
• Long term (>1 Month)• What a process (device) said over an hour
• Two tricks:• Flow Unification• Resampling
Storage Layer• Fused KFlow as input...Cap'n Proto (like
protobuffers)• Shard data into small chunks• HTTP to N distributed storage nodes• Metadata supervisor DB handles shard locations• Row Oriented to Column Oriented• Compressed using ZFS
DISK
Multi-Tenancy DBNeeded Multitenancy for a large-scale SaaS productCould not find other DB’s @scale with it
We succeeded by building in:● Fairness
queries are chopped into small chunks, users are rate limited and prioritized
● Security data is isolated between “users” down to the thread level
● Multiuser caching with fairness Built a cache that cannot be monopolized by any 1 user
Ingest & Fusion layer
Storage layer(flow specific)
Querylayer
Query engine and UI
Query interfaces
SQL
WWW
REST
Datasources Clients
SELECT flowFROM routerWHERE …
>_
● SQL interface PSQL FDW
● UI/UX feat. advanceddata-viz
● REST API based interfacebuild your own
Anomaly Detection ● Network + NPM specific● Policy based, customizable● Granular itemization and metrics
○ look at top-100 Country, IP, Port, ASN, site, path,...○ Unique senders, bps, pps, rxmits, latency
● Over/under static thresholds● Over/under what’s “normal” (baselining)● Perform actions
○ E-mail, Slack, JSON, Pagerduty○ Mitigation (A10, Radware, BGP)
• DDoS is a simple use case of anomaly detection
• V1 anomaly detection relied on KDE queries. Abusive
• V2 needed stream processing and in-ram baseline storage
• Typically avoided streaming db’s due to aggregation
• Streaming db’s for anomaly detection+our long term flow storage is a powerful combination
• Evaluated Spark, Storm, Samza, PipelineDB. Fail
Detecting Anomalies
BGP VIP
KDE ingest layer
enKryptor
Storage layer
kFlow(HTTPS)
NetFlow(UDP)
NetFlow(UDP)
kFlow(HTTPS)
kFlow(HTTP)
kFlow(HTTP)
relay
relay
proxy
proxy
proxy
client
C
client
C
client
C
Streaming layer
Aggregation Layer #2
POLICIES
kFlow
Multiple kFPS
Policy #1
Policy #2
1s 1s 1s 1s 1s 1s
Aggregation Layer #1
1min
Σ
Σ Σ
Aggregation Layer #3
Policy #1
Policy AggregationFilter
Policy Thresholds& Actions
1hour
Σ ThresholdComparator Action
Triggers