stabilising the jenga tower

37
Stabilizing the Jenga tower: Scaling out Ceilometer Gordon Chung & Pradeep Kilambi Engineers @ Red Hat, Inc.

Upload: gordon-chung

Post on 12-Aug-2015

47 views

Category:

Technology


3 download

TRANSCRIPT

Stabilizing the Jenga tower: Scaling out Ceilometer

Gordon Chung & Pradeep KilambiEngineers @ Red Hat, Inc.

Our Mission

“To reliably collect measurements of the utilization of physical & virtual resources comprising deployed clouds, persist this data for subsequent retrieval & analysis, and trigger actions when defined criteria are met."

Overview

● Collect physical and virtual resource data ● Transform data to something measurable● Publish data to various targets● Persist data to storage● Retrieve data via API for further analysis, billing,

triggering actions etc.

Collect Transform Publish Persist Retrieve

Architecture (Icehouse)OpenStack Services

Notification Bus

AP

I

External Systems

Notification Agents

Agent1

AgentN

Agent2

Pipeline

Polling Agents

Agent1

AgentN

Agent2

Pipeline

Database

EventsMetersAlarms

Alarm

Evaluator

Alarm

Notifier

Collectors

Collector1

CollectorN

Collector2

Partial HASupport

Active-Active HA support

Ceilometer as it’s perceived

Ceilometer

Cloud Admin

“API response too slow”

“When Ceilometer dies, Glance dies.”

“Ceilometer is leaking memory”

“Ceilometer doesn’t scale”

“HAProxy is messing with MongoDB replica-sets”

“Ceilometer is not Production Ready”

Evolution of Ceilometer

Architecture (Juno)OpenStack Services

Notification Bus

AP

I

External Systems

Notification Agents

Agent1

AgentN

Agent2

Pipeline

Polling Agents

Agent1

AgentN

Agent2

Pipeline

Databases

Alarms

EventsMeters

Alarm

Evaluator

Alarm

Notifier

Collectors

Collector1

CollectorN

Collector2

Partial HASupport

Active-Active HA support

Active/Active Workload Partitioning

Architecture (Kilo)OpenStack Services

Notification Bus

AP

I

External Systems

Notification Agents

Agent1

AgentN

Agent2

Pipeline

Polling Agents

Agent1

AgentN

Agent2

Pipeline

Databases

Alarms

Alarm

Evaluator

Alarm

Notifier

Collectors

Collector1

CollectorN

Collector2

Meters

Events

Active-Active HA support

best Practices

Best Practices (Data Collection)

● Modify your pipeline to match requirements○ Collect only meters you need by tuning pipeline.yaml○ Tweak polling interval as needed

● Enable jittering to polling (Kilo+)● Scale out - add agents as load increases (Juno+)● Use notifier publisher vs rpc publisher (Juno+)

Best Practices (Data Storage)

● Avoid open-ended queries, query on a time range● Install API behind mod_wsgi● Tweak WSGIDaemon settings such as threads and

processes● Set a TTL, expire data to minimise database size● Run mongodb on a separate node

○ Use sharding and replica-sets

Different Strokes for Different Folks

Deployment Scenarios (Lambda Design)

Polling /Notification

AgentsQueue1

Queue2

Short-TermDatabase

ArchiveDatabase

Collector (short-term)

Collector (short-term)

Collector (short-term)

Collector (short-term)

Collector (short-term)

Collector (long-term)

Deployment Scenarios (Data Segregation)

Polling /Notification

AgentsQueue1

Queue2

Database

AuditDatabase

Collector (short-term)

Collector (short-term)Collector (public)

Collector (short-term)

Collector (short-term)

Collector(audit)

Deployment Scenarios (JSON Files)

Polling /Notification

AgentsQueue1

Collector (short-term)

Collector (short-term)Collector Apache Spark

JSON files

Deployment Scenarios (Fraud Detection)

Polling /Notification

AgentsQueue

Collector (short-term)

Collector (short-term)Collector

Proprietary Alerting System

HTTP

Deployment Scenarios (Custom consumers)

Polling /Notification

AgentsKafka Apache Storm

Deployment Scenarios (Debugging)

Polling /Notification

Agents

EventQueue Collectors ElasticSearch

Kibana

OpenStack Services

Deployment Scenarios (Noisy Services)

Notification Bus

Notification Bus

Databases

Alarms

Collectors

Collector1

CollectorN

Collector2

Meters

Events

Notification Agents

Agent1

AgentN

Agent2

Pipeline

Continual Evolution

Continual Evolution

Liberty

● Gnocchi Integration● Building up events● Declarative data collection● Minimise the bloat

Gnocchi: Resource Metering as a Service

● Lightweight time-series metadata● Separate storage and data models for

resources and time-series data● indexer for metrics and resources● Eagerly pre-aggregates metric data● Supports restricted cross-metric

aggregation● Per time-series configurable retention policy

Eoghan Glynn
So the key idea here is to separate the TSD from the (resource) metadata, right?
Pradeep Kilambi
The key is to give a high level overview of gnocchi without getting into too much depth

Size matters { "_id": ObjectId("55103dd3bf4d2c7a7de6e319"), "counter_name": "cpu", "user_id": "72bd0799d496476f9eed16d49e0b86e9", "resource_id": "d7f94857-a0d8-4864-8ab1-124055950973", "timestamp": ISODate("2015-03-23T16:22:43Z"), "message_signature": "539736605d14c0aa8c85058e6e9e67a078146f2e80a218d8dc6711c8d6875ae5", "message_id": "d559f244-d178-11e4-9fa9-28b2bd01ed52", "source": "openstack", "counter_unit": "ns", "counter_volume": NumberLong("22450000000"), "recorded_at": ISODate("2015-03-23T16:22:43.412Z"), "project_id": "99fb96cb63624163975dcbf95d7d2d6f", "resource_metadata": { "status": "active", "cpu_number": 1, "ephemeral_gb": 0, "display_name": "inst-3", "name": "instance-00000003", "disk_gb": 0, "kernel_id": "4e303a91-ae5b-43c7-b823-fd6f2cceab4e", "image": { "id": "490af6b0-2402-45d8-bcb1-c81376326e8d", "links": [ { "href": "http://10.162.32.175:8774/837660dc95324be594a0607d80a22c53/images/490af6b0-2402-45d8-bcb1-c81376326e8d", "rel": "bookmark" } ], "name": "cirros-0.3.2-x86_64-uec" }, "ramdisk_id": "7112ea15-3ece-4805-9f23-f6141a6f27b0", "vcpus": 1, "memory_mb": 64, "instance_type": "42", …..}

{"2015-03-23T16:22:43Z" : 1 }

gnocchi datapoint

ceilometer datapoint (mongodb)

Vs

Gnocchi Benchmarks

Gnocchi Benchmarks

Gnocchi

Architecture (Gnocchi)OpenStack Services

Notification Bus

AP

I

External Systems

Notification Agents

Agent1

AgentN

Agent2

Pipeline

Polling Agents

Agent1

AgentN

Agent2

Pipeline

Databases

Alarms

Alarm

E

valuator

Alarm

Notifier

Collectors

Collector1

CollectorN

Collector2

Events

Active-Active HA support

API

Metric Resources

Discussions

● operators session - May 19, 2015 (12:05pm) Rm 306● design track - May 20, 2015 (9:00am - 3:30pm)

o event alarms; ceilometer componentisation ● design track - May 21, 2015 (9:00am - 12:30pm)● speaker session:

o The Anatomy of an Action - May 21, 2015 (1:30pm)

● irc: #openstack-ceilometer● mailing-list: [email protected]

● https://wiki.openstack.org/wiki/ReleaseNotes/Juno● https://wiki.openstack.org/wiki/ReleaseNotes/Kilo● http://nejc.saje.info/ceilometer-central-agent.html● https://julien.danjou.info/blog/2015/openstack-gnocchi-fi

rst-release● https://blog.sileht.net/writing-a-gnocchi-storage-driver-fo

r-ceph.html

Resources

Thank You