service overview - support.huaweicloud.com · generates threshold alarms. in addition, by enabling...

48
Application Operations Management Service Overview Issue 01 Date 2019-04-03 HUAWEI TECHNOLOGIES CO., LTD.

Upload: others

Post on 23-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Application Operations Management

Service Overview

Issue 01

Date 2019-04-03

HUAWEI TECHNOLOGIES CO., LTD.

Page 2: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Copyright © Huawei Technologies Co., Ltd. 2019. All rights reserved.No part of this document may be reproduced or transmitted in any form or by any means without prior writtenconsent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei and thecustomer. All or part of the products, services and features described in this document may not be within thepurchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information,and recommendations in this document are provided "AS IS" without warranties, guarantees orrepresentations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. i

Page 3: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Contents

1 What Is AOM?................................................................................................................................1

2 Architecture.................................................................................................................................... 3

3 Functions......................................................................................................................................... 5

4 Application Scenarios...................................................................................................................8

5 Metric Overview.......................................................................................................................... 125.1 Introduction.................................................................................................................................................................. 125.2 Network Metrics and Dimensions................................................................................................................................ 135.3 Disk Metrics and Dimensions.......................................................................................................................................145.4 File System Metrics and Dimensions........................................................................................................................... 155.5 Host Metrics and Dimensions.......................................................................................................................................165.6 Cluster Metrics and Dimensions...................................................................................................................................185.7 Container Metrics and Dimensions.............................................................................................................................. 205.8 Process Metrics and Dimensions..................................................................................................................................225.9 Instance Metrics and Dimensions.................................................................................................................................235.10 Service Metrics and Dimensions................................................................................................................................ 245.11 SLA Metrics and Dimensions.....................................................................................................................................24

6 Usage Restrictions....................................................................................................................... 26

7 Relationships Between AOM and Other Services................................................................31

8 Basic Concepts..............................................................................................................................36

9 Permissions Management..........................................................................................................40

10 Feature Description................................................................................................................... 44

Application Operations ManagementService Overview Contents

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. ii

Page 4: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

1 What Is AOM?

Background and ChallengesPreviously, most enterprises purchased infrastructure resources and built clusters bythemselves. They implemented O&M by focusing on hosts, and built their own applicationand database monitoring systems. With the popularization of container technologies, moreand more enterprises develop applications through the microservice framework. As thenumber of cloud services increases, enterprises gradually turn to cloud O&M. Cloud O&Mposes the following challenges:

l Lots of O&M tools, resulting in high usage and maintenance costsCloud O&M has high requirements on personnel skills, O&M tools are hard toconfigure, and multiple systems need to be maintained at the same time. In addition, thedistributed tracing system features high learning and usage costs, but poor stability.

l Difficult problem analysis for distributed applications on the cloudLots of problems need to be solved: how to visualize the dependency betweenmicroservices, how to improve user experience, how to associate scattered logs foranalysis, and how to quickly trace problems.

AOMApplication Operations Management (AOM) is a one-stop and multi-dimensional O&Mmanagement platform for cloud applications. It monitors applications and related cloudresources in real time, collects and associates resource metrics, logs, and events to analyzeapplication health status, and provides flexible alarm reporting and data visualization. WithAOM, you can detect faults in a timely manner and monitor running statuses of applications,resources, and services in real time.

AOM can monitor and manage cloud hosts, storage devices, networks, web containers, andapplications hosted in Docker and Kubernetes in a centralized, unified, and visualized manner.This effectively prevents problems and helps O&M personnel locate faults, reducing O&Mcosts. In addition, AOM provides unified APIs for connecting self-developed monitoring orreporting systems. Unlike traditional monitoring systems, AOM monitors services from theperspective of applications. It meets enterprises' requirements for high efficiency and fastiteration, effectively supports their services through IT, and protects and optimizes their ITassets, so that enterprises can achieve strategic goals.

Application Operations ManagementService Overview 1 What Is AOM?

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 1

Page 5: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Advantagesl Multi-dimensional O&M

Provides one-stop and multi-dimensional O&M platform for mobile apps, applicationservices, middleware, and cloud resources, improving O&M efficiency.

l Health checkMonitors service health status in real time, and detects exceptions or performancebottlenecks in minutes. When a fault occurs, AOM helps you determine the resource,application, or service code that causes the fault.

l Intelligent analysisAnalyzes root causes using Artificial Intelligence (AI)-powered threshold detection andmachine learning based on historical data.

l Ease of useConnects to applications without having to modify code, and collects data in a non-intrusive way. Proactively discovers and monitors applications based on the applicationrunning environment, visualizes application data in real time, and facilitates O&M.

l Massive log managementSupports high-performance search and service analysis, automatically associates logs,and quickly filters logs by application, host, file name, or instance.

l Association analysisAutomatically associates applications and resources, and displays data in a panoramaview. Metric and alarm data about applications, services, instances, hosts, andtransactions is associated for analysis, so that you can easily locate faults.

l Open ecosystemOpens O&M data query APIs and collection standards, and supports independentdevelopment.

Application Operations ManagementService Overview 1 What Is AOM?

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 2

Page 6: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

2 Architecture

Application Operations Management (AOM) is a multi-dimensional O&M platform thatfocuses on resource data and associates log, metric, resource, alarm, and event data. It consistsof the data collection and access layer, transmission and storage layer, and service computinglayer. The following figure shows the architecture.

Figure 2-1 Architecture

l Data collection and access layerAOM supports two data access modes:

– Mode 1: Use the ICAgent (a data collector plug-in) to collect data.

The ICAgent must be installed on the host, so that O&M data can be reportedthrough the ICAgent. For Cloud Container Engine (CCE) users, the ICAgent isinstalled on each host by default when CCE clusters are created. For common

Application Operations ManagementService Overview 2 Architecture

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 3

Page 7: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Elastic Cloud Server (ECS) or Bare Metal Server (BMS) users, the ICAgent need tobe installed manually. For details, see Installing the ICAgent.

– Mode 2: Use APIs to access data.For cloud services such as Intelligent EdgeFabric (IEF) and FunctionGraph, you donot need to perform any operations. Instead, such services are automaticallyconnected to AOM, and then monitored and maintained by AOM. In addition,service metrics can be set to custom metrics and connected to AOM through openor exporter APIs.

l Transmission and storage layer– Data transmission: AOM Access is a proxy for receiving O&M data. After O&M

data is received, such data will be placed in the Kafka queue. Kafka then transmitsthe data to the service computing layer in real time based on its high-throughputcapability.

– Data storage: After being processed by the AOM backend, O&M data is writteninto a database. Cassandra stores sequential data, Redis is used for cache query,ETCD stores AOM configuration data, and Elasticsearch stores resources, logs,alarms, and events.

l Service computing layerAOM provides basic O&M services such as alarms, logs, monitoring, and metrics, andArtificial Intelligence (AI) services such as exception detection and analysis.

Application Operations ManagementService Overview 2 Architecture

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 4

Page 8: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

3 Functions

Application MonitoringApplication monitoring allows you to view application resource usage, trends, and alarms inreal time, so that you can make fast responses and ensure smooth running for applications.

This function adopts the hierarchical drill-down design. The hierarchy is as follows:Application list > Application details > Service details > Instance details > Container details >Process details. That is, applications, services, instances, containers, and processes areassociated and their relationships are directly displayed on the UI.

On the details page of each layer, Application Operations Management (AOM) associatesrelated alarm, log, and host information, and displays the detailed alarm and host information,and next-level resource list for further analysis. In addition, AOM allows you to customizemonitoring views. Specifically, you can add metric graphs and select all the metrics related tothe current resource.

Host MonitoringHost monitoring allows you to view host resource usage, trends, and alarms in real time, sothat you can make fast responses and ensure smooth running for hosts.

Like application monitoring, this function also adopts the hierarchical drill-down design. Thehierarchy is as follows: Host list > Host details. The details page contains all the instances,GPUs, NICs, disks, and file systems discovered on the current host.

Metric MonitoringMetric monitoring allows you to search for metrics, which are displayed in the metric treeaccording to the resource hierarchy. The hierarchy is as follows: Cluster > Service > Instance> Container/Process. On the metric monitoring page, you can compare different metrics of thesame resource or compare the same metric of different resources. A maximum of 12 metricscan be added to a metric graph. You can also quickly add a metric graph to a dashboard andexport metric data to a local PC in CSV or TXT format.

Application PanoramaApplication panorama focuses on applications and associates services, instances, hosts, andmiddleware for multi-dimensional analysis. Through analysis of metric and alarm data aboutapplications, services, instances, hosts, and transactions, you can easily locate faults.

Application Operations ManagementService Overview 3 Functions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 5

Page 9: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Figure 3-1 Data model

Automatic Discovery of ApplicationsAfter you deploy applications on hosts, the ICAgent installed on the hosts automaticallycollects information, including names of processes, applications, containers, and Kubernetespods. The applications are automatically discovered and then displayed graphically. You canset aliases and groups for better resource management.

DashboardWith a dashboard, different graphs can be displayed on the same screen. Various graphs, suchas line graphs, digital graphs, and top N resource graphs can display resource data, enablingyou to monitor data comprehensively.

For example, you can add key metrics of important resources to a dashboard for real-timemonitoring. You can also compare the same metric of different resources on one UI. Inaddition, by adding routine O&M metrics to a dashboard, you can perform routine checkwithout re-selecting metrics when re-opening the AOM console.

Alarm CenterAlarm center is a platform for managing alarms and events. It supports custom notificationactions, that is, you can obtain alarm information by email or Short Message Service (SMS)message. In this way, you can detect and handle exceptions at the earliest time. You can createthreshold rules for key resource metrics. When metric data meets threshold criteria, AOMgenerates threshold alarms. In addition, by enabling the alarm subscription function, you canconnect threshold alarms to an O&M platform for analysis.

Log Managementl Log search: You can view logs in real time. You can use the retrieval function to quickly

find out required logs. You can also use log source information and raw context data tofacilitate fault locating. You can also query and analyze original logs, and structured logsbased on SQL syntax.

l Log dump: AOM dumps logs to the bucket in Object Storage Service (OBS) for long-term storage. To store logs for a longer time, add log dumps.

Application Operations ManagementService Overview 3 Functions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 6

Page 10: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

l Statistical rule: Logs contain information about system performance and services. Forexample, the number of ERROR keywords indicates the system health, and that of BUYkeywords indicates the service trading volume. To know such information, create astatistical rule. After a statistical rule is created, AOM periodically counts the number ofkeywords and generates metric data, so that you can obtain system performance andservice information in real time. You can also set threshold rules for metrics. AOMreports a threshold alarm when the value of a metric reaches the preset threshold. In thisway, you can detect and handle exceptions at the earliest time.

l Log subscription: To connect logs to your O&M platform in real time, enable the logsubscription function. After the function is enabled, AOM can call the transmission APIof Distributed Message Service (DMS) to send logs to a specified Kafka queue. In thiscase, you can retrieve logs in the DMS Kafka queue.

l Delimiter configuration: You can separate log contents into multiple words by usingdelimiters, and then search for logs based on these words.

Figure 3-2 Log management

APM

In addition to basic functions, AOM integrates the functions of Application PerformanceManagement (APM), such as topology display, tracing, device-side analysis, and abnormalSQL analysis. In this way, AOM can implement more advanced monitoring, so that O&Mpersonnel can quickly resolve problems and performance bottlenecks in the distributedarchitecture, ensuring premium using experience.

To use AOM to monitor application performance, you need to enable APM and configureapplication startup scripts. For details, APM Getting Started. The fees generated by APMare settled according to standard APM pricing. For details, see APM Pricing Details.

Application Operations ManagementService Overview 3 Functions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 7

Page 11: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

4 Application Scenarios

Application Operations Management (AOM) is widely used. You can learn how to use AOMbased on the following four typical scenarios.

Device-Cloud Full-Link MonitoringIf a performance problem such as slow page loading or frame freezing occurs and cannot bereproduced, it is difficult to quickly detect performance bottlenecks and locate causes. Forexample, when a user reports that page loading is slow, is this problem caused by the faultynetwork, resource loading, or Document Object Model (DOM) parsing? Is it related to theprovince or country where the user locates, or the browser or device of the user? Anotherexample, when a user reports that frame freezing occurs, is this problem caused by the faultynetwork between the user terminal and the server? Is the server or database overloaded? It isalso difficult to quickly locate root causes in code.

AOM provides full-link monitoring capabilities, covering the browser, mobile, network, webservice, and data center. You can view the latency and throughput from the mobile or browserend to the data center on the topology. In addition, you can analyze application performancemonitoring data, such as Application Performance Index (Apdex), throughput, error count,frame freezing and crash, and users' geographic distribution through device-side analysis. Inthis way, you are able to view the running status of applications in real time and quicklydiagnose faults.

Advantages

l E2E full-link tracing: You can reproduce problems using the distributed tracingtechnology to quickly locate performance bottlenecks in code.

l Intelligent Root Cause Analysis (RCA): AOM can analyze O&M data in real time andidentify success and error patterns to locate root causes.

l Non-intrusive access: Applications can be quickly connected through non-intrusivetracing points instead of SDKs.

Application Operations ManagementService Overview 4 Application Scenarios

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 8

Page 12: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Figure 4-1 Device-cloud full-link monitoring

Intelligent Analysis of O&M MetricsFor massive services, there is rich but unassociated application O&M data, such as hundredsof monitoring metrics, KPI data, and tracing data. How to associate metric and alarm datafrom multiple perspectives (such as applications, services, instances, hosts, and transactions),and automatically complete RCA? How to provide possible causes during intelligenceexception analysis based on the learned historical data and O&M experience library?

AOM uses Artificial Intelligence (AI) algorithms to analyze trends of O&M metrics, so thatyou can predict exceptions, including regular changes and abrupt increase in metrics.

Advantages

l Intelligent scenario identification: Optimal algorithms are selected based on O&M metriccharacteristics, helping you identify abrupt status changes and periodical exceptions.

l Adaptive algorithms: When lots of alarms are reported, algorithm parameters areautomatically adjusted to suppress alarms.

l Automatic filtering of glitch signals: Occasional and discrete glitch signals can befiltered automatically, reducing unnecessary alarm reporting.

Figure 4-2 Intelligent analysis of O&M metrics

Problem Inspection and DemarcationDuring routine O&M, it is hard to locate faults and obtain logs. Therefore, a monitoringplatform is required to monitor resources, logs, and application performance.

AOM interconnects with application services, and collects O&M data of infrastructures,middleware, and application instances in one stop. Through metric monitoring, log analysis,and event/alarm reporting, AOM enables you to monitor the application running status andresource usage easily, and detect and demarcate problems in a timely manner.

Application Operations ManagementService Overview 4 Application Scenarios

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 9

Page 13: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Advantages

l Automatic discovery of applications: Collectors are deployed to proactively discover andmonitor applications based on different runtime environments.

l Monitoring of distributed applications: AOM serves as a unified O&M platform thatenables you to implement multi-dimensional monitoring over distributed applicationswith multiple cloud services.

l Notification of events and alarms: Multiple exception detection policies, event and alarmtrigger modes, and APIs are provided.

Figure 4-3 Problem inspection and demarcation

Multi-Dimensional O&MYou need to monitor comprehensive system running status and make fast response to variousproblems.

AOM provides multi-dimensional O&M capabilities from the cloud level to the resource leveland from application monitoring to microservice tracing.

Advantages

l User experience assurance: Service health status KPIs in real time are monitored in realtime and root causes of exceptions are analyzed.

l Fast fault diagnosis: Distributed call tracing enables you to locate faults quickly.l Resource running assurance: Hundreds of O&M metrics about resources such as

containers, disks, and networks are monitored in real time, and clusters, VMs,applications, and containers are associated for analysis.

Application Operations ManagementService Overview 4 Application Scenarios

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 10

Page 14: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Figure 4-4 Multi-dimensional O&M

Application Operations ManagementService Overview 4 Application Scenarios

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 11

Page 15: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

5 Metric Overview

5.1 Introduction

5.2 Network Metrics and Dimensions

5.3 Disk Metrics and Dimensions

5.4 File System Metrics and Dimensions

5.5 Host Metrics and Dimensions

5.6 Cluster Metrics and Dimensions

5.7 Container Metrics and Dimensions

5.8 Process Metrics and Dimensions

5.9 Instance Metrics and Dimensions

5.10 Service Metrics and Dimensions

5.11 SLA Metrics and Dimensions

5.1 IntroductionMetrics reflect resource performance data or status. A metric consists of the namespace,dimension, name, and unit. Metrics can be divided into the following types:

l System metrics: Basic metrics provided by Application Operations Management (AOM),such as CPU usage and used CPU cores.

l Custom metrics: Metrics defined by you. Custom metrics can be reported using thefollowing methods:– Method 1: Use AOM APIs. For details, see Adding Monitoring Data and

Querying Monitoring Data.– Method 2: Connect to Prometheus when creating container applications on the

Cloud Container Engine (CCE) console. For details, see Interconnection withPrometheus (Monitoring).

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 12

Page 16: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Namespaces

Metric namespaces indicate containers for storing metrics. Metrics in different namespacesare independent of each other so that metrics of different applications will not be aggregatedto the same statistics information.

l Namespaces of system metrics: are fixed and started with PAAS., as shown in Table 5-1.

Table 5-1 Namespaces of system metrics

Namespace Description

PAAS.AGGR Namespace of cluster metrics

PAAS.NODE Namespace of host, network, disk, and file system metrics

PAAS.CONTAINER

Namespace of service, instance, process, and container metrics

PAAS.SLA Namespace of SLA metrics

l Namespaces of custom metrics: must be in the XX.XX format. Enter 3 to 32 charactersstarting with a letter. Only digits, letters, and underscores (_) are allowed. Note that thenamespaces cannot be started with PAAS., SYS., or SRE..

Metric Dimensions

Metric dimensions indicate the categories of metrics. Each metric has certain features, and adimension may be considered as a category of such features.

l Dimensions of system metrics: are fixed. Different types of metrics have differentdimensions. For more details, see to 5.2 Network Metrics and Dimensions to 5.11 SLAMetrics and Dimensions.

l Dimensions of custom metrics: must be 1 to 32 characters long, which need to becustomized.

5.2 Network Metrics and Dimensions

Table 5-2 Network metrics

Metric Description ValueRange

Unit

Downlink rate(recvBytesRate)

Inbound network traffic rate of themeasured object

≥ 0 Byte Per Second (BPS)

Downlink rate(recvPackRate)

Number of data packets received by theNIC per second

≥ 0 Packet Per Second (PPS)

Downlink error rate(recvErrPackRate)

Number of error packets received by theNIC per second

≥ 0 PPS

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 13

Page 17: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

Uplink rate(sendBytesRate)

Outbound network traffic rate of themeasured object

≥ 0 BPS

Uplink error rate(sendErrPackRate)

Number of error packets sent by the NICper second

≥ 0 PPS

Uplink rate(sendPackRate)

Number of data packets sent by the NICper second

≥ 0 PPS

Total rate(totalBytesRate)

Total inbound and outbound networktraffic rate of the measured object

≥ 0 BPS

Table 5-3 Dimensions of network metrics

Dimension Description

clusterId Cluster ID

hostID Host ID

nameSpace Cluster namespace

netDevice NIC name

nodeIP Host IP address

nodeName Host name

5.3 Disk Metrics and Dimensions

Table 5-4 Disk metrics

Metric Description ValueRange

Unit

Disk read rate(diskReadRate)

Volume of data read from a disk persecond

≥ 0 KB/s

Disk write rate(diskWriteRate)

Volume of data written into a disk persecond

≥ 0 KB/s

Table 5-5 Dimensions of disk metrics

Dimension Description

clusterId Cluster ID

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 14

Page 18: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Dimension Description

diskDevice Disk name

hostID Host ID

nameSpace Cluster namespace

nodeIP Host IP address

nodeName Host name

5.4 File System Metrics and Dimensions

Table 5-6 File system metrics

Metric Description ValueRange

Unit

Available disk space(diskAvailableCapacity)

Disk space that has not been used ≥ 0 MB

Total disk space(diskCapacity)

Total disk space ≥ 0 MB

Disk read/write status(diskRWStatus)

Read or write status of a disk 0 or 1l 0:

read/write

l 1:read-only

N/A

Disk usage(diskUsedRate)

Percentage of the used disk space to thetotal disk space

≥ 0 %

Table 5-7 Dimensions of file system metrics

Dimension Description

clusterId Cluster ID

clusterName Cluster name

fileSystem File system

hostID Host ID

mountPoint Mount point

nameSpace Cluster namespace

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 15

Page 19: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Dimension Description

nodeIP Host IP address

nodeName Host name

5.5 Host Metrics and Dimensions

Table 5-8 Host metrics

Metric Description ValueRange

Unit

Total CPU cores(cpuCoreLimit)

Total number of CPU cores that themeasured object has applied for

≥ 1 Cores

Used CPU cores(cpuCoreUsed)

Number of CPU cores used by themeasured object

≥ 0 Cores

CPU usage (cpuUsage) CPU usage of the measured object 0%–100%

%

Available physicalmemory (freeMem)

Available physical memory of themeasured object

≥ 0 MB

Available virtual memory(freeVirMem)

Available virtual memory of the measuredobject

≥ 0 MB

Total GPU memory(gpuMemCapacity)

Total GPU memory of the measured object > 0 MB

GPU memory usage(gpuMemUsage)

Percentage of the used GPU memory tothe total GPU memory

0%–100%

%

Used GPU memory(gpuMemUsed)

GPU memory used by the measured object ≥ 0 MB

GPU usage (gpuUtil) GPU usage of the measured object 0%–100%

%

Total NPU memory(npuMemCapacity)

Total NPU memory of the measured object > 0 MB

NPU memory usage(npuMemUsage)

Percentage of the used NPU memory tothe total NPU memory

0%–100%

%

Used NPU memory(npuMemUsed)

NPU memory used by the measured object ≥ 0 MB

NPU usage (npuUtil) NPU usage of the measured object 0%–100%

%

NPU temperature(temperature)

NPU temperature of the measured object - °C

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 16

Page 20: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

Physical memory usage(memUsedRate)

Percentage of the used physical memory tothe total physical memory

0%–100%

%

Host status (nodeStatus) Host status l 0:Normal

l Othervalues:Abnormal

N/A

NTP offset (ntpOffset) Offset between the local time of the hostand the NTP server time. When the NTPoffset is closer to 0, the local time of thehost is closer to the time of the NTP server.

N/A ms

NTP server status(ntpServerStatus)

Whether the host is connected to the NTPserver

0 or 1l 0:

Connected

l 1:Unconnected

N/A

NTP sync status(ntpStatus)

Whether the local time of the host issynchronized with the NTP server time

0 or 1l 0:

Synchronous

l 1:Asynchronous

N/A

Processes (processNum) Number of processes on the measuredobject

≥ 0 N/A

GPU temperature(temperature)

GPU temperature of the measured object - °C

Total physical memory(totalMem)

Total physical memory that the measuredobject has applied for

≥ 0 MB

Total virtual memory(totalVirMem)

Total virtual memory that the measuredobject has applied for

≥ 0 MB

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 17

Page 21: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

Virtual memory usage(virMemUsedRate)

Percentage of the used virtual memory tothe total virtual memory

0%–100%

%

Table 5-9 Dimensions of host metrics

Dimension Description

clusterId Cluster ID

clusterName Cluster name

gpuName GPU name

gpuID GPU ID

npuName NPU name

npuID NPU ID

hostID Host ID

nameSpace Cluster namespace

nodeIP Host IP address

nodeName Host name

5.6 Cluster Metrics and Dimensions

Table 5-10 Cluster metrics

Metric Description ValueRange

Unit

Total CPU cores(cpuCoreLimit)

Total number of CPU cores that themeasured object has applied for

≥ 1 Cores

Used CPU cores(cpuCoreUsed)

Number of CPU cores used by themeasured object

≥ 0 Cores

CPU usage (cpuUsage) CPU usage of the measured object 0%–100%

%

Available disk space(diskAvailableCapacity)

Disk space that has not been used ≥ 0 MB

Total disk space(diskCapacity)

Total disk space ≥ 0 MB

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 18

Page 22: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

Disk usage(diskUsedRate)

Percentage of the used disk space to thetotal disk space

≥ 0 %

Available physicalmemory (freeMem)

Available physical memory of themeasured object

≥ 0 MB

Available virtual memory(freeVirMem)

Available virtual memory of the measuredobject

≥ 0 MB

Total GPU memory(gpuMemCapacity)

Total GPU memory of the measured object > 0 MB

GPU memory usage(gpuMemUsage)

Percentage of the used GPU memory tothe total GPU memory

0%–100%

%

Used GPU memory(gpuMemUsed)

GPU memory used by the measured object ≥ 0 MB

GPU usage (gpuUtil) GPU usage of the measured object 0%–100%

%

Physical memory usage(memUsedRate)

Percentage of the used physical memory tothe total physical memory

0%–100%

%

Downlink rate(recvBytesRate)

Inbound network traffic rate of themeasured object

≥ 0 Byte Per Second (BPS)

Uplink rate(sendBytesRate)

Outbound network traffic rate of themeasured object

≥ 0 BPS

Total physical memory(totalMem)

Total physical memory that the measuredobject has applied for

≥ 0 MB

Total virtual memory(totalVirMem)

Total virtual memory that the measuredobject has applied for

≥ 0 MB

Virtual memory usage(virMemUsedRate)

Percentage of the used virtual memory tothe total virtual memory

0%–100%

%

Table 5-11 Dimensions of cluster metrics

Dimension Description

clusterId Cluster ID

clusterName Cluster name

projectId Project ID

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 19

Page 23: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

5.7 Container Metrics and Dimensions

Table 5-12 Container metrics

Metric Description ValueRange

Unit

Total CPU cores(cpuCoreLimit)

Total number of CPU cores that themeasured object has applied for

≥ 1 Cores

Used CPU cores(cpuCoreUsed)

Number of CPU cores used by themeasured object

≥ 0 Cores

CPU usage (cpuUsage) CPU usage of the measured objectPercentage of the used CPU cores to thetotal CPU cores

0%–100%

%

Disk read rate(diskReadRate)

Volume of data read from a disk persecond

≥ 0 KB/s

Disk write rate(diskWriteRate)

Volume of data written into a disk persecond

≥ 0 KB/s

Available file system(filesystemAvailable)

Available file system capacity of themeasured object Only containers using thedevice mapper in the Kubernetes cluster of1.11 or a later version are supported.

≥ 0 MB

Total file system(filesystemCapacity)

Total file system capacity of the measuredobject Only containers using the devicemapper in the Kubernetes cluster of 1.11or a later version are supported.

≥ 0 MB

File system usage(filesystemUsage)

File system usage of the measured objectPercentage of the used file system to thetotal file system Only containers using thedevice mapper in the Kubernetes cluster of1.11 or a later version are supported.

0%–100%

%

Total GPU memory(gpuMemCapacity)

Total GPU memory of the measured object > 0 MB

GPU memory usage(gpuMemUsage)

Percentage of the used GPU memory tothe total GPU memory

0%–100%

%

Used GPU memory(gpuMemUsed)

GPU memory used by the measured object ≥ 0 MB

GPU usage (gpuUtil) GPU usage of the measured object 0%–100%

%

Total NPU memory(npuMemCapacity)

Total NPU memory of the measured object > 0 MB

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 20

Page 24: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

NPU memory usage(npuMemUsage)

Percentage of the used NPU memory tothe total NPU memory

0%–100%

%

Used NPU memory(npuMemUsed)

NPU memory used by the measured object ≥ 0 MB

NPU usage (npuUtil) NPU usage of the measured object 0%–100%

%

Total physical memory(memCapacity)

Total physical memory that the measuredobject has applied for

≥ 0 MB

Physical memory usage(memUsage)

Percentage of the used physical memory tothe total physical memory

0%–100%

%

Used physical memory(memUsed)

Used physical memory of the measuredobject

≥ 0 MB

Downlink rate(recvBytesRate)

Inbound network traffic rate of themeasured object

≥ 0 Byte Per Second (BPS)

Downlink rate(recvPackRate)

Number of data packets received by theNIC per second

≥ 0 Packet Per Second (PPS)

Downlink error rate(recvErrPackRate)

Number of error packets received by theNIC per second

≥ 0 PPS

Error packets(rxPackErrors)

Number of error packets received by themeasured object

≥ 0 Packets

Uplink rate(sendBytesRate)

Outbound network traffic rate of themeasured object

≥ 0 BPS

Uplink error rate(sendErrPackRate)

Number of error packets sent by the NICper second

≥ 0 PPS

Uplink rate(sendPackRate)

Number of data packets sent by the NICper second

≥ 0 PPS

Container status (status) Docker container status 0 or 1l 0:

Normal

l 1:Abnormal

N/A

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 21

Page 25: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Table 5-13 Dimensions of container metrics

Dimension Description

appID Service ID

appName Service name

clusterId Cluster ID

clusterName Cluster name

containerID Container ID

containerName Container name

deploymentName Kubernetes deployment name

kind Application type

nameSpace Cluster namespace

podID Instance ID

podName Instance name

serviceID Inventory ID

gpuID GPU ID

npuName NPU name

npuID NPU ID

5.8 Process Metrics and Dimensions

Table 5-14 Process metrics

Metric Description ValueRange

Unit

Total CPU cores(cpuCoreLimit)

Total number of CPU cores that themeasured object has applied for

≥ 1 Cores

Used CPU cores(cpuCoreUsed)

Number of CPU cores used by themeasured object

≥ 0 Cores

CPU usage (cpuUsage) CPU usage of the measured objectPercentage of the used CPU cores to thetotal CPU cores

0%–100%

%

Handles (handleCount) Number of handles used by the measuredobject

≥ 0 N/A

Total physical memory(memCapacity)

Total physical memory that the measuredobject has applied for

≥ 0 MB

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 22

Page 26: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Metric Description ValueRange

Unit

Physical memory usage(memUsage)

Percentage of the used physical memory tothe total physical memory

0%–100%

%

Used physical memory(memUsed)

Used physical memory of the measuredobject

≥ 0 MB

Process status (status) Process status 0 or 1l 0:

Normal

l 1:Abnormal

N/A

Threads (threadsCount) Number of threads used by the measuredobject

≥ 0 N/A

Total virtual memory(virMemCapacity)

Total virtual memory that the measuredobject has applied for

≥ 0 MB

Table 5-15 Dimensions of process metrics

Dimension Description

appName Service name

clusterId Cluster ID

clusterName Cluster name

nameSpace Cluster namespace

processID Process ID

processName Process name

serviceID Inventory ID

5.9 Instance Metrics and DimensionsInstance metrics consist of container or process metrics. The dimensions of instance metricsare the same as those of container or process metrics. For details, see 5.7 Container Metricsand Dimensions and 5.8 Process Metrics and Dimensions.

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 23

Page 27: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

5.10 Service Metrics and DimensionsService metrics consist of instance metrics. The dimensions of service metrics are the same asthose of instance metrics. For details, see 5.9 Instance Metrics and Dimensions.

5.11 SLA Metrics and DimensionsNOTE

SLA indicates the service level agreement. SLA metrics described in this section refer to productfeatures of Application Operations Management (AOM), instead of the service level of AOM.

Table 5-16 SLA metrics

Metric Description ValueRange

Unit

Success rate(successRate)

Success rate of API calls in a statisticalperiod

0%–100%

%

Average latency (TP99) Minimum time for meeting requirementsof 99% requestsFor example, the time required forprocessing four requests is 10 ms, 100 ms,500 ms, and 20 ms, respectively.In the four requests, the number of 99%requests can be calculated by multiplying 4by 99%, and the rounding value is 4. Thatis, the number of 99% requests is 4. Theminimum time required for the fourrequests is 500 ms. That is, the TP99latency is 500 ms.

≥ 0 ms

Error calls (errors) Failure rate of API calls in a statisticalperiod

≥ 0 Count

Throughput (throughput) Total API calls in a specified period ≥ 0 Transaction Per Minute(TPM)

Apdex (apdex) User satisfaction with applicationperformance. A larger value indicates ahigher satisfaction level.

0–1 N/A

Table 5-17 Dimensions of SLA metrics

Dimension Description

appId Application ID

appName Service name

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 24

Page 28: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Dimension Description

clusterId Cluster ID

monitoringGroup Application name

nameSpace Cluster namespace

transactionType Transaction type

tier Application layer name

Application Operations ManagementService Overview 5 Metric Overview

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 25

Page 29: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

6 Usage Restrictions

OS Usage Restrictions

Application Operations Management (AOM) supports multiple operating systems (OSs).When purchasing a host, select the OS that meets requirements in Table 6-1. Otherwise, thehost cannot be monitored by AOM.

Table 6-1 OSs and versions supported by AOM

OS Version

SUSE SUSEEnterprise11 SP464-bit

SUSEEnterprise12 SP164-bit

SUSEEnterprise12 SP264-bit

SUSE Enterprise 12 SP3 64-bit

openSUSE

13.2 64-bit

42.2 64-bit

15.0 64-bit (Currently, syslog logs cannot becollected.)

EulerOS 2.2 64-bit 2.3 64-bit

CentOS 6.3 64-bit 6.5 64-bit 6.8 64-bit 6.9 64-bit 6.10 64-bit

7.1 64-bit 7.2 64-bit 7.3 64-bit 7.4 64-bit 7.5 64-bit 7.6 64-bit

Ubuntu 14.04server 64-bit

16.04server 64-bit

18.04 server 64-bit

Fedora 24 64-bit 25 64-bit 29 64-bit

Debian 7.5.0 32-bit

7.5.0 64-bit

8.2.0 64-bit

8.8.0 64-bit

9.0.0 64-bit

Resource Usage Restrictions

When using AOM, pay attention to the restrictions in Table 6-2. Resource usage restrictionsinclude some quota restrictions. For details, see Quotas.

Application Operations ManagementService Overview 6 Usage Restrictions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 26

Page 30: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Table 6-2 Resource usage restrictions

Category Object Usage Restrictions

Dashboard Dashboard A maximum of 50 dashboards can be created in oneregion, for example, CN North-Beijing1.

Graph in adashboard

A maximum of 20 graphs can be added to adashboard.

Number ofresources,threshold rules,services, orhosts in a graph

l A maximum of 100 resources can be added to aline graph, and resources can be selected acrossclusters.

l Only one resource can be added to a digital graph.l A maximum of 10 threshold rules can be added to

a threshold status graph.l A maximum of 10 hosts can be added to a host

status graph.l A maximum of 10 services can be added to a

service status graph.

Metric Metric data l Basic edition: Metric data can be stored in thedatabase for a maximum of 30 days.

l Professional edition: Metric data can be stored inthe database for a maximum of one year.

Metric item After resources such as clusters, services, and hostsare deleted, their related metrics can be stored in thedatabase for a maximum of 30 days.

Dimension A maximum of 20 dimensions can be configured fora metric.

Metric queryAPI

A maximum of 20 metrics can be queried at a time.

Statisticalperiod

The maximum statistical period is 1 hour.

Metric datareturned for asingle query

A maximum of 1440 data points can be returned for ametric in a single query.

Custom metric No restrictions.

Custom metricto be reported

A maximum of 40 KB data can be reported in asingle request.

Application Operations ManagementService Overview 6 Usage Restrictions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 27

Page 31: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Category Object Usage Restrictions

ApplicationmetricJob metric

l When the number of containers on a host exceeds1000, the ICAgent stops collecting applicationmetrics and sends the ALM-34105 ICAgentStopped Collecting Application Metrics alarm.

l When the number of containers on a host isreduced to less than 1000, the ICAgent resumesthe collection of application metrics and theALM-34105 ICAgent Stopped CollectingApplication Metrics alarm is cleared.

A job automatically exits after it is completed. Tomonitor metrics of a job, ensure that the survival timeis greater than 90s so that you can collect its metricdata.

Resourcesconsumed bythe collector

When the collector collects basic metrics, theresources consumed by the collector are related tofactors such as the number of containers and that ofprocesses. On a VM without any services, thecollector consumes 30 MB memory and 1% CPUusage. To ensure collection reliability, ensure that thenumber of containers running on a single node mustbe less than 1000.

Threshold rule(regions exceptCN North-Beijing1 andCN East-Shanghai2)

Threshold rule A maximum of 1000 threshold rules can be created ina project.

Number oftopics that canbe selected

A maximum of five topics can be selected for eachthreshold rule.

Threshold rule(CN North-Beijing1 andCN East-Shanghai2)

Threshold rule l A maximum of 1000 static threshold rules can becreated.

l A maximum of 10 intelligent threshold rules canbe created.

Thresholdtemplate

l A maximum of 50 static templates can be created.l A maximum of 10 intelligent templates can be

created.

Number oftopics that canbe selected

A maximum of five topics can be selected for eachthreshold rule.

Notificationrule

Number oftopics that canbe selected

A maximum of five topics can be selected for eachnotification rule.

Log Size of a log The maximum size of each log is 10 KB. If a logexceeds 10 KB, the ICAgent does not collect it. Thatis, the log will be discarded.

Application Operations ManagementService Overview 6 Usage Restrictions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 28

Page 32: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Category Object Usage Restrictions

Log traffic A maximum of 10 MB/s is supported for each tenantin a region. If the log traffic exceeds 10 MB/s, logsmay be lost.If you require more log traffic, submit a serviceticket. For details, see Submitting a Service Ticket.

Historical log The storage duration of log data varies according toeditions and storage fees of different editions aredifferent. For details, see AOM Pricing Details.

Log file The ICAgent can collect a maximum of 20 log filesfrom a volume mounting directory.

The ICAgent can collect a maximum of 1000standard container output log files. These files mustbe in JSON format.

Resourcesconsumedduring log filecollection

The resources consumed during log file collection areclosely related to the log volume, number of files,network bandwidth, and backend service processingcapability.

Log loss The collector uses multiple mechanisms to ensure logcollection reliability and prevent data loss. However,logs may be lost in the following scenarios:l The log rotation policy of Cloud Container Engine

(CCE) is not used.l Log files are rotated at a high speed, for example,

once per second.l Logs cannot be forwarded due to improper system

security settings or syslog.l The container running time is extremely short.l A single node generates logs at a high speed,

exceeding the allowed transmit bandwidth or logcollection speed. It is recommended that the loggeneration speed of a single node be smaller than5 MB/s.

Repetitive logs When the collector is restarted, repetitive data may becollected around the restart time.

Alarm center Alarm You can query the alarms generated in the last 30days.

Event You can query the events generated in the last 30days.

Application Operations ManagementService Overview 6 Usage Restrictions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 29

Page 33: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Category Object Usage Restrictions

Second-levelmonitoring

Metric l Second-level monitoring takes effect for 60minutes.After you click Start Second-Level Monitoring,the Stop Second-Level Monitoring optionbecomes available and the system starts to countdown. An hour later, AOM automatically stopssecond-level monitoring.

l Second-level monitoring does not supportimmediate start and stop. The minimum intervalbetween a start and a stop is 1 minute.To stop second-level monitoring right after it isstarted, wait for 1 minute.To start second-level monitoring right after it isstopped, wait for 1 minute.

l When second-level monitoring is started, thestatistical period of the metrics on the MetricMonitoring and Dashboard pages isautomatically changed to the statistical period youset. When second-level monitoring is stopped, thedefault statistical period is used.

l Currently, custom metrics, SLA metrics, andcluster metrics do not support second-levelmonitoring. For details, see SLA Metrics andMetric Dimensions and Cluster Metrics andMetric Dimensions.

Application Operations ManagementService Overview 6 Usage Restrictions

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 30

Page 34: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

7 Relationships Between AOM and Other

Services

Application Operations Management (AOM) can work with Simple Message Notification(SMN), Distributed Message Service (DMS), Cloud Trace Service (CTS), and other services.For example, when you subscribe to SMN, AOM can inform related personnel of thresholdrule status changes by email or Short Message Service (SMS) message. When AOMinterconnects with middleware services such as Virtual Private Cloud (VPC) and Elastic LoadBalance (ELB), you can monitor such services through AOM. When AOM interconnects withCloud Container Engine (CCE) or Cloud Container Instance (CCI), you can monitor theirbasic resources and applications, and view related logs and alarms.

Figure 7-1 Relationships between AOM and other services

Application Operations ManagementService Overview 7 Relationships Between AOM and Other Services

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 31

Page 35: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

SMNSMN can push notifications based on requirements, and you can receive notifications by SMSmessage, email, or app. You can also integrate application functions through SMN to reducesystem complexity.

AOM uses the message transmission mechanism of SMN. When it is inconvenient for you toquery threshold rule status changes on site, AOM sends such changes to you by email or SMSmessages. In this way, you can obtain resource status and other information in real time andtake necessary measures to avoid service loss. For details, see Creating a Static ThresholdRule and Creating an Intelligent Threshold Rule.

OBSObject Storage Service (OBS) is a secure, reliable, and cost-effective cloud storage service.With OBS, you can easily create, modify, and delete buckets, as well as upload, download,and delete objects.

AOM allows you to dump logs to OBS bucket for long-term storage. For details, see Addinga Log Dump.

DMSDMS is a fully-managed, high-performance message queuing service that supports standard,FIFO, Kafka, and ActiveMQ queues. It is compatible with HTTP, TCP, and AMQP, andprovides a flexible and reliable asynchronous communication mechanism for distributedapplications.

AOM calls the transmission API of DMS to send log or threshold alarm data to specifiedDMS Kafka queues. In this way, your applications can retrieve log or threshold alarm data inthese queues. For details, see Subscribing to Logs and Subscribing to Threshold Alarms.

CTSCTS records operations on cloud resources in your account. You can use the records toperform security analysis, track resource changes, conduct compliance audits, and locatefaults. To store operation records for a longer time, you can subscribe to OBS and synchronizeoperation records to OBS in real time.

With CTS, you can record operations associated with AOM for future query, audit, andtracing. For details, see Key Operations on AOM.

IAMIdentity and Access Management (IAM) provides identity authentication, permissionmanagement, and access control.

IAM can implement authentication and fine-grained authorization for AOM.

Cloud EyeCloud Eye provides a multi-dimensional monitoring platform for resources such as ElasticCloud Server (ECS) and bandwidth. With Cloud Eye, you can view the resource usage andservice running status in the cloud, and respond to exceptions in a timely manner to ensuresmooth running of services.

Application Operations ManagementService Overview 7 Relationships Between AOM and Other Services

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 32

Page 36: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

By calling Cloud Eye APIs, AOM can obtain and display monitoring data of ECS, VPC,Relational Database Service (RDS), and Distributed Cache Service (DCS), so that you canmonitor these services on the AOM console.

APMApplication Performance Management (APM) monitors and manages the performance ofcloud applications in real time. APM provides performance analysis of distributedapplications, helping O&M personnel quickly locate and resolve faults and performancebottlenecks.

AOM integrates APM functions to better monitor and manage applications.

VPCVPC is a logically isolated virtual network. It is created for ECS servers, and supports customconfiguration and management, improving resource security and simplifying networkdeployment.

After subscribing to VPC, you can monitor VPC running status and metrics on the AOM pagewithout installing other plug-ins.

ELBELB distributes access traffic to multiple backend ECS servers based on forwarding policies.By distributing traffic, ELB expands the capabilities of application systems to provideservices externally. By preventing single points of failures, ELB improves the availability ofapplication systems.

After subscribing to ELB, you can monitor ELB running status and metrics on the AOM pagewithout installing other plug-ins.

RDSRDS is a cloud-based web service which is reliable, scalable, easy to manage, and ready touse out-of-the-box.

After subscribing to RDS, you can monitor RDS running status and metrics on the AOM pagewithout installing other plug-ins.

DCSDCS is an online, distributed, in-memory cache service compatible with Redis, Memcached,and In-Memory Data Grid (IMDG). It is reliable, scalable, ready to use out-of-the-box, andeasy to manage, meeting your requirements for high read/write performance and fast dataaccess.

After subscribing to DCS, you can monitor DCS running status and metrics on the AOM pagewithout installing other plug-ins.

CCECCE is a high-performance and scalable container service through which enterprises canbuild reliable containerized applications. It integrates network and storage capabilities, and iscompatible with Kubernetes and Docker container ecosystems. CCE enables you to create and

Application Operations ManagementService Overview 7 Relationships Between AOM and Other Services

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 33

Page 37: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

manage diverse containerized workloads easily. It also provides efficient O&M capabilities,such as container fault self-healing, monitoring log collection, and auto scaling.

You can monitor basic resources, applications, logs, and alarms about CCE on the AOM page.

CCI

CCI is a serverless container engine that allows you to run containers without creating andmanaging server clusters.

You can monitor basic resources, applications, logs, and alarms about CCI on the AOM page.

AOS

Application Orchestration Service (AOS) provides a graphical designer, enabling you toprovision cloud service resources and deploy applications intuitively and easily. By compilingtemplates, you can provision and copy cloud service resources and applications with just afew clicks. AOS also provides a large number of free sample templates, covering commoncloud service and application scenarios. You can directly use these templates or customizeyour own templates.

You can monitor basic resources, applications, logs, and alarms about AOS on the AOM page.

ServiceStage

ServiceStage is a one-stop PaaS service that provides cloud-based application hosting,simplifying application lifecycle management, from deployment, monitoring, O&M, togovernance. It provides a microservice framework compatible with mainstream open-sourceecosystems and enables quick building of distributed applications.

You can monitor basic resources, applications, logs, and alarms about ServiceStage on theAOM page.

FunctionGraph

FunctionGraph hosts and computes functions in a serverless context. It automatically scalesup/down resources during peaks and spikes without requiring the reservation of dedicatedservers or capacities. Resources are billed on a pay-per-use basis.

You can monitor basic resources, applications, logs, and alarms about FunctionGraph on theAOM page.

IEF

Intelligent EdgeFabric (IEF) manages edge nodes of users, extends cloud applications to edgenodes and associates edge and cloud data, meeting customer requirements for remote control,data processing, analysis, decision-making, and intelligence of edge computing resources. IEFalso provides unified on-cloud O&M capabilities, such as device/application monitoring andlog collection, to offer a complete edge computing solution that contains integrated servicesunder edge and cloud collaboration.

You can monitor resources (such as edge nodes, applications, and functions), logs, and alarmsabout IEF on the AOM page without installing other plug-ins.

Application Operations ManagementService Overview 7 Relationships Between AOM and Other Services

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 34

Page 38: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

ECSECS is a computing server consisting of the CPU, memory, image, and Elastic VolumeService (EVS) disk. It supports on-demand allocation and auto scaling. ECSs integrate VPC,virtual firewall, and multi-data-copy capabilities to create an efficient, reliable, and securecomputing environment. This ensures stable and uninterrupted running of services. Aftercreating an ECS server, you can use it like using your local computer or physical server.

When purchasing an ECS server, ensure that its operating system (OS) meets the requirementsin Table 1 OSs and versions supported by AOM. In addition, install the ICAgent on theserver according to Installing the ICAgent. Otherwise, the server cannot be monitored byAOM. You can monitor basic resources, applications, logs, and alarms about ECS on theAOM page.

BMSBare Metal Server (BMS) is a dedicated physical server in the cloud. It provides high-performance computing and ensures data security for core databases, key application systems,and big data. With the advantage of scalable cloud resources, you can apply for BMS serversflexibly and they are billed on a pay-per-use basis.

When purchasing a BMS server, ensure that its OS meets the requirements in Table 1 OSsand versions supported by AOM. In addition, install the ICAgent on the server according toInstalling the ICAgent. Otherwise, the server cannot be monitored by AOM. You canmonitor basic resources, applications, logs, and alarms about BMS on the AOM page.

Application Operations ManagementService Overview 7 Relationships Between AOM and Other Services

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 35

Page 39: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

8 Basic Concepts

Metrics

Metrics reflect resource performance data or status. A metric consists of the namespace,dimension, name, and unit.

Metric namespaces indicate containers for storing metrics. Metrics in different namespacesare independent of each other so that metrics of different applications will not be aggregatedto the same statistics information. Each metric has certain features, and a dimension may beconsidered as a category of such features. Figure 8-1 describe the relationships amongnamespaces, dimensions, and cluster metrics.

Figure 8-1 Cluster metrics

The metric storage duration and billing mode are different in Application OperationsManagement (AOM) basic and pay-per-use editions. For details, see AOM-Pricing Details.

Hosts

Each host of AOM corresponds to a VM or physical machine. A host can be your own VM orphysical machine, or a VM, for example, an Elastic Cloud Service (ECS) or a physicalmachine, for example, Bare Metal Server (BMS) that you purchase from the cloud. Only

Application Operations ManagementService Overview 8 Basic Concepts

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 36

Page 40: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

when the Operating System (OS) of a host meets the requirements in OS Usage Restrictionsand the ICAgent is installed on the host, the host can be connected to AOM for monitoring.

ICAgent

ICAgent is the collector of AOM. It runs on each host to collect metrics, logs, and applicationperformance data in real time. Before using AOM, install the ICAgent. Otherwise, AOMcannot be used.

Logs

AOM supports the functions of querying and analyzing massive logs; collecting,downloading, dumping, and searching for logs; analyzing reports, querying SQL statements,implementing real-time monitoring, and reporting alarms based on keyword statistics.

The log storage duration, size, and billing mode are different in AOM basic and pay-per-useeditions. For details, see AOM-Pricing Details.

Log Buckets

Log buckets are logical groups of log files. Before creating statistical rules or querying bucketlogs, you need to add a log bucket.

Bucket Logs

Bucket logs support fine-grained query. You can view logs by bucket to obtain key servicedata, and quickly understand and locate problems.

Alarms

Alarms are reported when AOM or an external service such as Application OrchestrationService (AOS), ServiceStage, Cloud Container Engine (CCE), or Application PerformanceManagement (APM) is abnormal or may cause exceptions. Alarms will cause serviceexceptions and need to be handled.

There are two alarm clearance modes:

l Automatic clearance: After a fault is rectified, AOM automatically clears thecorresponding alarm, for example, a threshold alarm. You do not need to perform anyoperations.

l Manual clearance: After a fault is rectified, AOM does not automatically clear thecorresponding alarm, for example, ICAgent installation failure alarm. You need tomanually clear the alarm.

Events

Events generally carry some important information. They are reported when AOM or anexternal service, such as AOS, ServiceStage, CCE, or APM encounters some changes. Suchchanges are not necessarily cause service exceptions. Events do not need to be handled.

Threshold Rules

Threshold rules include static and intelligent threshold rules.

Application Operations ManagementService Overview 8 Basic Concepts

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 37

Page 41: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

l Static threshold rules: You can set threshold criteria for resource metrics. AOM reportsa threshold alarm when the value of a metric reaches the preset threshold, or reports an"insufficient data" event when no metric data is reported. AOM interconnects withSimple Message Notification (SMN). When the static threshold rule status (Exceeded,OK, or Insufficient) changes, a notification is sent by email or Short Message Service(SMS) message. In this way, you can detect and handle exceptions at the earliest time.

l Intelligent threshold rule: Based on the defined threshold sensitivity and historicaltrend of metrics, AOM intelligently predicts the range of metric data. If the data deviatesfrom the range, a threshold alarm is reported. AOM interconnects with SMN. When thestatic threshold rule status (Exceeded, OK, or Insufficient) changes, a notification issent by email or SMS message. In this way, you can detect and handle exceptions at theearliest time.When an intelligent threshold rule is used, you do not need to set thresholds. AOMdetects exceptions through machine learning. For large systems, intelligent thresholdrules effectively reduce manual configuration costs, and avoid repeated adjustments.

Notification RulesAOM provides the notification function. When an alarm is reported due to an exception inAOM or an external service, alarm information will be sent to the specified personnel byemail or SMS message. Therefore, such personnel can rectify faults in time to avoid serviceloss.

Statistical RulesAOM can periodically collect statistics about keywords or SQL statements, and generatemetric data, enabling you to monitor system performance and service information in real time.You can also set threshold rules for metrics. AOM reports a threshold alarm when the value ofa metric reaches the preset threshold. In this way, you can detect and handle exceptions at theearliest time.

TopologiesTopologies show the call and dependency relationships between services. A topology viewconsists of circles, lines with arrows, and resources. Each circle represents a service, and eachsegment in the circle represents an instance. The fraction in each circle indicates the numberof active instance/total number of instances. The values below a fraction respectively indicatethe call count, latency, and error count. Each line with an arrow represents a call relationship.Thicker lines indicate more calls. The values above a line respectively indicate the throughputand total latency. Throughput indicates the call count within the selected period. ApplicationPerformance Index (Apdex) is used in topologies to quantify user satisfaction with applicationperformance. Different colors indicate different Apdex ranges, helping you quickly detect andlocate faults.

Application Operations ManagementService Overview 8 Basic Concepts

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 38

Page 42: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Figure 8-2 Topology

TransactionsIn real life, a transaction is a one-time task. A user completes a task by using an application.For example, a commodity query in an e-commerce application is a transaction, and apayment is also a transaction. A transaction is usually an HTTP request (complete process:request > web server > database > web server > request).

TracingBy tracing and recording service calls, AOM visually restores the execution track and statusof service requests in distributed systems, so that you can quickly demarcate performancebottlenecks and faults.

Application Operations ManagementService Overview 8 Basic Concepts

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 39

Page 43: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

9 Permissions Management

If you need to assign different permissions to employees in your enterprise to access yourApplication Operations Management (AOM) resources, Identity and Access Management(IAM) is a good choice for fine-grained permissions management. IAM provides identityauthentication, permissions management, and access control, helping you secure access toyour AOM resources.

With IAM, you can use your HUAWEI CLOUD account to create IAM users for youremployees, and assign permissions to the users to control their access to specific resourcetypes. For example, some software developers in your enterprise need to use AOM resourcesbut must not delete them or perform any high-risk operations. To achieve this result, you cancreate IAM users for the software developers and grant them only the permissions requiredfor using AOM resources.

If your HUAWEI CLOUD account does not need individual IAM users for permissionsmanagement, you may skip over this chapter.

IAM can be used free of charge. You pay only for the resources in your account. For moreinformation about IAM, see IAM Service Overview.

Supported System Policies

A policy is a set of permissions defined in JSON format. By default, new IAM users do nothave any permissions assigned. You need to add a user to one or more groups, and assignpermissions policies to these groups. The user then inherits permissions from the groups it is amember of. This process is called authorization. After authorization, the user can performspecified operations on AOM based on the permissions. IAM provides system policies thatdefine the common permissions for AOM, such as administrator and read-only permissions.You can directly use these system policies to assign permissions.

AOM is a project-level service deployed in specific physical regions. Therefore, AOMpermissions are assigned to users in specific regions (such as CN North-Beijing1) and onlytake effect for these regions. If you want the permissions to take effect for all regions, youneed to assign the permissions to users in each region. When accessing AOM, the users needto switch to a region where they have been authorized to use this service.

Table 9-1 lists all the system policies supported by AOM. There are fine-grained policies androle-based access control (RBAC) policies. AOM supports fine-grained policies only. A fine-grained policy consists of API-based permissions for operations on specific resource types.Fine-grained policies, as the name suggests, allow for more fine-grained control than RBAC

Application Operations ManagementService Overview 9 Permissions Management

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 40

Page 44: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

policies. Users with such a policy assigned are allowed or not allowed to perform specificoperations on AOM. For example, users can only perform basic operations on AOM, such asquerying metrics and threshold rule lists. For the API actions supported by AOM, seeIntroduction. Fine-grained policies are currently available for open beta testing. You canapply to use the fine-grained access control function free of charge. For more information, seeFine-grained Policy.

Table 9-1 Supported system policies

Policy Name Description Policy Type

AOM Admin Administrator permissions for AOM. Usersgranted these permissions can operate anduse AOM.

Fine-grained policy

AOM Viewer Read-only permissions for AOM. Usersgranted these permissions can only viewAOM data.

Fine-grained policy

Table 9-2 lists the common operations supported by each system policy of AOM. Pleasechoose proper system policies according to this table.

Table 9-2 Common operations supported by each system policy

Operation AOM Admin AOM Viewer

Creating a threshold rule √ x

Modifying a threshold rule √ x

Deleting a threshold rule √ x

Creating a thresholdtemplate

√ x

Modifying a thresholdtemplate

√ x

Deleting a thresholdtemplate

√ x

Creating a dashboard √ x

Modifying a dashboard √ x

Deleting a dashboard √ x

Creating a notification rule √

NOTEFirst assign Simple MessageNotification (SMN)permissions to IAM users. Fordetails, see What Can I Do IfI Do Not Have thePermission to Access SMN.

x

Application Operations ManagementService Overview 9 Permissions Management

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 41

Page 45: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Operation AOM Admin AOM Viewer

Modifying a notificationrule

NOTEFirst assign SMN permissionsto IAM users. For details, seeWhat Can I Do If I Do NotHave the Permission toAccess SMN.

x

Deleting a notification rule √ x

Creating a service discoveryrule

√ x

Modifying a servicediscovery rule

√ x

Deleting a service discoveryrule

√ x

Subscribing to thresholdalarms

√ x

Exporting a monitoringreport

√ √

Starting second-levelmonitoring

√ √

Configuring a VM logcollection path

√ x

Adding a log bucket √ x

Modifying a log bucket √ x

Deleting a log bucket √ x

Adding an extraction rule √ x

Viewing bucket logs √ √

Adding a log dump √ x

Modifying a log dump √ x

Deleting a log dump √ √

Starting periodical dump √ √

Stopping periodical dump √ √

Creating a statistical rule √ x

Modifying a statistical rule √ x

Deleting a statistical rule √ x

Configuring a delimiter √ x

Application Operations ManagementService Overview 9 Permissions Management

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 42

Page 46: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Operation AOM Admin AOM Viewer

Subscribing to logs √ x

Installing the ICAgent √ √

Upgrading the ICAgent √ x

Uninstalling the ICAgent √ x

Helpful Linksl IAM Service Overviewl Adding Users and Assigning AOM Permissionsl Policy Syntaxl Introduction

Application Operations ManagementService Overview 9 Permissions Management

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 43

Page 47: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

10 Feature Description

Table 10-1 Feature description

Released On Description

2019-08-30 l Supported access to Prometheus O&M data.l Added three container metrics: available file system, total file system, and file system

usage.l Added NPU metrics to host and container metrics.l Supported log collection using the ICAgent in the Windows Operating System (OS).

2019-08-21 l Supported measurement and analysis of user operation data of apps, helping youconduct operation activities more easily.

l Supported measurement and analysis of network interaction data of apps, helping youoptimize networks and improve app user experience.

2019-08-09 Added AOM collection configuration, and supported collection of both metrics and logs.

2019-01-31 l Supported structuring of original logs, and query and analysis of structured logsusing SQL statements. Learn more

l Removed the function of configuring container log collection paths.

2018-12-05 Supported the function of configuring container log collection paths. Learn more

2018-11-27 Supported query of bucket logs. Learn more

2018-10-31 Supported configuration of delimiters. You can separate log contents into multiple wordsby using delimiters, and then search for logs based on these words. Learn more

2018-10-24 Supported creation of statistical rules. You can periodically collect the number ofkeywords in log files and generate metrics data. Learn more

2018-09-26 Supported log dump. You can dump log files in log buckets to Object Storage Service(OBS) buckets. Learn more

2018-09-13 Supported fine-grained authorization, precisely allowing or disallowing you to perform aspecific operation on a specific resource.

Application Operations ManagementService Overview 10 Feature Description

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 44

Page 48: Service Overview - support.huaweicloud.com · generates threshold alarms. In addition, by enabling the alarm subscription function, you can connect threshold alarms to an O&M platform

Released On Description

2018-09-05 l Multi-dimensional cloud application O&M: Provides a full-link, multi-layer, andone-stop O&M platform for resources, applications, and application experience.

l Intelligent O&M: Provides an intelligent threshold mechanism, which supportsdynamic threshold alarm reporting based on machine learning, improving themonitoring efficiency.

l Device-side analysis: Supports performance metric and crash analysis of browser andmobile applications, achieving full control over applications.

l Transaction insights: Supports automatic discovery of transaction performanceproblems, intelligent filtering, and Root Cause Analysis (RCA).

l Middleware monitoring: Supports monitoring of the statuses and metrics ofmiddleware such as Relational Database Service (RDS) and Distributed CacheService (DCS) on the Application Operations Management (AOM) page withoutinstalling other plug-ins.

2018-08-15 Supported creation of notification rules, enabling alarm information to be sent tospecified personnel by Short Message Service (SMS) message or email.

2018-08-05 Supported batch creation of static and intelligent threshold rules based on templates inNorth China.

2018-07-12 l Supported second-level monitoring. Learn morel Supported disk and file system monitoring.

2018-05-16 Supported collection of metrics of VM applications (using Java, Node.js, or Pythonlanguage). Supported discovery of applications based on custom rules. Learn more

2018-04-15 Provided both basic and professional editions, meeting your different demands formetric and log storage.

2018-03-15 This issue is the first official release.

Application Operations ManagementService Overview 10 Feature Description

Issue 01 (2019-04-03) Copyright © Huawei Technologies Co., Ltd. 45