conflux: distributed, real-time actionable insights on ... · pdf fileare apache spark, apache...

The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

Conflux: Distributed, real-time actionable insights on high-volume data streams

Vinay Eswara VMware India, Bangalore [email protected]

Jai Krishna VMware India, Bangalore

[email protected]

Gaurav Srivastava VMware India, Bangalore

[email protected]

Abstract Systems and applications are most useful when they can react to events and data in real-time, because the utility of the reaction typically reduces with time. In cloud environments with high data volume, identifying and filtering relevant data, processing it and reacting in real-time becomes challenging.

This paper presents the design and implementation of Conflux, a system for merging high volume streams of data in a scalable, fault tolerant way. It allows users to declaratively specify complex merge functions over data streams and evaluates these functions in real time. It also provides callout facility where applications can specify conditional actions in the context of the merged result stream. This allows applications to delegate stream merging and evaluation and focus on the resultant application specific reactions.

Consider a stock-market use case where news feeds, stock quotes, risk appetite and currency values must be continuously monitored. When a favourable situation occurs, the system must automatically place buy or sell orders to maximize profit within the bounds of risk specified. With Conflux, the operator only needs to express application actions as a function over the current and historical stock quotes, risk and current currency streams, and provide stubbed code buying, selling or paging a trader. There are similar use-cases that can be realized in domains such as IoT, fraud detection etc. where real-time reaction at scale is critical.

Conflux itself is deployed on a cluster to linearly scale with data volume. It tolerates node failures by transparently redistributing streams on the fly using consistent hashing [1] onto survivor nodes while compensating for any data loss.

1. Introduction Stream processing systems are critical to make accurate decisions in rapidly changing environments. In scenarios where there are a multitude of disparate data streams that need to be grouped and merged on the fly the following issues need to be addressed:

Grouping: Users should be able to specify grouping of data streams before any merge functions.

Merging: Streams of data need to be merged based on some pluggable, domain specific rules that compute the result that can be specified by users of Conflux.

Change: Changes to the merge function specified by the user and group updates as individual data sources enter or leave the group should reflect immediately in the result of the merge.

Consider an example of automating Dev-Ops responses when running a critical SaaS application:

Operators’ definition of ‘critical workload’ may vary with time. Additional mid-tier machines may be added to the cluster or decommissioned; completely new application clusters may be marked critical. Conflux needs to track the snapshot of ‘criticality’ for the automatic DevOps system to be effective. I.e. it should allow users to specify a group of ‘critical data sources’ and guarantee updating of the group in real-time.

Developers, maintainers and operators of a system would usually have knowledge about what constitutes a problem. These would be expressed as a relationship between data streams. Since the users of that system know best, it should allow the developers / maintainers / operators to specify the relationship in a way that does not require a recompile / redeploy of Conflux itself. A secondary


objective is that it should be as simple as possible to specify these merge relationships

If either the grouping or the merge functions were changed, the changes should reflect in the composite merged stream immediately. Unless this is immediate, the reaction would be based on a stale view until the new declarations take effect.

Conflux is designed to address these high volume, low latency stream merging use cases. Typical use cases include IOT, Dev-Ops, faceted dash boarding and monitoring. It keeps users of the system insulated from scale, availability and performance concerns and allows them to focus on their application logic.

Conflux users can specify application specific aggregates, define thresholds and provide callout actions. This application code is in JavaScript to run on the JVM. This allows for dynamic modification of declarations without system recompile / redeploy.

Conflux does not address the problem of modelling using machine learning or other methods. It focuses on evaluating the results of an existing model function with low latency It supports adding; deleting or modifying existing models in real-time. It is hosted as a SaaS platform and supports a multi-tenant model. It uses consistent hashing to partition data streams based on the ID of the data source emitting the stream to

parallelize memory and compute resources across the cluster and for coping with node failure.

In this paper, we propose: i. A scalable, fault tolerant, low-latency way

to evaluate models, expressed as relationships between different high-volume streams.

ii. A way to define new models, redefine existing models and delete models dynamically in real-time

iii. Threshold results and invoke external HTTP callouts for application specific actions.

The rest of the paper is organized as follows; Section 2 presents the background for conflux and related work. Section 3 presents system design. Section 4 describes the implementation of Conflux. Section 5 evaluates behavior and performance under different scenarios. Finally, section 6 contains concluding remarks and directions for future work.

2. Motivation and related work Conflux originated as a general monitoring system [12] for virtual machines that ran critical workloads. The following requirements in this system led to the conceptualization of Conflux:

Figure 1. The first part of the figure represents routing to conflux nodes using consistent hash of the stream IDs. The second shows ease of deployment/maintenance of conflux versus a typical distributed application.


i. Metrics were collected at the VM level. There was a need to provide rollups in real-time to provide useful real-time dash boarding.

ii. It was useful to allow the customer to synthesize new metrics by merging metrics streams from applications and underlying metrics with resource billing and allocation metrics to provide real-time, low-granularity economic ROI dashboards.

iii. A logical extension was to invoke HTTP callouts, where tenants can embed advisory or remedial actions such as notifications and auto-scaling respectively.

There are several existing stream merging systems available [2,3,4,5,6,7,8,9]. Twitter’s Heron, Google millwheel and Google photon are the closest in intent to Conflux, but are closed source. Other systems that have inspired Conflux are Apache Spark, Apache Storm, Apache Samza, etc. The critical reason for building Conflux instead of using these systems was twofold: I. It was an explicit requirement that all nodes of

the cluster processing system are identical w.r.t the software stack, i.e. there are no ‘master’ or ‘slave’ nodes. We have found that this greatly simplifies deployment and upgrades in production since an the VM snapshots used for deployment are identical. The tuning of the different nodes, CPU, RAM and disk configuration are all identical. This requirement also implies that when the Conflux cluster deployment is under high load, all nodes should be almost equally loaded. There should be no single point of failure or hot spotting. Also, adding new nodes to the cluster should uniformly reduce load across the cluster. This requirement eliminated all systems with master-slave topologies or systems which required nodes with heterogeneous software components. Homogenous node types allow us to leverage established technology like VM snapshotting, storage and deployment to monitor and manage the cluster.

II. Another requirement was to be able to truly operate on streams ‘on the fly’. Thus, frameworks that do any form of in-memory caching, e.g. micro-batching in Apache storm were not considered.

3. System Design a) Definitions

Agents: are external systems that push data to Conflux in a format that it understands.

Packet: is the atomic unit of input and output in Conflux. It is a set of (ID, Metric, Timestamp, Value) tuples, which allows for composition and pipelining of data operations. The ID is a universally unique ID corresponding to each data source.

Stream: is a logically unbounded sequence of tuples bearing the same ID.

Routing: is the process of consistent hashing the ID in each packet with the number of live nodes in the conflux cluster to decide which node to deliver the packet to.

Data Source: is the entity being monitored for observable phenomenon. It could be a virtual machine in the case of monitoring, a process in the case of log file streams or a sensor in IoT. A single data source generates a single stream.

Metric: is an individual, time stamped, measurable property of a phenomenon being observed. A data source can emit multiple metrics, for example a data source corresponding to a VM can emit CPU, RAM and IOPS metrics identified by the VM id. A stock ticker can emit stock quotes identified by the stock symbol as ID. All Metric values are not restricted to numbers; strings and Booleans are allowed.

Tags: are empty metrics without timestamp/value tuples. They are used for logical grouping of streams.

Feed forward: when a node receives a packet with some ID ‘X’, it looks up all the groups that ‘X’ belongs to and for each group, retransmits the packet after applying the group specific transformation. Each transformed packet’s ID ‘X’ is overwritten with the respective Group ID. This process of retransmission is called feed forward.

Since the same Group Id is used for routing by all members of a given group, the result of the feed forward would always be homed to a specific node although individual members may be homed to different nodes. This allows for parallelization of pre-processing before merging the stream. [Figure 2]

Merging: is the process of joining multiple streams to create a single composite stream. Comparing


feed forward and merging to MapReduce, feed forward like map in that it operates over individual streams and can be parallelized. Merging is analogous to reduce which aggregates multiple values to one metric.

Ingestion: Conflux expects agents to emit streams in a specific format. An ID uniquely identifies each stream. The stream to node mapping is decided by a consistent hash of the stream ID to the nodes in the Conflux cluster. This ensures that in steady state, a particular stream always lands on a particular node in the cluster. Streams are roughly equally partitioned between the different nodes, so nodes can cache metadata about the stream IDs that are homed to it. This provides horizontal scale since the fire hose of streams are split among Conflux nodes in a clustered deployment. When nodes enter or leave the cluster, the resultant streams are consistent-hashed among the current snapshot of cluster nodes. This ensures that the load balancing characteristics of the cluster are retained.

Groups: Conflux is unique in the way that groups are treated. It exposes synchronous API to create a group and add/delete constituent members. The create API expects a group name and list of stream ID’s that belong to the group. It returns the

group ID. The “add” and “delete” API expect the group ID and the list of elements to be added and deleted from the group.

Any group operation is guaranteed to disseminate across the cluster fast. This is again done using consistent hashing ID’s against nodes. Consider a group is created with member ID’s A,B,C and returns the group ID G. Apart from persisting this information into the persistent store, for each member ID, Conflux sends out a message notifying the member that it now belongs to group ‘G’. The message is routed the same way as an agent message: by consistent hashing the key to the member ID. This ensures that the group membership message reaches the same node where the corresponding stream is homed. Since this message is subject to acknowledges and retries just like stream messages, it eventually reaches the correct node. To illustrate, consider member ‘X’ is added to the group ‘G’. Conflux persists the information and sends the message using ‘X’ as routing key, specifying that the X to G mapping no longer holds. The X routing key ensures that the group membership change message lands on the same node that services ‘X’. The node updates its cache removing X from group G. The same sequence of events happens

Figure 2. Every node maintains a cached list of groups that a stream homed on it belongs to. All group operations update the list, as shown in the first 2 diagrams. These lead to fast group updates. The 3rd ‘feed forward’ figure shows how this list is used for routing, enabling merging of grouped streams at one node. Since nodes owning groups is determined by consistent hashing and groups have unique IDs, Groups themselves are evenly spread across the cluster.


on member addition and group creation. If there are many members added or deleted, Conflux sends messages out in parallel. The remaining steps are identical.

b) Load Balancing, Scalability, Availability

Load balancing in Conflux is probabilistic, based on Consistent hashing. For a given stream, which consists of a stream of unbounded tuples with the same ID, routing is done by consistent hashing the ID against the number of live conflux nodes in the cluster. Once a packet lands on the cluster, the same ID is used for persisting the data into Cassandra. This ensures that writes to Cassandra are always local. The number of network hops needed for a packet to from emission by an agent to persisting in Cassandra is exactly one. The probabilistic consistent hash algorithm ensures that all streams being ingested are roughly equally distributed throughout the Conflux clustered deployment. If a node enters/leaves a Conflux cluster changing its state, message routing in RMQ changes and Cassandra would be selecting new nodes to write to, both per the latest cluster state. However, since the method of node selecting is the same for both cases, i.e. consistent hash the ID against the number of live nodes in the cluster, both would be selecting the same nodes for the same ID’s thus, writes would always remain local. For other nodes in steady state, that are not part of the cluster state

change, the consistent hash algorithm ensures streams homed to them remain homed as before, so the caches of these nodes that are minimally impacted. c) Deployment Architecture

Conflux is deployed in a 5-node cluster that spans 3 racks, allowing for one rack to fail. Connectivity is through the gigabit backplane IP of the data center to prevent network hops being the bottleneck. [Ref: Figure 3] Each node is configured with 8 CPU’s (virtual), 32 GB RAM and 1 TB of hard disk. RMQ is configured for configurable batch acknowledge. I.e. Conflux consumes messages in batches and acknowledges them. Details are provided in the implementation. Due to the nature of consistent hashing, load tests without groups distributes packets almost identically across nodes as expected. The standard deviation in percentage CPU is less than 3 when the average CPU is 95% across all 5 nodes in the cluster. d) Reactive monitoring of individual streams

managed by a group.

Grouping allows users of Conflux to address and manage a large, transient list of streams with one group ID. Entries and exits to/from the group will

Figure 3 A typical deployment of Conflux consists of an odd numbered set of nodes to allow for quorum election in Cassandra. The nodes are always local to a data center and spread across racks to guard against hardware failure. TCP/IP connectivity is through the data center backplane. In practice a set of 5 nodes is used.


be automatically considered if group updates happen accurately from an external system watching group memberships. Consider a medical IOT streaming data into Conflux. The problem is to page a doctor when there is no heartbeat for a specified time for a given patient. In this example, The Conflux user will form a group consisting of heartbeat data sources, since there would be several other data sources in the system for measuring blood pressure, temperature etc. formulate a ‘no heartbeat’ condition and specify the paging callout for notifying a doctor. When any stream in the group of heartbeat data sources satisfies the ‘no heartbeat’ condition, the correct doctor is paged, based on the ID of the data source that caused the problem. This example can be extended for machine learning type use cases by specifying cost functions on individual metrics in a packet and calling out when a condition is changed. In this example, the operation is on the individual stream, the group merely functions as a ‘tag’ for classifying classes of streams. e) Merging streams based on groups

Groups also allow for merging streams based on group ID. This is the more interesting use case of Conflux. Since a node maintains the mapping of the groups a stream belongs to, the node can apply a transform to the metrics and retransmit the data using the group ID as the routing key. Note that all members of a group transmit on the same group ID as a result of this scheme all the retransmitted data from the children land up on one node on which group ID is homed. These streams are now merged with the merge function specified for the group. The result of the group merge computation can be again ‘fed-forward’ to compose them further. Members of a group are expected to have the same sample frequency. There may be cases where groups would contain merges of many hundreds of streams. These would create hotspots Conflux deals with this by:

i. Limiting Group membership to 20. ii. Breaking down groups with larger than 20

into subgroups with 20 elements or less recursively, so very large groups would be broken down into a tree-like containment hierarchy.

iii. This works for cases where the merge functions are associative, e.g. SUM, MAX, MIN, AVG etc.

iv. For non-associative merge functions like STDDEV, TOP-10 etc. the group size is restricted to 30.

v. In practice, large non-associative group merges are roughly expressed using a

combination of associative merges over the large group and a final non-associative merge. For example, the TOP-10 over a very large group containing many hundreds of elements can be expressed approximately as the TOP-10 of 30 large groups of MAX.

This is a limitation of Conflux, since such a value may not always be accurate. The other limitation with this design is that load balancing is achieved at the cost of increased network hops of the order of log(group membership limit). This is not very significant in practice.

Figure 4: Transforming, Grouping and merging streams The stream that results from a group merge is indistinguishable from an incoming stream. I.e. group streams, sub-group streams and incoming streams are indistinguishable from the other except by their unique ID. Conflux treats all streams identically by persisting every incoming packet in Cassandra. The intermediate stages of calculation from sub-groups are persisted as well. The Since sub-groups are treated identical to groups; group change notifications travel up the tree and the group membership is correctly reflected. f) Aligning time in streams

We propose the following approaches for aligning time in feed forward and merge functions:


i. Wall clock window, which results in an aggregate packet every time that the wall clock interval is crossed, considering all packets for the interval.

ii. Sliding window, which results in an aggregate packet for every incoming packet, considering all past packets for the same (id, metric).

iii. Timeout for ID all metrics with timestamps older than this timeout value are not considered for merge or feed forward.

g) Software stack

The software stack consists of:

Web Layer Play Framework

Message bus Rabbit MQ with Consistent hash plugin for message routing

Persistent Store Cassandra.

The unit of deployment of conflux is a node. This node can either be a physical box or a VM (virtual machine). In case a VM is used, care must be taken to deploy some VM’s on different physical hardware to mitigate the possibility of catastrophic hardware failure. The physical hardware however is assumed to be part of a single data centre, connected by redundant, high bandwidth cabling. I.e. nodes cannot span geographical locations.

h) Handling failure

We consider two broad classes of failure

i. Node failures - we rely on consistent hashing re-distribute streams to survivor nodes upon node failure. As described in section 3.b, traffic gets re-routed to available nodes. Batched acknowledgements and corresponding retransmits in RMQ ensure packets are not lost in case the node goes down before processing.

ii. Packet re-transmits and delays – are handled naturally for wall clock time alignment. When a packet arrives late, all calculations are triggered for the wall clock window that the packet belongs to. Feed forward ensures that these changes reflect in all group aggregates that the ID belongs to. Out of sequence packets will not matter since eventually, the aggregate values are all correct. The system

can recover from failure by replaying persisted packets from the time of failure.

Packets older than a certain threshold are discarded by the system since they can trigger cascading re-calculations up to the present time.

i) Tradeoffs and Limitations

It is important to note that since Conflux is designed for low latency stream operations; a conscious design tradeoff is to assume the absence of network partitions. Conflux is not designed for a multi-site deployment.

4. Implementation Monitoring Insight [12], a vCloud Air alpha service offering is in production with a subset of Conflux features. RMQ [10] is used as the message bus and Cassandra [11] is the persistent NOSQL store. The System was tested to monitor 80+ metrics from each VM, for 3000 VM’s collecting data every 5 minutes for 20 second granularity. This was simply too much data. A new ‘subscription’ feature was introduced where customers could subscribe by paying for each VM ID. Different subscriptions resulted in different granularities. The default was 5 minutes instead of 20 seconds. This dealt with the data volume well. The subscribe action uses the same feed forward, routing using the VM ID on which the customer subscribes, so it comes into effect immediately. This is currently on production. A basic JavaScript API for surfacing JMX metrics, log files and accessing the Cassandra database has been tested. This has been invaluable in programmatically running checks to poke around when a problem occurs on a running system. Data was written into Cassandra with a default TTL of 30 days. There was a subscription option, just like for granularity where it was possible to store metrics for 45 or 60 days. For example, it was possible to store very high granularity metrics for longer time in case of important monitored elements. Cassandra compaction was run weekly across the entire cluster to reclaim tombstone data.

5. Evaluation and Performance These performance numbers are for a single VM that is part of a cluster. The ingestion capacity


increases linearly as new nodes are added due to the consistent hashing property. The chart in Figure 5 is based on 20-second granularity data from 2 VMs. Nodes acknowledged each individual packet, which led to poor ingestion performance and increased latency.

Turning off individual acknowledgement and enabling batch acknowledgement dramatically pushed up ingestion rates as shown in Figure 6.

As expected, failure recovery is batched as well,. Pre-fetch size is set to 512, acknowledging 256 packets at a time, so there is always data in memory, speeding up consumption from 335 packets per second to 1277 per second. These figures are from RMQ monitoring console. The Blue lines in the figure above indicate messages being consumed and persisted to Cassandra.

6. Conclusion Conflux is a critical enabler for data mining services that an increasing number of companies are investing in. It allows users to react in real-time to events which are defined as functions on data streams.

Allowing customers to define actions that monitor and execute on live streaming data is a vital requirement across domains. This could be used to automate Dev-Ops, avert disaster scenarios,

monitor IOT or even execute buy/sell decisions on the stock market.

After historical data has been mined and collated into meaningful models, companies do not clearly

act on it in an automated manner. Conflux not only provides the means to act fast, but also update models with latest the learning in real-time.

7. Future work Large groups handling, handling data sources with very different packet emission frequencies lead to skew since all keys do not correspond to the same load. Thus, load balancing is a subject for future work. Cassandra and RMQ pick different nodes for the same ID, although the algorithm is consistent hash. This leads to an additional network hop. Tuning hash functions across Purely dynamic group whose members are defined by a function is targeted for future work. If the function returns ‘true’ the packet at the child is fed-forward to the group ID. Error Handling in sliding windows.

8. References [1] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D. 1997. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the

Figure 7. Data collected at 20 sec. intervals, for 2 VM’s, over an hour, displayed in a web UI. This could be any monitored

element or IOT device.

Figure 5. Ingestion performance with individual message acknowledgements.

Figure 6. Ingestion performance with batched acknowledgements.


World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May 04 - 06, 1997). STOC '97. ACM Press, New York, NY, 654-663.

[2] Apache Samza. http://samza.incubator.apache.org

[3] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh

Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[4] Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, Shivakumar Venkataraman: Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams. SIGMOD 2013: 577-588

[5] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, Siddarth Taneja: Twitter Heron: Stream Processing at Scale. SIGMOD 2015: 239-250

[6] Kestrel: A Simple, Distributed Message Queue System. http://robey.github.com/kestrel

[7] S4 Distributed Stream Computing Platform. http://incubator.apache.org/s4/

[8] Spark Streaming.https://spark.apache.org/streaming/

[9] Ankit Toshniwal,Siddarth Taneja,Amit Shukla,Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy V. Ryaboy: Storm@Twitter. SIGMOD 2014: 147- 156

[10] RabbitMQ: http://www.rabbitmq.com/

[11] Apache Cassandra: http://cassandra.apache.org/

conflux: distributed, real-time actionable insights on ... · pdf fileare apache spark, apache...

Documents