bridging the gap between serving and analytics in scalable ... · in scalable web applications...

2
Bridging the Gap between Serving and Analytics in Scalable Web Applications Panagiotis Garefalakis Imperial College London [email protected] Raul Castro Fernandez Imperial College London [email protected] Peter Pietzuch Imperial College London [email protected] Abstract. Web applications that include personalised rec- ommendations, targeted advertising, and other analytics functions must maintain complex prediction models, which are trained over large datasets. Such applications typically separate stored data into offline and online data based on its time to compute and freshness requirements. To serve re- quests robustly and with low latency, applications then cache data from the analytics layer, constructing responses from this data; to train models and offer analytics, they use asyn- chronous offline computation in the analytics layer, which leads to stale data being served to clients. Instead, our goal is to offer a unified model to developers when writing web applications that serve data while using big data analytics. Our idea is to express the online and of- fline logic of a web application as a single stateful distributed dataflow graph. The state of the dataflow computation is then expressed as In-memory Web Objects (IWOs), which are directly accessible as persistent objects by the applica- tion. This means that the application can exploit data-parallel processing for compute-intensive requests, e.g. when train- ing complex models, while serving results with low latency from IWOs. Motivation. Modern web applications must offer low- latency responses, which typically means that they pre- compute computationally-expensive analytics tasks such as personalised recommendations using asynchronous back- end systems. These tasks are decoupled from the critical path of serving web requests, and, for serving, developers must load pre-computed results into scalable stores such as dis- tributed key/value stores. The pre-computation is performed by data-parallel frameworks such as Hadoop or Spark. [Copyright notice will appear here once ’preprint’ option is removed.] Challenges. Despite achieving low latency, the above ap- proach has a number of limitations: (i) decoupling data an- alytics from serving means that results can be stale, which has a negative impact on time-critical data such as adver- tisements. More than 70% of all Hadoop jobs running at LinkedIn [3] use key-value stores as egress mechanism; (ii) caching data, while reducing the read load, requires the construction of complex queries, involving multiple back- end stores; and (iii) a variety of stores and back-end systems must be managed and scaled out independently, which has proven to be hard, error-prone and inefficient. Approach. Our goal is to provide a unified model for web applications by exposing a common object-based interface to the developers for handling offline and online data. A web application framework, such as Play for Java, can directly manipulate persistent objects, which we extend to become In-memory Web Objects (IWOs). IWOs are computed in data-parallel fashion, either on-demand when a web request is handled, or asynchronously when representing previously computed, cached data. They are implemented as in-memory state in a stateful distributed dataflow model [1]. Since IWOs are maintained in-memory but manipulated in a data-parallel fashion, they can satisfy the requirements of offline and online data processing in web applications. We describe an initial design of a framework that supports stateful dataflow graphs with IWOs based on SEEP, an open- source, data-parallel processing platform. We show that, us- ing a source-to-source compiler [2], it is possible to automat- ically synthesise the dataflow graphs for existing Java-based web applications, thus benefiting from data-parallelism for serving computationally-expensive requests while maintain- ing analytics over large amounts of data. References [1] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. In ACM SIGMOD, 2013. [2] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Making State Explicit for Imperative Big Data Processing. In USENIX ATC, 2014. [3] R. Sumbaly, J. Kreps, and S. Shah. The ”Big Data” Ecosystem at LinkedIn. In ACM SIGMOD, 2013. 1 2015/4/10

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bridging the Gap between Serving and Analytics in Scalable ... · in Scalable Web Applications Panagiotis Garefalakis, Raul Castro Fernandez, Peter Pietzuch ... prp@doc.ic.ac.uk Motivation

Bridging the Gap between Serving and Analytics

in Scalable Web Applications

Panagiotis GarefalakisImperial College [email protected]

Raul Castro FernandezImperial College [email protected]

Peter PietzuchImperial College London

[email protected]

Abstract. Web applications that include personalised rec-ommendations, targeted advertising, and other analyticsfunctions must maintain complex prediction models, whichare trained over large datasets. Such applications typicallyseparate stored data into offline and online data based onits time to compute and freshness requirements. To serve re-quests robustly and with low latency, applications then cachedata from the analytics layer, constructing responses fromthis data; to train models and offer analytics, they use asyn-chronous offline computation in the analytics layer, whichleads to stale data being served to clients.

Instead, our goal is to offer a unified model to developerswhen writing web applications that serve data while usingbig data analytics. Our idea is to express the online and of-fline logic of a web application as a single stateful distributeddataflow graph. The state of the dataflow computation isthen expressed as In-memory Web Objects (IWOs), whichare directly accessible as persistent objects by the applica-tion. This means that the application can exploit data-parallelprocessing for compute-intensive requests, e.g. when train-ing complex models, while serving results with low latencyfrom IWOs.Motivation. Modern web applications must offer low-latency responses, which typically means that they pre-compute computationally-expensive analytics tasks such aspersonalised recommendations using asynchronous back-end systems. These tasks are decoupled from the critical pathof serving web requests, and, for serving, developers mustload pre-computed results into scalable stores such as dis-tributed key/value stores. The pre-computation is performedby data-parallel frameworks such as Hadoop or Spark.

[Copyright notice will appear here once ’preprint’ option is removed.]

Challenges. Despite achieving low latency, the above ap-proach has a number of limitations: (i) decoupling data an-alytics from serving means that results can be stale, whichhas a negative impact on time-critical data such as adver-tisements. More than 70% of all Hadoop jobs running atLinkedIn [3] use key-value stores as egress mechanism;(ii) caching data, while reducing the read load, requires theconstruction of complex queries, involving multiple back-end stores; and (iii) a variety of stores and back-end systemsmust be managed and scaled out independently, which hasproven to be hard, error-prone and inefficient.Approach. Our goal is to provide a unified model for webapplications by exposing a common object-based interfaceto the developers for handling offline and online data. A webapplication framework, such as Play for Java, can directlymanipulate persistent objects, which we extend to becomeIn-memory Web Objects (IWOs). IWOs are computed indata-parallel fashion, either on-demand when a web requestis handled, or asynchronously when representing previouslycomputed, cached data. They are implemented as in-memorystate in a stateful distributed dataflow model [1]. Since IWOsare maintained in-memory but manipulated in a data-parallelfashion, they can satisfy the requirements of offline andonline data processing in web applications.

We describe an initial design of a framework that supportsstateful dataflow graphs with IWOs based on SEEP, an open-source, data-parallel processing platform. We show that, us-ing a source-to-source compiler [2], it is possible to automat-ically synthesise the dataflow graphs for existing Java-basedweb applications, thus benefiting from data-parallelism forserving computationally-expensive requests while maintain-ing analytics over large amounts of data.

References

[1] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch.Integrating Scale Out and Fault Tolerance in Stream Processing usingOperator State Management. In ACM SIGMOD, 2013.

[2] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch.Making State Explicit for Imperative Big Data Processing. In USENIXATC, 2014.

[3] R. Sumbaly, J. Kreps, and S. Shah. The ”Big Data” Ecosystem atLinkedIn. In ACM SIGMOD, 2013.

1 2015/4/10

Page 2: Bridging the Gap between Serving and Analytics in Scalable ... · in Scalable Web Applications Panagiotis Garefalakis, Raul Castro Fernandez, Peter Pietzuch ... prp@doc.ic.ac.uk Motivation

Front-End (Web)

view(user){ //Access Dataflow live state DataSource ds = DB.getDatasource() userRow = db.get(userItem).getRow(user) coOcc.multiply(userRow)}rateItem(user, item, rating){ //Write directly to dataflow state DataSource ds = DB.getDatasource() ds.updateUserItem(user, item, rating) ds.updateCoOc(UserItem) return OK;}

login(user, password){ if(! User.authenticate(user, pass)) return "Invalid credentials” }

Back-End

Play Framework

SDG Distributed Processing System

write datasource

read datasource

Relational Store

MySQL

userItemCooccurrence

Transparent State

low latency interface

In-Memory Web Object

(IWO)

batch processesing

direct access

authenticate user through SQL api

ORM interface

IWO interface

Recommendation algorithm synchronouslyreads and updates IWOs from the distributed SDG system through a low latency interface

Java2SDG

Dataflow Compiler

Stateful Dataflow Graph

Controller Code

Front-End (Web)

4.

5.

7.

8.

Data Transport Layer Back-End

Relational Store

MySQL

Non Relational Store

userItem

Cooccurrence

Casssandra

Queueing System

rating

rating

rating

Kafka

Scheduler

Java App

Batch ProccesingHigh-LatencyHigh-Throughput

6.

Hadoop Cluster

login(user, password){ if(! User.authenticate(user, pass)) return "Invalid credentials” }

view(user){ //Constructing Recommendation userRow = userItem.getRow(user) coOcc.multiply(userRow)}

rateItem(user, item, rating){ //Pushing new rating in the queue Queue.publish(user, item, rating)}

Play Framework

reduce(key, Iterator userRows){ for each row in userRows: //Generate user recommendations UserRec.add(Cooccurrence.multiply(row)) Emit(merge(userRec))}

map(key, Matrix ratings){ for each rating in ratings: //update user ratings userItem.setElement(user,item,rating) userRow = userItem.getRow(user) //emit the update user rating row EmitIntermediate(rating, userRow)}

Map Reduce Job

6.

async fetch ratings

synch authenticate

1.

get recommendations

2.

3.

sync add new rating

read userItem data

batch process for recommendation data

write CoOccurence data

update data

4.async fetch data

sync

ORM interface

Queue interface

Key-value interface

#

#

Synchronous Task

Asynchronous Task

Bridging the Gap between Serving and Analytics in Scalable Web Applications

Panagiotis Garefalakis, Raul Castro Fernandez, Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group, Department of Computing

Imperial College London

[email protected], [email protected], [email protected]

Motivation •  Most modern web and mobile applications today offer highly personalised services generating large amounts of data •  Stored data is separated into offline and online data, based on its generation cost and freshness requirements •  Data resides on different storage layers and is processed by different systems to hide the underlying complexity •  Serving and Analytics layers are decoupled in order to serve requests with the minimum processing overhead

Challenges 1.  Serving stale data can have negative impact on time critical tasks

2.  Building web responses using multiple systems involves joining information from complex back-end systems

Typical Web Application Today

Dataflow-Based Web Application using In-memory Web Objects (IWOs)

Acknowledgments This work was partially supported by a PhD CASE Award funded by EPSRC/BAE Systems and by the High-Performance and Embedded Distributed Systems Centre for Doctoral Training (CDT) HiPEDS programme.

References R. Sumbaly, J. Kreps, and S. Shah. The ”Big Data” Ecosystem at LinkedIn. In SIGMOD, 2013.

R. C. Fernandez, M. Migliavacca, et al. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. In SIGMOD, 2013.

R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Making state explicit for imperative big data processing. In USENIX ATC, 2014.

Summary •  Need for web frameworks to handle entire lifecycle of web requests,

including data serving and analytics

•  Unified model for web applications by exposing common object-based interface for handling offline and online data

•  Extends popular web application frameworks such as Play for Java to directly manipulate In-Memory Web Objects (IWOs) instead of plain persistent objects

Challenges 3.  Increased developers’ effort to learn and program these systems

4.  High complexity to maintain, monitor and scale a variety of systems independently