bridging the gap between serving and analytics in scalable ... · in scalable web applications...
TRANSCRIPT
Bridging the Gap between Serving and Analytics
in Scalable Web Applications
Panagiotis GarefalakisImperial College [email protected]
Raul Castro FernandezImperial College [email protected]
Peter PietzuchImperial College London
Abstract. Web applications that include personalised rec-ommendations, targeted advertising, and other analyticsfunctions must maintain complex prediction models, whichare trained over large datasets. Such applications typicallyseparate stored data into offline and online data based onits time to compute and freshness requirements. To serve re-quests robustly and with low latency, applications then cachedata from the analytics layer, constructing responses fromthis data; to train models and offer analytics, they use asyn-chronous offline computation in the analytics layer, whichleads to stale data being served to clients.
Instead, our goal is to offer a unified model to developerswhen writing web applications that serve data while usingbig data analytics. Our idea is to express the online and of-fline logic of a web application as a single stateful distributeddataflow graph. The state of the dataflow computation isthen expressed as In-memory Web Objects (IWOs), whichare directly accessible as persistent objects by the applica-tion. This means that the application can exploit data-parallelprocessing for compute-intensive requests, e.g. when train-ing complex models, while serving results with low latencyfrom IWOs.Motivation. Modern web applications must offer low-latency responses, which typically means that they pre-compute computationally-expensive analytics tasks such aspersonalised recommendations using asynchronous back-end systems. These tasks are decoupled from the critical pathof serving web requests, and, for serving, developers mustload pre-computed results into scalable stores such as dis-tributed key/value stores. The pre-computation is performedby data-parallel frameworks such as Hadoop or Spark.
[Copyright notice will appear here once ’preprint’ option is removed.]
Challenges. Despite achieving low latency, the above ap-proach has a number of limitations: (i) decoupling data an-alytics from serving means that results can be stale, whichhas a negative impact on time-critical data such as adver-tisements. More than 70% of all Hadoop jobs running atLinkedIn [3] use key-value stores as egress mechanism;(ii) caching data, while reducing the read load, requires theconstruction of complex queries, involving multiple back-end stores; and (iii) a variety of stores and back-end systemsmust be managed and scaled out independently, which hasproven to be hard, error-prone and inefficient.Approach. Our goal is to provide a unified model for webapplications by exposing a common object-based interfaceto the developers for handling offline and online data. A webapplication framework, such as Play for Java, can directlymanipulate persistent objects, which we extend to becomeIn-memory Web Objects (IWOs). IWOs are computed indata-parallel fashion, either on-demand when a web requestis handled, or asynchronously when representing previouslycomputed, cached data. They are implemented as in-memorystate in a stateful distributed dataflow model [1]. Since IWOsare maintained in-memory but manipulated in a data-parallelfashion, they can satisfy the requirements of offline andonline data processing in web applications.
We describe an initial design of a framework that supportsstateful dataflow graphs with IWOs based on SEEP, an open-source, data-parallel processing platform. We show that, us-ing a source-to-source compiler [2], it is possible to automat-ically synthesise the dataflow graphs for existing Java-basedweb applications, thus benefiting from data-parallelism forserving computationally-expensive requests while maintain-ing analytics over large amounts of data.
References
[1] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch.Integrating Scale Out and Fault Tolerance in Stream Processing usingOperator State Management. In ACM SIGMOD, 2013.
[2] R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch.Making State Explicit for Imperative Big Data Processing. In USENIXATC, 2014.
[3] R. Sumbaly, J. Kreps, and S. Shah. The ”Big Data” Ecosystem atLinkedIn. In ACM SIGMOD, 2013.
1 2015/4/10
Front-End (Web)
view(user){ //Access Dataflow live state DataSource ds = DB.getDatasource() userRow = db.get(userItem).getRow(user) coOcc.multiply(userRow)}rateItem(user, item, rating){ //Write directly to dataflow state DataSource ds = DB.getDatasource() ds.updateUserItem(user, item, rating) ds.updateCoOc(UserItem) return OK;}
login(user, password){ if(! User.authenticate(user, pass)) return "Invalid credentials” }
Back-End
Play Framework
SDG Distributed Processing System
write datasource
read datasource
Relational Store
MySQL
userItemCooccurrence
Transparent State
low latency interface
In-Memory Web Object
(IWO)
batch processesing
direct access
authenticate user through SQL api
ORM interface
IWO interface
Recommendation algorithm synchronouslyreads and updates IWOs from the distributed SDG system through a low latency interface
Java2SDG
Dataflow Compiler
Stateful Dataflow Graph
Controller Code
Front-End (Web)
4.
5.
7.
8.
Data Transport Layer Back-End
Relational Store
MySQL
Non Relational Store
userItem
Cooccurrence
Casssandra
Queueing System
rating
rating
rating
Kafka
Scheduler
Java App
Batch ProccesingHigh-LatencyHigh-Throughput
6.
Hadoop Cluster
login(user, password){ if(! User.authenticate(user, pass)) return "Invalid credentials” }
view(user){ //Constructing Recommendation userRow = userItem.getRow(user) coOcc.multiply(userRow)}
rateItem(user, item, rating){ //Pushing new rating in the queue Queue.publish(user, item, rating)}
Play Framework
reduce(key, Iterator userRows){ for each row in userRows: //Generate user recommendations UserRec.add(Cooccurrence.multiply(row)) Emit(merge(userRec))}
map(key, Matrix ratings){ for each rating in ratings: //update user ratings userItem.setElement(user,item,rating) userRow = userItem.getRow(user) //emit the update user rating row EmitIntermediate(rating, userRow)}
Map Reduce Job
6.
async fetch ratings
synch authenticate
1.
get recommendations
2.
3.
sync add new rating
read userItem data
batch process for recommendation data
write CoOccurence data
update data
4.async fetch data
sync
ORM interface
Queue interface
Key-value interface
#
#
Synchronous Task
Asynchronous Task
Bridging the Gap between Serving and Analytics in Scalable Web Applications
Panagiotis Garefalakis, Raul Castro Fernandez, Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group, Department of Computing
Imperial College London
[email protected], [email protected], [email protected]
Motivation • Most modern web and mobile applications today offer highly personalised services generating large amounts of data • Stored data is separated into offline and online data, based on its generation cost and freshness requirements • Data resides on different storage layers and is processed by different systems to hide the underlying complexity • Serving and Analytics layers are decoupled in order to serve requests with the minimum processing overhead
Challenges 1. Serving stale data can have negative impact on time critical tasks
2. Building web responses using multiple systems involves joining information from complex back-end systems
Typical Web Application Today
Dataflow-Based Web Application using In-memory Web Objects (IWOs)
Acknowledgments This work was partially supported by a PhD CASE Award funded by EPSRC/BAE Systems and by the High-Performance and Embedded Distributed Systems Centre for Doctoral Training (CDT) HiPEDS programme.
References R. Sumbaly, J. Kreps, and S. Shah. The ”Big Data” Ecosystem at LinkedIn. In SIGMOD, 2013.
R. C. Fernandez, M. Migliavacca, et al. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. In SIGMOD, 2013.
R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Making state explicit for imperative big data processing. In USENIX ATC, 2014.
Summary • Need for web frameworks to handle entire lifecycle of web requests,
including data serving and analytics
• Unified model for web applications by exposing common object-based interface for handling offline and online data
• Extends popular web application frameworks such as Play for Java to directly manipulate In-Memory Web Objects (IWOs) instead of plain persistent objects
Challenges 3. Increased developers’ effort to learn and program these systems
4. High complexity to maintain, monitor and scale a variety of systems independently