feisu: fast query execution over heterogeneous data...

Feisu: Fast Query Execution over HeterogeneousData Sources on Large-Scale Clusters

An Qin∗, Yuan Yuan†, Dai Tan∗, Pengyu Sun∗, Xiang Zhang∗, Hao Cao∗, Rubao Lee†, Xiaodong Zhang†∗Baidu, Inc, {qinan, tandai02, sunpengyu, zhangxiang02, caohao}@baidu.com

†The Ohio State University, {yuanyu, liru, zhang}@cse.ohio-state.edu

Abstract—Fast data analytics at an increasingly large scalehas become a critical task in any Internet service company.For example, in Baidu, the major search engine company inChina, large volumes of Web and business data in PB-scale aretimely and constantly acquired and analyzed for the purposesof evaluating product revenue, tracking product demandingactivities on market, predicting user behavior, upgrading productrankings, and diagnosing spam cases, and many others. Responsetime for queries of various data analytics not only affects userexperiences, but also has a serious impact on productivity ofbusiness operations.

In this paper, to meet the challenge of fast data analytics, wepresent Feisu (meaning fast in Chinese), a data integration systemover heterogeneous storage systems, which has been widely usedin Baidu’s critical and daily business analytics applications afterour R&D efforts. Feisu is designed and implemented to co-worktogether with several heterogeneous storage systems, and exploitthe query similarity embedded in complex query workloads. Ourexperiments using real world workloads show that Feisu cansignificantly improve query performance in Baidu. Feisu has beenin production use in Baidu for two years to effectively manageover dozens of petabytes of data for various applications.

I. INTRODUCTION

In the big data era, it is a challenging and timely task tosupport various interactive data analytics jobs over increasinglylarge scale data sets that are archived and managed by differenttypes of storage systems. The search engine of Baidu [1], theleading Chinese online Web service provider serving billionsof users, is such a system platform to address these challengesin its daily operations. A large number of Baidu services,such as Web search, map and navigation, cloud service andonline encyclopedia, require fast accesses over huge amountsof data sets. In practice, strategy engineers need to quicklymodel new product ideas, diagnose and optimize their modelsbased on information sources collected from multiple systems;data engineers periodically query dozens of terabytes of userbusiness log data to produce statistical reports; system engi-neers have to figure out malfunctions hidden in the searchengine in petabytes-scale data sets. Processing various typesof interactive data analysis jobs in such a complex environmentin an efficient way has become increasingly demanding.

In the Baidu search engine, data sets are generated from avariety of sources and managed by different business-specificstorage systems (e.g., local files system, HDFS [2] and cold-data distributed filesystem [3]). An interactive data analyticstask may access data sources from all these systems, but cannotaffect the service quality of any business application runningon top of these systems. For this reason, it is unrealistic forBaidu to adopt commonly used systems of general purposes,

HDFSFatmanLocal FS

PaddleFeisu

Crawler Indexer Ranking

Query Processor

Web Search

Encyclo-pedia

Music Search

Scholar Search

…Map Online Services

Business Systems

Infrastructure Systems

…

Fig. 1: Baidu’s software stack

such as Hadoop [4], Apache Hive [5], Apache Spark [6], andApache Impala [7] to process these tasks.

To make interactive data analytics tasks highly efficientin large scale and heterogeneous cluster systems, we havedesigned and implemented Feisu, a columnar data processingsystem that is optimized for geo-distributed storage systems.Feisu handles the geographical distribution via the cross-domain mechanism to share the data schema and access rights.It organizes data sets into partitions using a compression-friendly columnar format. To speedup ad-hoc queries, Feisuintroduces an effective adaptive indexing mechanism, calledSmartIndex, which utilizes query similarity behind our inter-nal user’s behaviors. SmartIndex can effectively reduce dataretrieval time without affecting the service quality of differentstorage systems.

Feisu has been in production use in Baidu for two years,supporting more than one hundred products. Figure 1 illus-trates Feisu’s position in the Baidu’s software stack. Fromrapid product prototyping to revenue reporting, from deeplearning model training to system malfunction debugging,Feisu has significantly improved the efficiency of data explo-ration and supports a large number of products. Now morethan fifteen thousand of Feisu instances have been deployedacross six data centers, serving five thousands of queries onaverage every day. The total size of data managed by Feisuhas reached dozens of petabytes.

In general, we have made the following contributions inthis paper:

• We show the necessity to build an independent dataprocessing system over heterogeneous data sources,such as Feisu.

• We identify query locality and similarity based onanalysis of query characteristics in Baidu, which canbe used to significantly improve query performance.

2017 IEEE 33rd International Conference on Data Engineering

2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1175


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1175


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1175


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1161


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1173


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1173


2375-026X/17 $31.00 © 2017 IEEE

DOI 10.1109/ICDE.2017.162

1173

• We present the detailed design and implementation ofFeisu, a parallel columnar data processing system withan adaptive indexing mechanism over heterogeneousstorage systems.

• We evaluate Feisu with real-world workloads andillustrate our experiences in building Feisu, whichwould shed some light on data management systemson large clusters.

The rest of the paper is organized as follows. Section IIexplains Baidu’s data heterogeneity and the motivation ofFeisu. Section III presents the overall design of Feisu. Sec-tion IV describes Feisu’s optimizations based on our analysisof user logs. After describing implementation issues in buildingFeisu in Section V, We evaluate the performance of Feisuin Section VI. Section VIII introduces the related work andSection IX concludes the work.

II. BAIDU’S REQUIREMENTS IN DISTRIBUTED DATA

INTEGRATION

In Baidu, data are generated from a variety of sources (seethe pipeline of web data processing in Figure 2), which canbe categorized into the following three types.

• Log data, which include user activity logs and systemrunning logs, are generated on tens of thousands ofonline service machines.

• Business data, which include web links, web pages,and indices (web page’s digest, index and inverted in-dex), are generated by offline processes cross differentclusters.

• Labeled data, generated by engineers for model train-ing or other tasks, such as ranking enhancement orbad case intervention.

These data are usually stored on different storage systemsin Baidu. Log data are stored in the local file systems ofthe online machines; business data are mainly stored in theglobal file systems (e.g., HDFS); and labeled data can be storedin key-value stores or Baidu’s internal distributed file systemFatman [3]. Many data analytics tasks need to access data fromthese storage systems. We use the following three applicationcases in Baidu to illustrate the characteristics of these tasksand motivate our work.

Case 1: Debugging search engine.

Debugging search engine is a common but challenging taskfor system engineers to find and solve problems (e.g., ineffi-cient designs, data inconsistency issues and etc.) in Baidu’ssystem stack. Usually engineers need to access data that maybe placed in multiple data sources such as local filesystems andstandalone distributed filesystems to find the the root cause ofa problem. The process is expensive and error-prone becauseengineers must learn and understand the complex data flowsinvolving different storage systems. Data inconsistency andschema mismatching can happen across storage systems, whichsignificantly delays the engineers from giving insights.

Case 2: Rapid product prototyping.

Rapid product prototyping is important for testing newideas and launching new products. Before Feisu, a large portion

of product engineers’ time is spent on preparing data setsfrom several application storage systems. They have to learnthe interfaces of these storage systems and collaborate withsystem engineers to figure out the necessary data sets to use forproduct prototyping. One-round of the data preparation processwould cost almost one week. Product prototyping usually takesmultiple rounds of data preparation and evaluation no matterwhether the new product is finally proved to be launched.For example, in our design of voice search product, we needto evaluate the benefited user domain so we have to extractthe user behavior data again and again to demarcate thespecific user set. Moreover, data freshness is very important fornew product research and development. The delay of productprototyping caused by the complexity of different storagesystems can result in business losses.

Case 3: Product Analysis.

Product analysis is critical to figure out the user andbusiness trends for our products. Some analysis tasks need toprocess past one-or-two years of data to analyze the industrytendency or to analyze specific product activities. In this case,historical data stored in cold storage will be loaded togetherwith the latest data for computation. It is also common forengineers to periodically analyze sampled hot data to checkthe indicators in a revenue report. These tasks access differentstorage systems and have different performance requirements.

A straightforward approach to solve the problems is todeploy one global file system (e.g., HDFS) to manage thesedata or loading the data from different storage systems into onestorage system and then run standard data processing enginessuch as Hadoop and Spark for data analytics. However, thisapproach doesn’t work for Baidu.

Deploying one global file system (e.g., HDFS) to manageall data is impossible in Baidu because of the different servicerequirements for different applications. For example, log dataare stored on online service machines, which run legacyretrieval service for Baidu search engine. The retrieval serviceis critical to the search engine and thus other heavy servicessuch as global file system cannot co-run on the same nodes.

It is also impossible to load the data from these differentstorage systems to one global file system for data analyticsdue to the high cost of storage and network data transferring,which is caused by two facts: (1) Internet search engine isusually geo-distributed; and (2) data are generated in a highspeed in Baidu (e.g., 2.3 GB log data per hour per node).Collecting the data into one storage system will consume largeamounts of network resources, which can severely affect thebusiness service traffic in hot regions. This approach is alsoinefficient because analytic tasks cannot access the data in atimely manner to generate product insights, which may blockfast product development [8].

As a result, we need to develop a new data processingsystem that efficiently manage and query all the data to meetBaidu’s needs. There are two major requirements for the newsystem. First, the system needs to consider data heterogeneityand different service level requirements when accessing thedata on different storage systems. Each storage system worksin an independent domain. Data on different systems havedifferent storage layouts, and cannot be shared among systems.The new system must efficiently track the data information

117611761176116211621162116211621162117411741174

Crawler IndexerQuery

ProcessorLink Page Index

Link Storage

Page Storage

Index Storage

Ranking

Strategies Log

Web Users

Web

Ranking Model Training Product Analysis

Revenue ReportRapid Prototyping / Debugging / Case Tracking / System Diagnose

Anti-Spam / Page Quality

Log Data

Local FS Local FS…… Label Storage

Label Storage

Archival Data

FatmanHeterogeneous

Storage

Applications

Fig. 2: Data pipeline in Baidu’s search engine

across different storage systems. Second, the system shouldexecute queries fast enough to meet the business requirements.We observe that users in Baidu access different types of datadifferently, which can help design the data access method ondifferent systems. For example, for log data, many analyticapplications only need partial information hidden in the data.In this case, we can design a light weight process running onthese nodes for information extraction.

In the following sections, we will describe our designof Feisu, a columnar data processing system optimized forheterogeneous storage systems.

III. FEISU DESIGN OVERVIEW

A. Data Model and Query Model

Feisu stores data in columnar format, similar to otherhigh performance data processing engines (e.g., [9], [10]).The reason is that data in Baidu usually contain hundreds ofattributes but only a small subset of them are actually queried.Storing data in columnar format can effectively reduce the I/Ocost. Feisu also supports nested data format such as json, whichwill be flatten into columns when the data are processed.

Currently Feisu supports star schema like queries in thefollowing format.

SELECT expr1 [[AS] expr_alias1] [...][aggr_func(expr3) WITHIN expr4]FROM table1 [, table2, ...][[INNER|[RIGHT|LEFT] OUTER|CROSS] JOIN

table3 [[AS] table_alias3]ON join_cond_1 [AND join_cond_N ...]]

[WHERE cond][GROUP BY (field1 | alias1) [...]][HAVING cond][ORDER BY field1|alias1 [DESC|ASC] [...]][LIMIT n];

B. System Architecture

Figure 3 shows the architecture of Feisu. Feisu’s design ismotivated by the architecture of modern web search engine.As can be seen, Feisu organizes servers in the cluster usingthe tree structure. There are three types of servers in the tree,the master server, the stem server and the leaf server. Themaster server is the root of the tree, which accepts ad-hoc

queries from clients and generates optimized query executionplans using a cost-based approach. It dissects a query plan intosub-plans based on the information of available stem serversand dispatch the sub-plans to them. The stem server is theinternal node in the tree. It accepts a sub-query plan from themaster and further dissects the plan to the leaf servers. The leafserver is the leaf in the tree, which actually executes the queryplan and computes the results. After the results are generated,they are summarized in a bottom-up way and sent back to theclients.

Feisu’s tree-structured organization can efficiently utilizeavailable computing resources. To maximize the performance,Feisu schedules a query based on data location, the cluster’snetwork structure, and the load statistics on the leaf servers.Feisu always schedules a task to the leaf server that containsthe data if the server are available. If the leaf server is notavailable, Feisu will either schedule the task to the availableleaf server that contains the data replica or to an availableserver that has a low network transfer overhead based on thenetwork structure.

To support heterogeneous storage systems, each storagenode in a specific storage system is deployed a light-weightprocess, which monitors the storage for newly generated data(e.g., log data) and converts the data into Feisu in columnarformat when new data arrive. Each storage node acts as aleaf server in Feisu. To guarantee that Feisu doesn’t affectthe business critical applications on top of the storage system,Feisu controls the amount of resources that can be used forquery processing on the leaf server. The resources will alwaysbe granted to business-critical applications first.

C. System Implementation

Master. The Feisu’s master is a key service and is builtwith the following main components:

• Job manager. Job manager maintains the runninginformation of user query jobs. When users submita query, job manager will analyze query executionsemantics and verify accessed right of specific dataset. If the request is legal, job manager will create anexecution plan based on data partition information andcluster utilizations. After that, a new job with dozensof tasks is created. Before the new job is put into acandidate job queue, job manager tries to reuse otherrunning job’s task result if tasks are identical.

117711771177116311631163116311631163117511751175

Local Filesystem

0 11 0

0 11 0

0 11 0

block1 idx1 block2 idx2 blockn idxn

…

…

…

Leaf Server

Leaf Server

Leaf Server

Stem Server

Stem Server

Master

Distributed Filesystem

1. Create execution plan by user ad-hoc queries

S S S …L L L …

2. Generate scheduling plan based on cluster state

4. Drive IO by subplan

E’ 0 11 0

5. Update existing indexes

0 11 0

E’

3. Rewrite subplanequivalently based on SmartIndex

E

E E

0 11 0

E

E E

E E E E

Client

Parallel Execution Engine

Columnar-oriented Storage

Index Execution

Fig. 3: Feisu’s architecture

• Cluster manager. Cluster manager manages runtimeinformation of workers (i.e. leaf server and stemserver). It communicates with the job manager usingperiodic RPC to keep information updated. Feisu doesnot adopt systems like Zookeeper [11] for survivaldetection because the number of workers is too largeand the workers are geographical distributed.

• Job Scheduler. Job Scheduler creates the schedulingplan for candidate jobs after collecting the informationfrom job manager and cluster manager. Then candidatejobs with scheduling plans will be put into an emittingqueue, and tasks will be dispatched to correspondingworkers.

• Entry Guard. It is the entry point of whole system,executing the security checking of access flows anddispatching the incoming traffics. It is also responsiblefor capability protection to avoid malicious attacks.

To support large clusters that contain tens of thousandsof nodes, master components are implemented as services,communicated via an RPC channel and deployed in standalonemachines. With a large number of workers in the cluster,the network connection for worker heartbeat will reach theupper limit of a single machine. Our design of separatedcluster management components can easily solve this issueby horizontal-scaling the cluster manager. Job manager isdesigned in a similar way.

For reliability, components (the primary) are running withbackups, which don’t provide service until the primary onescrash. The backup components get checkpoint and operationslog from the primary in realtime, so that they will reach thesame running state as the primary. Since the backup ones areshadows of the primary, they can provide functionalities suchas monitoring running information to reduce the burdens onthe primary.

Stem and Leaf servers. Feisu’s workers can work as bothstem servers and leaf severs, which is determined when the

service starts. Stem servers aggregate task execution resultsgenerated by leaf servers or by other stem servers. The dataflow information is defined in the execution plan, which is sentto all related workers.

Feisu achieves task fault-tolerance using backup tasks,which are activated by job scheduler when status update ofa specific task times out or service crash is detected. Feisuschedules backup tasks considering data locations, responsetimes, and workloads on candidate workers. To enhance in-teractive response, user can optionally configure the processedratio of total data sets to avoid long-tail influence, or directlylimit the total elapse time of whole query running. If processeddata ratio reaches or response time limitation is elapsed out,the whole job will abandon the unfinished tasks (includingbackup tasks) and return back.

Client. The client-end is a versatile component with plug-gable framework to support command-line tool, website-basedservice, and third-party tools. It has two major functionalities:query syntax checking and access right verification. Querysyntax checking verifies syntax legitimacy and guides usersto write the proper SQL-like query command. Access rightverification checks user identity, accessed resource right andquota before submitting a query to the servers. The client-endalso collects user query histories to personalize data indexingand caching. Differently from the query collection in mastercomponent, collection on the client side is used for SmartIndexto build private index for specific users or user groups.

Common Storage layer. The common storage layerprovides unified data view for other components. To makeheterogeneous storage systems work together smoothly, alldata files are given full paths with prefix flags to activatedifferent storage plugins. For example, the file path in Hadoopfilesystem will be ”/hdfs/path/to/filename”, and in Fatmanfilesystem the path will be ”/ffs/path/to/filename”. If a prefixstring can not be recognized, local filesystem is activated bydefault. Besides unifying naming, Single-Sign-On (SSO) is im-plemented to allow cross-domain access all storage systems inFeisu, by mapping their authentication information to running

117811781178116411641164116411641164117611761176

job credential.

IV. SYSTEM OPTIMIZATION

In this section we present our optimizations in Feisu basedon the study of a two-month query log traces after Feisuwas deployed in Baidu. We find that queries show an accesspattern of similarities in a short time span. Based on thisdiscovery, we implement a caching mechanism in Feisu anddesign SmartIndex, which utilizes the evaluation results ofquery predicates to improve the performance of later queries.

A. Query Similarity and Data Locality

We collected a two-month user query log and analyzedthe query features in the log. We find that in a short timeperiod: (1) a small set of query columns are likely to berepeatedly accessed (i.e. data locality); and (2) a small set ofquery predicates are likely to be reused (i.e. query similarity).

Fig. 4: Number of accessed identical columns with differenttime spans

Data Locality. We split the query log traces based on fixedtime span (e.g., 1-hour, 2-hour) and analyzed the number ofrepeated accessed columns in the time span. The results areshown in Figure 4. As can be seen, there is a small set ofcolumns that are repeatedly accessed in a given time span.The number increases when the time span becomes larger. Thisindicates that Feisu can cache the hot columns to improve thequery performance.

Query Similarity. Similar to the analysis of data locality,we also split the query log traces based on fixed time spans andanalyzed the characteristics of query predicates in the log. Thereason is that when executing a query plan, each leaf serverin Feisu only executes a sub-plan. In this case, the evaluationof query predicates has a significant impact on the data accessperformance. In the analysis the predicates are converted tothe conjunctive form. Figure 5 shows the ratio of queries thathave at least one exact same query predicate with different timespans. As can be seen, in a given time span, a large number ofqueries have at least one same query predicate. This indicatesthat we can improve Feisu’s performance if the evaluation ofhot query predicates can be optimized.

Fig. 5: Ratio of queries that have at least one same querypredicate with different time spans

Data locality and query similarity reflect some representa-tive access patterns of Feisu users. Human users usually ex-plore the data in a trial-and-error approach. For example, whendiagnosing bugs, a user is likely to first issue an aggregationquery without query predicates and then add predicates oneby one based on the query results. In this case, same columnswill be repeatedly accessed and query predicates are repeatedlyevaluated.

Next, we describe how we utilize data locality and querysimilarity to improve Feisu’s query performance.

B. Data Cache

To utilize data locality, we implement a cache layer inFeisu’s storage system using SSDs. The SSD cache is managedusing LRU. Currently not all query’s data will be cached. Onlydata that may be accessed by critical business applicationscan utilize SSDs. We manually set the cache preferences fordifferent data based on practical knowledge.

The reason is that even if hot columns exist in user queries,it is very difficult to keep the hot data in SSD cache becauseof the large amounts of ad-hoc queries. We evaluated differentcache management methods, all of which incur more than80% of cache miss rates. Without a manual interference, SSDresources would not be efficiently used in Feisu.

C. SmartIndex

To utilize query similarity, we design SmartIndex forefficiently reusing the evaluating results of query predicates.Each SmartIndex is a 0-1 vector, which stores the evaluationresults of a query predicate. Feisu stores all SmartIndices inthe memory of leaf servers. In this case, the leaf servers canavoid accessing the data and evaluate the query predicate if acorresponding SmartIndex exists.

1) Index Format: The SmartIndex is stored in the cor-responding leaf server’s memory in a format similar to thebitmap index, as shown in Figure 6. It stores the evaluationresults of a query predicate and meta information of thecorresponding data in Feisu’s storage. Feisu can compress theindex to improve memory efficiency.

117911791179116511651165116511651165117711771177

Block id

Op/colname/colvalue/offset

Compress type magic

Cond value r d misc

Index schema

range bloom

Fig. 6: The structure of SmartIndex

2) Index Management: Feisu creates a SmartIndex eachtime a query predicate is evaluated in a leaf server. Feisumanages the indices based on the size of the cache memory inthe leaf servers and the time the index has been in the cachesince creation. An index will be deleted from the cache if:(1) the cache memory is full (by a LRU based approach); or(2) the index has been in the cache for too long. Current theTime-To-Live (TTL) for each index is set to 72 hours basedon our experiences.

Feisu also provides interfaces for users to set preferencesand retire strategies on indices to increase the possibility thatthey are cached in the memory for better performance. Forexample, indices with preferences can remain in the memorywhen their TTL expire if the cache memory is not full.

3) Query Execution with SmartIndex : With SmartIndex,Feisu’s leaf servers will transform the predicates in querysub-plans into conjunctive forms and check if there exist aSmartIndex for each data block it will process. If a SmartIndexexists, the scan of the data block and the evaluation of thepredicate are avoided, which can significantly improve thethroughput. Otherwise the data are read from the storagesystem and a new SmartIndex will be created.

Q1: SELECT COUNT(*) FROM TWHERE (c2 > 0) AND (c2 <= 5)

Figure 7 shows how an aggregation query with two querypredicates is executed on a leaf server when the indicesfor both predicates exist. In this case, all computations areconducted in memory. No scan operation is actually needed.

V. OTHER IMPLEMENTATION ISSUES

A. Authentication and Authorization

Feisu manages heterogeneous storages, each of which isin an independent storage domain. Each domain has its ownaccess control and resource management strategy. To grantFeisu to access data on each storage system, each system mustsupport Single-Sign-On (SSO) access [12]. Authentication andAuthorization (AA) are offline executed on X509-based certi-fication system [13][14]. To make the authentication processtransparent, we implement the AA functionalities as standardPAM plugins in the storage system to quickly support SSOaccess [15][16][17]. To guarantee that Feisu doesn’t affectthe service quality of the business application on top of eachstorage system, we define a resource consumption agreementbetween Feisu and each storage system. Each storage systemmust synchronize its agreement to Feisu such that Feisudoesn’t over-schedule tasks to the storage system.

B. Hardware Resource Utilization Control

Hardware has become increasingly powerful in recentyears, which makes it possible for each machine to servemore than one application. To increase hardware utilization,we consolidate servers by deploying multiple containers inthe same sever [18][19]. All containers are generally sched-uled by our central cluster manager. It is inevitable thatsome of low-priority containers have to give up resourcesto guarantee the provision of high-priority online servicesor for balancing the workload via instance migration. Thisaffects system throughput and latency, particularly for storagesystems. In this case, Feisu’s design of task scheduling andfault tolerance must consider the fluctuation and avoids thetemporary unavailability via dynamically communicating (e.g.,heartbeat) with the cluster manager. Sometimes the clustermanager may not timely detect server crashes. To reduce theimpact on performance, Feisu communicates with applicationsto schedule more resources.

C. Traffic Flow Management

Traffic control is important for Feisu’s performance be-cause of the large amounts of severs in the cluster. Feisudivides traffic control into the following three types: controland state flow, write date flow and read data flow. The first typeis control and state flow, which includes control informationsuch as cluster-level operation commands and heartbeat. Thecontrol flow has the highest priority because these informationitems are used to control the cluster state in Feisu. Besidesthe communication priority designed in Feisu system, we alsoactivate the type-of-service (TOS) flag in switch device toreserve bandwidth for priority communication. The secondtraffic type is write data flow. Although queries on Feisuare read-only, Feisu still needs to write data (e.g., temporarydata and intermediate results) during query execution. Thesewritten data are transmitted in a bypass channel to a globaldistributed storage. The third traffic type is the read data flowfor collecting back the analyzed data. If the data are too big,it will be dumped to global storage and only the locationinformation is passed. Read data flow has the lowest priority inFeisu because the cost of read bandwidth is cheaper than writeflow and re-try mechanism is more acceptable and flexiblewhen data are stored on persistent storage with replicas.

VI. EVALUATION

A. Experiment Setup

We have measured the experiments on one of onlineclusters in Baidu, which contains 4,000 nodes. Each node inthe cluster is equipped with a 4-core 2.4 GHz Xeon processor,64 GB of memory, four 3-TB SATA disks, and one 500GBSSD disk. All nodes are connected through 1 Gbps full-duplexEthernet. In the experiments, Feisu uses 512 MB of memoryby default to store SmartIndex. The cluster has two HDFSstorage systems managed by Feisu. Each data block in thestorage system has three replicas.

The experiments use three datasets based from real-worldapplications, as shown in Table I. Datasets T1 and T2 are fromuser business log data with the same schema, carrying URL-clicked information and query attributes. Dataset T3 is from asample set of traced webpage URLs downloaded from Baidu’s

118011801180116611661166116611661166117811781178

Scan planb0: C2

Filter planC2 > 0 AND C2<=5

Projection plan

bi = {/fs/bi}

Index0

0,1,1,0,0 0,1C2>0

0,0,0,0,0 0,1C2>5

Index1

0,1,1,0,0 0,1C2>0

0,0,0,0,0 0,1C2>5

0,1,1,0,0 0,0Y4=11,1,1,1,1 1,0C2<=5

0,1,1,0,0 0,1C2>0

bit-NOT

bit-ADD

Q10: Select … from T where C2 > 0 and C2 <= 5

Q11: Select … from T where C2 > 0 and !(C2 > 5)

Scan planb0: C2

Filter planC2 > 0 AND !(C2<=5)

Projection plan

bi = {/fs/bi}

Q12: Select … from T where C2 > 0 and !(C2 > 5)

Filter planC2 > 0 AND !(C2<=5)

Projection plan

bi = {/fs/bi}

Fig. 7: Plan transformation and index calculation during query execution

Fig. 8: Keyword frequency

webpage database. T3’s attributes are a subset of T1’s attributesand T2’s attributes.

TABLE I: Experimental datasets

Tablename

Number ofrecords

UncompressedSize

Numberof fields

Storage

T1 30 billion 62 TB 200 AT2 130 billion 200 TB 200 BT3 10 billion 7 TB 57 A

We use scan query to evaluate Feisu’s performance. Thereason is that scan queries (including aggregation) are mostfrequent in Feisu, based on our analysis of a three-month userquery log, as shown in Figure 8. They occupy more than 99%of all queries in Feisu and thus it is important to show howthey perform in Feisu.

B. Performance Results

1) Scan Performance on One Storage System: The work-load we use to evaluate the scan performance is in thefollowing format.

SELECT a FROM T1WHERE b OP1 value1 [[AND | OR] c OP2 value2](OP is comparison operator, or CONTAINS)

We randomly generate the query parameters to run thequery on Feisu. For a comparison, we also implemented B-tree index in Feisu. The results are shown in Figure 9.

Figure 9(a) shows the query performance with SmartIndexand without SmartIndex. This demonstrates SmartIndex’s effi-ciency because query performance improves as more queriesare processed by Feisu. When the number of queries processedgoes above 4,000, the performance is improved by more than3x compared to the case when SmartIndex is disabled. Theperformance improvement comes from SmartIndex’s reductionof I/O when a query predicate has SmartIndex.

SmartIndex performs differently compared to B-tree index,as shown in Figure 9(b). The query performance when usingB-tree index remains almost constant as more queries are pro-cessed by Feisu, but it is not as effective as SmartIndex becauseSmartIndex not only reduces I/O but also the computationexecution time for predicate evaluation.

2) Scan Performance on Multiple Storage Systems: Thescan queries used in these experiments are in the same formatas previous experiments. The only difference is that each scanquery in the current experiments will scan both T2 and T3,which are stored on different storage systems. Recall that T2’sattributes are a subset of T3’s attributes.

SELECT a FROM TWHERE b OP1 value1 [[AND | OR] c OP2 value2](OP is comparison operator, or CONTAINS)

We randomly generate query parameters to run the queryon Feisu. We measure the averaged scan throughput of eachserver in the cluster and the results are shown in Figure 10.

As can be seen, after SmartIndex is enabled, the averagedthroughput on a single server can be improved by up to 1.5x.This is constant with previous experiments because SmartIndexcan effectively reduce I/O and computation.

3) The Impact of Memory Size on the Performance ofSmartIndex: Feisu’s SmartIndex is stored in the memory ofeach leaf server. The memory size on each server used by Feisusignificantly affects the overall throughput. To illustrate theimpact, we use the workload for evaluating scan performance

118111811181116711671167116711671167117911791179

(a) Performance ith and without SmartIndex (b) Comparison of SmartIndex and B-tree

Fig. 9: Scan performance on one storage system

Fig. 10: Averaged scan throughput of a single server ondifferent storage systems

on multiple servers and measure its performance when varyingthe size of memory on each server used by Feisu. The resultsare shown in Figure 11.

As expected, Feisu’s performance increases with the in-crease of available memory on each server. This is what wehave expected since the more indices be loaded into cache, thebetter the performance will be. Another observation we canobtain is that the performance of Feisu with 512 MB memoryis comparable to that with 2GB memory. This shows theeffectiveness of Feisu’s index management, and Feisu doesn’tconsume too much memory on each server.

C. Scalability

We use the workloads in the previous experiment andmeasure its performance using different number of nodes inthe cluster to evaluate Feisu’s scalability. The results are shownin Figure 12. As can be seen, Feisu’s performance increaseslinearly with the number of nodes. This is contributed byFeisu’s scale-out design and indicates Feisu’s strong scalability.

VII. ONLINE SERVICE IN PRODUCTION SYSTEM

Feisu has been used for large number of data analyticsapplications for more than two years. In the earlier period, itsdeployment was only on several single clusters with aroundfive hundreds of machines and several hundreds TB of datasets. Soon, data sharing among these clusters requires Feisuto be able to cross-storage data processing. Therefore, the

Fig. 12: Response time with different number of nodes

common storage layer is developed to deal with the issue. Asdata sets become more and more huge, the response latencyof Feisu engine would be hard to meet user requirementunder the limited resources. Thus, we have encapsulated Feisuworker in container and elastically launched them in the onlineservice machine to share their idle CPU and memory resources.When the number of worker reaches more than five thousands,master’s memory becomes rare resources. Hence, we separatedjob manager from master as standalone service to guarantee itsresource provision. One year ago, Feisu was used to analyzethe full log data of whole search service. Data growth requiresmore resources put into Feisu to meet the response latency, andthe increasing number of workers continues put new challengesto Feisu’s architecture. Especially when the worker numberreaches eight thousands, the network overhead of internalcommunication (e.g. heartbeat, task dispatch) began affectingexternal user experience (job submission, monitoring informa-tion access, etc.). Feisu’s master was too busy in handlingthe internal requests and can not respond to user’s request intime. So we have to further separate another components ofscheduler and cluster management from master and make eachof them more scalable. Another challenge in full-log analyticsis that data can not be copied to central global file systembecause we don’t have such big global filesystem to holddozens of petabytes of data. Designing proper protocols tomanage the data location in each disks is a non-trivial task.Currently Feisu’s protocol only carries the location information(e.g. , file name, and offset) during computation to narrowdown the data workload. Feisu has already released eight majorversions during the past two years, and has witnessed the above

118211821182116811681168116811681168118011801180

(a) Miss ratio with different memory size (b) Throughput with different memory size

Fig. 11: The impact of memory size on Feisu’s performance

evolution.

Currently, Feisu system has been deployed with about tensof thousands of workers, managing dozens of petabytes of datacross mostly product data. The total number of product usingFeisu as data analytics platform is above one hundred. Thenumber of users has doubled in the past half year.

• Above 150 users access the system in recent halfyear averagely. Most of them are doing the rapidprototyping and product analytics. The total number ofqueries in one day can reach six thousands. Comparedto Hadoop MapReduce framework, no training isneeded for users to exploit the data treasure.

• More than 93% queries focus on those data sets areless than 200 TB. And, their response times are alwaysbelow 20 seconds. Most of these queries simply focuson statistic applications by filtering specific columnardata. It is thus useful to find the similarity in specificquery sequence.

• Product iteration productivity has been improved dra-matically. The product iteration period has been de-creased from months to days; product analysis time tofigure out problems has reduced from days to hours.

VIII. RELATED WORK

There are a number of recent big data systems devel-oped for large-scale data analysis running on large clusters.MapReduce-based systems [20], such as Apache Pig [21],Apache Hive [22], had been widely used in the Hadoop ecosys-tem. However, due to high overheads of starting and executingMapReduce jobs, these MapReduce-based systems are moresuitable for batch processing instead of satisfying the require-ments of interactive data analysis [10]. Shark [23] and SparkSQL [24] are recent SQL engines running on Apache Spark[25], which is a general cluster computing engine optimizedfor in-memory computing and iterative algorithms. There arealso other big data analytic systems which do not need aMapReduce-like general engine [26], such as Dremel [10],Impala [27], HAWQ [28],Presto [29], and VectorH [30]. Thesesystems executes SQL queries directly on cluster file systems(e.g., Hadoop File System) in order to support interactive dataanalysis. The major difference between our solution Feisu andthese existing systems is that Feisu is mainly designed for

the purpose of data integration in order to provide a unifieddata view for various developers and engineers in the team ofBaidu search engine. To the best of our knowledge, none ofthe above mentioned systems can be directly used to addressour challenges.

Traditional database systems have been extended to pro-vide data integration functionalities, such as IBM DB2 [31],Microsoft SQL Server [32], and PostgreSQL [33]. A recentdata integration system is the Data Tamer System [34], whichfocuses more on the schema mapping [35] issue across hetero-geneous data sets. It is unclear whether these data integrationsystems can run on a large cluster with thousands nodes, suchas ours in Baidu. More importantly, unlike our solution Feisu,these systems do not provide execution optimizations based onquery similarity that is an important feature and optimizationopportunity in our scenario. In addition, the request windowtechnique for data integration [36] exploits the request locality[37] to combine a group of similar remote queries. Differentfrom such a batch-processing technique, Feisu exploits theidea of query result reusing to accelerate executions of similarqueries.

IX. CONCLUSION

Interactive data analysis over heterogeneous data sourceshas become important to improve the quality of the Baidusearch engine. Due to the cost of storage and networking,distributed data sets cannot be collected into a single system ina conventional warehousing way. A data integration solutionnot only requires the ability of handling heterogeneous chal-lenges in the storage layer, but also needs minimized intervenesapplied to the data sources on which other production work-loads are co-running. Addressing these challenges, we havedesigned and implemented Feisu, a SQL query engine withcolumnar storage across heterogeneous storage systems. A keyoptimization in Feisu is to exploit the opportunity of querysimilarity in user queries and accordingly utilize a SmartIndextechnique to accelerate query executions.

Feisu has been widely used in Baidu’s critical businessrunning on more than tens of thousands of cluster nodes.We believe this work also benefits to system researchers andpractitioners on data management in large clusters.

118311831183116911691169116911691169118111811181

X. ACKNOWLEDGMENTS

We thank the following people: Haifeng Wu and JianfengZhan gave us valuable suggestions. Ran Zheng and Ketie Cendirected our integration test in Baidu’s biggest cluster man-agement system. Zheng Li took change in the online systemoperation, and cooperated with Ying Lian and Luming Cai onthe testing framework. The team at The Ohio State Universitywas supported in part by the National Science Foundationunder grants OCI-114752, CNS-1162165, and CCF-1513944.

REFERENCES

[1] “Baidu, inc.” http://www.baidu.com.

[2] “Hadoop distributed filesystem,” http://hadoop.apache.org/docs/stable/hdfs design.html.

[3] A. Qin, D. Hu, J. Liu, W. Yang, and D. Tan, “Fatman: Cost-saving andreliable archival storage based on volunteer resources,” Proceedings ofthe VLDB Endowment, vol. 7, no. 13, pp. 1748–1753, 2014.

[4] “Apache hadoop,” http://hadoop.apache.org.

[5] “Apache hive,” https://hive.apache.org.

[6] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing,”in Proceedings of the 9th USENIX Conference on NetworkedSystems Design and Implementation, ser. NSDI’12. Berkeley, CA,USA: USENIX Association, 2012, pp. 2–2. [Online]. Available:http://dl.acm.org/citation.cfm?id=2228298.2228301

[7] “Apache impala,” http://impala.apache.org.

[8] P. Pedreira, C. Croswhite, and L. Bona, “Cubrick: Indexing millions ofrecords per second for interactive analytics,” Proceedings of the VLDBEndowment, vol. 9, no. 13, 2016.

[9] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil,A. Rasin, N. Tran, and S. Zdonik, “C-store: A column-oriented dbms,”in Proceedings of the 31st International Conference on Very LargeData Bases, ser. VLDB ’05. VLDB Endowment, 2005, pp. 553–564.[Online]. Available: http://dl.acm.org/citation.cfm?id=1083592.1083658

[10] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton,and T. Vassilakis, “Dremel: interactive analysis of web-scale datasets,”Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 330–339,2010.

[11] “Apache zookeeper,” http://zookeeper.apache.org.

[12] A. Pashalidis and C. Mitchell, “A taxonomy of single sign-on systems,”in Proc. ACISP’03, 2003, pp. 249–257.

[13] F. W. Housley R., Polk W. and D. Solo, “Internet X.509 Public KeyInfrastructure Certificate and Certificate Revocation List (CRL) Profile[RFC 3280],” 2002.

[14] V. Welch1, I. Foster, C. Kesselman, O. Mulmo, L. Pearlman, S. Tuecke,J. Gawor, S. Meder, and F. Siebenlist, “X.509 Proxy Certificates forDynamic Delegation,” in 3rd Annual PKI R&D Workshop, 2004.

[15] A. Qin, H. Yu, C. Shu, and B. Xu, “Xos-ssh: A lightweight user-centrictool to support remote execution in virtual organizations,” in Proc. FirstUSENIX Workshop on Large-Scale Computing (Lasco’07), 2008.

[16] A. Qin, H. Yu, C. Shu, X. Yu, Y. Jegou, and C. Morin, “Operatingsystem-level virtual organization support in xtreemos,” in Proceedingsof the 9th International Conference on Parallel and Distributed Com-puting, Applications and Technologies (PDCAT’08), 2008, pp. 234–243.

[17] V. Samar, “Unified login with pluggable authentication modules(PAM),” Proceedings of the 3rd ACM conference on Computer andcommunications security, pp. 1–10, 1996.

[18] S. Soltesz, H. Potzl, M. E. Fiuczynski, A. C. Bavier, and L. L.Peterson, “Container-based operating system virtualization: a scalable,high-performance alternative to hypervisors,” in EuroSys’07, 2007, pp.275–287.

[19] “Kenel cgroups,” http://www.mjmwired.net/kernel/Documentation/cgroups/.

[20] A. Floratou, U. F. Minhas, and F. Ozcan, “Sql-on-hadoop: Full circleback to shared-nothing database architectures,” Proceedings of theVLDB Endowment, vol. 7, no. 12, pp. 1295–1306, 2014.

[21] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayana-murthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Buildinga high-level dataflow system on top of map-reduce: the pig experience,”Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425,2009.

[22] Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson,O. OMalley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang, “Major technicaladvancements in apache hive,” in Proceedings of the 2014 ACMSIGMOD international conference on Management of data. ACM,2014, pp. 1235–1246.

[23] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, andI. Stoica, “Shark: Sql and rich analytics at scale,” in Proceedings ofthe 2013 ACM SIGMOD International Conference on Management ofData, ser. SIGMOD ’13. New York, NY, USA: ACM, 2013, pp. 13–24.[Online]. Available: http://doi.acm.org/10.1145/2463676.2465288

[24] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi et al., “Spark sql:Relational data processing in spark,” in Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data. ACM,2015, pp. 1383–1394.

[25] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing,” in Pro-ceedings of the 9th USENIX conference on Networked Systems Designand Implementation. USENIX Association, 2012, pp. 2–2.

[26] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[27] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi,J. Erickson, M. Grund, D. Hecht, M. Jacobs et al., “Impala: A modern,open-source sql engine for hadoop.” in CIDR, 2015.

[28] L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan,J. Cohen, C. Welton, G. Sherry et al., “Hawq: a massively parallelprocessing sql engine in hadoop,” in Proceedings of the 2014 ACMSIGMOD international conference on Management of data. ACM,2014, pp. 1223–1234.

[29] “Presto query engine.” https://prestodb.io/.

[30] A. Costea, A. Ionescu, B. Raducanu, M. Switakowski, C. Barca,J. Sompolski, A. Luszczak, M. Szafranski, G. de Nijs, and P. Boncz,“Vectorh: Taking sql-on-hadoop to the next level,” in Proceedingsof the 2016 International Conference on Management of Data, ser.SIGMOD ’16. New York, NY, USA: ACM, 2016, pp. 1105–1117.[Online]. Available: http://doi.acm.org/10.1145/2882903.2903742

[31] V. Josifovski, P. Schwarz, L. Haas, and E. Lin, “Garlic: A new flavorof federated query processing for db2,” in Proceedings of the 2002ACM SIGMOD International Conference on Management of Data,ser. SIGMOD ’02. New York, NY, USA: ACM, 2002, pp. 524–532.[Online]. Available: http://doi.acm.org/10.1145/564691.564751

[32] J. A. Blakeley, C. Cunningham, N. Ellis, B. Rathakrishnan, andM.-C. Wu, “Distributed/heterogeneous query processing in microsoftsql server,” in 21st International Conference on Data Engineering(ICDE’05). IEEE, 2005, pp. 1001–1012.

[33] R. Lee and M. Zhou, “Extending postgresql to support dis-tributed/heterogeneous query processing,” in International Conferenceon Database Systems for Advanced Applications. Springer, 2007, pp.1086–1097.

[34] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack,S. B. Zdonik, A. Pagan, and S. Xu, “Data curation at scale: The datatamer system.” in CIDR, 2013.

[35] R. J. Miller, L. M. Haas, and M. A. Hernandez, “Schema mapping asquery discovery.” in VLDB, vol. 2000, 2000, pp. 77–88.

[36] R. Lee, M. Zhou, and H. Liao, “Request window: an approach toimprove throughput of rdbms-based data integration system by utilizingdata sharing across concurrent distributed queries,” in Proceedings ofthe 33rd international conference on Very large data bases. VLDBEndowment, 2007, pp. 1219–1230.

[37] R. Lee and Z. Xu, “Exploiting stream request locality to improvequery throughput of a data integration system,” IEEE Transactions onComputers, vol. 58, no. 10, pp. 1356–1368, 2009.

118411841184117011701170117011701170118211821182

feisu: fast query execution over heterogeneous data...

Documents