1 understanding processors design decisions for data ...mckee/papers/transbigdata17.pdftions on the...

1

Understanding Processors Design Decisions forData Analytics in Homogeneous Data Centers

Zhen Jia, Wanling Gao, Yingjie Shi, Sally A. McKee,Senior Member,IEEE , Zhenyan Ji, Jianfeng Zhan, LeiWang, Lixin Zhang,Senior Member,IEEE

Abstract—Our global economy increasingly depends on our ability to gather, analyze, link, and compare very large data sets. Keepingup with such big data poses challenges in terms of both computational performance and energy efficiency, and motivates differentapproaches to explore data center systems and architectures. To better understand the processor design decisions in context of dataanalytics in data centers, we conduct comprehensive evaluations using representative data analaytics workloads on representativeconventional multi-core and many-core processors. After a comprehensive analysis of performance, power, energy efficiency andperformance-cost efficiency, we have the following observations: contrasted with the conventional wisdom that uses wimpy many-core processors to improve energy-efficiency, the brawny multi-core processors with SMT (simultaneous multithreading) and dynamicoverclocking technologies outperform the counterparts in terms of not only execution time, but also energy-efficiency for most of dataanalytics workloads in our experiments.

Index Terms—Data Analytics, Performance, Processor Evaluation, Energy efficiency

F

1 INTRODUCTION

THE amount of data in our world is exploding. Digitalmultimedia continues to proliferate: networked sensors

track, create, and communicate data from our physicalworld; and businesses capture copious data about theircustomers, suppliers, and operations. The enormity of thesedata and the speed with which they are produced not onlychallenge our ability to keep up in terms of processingperformance, but also put data center systems under greatpressure to increase the performance achieved per dollar,which determines the system design decisions at data centerscale [1]. For the large scale of data center systems, e.g., thereare 60,000 servers in Facebook in 2010 [2], the infrastructureis often the largest capital expense [11]. Additionally, theenergy and related costs can account for more than a thirdof the TCO (Total Cost of Ownership), and these costs areprojected to increase faster than compute-related costs [3],[4], [5]. Even worse, the energy costs limit the scalabilityof data center systems, which address them as another keydesign constraint apart from the initial setup cost.

These constraints lead to discussions of different designdecisions for modern data center systems. In this paper, wefocus on the processor design decisions in the context ofdata analytics workloads in data centers. On one hand, In-ternet service providers (e.g., Facebook [6], [7]) exhibit great

• Zhen Jia is with Computer Science Department, Princeton University,Princeton, NJ. E-mail:[email protected]. He performed this work whenhe was at Institute of Computing Technology, Chinese Academy ofSciences.

• Wanling Gao, Jianfeng Zhan, Lei Wang and Lixin Zhang are withInstitute of Computing Technology, Chinese Academy of Sciences, andUniversity of Chinese Academy of Sciences, China. E-mail: {gaowanling,zhanjianfeng, wanglei 2011, zhanglixin}@ict.ac.cn.

• Yingjie Shi is with Beijing Institute of Fashion Technology. E-mail:[email protected].

• Sally A. McKee is with Rambus. E-mail: [email protected].• Zhenyan Ji is with Beijing Jiaotong University.

The corresponding author is Jianfeng Zhan.

interest in low-power wimpy cores 1. And many researchersalso advocate wimpy many-core systems as the promisingarchitecture for big data workloads [9]. On the other hand,many researchers question the efficiency of using smalland wimpy cores for service workloads or key-value data-stores in data centers [5], [8]. However, for data analyticsapplications—a typical class of workloads in data centers,no detailed evaluation is performed to investigate differentprocessor design decisions, which guides the architects tooptimize systems by taking the right design decisions, andmay yield substantial cost benefits achieved at the scale ofdata centers [10].

Our work seeks to address the question. In this paper wetake a pragmatic and experimental approach to run ten typ-ical data analytics workloads in data centers including bothHadoop based and Spark based workloads on two state-of-the-practice platforms equipped with different kinds of pro-cessors to reflect the state-of-the-art architectures in differentdirections. Our analysis shows that the brawny multi-coreprocessor is the winner for most of data analytics work-loads comparing to only use wimpy many-core processor,from the perspectives of execution time, energy-efficiencyand performance-cost efficiency. Further, we investigate twoimportant and frequently used technologies (simultaneousmultithreading and dynamic overclocking technology 2) inmodern processors to find whether they can benefit dataanalytics workloads. Combining the observations in thispaper, we have the following key findings:

• The brawny multi-core processors need less time andprocessor cycles to complete the same amount of

1. In this paper we use the definitions from Google [8] to describe thebrawny core processor and wimpy core processor. A brawny core pro-cessor means that its single-core performance is fairly high. Whereas, awimpy core processor means that its single-core performance is low.

2. A technology enables the processor core to run above its baseoperating frequency dynamically.

2

work in comparison with wimpy many-core proces-sors. The brawny multi-core pipeline is also moreefficient than wimpy many-core pipeline for dataanalytics workloads in data centers.

• The wimpy many-core processor is more energyefficient than the brawny multi-core processor onlyfor extreme I/O intensive workloads, which havemany I/O operations in each stage of the whole joband own simple computing logic. The extreme I/Ointensive workload achieves energy savings withonly slight performance degradation on the wimpymany-core processor (about 8%). However, whenconsidering performance per dollar, the extreme I/Ointensive workload running on the wimpy many-core processor does not achieve many gains (about3% in our fixed costs TCO model 3).

• For the other data analytics workloads (nine outof ten in our experiments), higher performance (2.2× on average) achieved on the brawny multi-coreprocessor offsets higher energy consumptions andeven higher costs.

• In the comparison of performance-cost efficiencybetween brawny multi-core and wimpy many-coreprocessor, the fluctuation of energy cost has little im-pact on the comparison results. While the fluctuationof server cost is a sensitive factor that will impactthe comparison results. However only if the brawnymulti-core server is more than 8 times expensivethan wimpy many-core server, the wimpy many-coreprocessor can outweigh brawny multi-core processorfor most data analytics workloads in terms of perfor-mance per dollar.

• The dynamic overclocking technology on the brawnymulti-core processors can improve performance buthas extra energy overheads (8% on average).

• The SMT technique on the brawny multi-core pro-cessor can improve the performance for all dataanalytics workloads we investigate in this paper(18% performance improvements on average). Andfor some workloads, it can even save energy (up to16%) due to execution time cutback.

• SMT and dynamic overclocking have mutual en-hanced effects. When both SMT and dynamic over-clocking are enabled, most of the data analyticsworkloads achieve the best performance and the bestperformance-cost efficiency.

With those findings, we conclude that the brawny multi-core processors with SMT and dynamic overclocking tech-nologies are welcomed by most data analytics workloadsin data centers, while a cluster, which only equipped withwimpy many-core processors, does not suit for data analyt-ics workloads in most situations. Heterogeneous data cen-ters [10] may achieve better performance for data analyticsworkloads through assigning different kinds of workloadsto diverse architectures, which also leads the direction ofour future work, but is out of the scope of this paper.

We believe that our work will be useful to both datacenter system architects as it demonstrates the benefits and

3. The TCO model is based on the work by Lim et al. [11] and will beexplained in Section 3.4.4

costs of different design decisions, and to the softwaredesigners since it will provide insights into the benefits andoverheads of those technologies for data analytics work-loads. Since the raw data may be useful for other researchersto pursue their own analysis, this paper is accompaniedby a public release of all data, which is publicly availablefrom [12].

The rest of this paper is organized as follows. In Sec-tion 2, we provide an overview of the state-of-the-art pro-cessor technologies and data analytics framework in datacenters. We provide a detailed description of our bench-marking methodology in Section 3. We present our resultsand corresponding analysis in Section 4. We summarizerelated work in Section 5 and conclude the whole paper inSection 6.

2 MODERN PROCESSOR TECHNOLOGIES ANDDATA ANALYTICS FRAMEWORKS

The architecture progresses for modern data center scaleprocessors mainly follow two apporaches. The first oneadopts the architecture, which closely follows the technol-ogy trends and continues to spend resources on improv-ing single-core performance with each processor generationat the expense of area and power efficiency. The otherapproach prefers to put forth efforts on delivering moreenergy-efficient performance, which focuses on the proces-sors’ overall performance and the parallelism instead ofsingle-core performance.

At the software layer, data analytics frameworks, whichfacilitate programmers to write applications without consid-ering the messy details of data partitioning, task distribu-tion, load balancing, failure handling and other data center-wide system details [13], are being proposed to facilitate thedevelopment of analytics applications in data centers.

2.1 Dominant Processor Technologies2.1.1 Brawny CoreIn order to achieve high performance, modern high per-formance processors comprise brawny cores, which areequipped with deep pipelines. The deep pipeline in-creases the potential parallelism among instructions, i.e.,instruction-level parallelism (ILP). To further increase thepotential amount of ILP, brawny cores also adopt the dy-namic multiple issue technology, i.e., superscalar. Multipleissue pipelines can fetch and issue more than one instruc-tion in parallel. Nearly all superscalars extend the basicframework of dynamic issue decisions to include dynamicpipeline scheduling (i.e., out-of-order execution) so as toachieve better performance. Dynamic pipeline schedulingcan choose independent instructions to execute in an orderdifferent from the program in a given clock cycle so as toavoid hazards and stalls.

To further decrease the number of dependent instruc-tions on the pipeline, simultaneous multithreading (SMT)is introduced to modern processor designs for over adecade [14]. Many major chip manufacturers, e.g., IBM,ARM and Intel, have their own implementations. Withsimultaneous multithreading, instructions from more thanone thread can be executed in any given pipeline stage at a

3

time. SMT improves the parallelization of computations aswell as thread-level parallelism (TLP). Multiple hardwarethreads in each physical processor core share most execu-tion resources, which improve the efficiency of instructionscheduling, increasing ILP, and keeping the execution re-sources occupied during pipeline stalls.

In recent years, dynamic overclocking technology is im-plemented by some major chip manufacturers, e.g., IntelTurbo Boost technology and AMD Turbo Core technology.This technology enables the processor to run above its baseoperating frequency via dynamic control of the CPU’s clockrate. It is activated when the operating system requeststhe highest performance state of the processor. When aworkload on the processor calls for better performance, andthe processor is below its limits, the processor’s clock willincrease the operating frequency in regular increments asrequired to meet demands [15], [16].

All the technologies mentioned above are aimed atachieving high performance. A brawny core architecturecomprising all those technologies can own a pretty goodperformance but power hungry as well. Each of thosetechnologies needs lots of circuitry to implement, whichoccupies die area and power budget. And for power and diearea limitations, only several brawny cores equipped withaggressive out-of-order pipelines can be assembled into onesocket.

2.1.2 Wimpy CoreIn order to solve the power and die area limitations, espe-cially the power limitation, which has joined performanceas the first class design constraint in the past decade [17],another solution comes into view—the wimpy core archi-tecture. Different from brawny core, the design of wimpycore processor focuses on delivering energy-efficient per-formance. Wimpy core processors usually adopt simplepipelines with low performance but less circuitry in thelimited die area to achieve low power. For instance, theIntel Atom employs in-order pipeline, which reduces thehardware complexity. The in-order pipeline saves the cir-cuitry that detects interdependencies, schedules instructionsand maintains instruction retired ordering. Some chip man-ufacturers even offer customizable specific designs, e.g.,ARM [18] and Tensilica [19], for a better performance topower consumption ratio.

Many wimpy cores are assembled together into onesocket so as to improve the overall performance and par-allelism of the single processor. One of the representa-tive approaches is the Tilera Tile-Gx processor, which cancomprise up to 100 cores in one socket [20]. The TileraTile-Gx processors use static multiple issue (i.e., VLIW)decision instead of dynamic issue decision in pipeline toreduce design complexity and energy consumption. Dif-ferent from supperscalar, the VLIW (very long instructionword architecture) approach Tilera adopted depends on theprograms themselves providing all the decisions regardingwhich instructions are to be executed simultaneously andhow conflicts are to be resolved, which reduces hardwarecomplexity and power consumption.

Here we choose the state-of-the-practice Intel Xeon pro-cessor as the example of brawny core processors and TileraTile-Gx processor as the example of wimpy core processors.

TABLE 1Design decisions for state-of-the-practice architectures.

Design decision Tile-Gx XeonPipeline Ordering In-Order Out-of-Order

Multiple Issue Static DynamicDynamic Overclocking N/A Turbo Boost

SMT N/A Hyper-Threading

Table 1 illustrates the different design decisions betweenthem.

2.2 Dominant Data Analytics FrameworksFor data analytics frameworks in data centers, there aretwo dominant lines. The MapReduce model proposed byGoogle [21] is influential. The MapReduce runtime systemprovides two user-defined operators: a map function gen-erates a set of intermediate key/value pairs, and a reducefunction then merges or aggregates all the intermediatevalues. Hadoop [22], an open source implementation ofMapReduce model, has attracted a large number of usersand companies in a short period of time [23] and becomesone of the most famous state-of-practice data analyticsframeworks.

Spark [24] is an emerging in-memory computing engine.The basic idea of Spark is keeping the working set ofdata in the memory. The proposed abstraction—ResilientDistributed Datasets (RDDs) provides a restricted form ofshared memory, based on coarse grained transformationsrather than fine-grained updates to shared state so as toachieve fault tolerance efficiently.

3 METHODOLOGY

Our experimental methodology including the choice ofhardware platforms, representative workloads, and systemconfigurations is guided by the following considerations:

• We embark a study to compare different design de-cisions in modern processors. So factors, which areorthogonal to the processor designs but may affectapplication performance, should be separated out.

• We compare different pipeline architectures. Theplatforms should be comparable and it is unfair tocompare platforms from vastly different time-frames.

• Our target application domain is data analytics indata centers. The workloads should cover most rep-resentative applications and frameworks in present.

3.1 Hardware PlatformsWe choose two hardware platforms equipped with pro-cessors own different micro architectures to evaluate dataanalytics workloads.

We select the Intel Xeon E5645 (code name Westmere)processor as the state-of-the-practice brawny multi-core pro-cessor, which is targeted at the high-end workstation, serversystem markets. The Xeon E5645 processor has six 2.4 GHzcores per socket [25], and each physical core supports twohardware threads. The E5645 has three levels of cache, thelast of which consists of a 12 MB cache shared by all cores.

4

TABLE 2Platform Configurations

Processor Xeon E5645 Tile-Gx36Number of Cores 6 36

Frequency 2.4GHz 1.2GHzL1 Cache(I/D) 32KB/32KB 32KB/32KB

L2 Cache 256KB*6 256KB*36L3 Cache 12MB 9MB (Pseudo)

TDP 80W 45WPipeline Depth 16 5Issue Widths 4 3

Technology Node 32nm 40nmMemory 32GB, DDR3 32GB, DDR3Ethernet 1Gb 1Gb

We select Tilera Tile-Gx36 as the state-of-the-practicewimpy many-core processor, which takes the concept ofa multi-core to a logical extreme. The Tile-Gx36 processorintegrates 36 tiles (cores) on one processor, and each tileis equipped with two level private caches. And all the L2caches form a large pseudo L3 cache with the total volumeof 9 MB. The three-way, five-stage VLIW, in-order pipelinecontributes to a low-power profile. Those VLIW tiles (i.e.,cores) are connected by an NoC interconnect that is fullycoherent.

The processors we choose (i.e., Xeon E5645 and Tile-Gx36) can be considered as products in the same age andhave the same L1 and L2 cache sizes per core, similartechnology nodes (32nm and 40nm), but different pipelinearchitectures. Tables 1 and 2 list the design decisions anddetail configurations for each kind of processor.

We conduct our comparisons between those two kindsof processors at the granularity of socket. Since in the socketlevel, some trade-offs in design are needed to make for bothpower and area constrains. In order to separate the NUMA’simpact out, we configure each server with a single socketprocessor and the same amount of memory, 32 GB.

3.2 System Configurations

We deploy a one master and one slave cluster for each kindof processor to avoid the massive data exchange amongslaves and reduce the network’s overhead. There is onlya few network communication between master and slave.So we use 1Gb Ethernet, which is sufficient and will beirrelevant to the overall performance [26], to connect allnodes.

The operating systems used by Tilera server and Xeonserver are Tilera Enterprise Linux and CentOS, respectively,both of which own the same upstream source—Red HatEnterprise Linux and all run the stable Linux 2.6 kernels.

Since this paper focuses on evaluating the processorinstead of the whole server, we isolate processor behaviorto saturate CPUs performance in order to reveal processorbest performance. In our experiment setup, we over providethe disk I/O subsystem by using RAM disk to avoid I/Obottleneck as Ferdman et al. did [27]. This approach placesthe entire data set of our data analytics applications in localmemory.

3.3 ApplicationsIn order to cover representative workloads running on thefamous data analytics frameworks, we select 10 data an-alytics workloads in data centers from BigDataBench [28],which covers both Hadoop based and Spark based repre-sentative applications. Table 3 lists the details of each dataanalytics workload. Considering that our systems are allequipped with 32 GB memory and each application’s input,intermediate, output data are stored in the RAM disk, weuse about 4 GB data set to drive each application so as toremain enough memory for processing it. Since Spark is anin-memory computing engine, we have to reduce input dataset for some applications to a smaller data set so as to keepthe input, intermediate and output data in memory to attainreasonable experiment time. For both the Xeon based clusterand the Tilera based cluster, we use Hadoop version 1.0.2and Spark version 1.0.2.

TABLE 3Representative Workloads

Framework Workload Problem Size Data TypeSort 4 GB sequence file

Hadoop Grep 4 GB textNaive Bayes 4 GB text

Kmeans 4 GB textPageRank 4.4 GB web page

Sort 3 GB sequence fileSpark Grep 4 GB text

Naive Bayes 4 GB textKmeans 4 GB text

PageRank 220 vertices graph

3.4 Metrics and Tools3.4.1 Execution Time and Microarchitecture eventsWe use wall-clock time to determine execution time. Wealso use hardware performance counters to understand thetechnologies influences on application performance. Perf, aprofiling tool for Linux 2.6+ based systems [29], is used toaccess the hardware performance counters.

3.4.2 PowerOur evaluation also compares and reports energy char-acteristics at the processor level. To quantify the energyconsumed by the processor, we measure the power by usingthe following method. We use a Hall-effect current clamp(FLUKE i30s) to measure the current across the 12V cablesthat connect the motherboard’s voltage regulator moduleto the processor, and we collect the current measurementwith a multimeter (FLUKE norma4000). We compute powerconsumed through the stable voltage and the measuredcurrent.

3.4.3 Technology Scaling and ProjectionsThe technology node is strongly related with energy andperformance. The Xeon E5645 and Tile-Gx36 use differenttechnology nodes (32nm vs. 40nm). We adopt the same pro-jection method that has been proposed by Belm et al. [30],which is based on the technology characteristics from the2007 ITRS tables [30], [31] and neglects the difference ofdevice type. In this paper, we normalize the technology

5

node of Tile-Gx36 (40nm) to 32nm (the technology node ofXeon E5645) with the projection method mentioned above.For the 32nm projections, the power of Tile-Gx36 is scaledby 0.8×. The technology scaling may introduce some error,but it is reasonable and does not affect our primary findings.

3.4.4 Total Cost of OwnershipAs the data center system is very much cost driven. Theperformance-cost efficiency directly affects the design de-cisions in modern data centers [1]. The performance-costefficiency metric (i.e., performance per dollar) for data cen-ter is the sustainable performance divided by total cost ofownership (TCO). Inspired by Polfliet et al. [10], we alsobuild a TCO model based on the work by Lim et al. [11] withan assumption of a three-year depreciation cycle. Hardwarecosts are associated with individual components per server,including CPU, memory, disk, board, power and cooling.Power and cooling costs involve two perspectives. We sumthe power consumption at the rack-level (Pconsumed), whichis derived from power consumption of above-mentionedindividual components. The burdened power and coolingcosts are computed by using Equation 1 according to [11],[32]. The infrastructure costs for power delivery is K1; Theelectricity costs for cooling is L1; The infrastructure cost forcooling is K2 and direct electricity costs is U$,grid.

Costpower&cooling = (1 +K1 + L1 +K2 × L1)

× U$,grid × Pconsumed(1)

We model two server types, a brawny multi-core typeof Xeon E5645 and a wimpy many-core type of Tile-Gx36.The cost and energy data originate from diverse sources,including not only different vendors [33], [34], [35], [36],but also corresponding official websites [25], [37]. For thecomputation of power and cooling costs, we adopt anactivity factor of 0.75 for the discrepancy between actualpower consumption and TDP [11]. In addition, the defaultvalues are used for three parameters according to [32]—infrastructure costs for power delivery, the electricity costsfor cooling, infrastructure costs for cooling, which are 1.33(K1), 0.8 (L1) and 0.667 (K2), respectively. For the valueof direct electricity costs (U$,grid), we use a default valueof $100/MWh. Note that we can measure the processorpower for each workload. Consequently, for the individualcomponent of CPU, we use the actual measured powerinstead of TDP for accuracy. Table 4 describes fixed valueswe collected, including the hardware cost and power of eachcomponent except for the power of CPU. Here we have toface to a issue, which many studies also share, e.g, [10], [38] 4

, the memory types of Tilera and E5645 clusters are different.Due to the fact, we investigate the two types’ prices ondifferent vendors and use the average values of differencevendors for each type.

The total hardware cost of one Xeon based server is$1610 and the total hardware cost of one Tilera basedserver is $766. The total server power excluding CPU’s weestimated are 105 Watts and 66 Watts for Xeon based serverand Tilera based server, respectively. With the measuredCPU power, we can compute the three-year power costs andtotal costs.

4. Those studies also use different memory types.

TABLE 4Cost Models of Server.

Details Xeon server Tilera serverTotal hardware cost ($) 1610 766

CPU 593 171Memory 254 176

Disk 238 238Board + mngmnt 275 80

Power + fans 250 101Server Power (Watt) 105 + CPU power 66 + CPU power

CPU Measured MeasuredMemory 12 10

Disk 16 16Board + mngmnt 42 20

Power + fans 35 20

3.5 LimitationsWe are aware of this work’s limitations as listed below. 1)For fair comparisons of the same generation of processorswith different architectures, we are not successful to includethe most latest-and-greatest brawny processors, but ourchoice on different processor architectures reflects a few dis-tinct trends in state-of-the-art data center processor architec-ture research. 2) We focus on evaluating homogeneous datacenters, and leave it to future work to explore the benefits ofheterogeneous architecture, which includes a wide range ofprocessing elements and can not be addressed sufficientlyin this paper. 3) Our current work mainly evaluates differ-ent processor architectures through running data analyticsbenchmarks from BigDataBench, and does not cover allscenarios in data centers, for instance, the co-locating dataanalytics workloads and latency-critical workloads in [39],[40], [41], [42].

Many factors may affect the TCO model. The hard-ware prices fluctuate over time and always depend on thepurchase quantity. Sometimes commercial factors may beinvolved. The energy cost also varies from one location toanother. Our TCO model is built on the work by Lim [11].And we obtain the cost data from diverse sources, mostlypublic, to mitigate those factors. Even we analyze the sensi-tivity of server cost and energy cost in this paper, it can notmeet everybody’s need. So we release the data we measuredin this paper, which is publicly available from [12], so as toallow other researchers to pursue their own studies.

4 EVALUATIONS AND ANALYSIS

In this section we first compare the brawny multi-coreprocessor to wimpy many-core processor in terms of per-formance and energy consumption (Section 4.1). Then weinvestigate whether the frequently used technologies (i.e.,SMT and dynamic overclocking) in modern processors canbenefit data analytics workloads in data centers (Section 4.2to Section 4.4). For data centers are much cost driven,we evaluate the design decisions by using the metric ofperformance per dollar (Section 4.5). At last we summarizeour findings (Sections 4.6).

4.1 Pipeline Architecture ComparisonIn this subsection, we investigate the performance, powerand energy of different pipeline architectures (the brawny

6

00.5

11.5

22.5

33.5

44.5

5N

orm

aliz

ed T

ime

Xeon Tilera

Fig. 1. Execution time normalized to Xeon.

multi-core pipeline architecture and the wimpy many-corepipeline architecture) in modern processors.

In order to compare the basic pipeline architecturespurely and investigate the suitable architecture for dataanalytics workloads, we disable the additional advancedtechnologies (i.e., Turbo Boost and Hyper-Threading), whichwill be further discussed in the rest section.

4.1.1 Performance

We compare the pipelines performance comprehensively,covering multiple aspects including execution time, cyclecounts and derived pipeline efficiency.

[Execution Time]Figure 1 shows the execution time normalized to Xeon.

We can find that there are huge execution time gaps betweenXeon E5645 and Tile-Gx36 for all data analytics workloadsexcept for the Hadoop based sort, whose gap is about 1.08×.For the other workloads, more than 2× gaps exist betweenXeon and Tilera. So from the perspective of execution time,the Xeon processor (i.e., Xeon E5645) is better than Tileraprocessor (i.e., Tile-Gx36) all the time.

[Cycle Counts]Considering that Xeon E5645 processor and Tile-Gx36

processor own different frequencies (2.4 GHz vs. 1.2 GHz),which can affect performance, we normalize the frequency’simpact by using cycle counts since CPU frequency deter-mines the cycle duration, but does not affect the cyclecounts.

Figure 2 presents the normalized value by using the cyclecounts on the Xeon processor as the baseline. There are hugecycle count gaps between Xeon E5645 and Tile-Gx36 rangingfrom 5.3× to 14×. This implies that the wimpy many-corearchitecture needs more cycles to complete the same amountof work for all the data analytics workloads we investigatein this paper. We can give the conclusion that even thefrequency’s impact is normalized by using cycle counts, thewimpy many-core processor still has poor performance incomparison with brawny multi-core processor.

[Pipeline Efficiency]The Xeon processor and Tilera processor have distinct

pipeline architectures, with different ISAs (instruction setarchitectures), design philosophies and purposes. The XeonE5645 processor implements dynamic multiple issue out-of-order cores, which can decode, issue, execute and commit

02468

10121416

Nor

mal

ized

Cyc

les

Xeon Tilera

Fig. 2. Cycle counts normalized to Xeon.

up to four instructions on each cycle in theory, so the theo-retical IPC 5 (instructions per cycle) is 4. While the Tile-Gx36uses a Very Long Instruction Word (VLIW) processor enginein each tile (core), which employs static multiple issue in-order pipelines. Different from the dynamic multiple issueout-of-order pipelines, VLIW cores need programs to ex-plicitly specify instructions that will be executed in parallel.With a 3-issue pipeline the Tilera compiler can generateup to 3 instructions for each bundle, which indicates atheoretical IPC of 3.

However, application IPC can never reach the theoreticalvalue for pipeline stalls and data or instructions dependen-cies. We adopt a metric of the pipeline efficiency to definehow close they are. The closer the actual application IPCto the theoretical one, the more efficient the pipeline is forthe application. Equation 2 is used to calculate the pipelineefficiency.

Pipeline Efficiency =Application IPC

The Theoretical IPC(2)

Since the Tile-Gx36 processor only provides the hard-ware event to profile the instruction bundles instead of theinstructions, we calculate the pipeline efficiency for Tile-Gx36 through dividing the application instruction bundlesper cycle by the theoretical instruction bundles per cycle.So the Tile-Gx36 processor’s pipeline efficiency is overes-timated since the actual instructions in each instructionbundle can not always occupy all available slots (i.e., 3 slotsin the 3-issue VLIW pipeline).

Figure 3 illustrates the pipeline efficiency of Xeon andTilera processor for different data analytics workloads.We can find that there are gaps between application IPCand theoretical IPC for both processors. Even Tile-Gx36’spipeline efficiency is overestimated (discussed in the previ-ous paragraph), it is less efficient when compared with XeonE5645, for most of data analytics workloads. This indicatesthat the dynamic multiple issue out-of-order pipeline ismore suitable for data analytics workloads. The only oneexception is Hadoop based Bayes, which may be caused bymore idle slots in its instruction bundles and the value ismore overestimated than that of other workloads.

5. The IPC refers to the number of instructions can be executed ina processor cycle in the pipeline. The more instructions complete in acycle, the better performance the pipeline can obtain.

7

00.05

0.10.15

0.20.25

0.30.35

0.40.45

Pipe

line

Effic

ienc

yXeon Tilera

Fig. 3. Pipeline efficiency of Xeon and Tilera processors for data analyt-ics workloads.

00.20.40.60.8

11.21.41.61.8

Nor

mal

ized

Pow

er

Xeon Tilera

Fig. 4. Power normalized to the Tilera processor.

4.1.2 Power and Energy

Figure 4 shows power normalized to Tilera for each dataanalytics workload. We find that the Xeon processor needsmore power than the Tilera processor, and the gaps rangefrom 1.18× to 1.66×, which verifies the design intention thatTilera is power optimized. However, the power gaps aremuch less than the execution time gaps, which may resultin more energy consumption for some workloads runningon the Tilera processor.

Figure 5 illustrates the energy consumption (product oftime and power) normalized to Xeon processor for eachworkload. The Tilera bar in Figure 5 shows the Tilera energyconsumption. As we expected, even though the Tile-Gx36processor owns low power consumption (Figure 4), it con-sumes more energy than Xeon E5645 to complete the sameamount of work for most of data analytics workloads. Theonly exception is Hadoop based sort. For other workloads,the energy consumption gaps range from 1.8× to 3.3×. Thisis because the Tilera processor needs longer time to finishthe work, and the longer execution time offsets the lowerpower and consumes more energy.

We notice that, the Xeon processor and Tilera processoruse different technology nodes. The Xeon E5645 is 32 nmand Tile-Gx36 is 40 nm. In order to isolate the impact oftechnology node, we perform a projection, which normal-izes the 40 nm technology node to 32 nm technology node.For the 32 nm projection, the Tile-Gx36’s power is scaled

00.5

11.5

22.5

33.5

Nor

mal

ized

Ene

rgy

Xeon Tilera Tilera projection

Fig. 5. Energy consumption normalized to the Xeon processor.

by 0.8× as we analyzed in Section 3.4.3. The Tilera projectionbar in Figure 5 presents the normalized energy consumptionafter a projection was performed for the Tile-Gx36 processor.We can infer that if we use the same technology node, i.e. 32nm, the Hadoop based Sort will gain more energy savings,increased from 18% to 35%. But for other data analyticworkloads, the energy consumption gaps are just reducedto a smaller range (the largest gap reduced from 3.3× to2.65×). That is to say that even with a technology projection,the conclusion that the Tile-Gx36 consumes more energythan Xeon E5645 for most of data analytics workloads stillremains the same.

4.1.3 Observations and Implications

From the above phenomena, we can conclude that thewimpy core processor is not as powerful as the brawnycore processor in most instances. For most of data analyticsworkloads, the static multiple issue in-order pipeline isless efficient than the dynamic multiple issue out-of-orderpipeline, which is reasonable. Since the out-of-order super-scalar (i.e., dynamic multiple issue) pipeline achieves moreinstructions execution efficiency by decreasing pipelinestalls through executing instructions in an order differentfrom the program at the cost of increasing hardware com-plexity and power consumption. However, the single core’sless efficiency caused by the static multiple issue in-orderdesign can not be made up by increasing the core count.For all data analytics workloads, the 36-core Tilera processorstill needs longer time to finish the work that a 6-core Xeonprocessor does. Even though we eliminate the impact offrequency by using cycle counts, the conclusion does notchange.

When considering about the energy consumption, theTilera processor owns low power, but consumes more en-ergy than the Xeon processor for the longer executiontime. The only exception is Hadoop based Sort, for whichthe Tilera consumes less energy than Xeon with a slightperformance degradation. However, even the Spark basedSort implements the same algorithm, we do not observe asimilar phenomenon. This can be explained by two uniquecharacteristics of Hadoop based Sort. The first one is thatdifferent from most of the data analytics workloads, theinput data size of Sort is equal to the output data size,

8

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8N

orm

aliz

ed T

ime

Turbo & HT disabled Turbo enabled HT enabled HT & Turbo enabled

Fig. 6. Execution time normalized to workloads running on Xeon withoutTurbo Boost and Hyper-Threading.

so in each stage of the MapReduce job, it will generate alarge amount of data. This characteristic makes Sort havemore I/O operations than the other workloads. The secondunique characteristic is that Sort has simple computinglogic, only comparing. So it can process a large amountof data in a short period of time. Those characteristics letHadoop based Sort involve more frequent I/O operationsand become an extreme I/O intensive workload 6. Note thatwe use RAM disk to reduce the latency of I/O operations,but do not change the workload’s behaviors. However theSpark based sort is different. The Spark is an in memorycomputing framework. Different from Hadoop, most of theintermediate data are stored in memory instead of disks,which reduces lots of I/O operations and makes Sparkbased Sort be not as I/O intensive as Hadoop based Sort.

We also find a similar phenomenon from the work byLim et al. [11]. The Hadoop distributed file system writeoperations, which are also extreme I/O intensive workloads,are more energy efficient on wimpy core processors thanon brawny core processors. So from above discussions,we can conclude that the wimpy core processor is moresuitable for extreme I/O intensive workloads, which processa large amount of I/O operations in each stage of the wholejob and own simple computing logic. The extreme I/Ointensive workloads can save energy (up to 35% for ourHadoop based Sort case), with slight performance degra-dation (about 8%) on wimpy many-core processors. Forthe other data analytics workloads, the brawny multi-coreprocessor is preferred.

4.2 Does Dynamic Overclocking Help?

Most of high performance processors integrate the dynamicoverclocking technology so as to gain a performance boostjust when the operating system requests the highest pro-cessor performance. The Intel Xeon E5645 processor has thedynamic overclocking feature called Turbo Boost technol-ogy. With Turbo Boost, the processor can opportunisticallyadjust the frequency of cores. Normally, the Xeon E5645

6. We have observed that the Hadoop based Sort has about 2× to 4×more I/O operations than other workloads in our experiments, whichconsists with [43]

processor’s scaling available frequency ranges from 1.6 GHzto 2.4 GHz. With Turbo Boost enabled, the frequency can beincreased up to 2.67 GHz based on the processor’s power,current and thermal limits, the number of cores currently inuse and the maximum frequency of the active cores. In thissubsection, we would like to investigate the Turbo Boosttechnology’s impact on performance and energy consump-tion. We run all the workloads mentioned in Section 3.3 onall six cores with and without Turbo Boost.

4.2.1 Performance

We use the workload execution time with neither TurboBoost nor Hyper-Threading as the baseline and present thenormalized execution time in Figure 6, in which the datameasures the speedup due to enabling the correspondingtechnology. The Turbo enabled bar in Figure 6 shows thenormalized execution time for data analytics workloadsrunning with Turbo Boost enabled.

We find that not all of the workloads favor TurboBoost technology. Some workloads have performance gainswhile others do not. The Hadoop based Kmeans owns thehighest speedup among the ten workloads we investigatefor it’s good locality, which can be found from Figure 7.An application with good locality does not need to spendlots of time on memory accesses, which is in favor of theTurbo Boost technology [44]. Figure 7 presents the L2 cacheand L3 cache misses per kilo instructions (MPKI) of dataanalytics workloads running on the Xeon processor. We donot measure the L1 data cache’s statistics for its miss penaltycan be hidden by out-of-order cores in modern processors.The Hadoop based Kmeans has the least L2 cache and L3cache MPKI, which indicates a good locality.

We find from Figure 6 that the Hadoop based workloadscan achieve more performance gains with Turbo Boost thanthe Spark based workloads on average. This is because theSpark based workloads are more memory intensive andmost of memory accesses are burst, which is observed byJiang et al. [45]. The data shown in Figure 7 indicates theHadoop based workloads own better localities than those ofSpark based workloads. Spark based workloads have moreL2 cache and L3 cache MPKI on average, which is consistentwith that in [46].

4.2.2 Energy Consumption

When the Turbo Boost is enabled, the data analytics work-loads may gain performance benefits because of increasedcore frequency. However, the performance improvementscome at the cost of more energy consumed. Figure 8 showsthe normalized energy consumption by using the workloadsrunning with neither Turbo Boost nor Hyper-Threading asthe baseline. The Turbo enabled bar in Figure 8 shows eachworkload’s normalized energy consumption with TurboBoost. We find that the Turbo Boost increases the energyconsumption for most of data analytics workloads eventhough the workload does not achieve performance gain.The only workload that consumes less energy when TurboBoost enabled is the Hadoop based Kmeans, which hasabout 3% energy saving for its shorter execution time.

9

02468

10121416

MPK

I L2 cache L3 cache

Fig. 7. L2 cache and L3 cache MPKIs on Xeon.

4.2.3 Observations and ImplicationsFrom the data shown in Section 4.2.1 and 4.2.2, we canconclude that Turbo Boost does not favor all data analyticsworkloads. For poor locality workloads with lots of out-standing long-latency memory accesses, the increased corefrequency due to Turbo Boost enabled can not offset thepenalty caused by long latency memory accesses and haslimited performance improvements. For the workloads withgood locality, the Turbo Boost can improve the performanceand reduce the energy consumption by shortening execu-tion time.

4.3 Does SMT Help?Simultaneous multithreading (SMT) is a processor designthat combines hardware multithreading with superscalarprocessor technology to allow instructions from more thanone thread to be executed in any given pipeline stage at atime by sharing the execution engine, the caches, the system-bus interface and the firmware. SMT has the potential ofhiding memory latency, increasing the efficiency and corre-spondingly increasing the amount of work done for a givenperiod of time. SMT has been proved to exhibit performancegains for a lot of traditional workloads for the above reasons.However it may also cause performance degradations forsome workloads because of shared resource conflicts [47],[48], [49]. We would like to investigate whether the SMTtechnology benefits data analytics workloads. The IntelXeon E5645 processor utilizes two-way Hyper-Threading,which is the Intel’s proprietary simultaneous multithread-ing implementation. That is to say when we enable theHyper-Threading technology, each Xeon E5645 processorcore appears as two CPUs to the operating system.

We run all the workloads mentioned in Section 3.3 withand without Hyper-Threading. When the Hyper-Threadingtechnology is enabled, the OS will see twice as many CPUsas Hyper-Threading disabled. So we double the threadnumber of Hadoop and Spark based workloads correspond-ingly in order to take advantage of the Hyper-Threadingtechnology.

4.3.1 PerformanceThe HT enabled bar in Figure 6 shows the speed up due toenabling Hyper-Threading. We find that all data analytics

0.40.50.60.70.80.9

11.11.2

Nor

mal

ized

Ene

rgy

Turbo & HT disabled Turbo enabled HT enabled HT & Turbo enabled

Fig. 8. Energy consumption normalized to workloads running on Xeonwithout Turbo Boost and Hyper-Threading.

workloads achieve performance gains through the Hyper-Threading technology. The speedup ratio ranges from 1.003to 1.55. The performance improvements are caused by thatthe data analytics workloads’ characteristics match the fea-tures Hyper-Threading technology required. Data analyticsapplications in data centers are designed for parallel execu-tion. No matter Hadoop based or Spark based workloads,by partitioning the input data set, each job is split intomany independent tasks, working on a part of the datain parallel. Each task can be given to a software threadto execute in parallel with the other software threads andthere is no data racing among them. So we can find that thedata analytics applications give partitioned parallelism [21]and good scalability, which can scale with the number ofhardware threads. The partitioned parallelism makes twohardware threads on the same core have no dependency,so if one of the threads is stalled, e.g., due to cache misses,the instructions from the other thread can be issued to thepipeline. And the good scalability gives us the opportunityto accelerate the applications by providing more hardwarethreads. All those features meet the characteristics withwhich Intel Hyper-Threading Technology will achieve per-formance improvements [50]. This can explain why dataanalytics workloads achieve performance gains by enablingHyper-Threading.

In Figure 6, workloads obtain different performancegains for their diverse characteristics. The Hadoop basedSort acquires the maximum performance boost among theten data analytics workloads. For it owns lots of I/O op-erations (explained in Section 4.1.3) and poor locality (lotsof cache misses in Figure 7). This means that the Hadoopbased Sort has to spend a lot of time on memory accesses,for which the Hyper-Threading technology may hide mostof their latency. So the Hadoop based Sort obtains a greatperformance boost. However, the Hyper-Threading technol-ogy can not hide latency endlessly. When there are too manymemory accesses or the latency is too long, e.g., lots of burstmemory accesses, there are lots of outstanding memoryaccess instructions in-flight and the new instructions cannot be issued for lacking of entries in reorder buffer, inreservation station or in load-store buffer. In that situation,both of the hardware threads in one core can be blockedand the Hyper-Threading technology can not further obtainperformance improvements. Some Spark based workloadsseem to come across such a situation for their poor localities

10

and burst memory accesses 7. This can explain why theSpark based workloads including Sort do not have as manyperformance gains as Hadoop based workloads.

4.3.2 Energy ConsumptionThe HT enabled bar in Figure 8 shows the normalized en-ergy consumption for each workload running with Hyper-Threading. We find that even though Hyper-Threadingaccelerates all data analytics applications and may keepthe execution resources occupied, it does not necessarilyconsume extra energy. Among the ten workloads, Hyper-Threading only incurs an increase of energy consumptionup to 5%. For more than half of the workloads (6 out of 10workloads), Hyper-Threading obtains energy savings (upto 17%) for the shortened execution time offsets the extraenergy consumed.

4.3.3 Observations and ImplicationsThe Hyper-Threading technology is a useful feature for dataanalytics workloads, especially for the workloads have largeamount of memory accesses with limited latency. Hyper-Threading would not only improve application performancebut also cut down the energy consumption by shrinkingexecution time. However, for the workloads that show alarge amount of long latency memory accesses, Hyper-Threading has limited benefit for those memory accessesmay block both of the threads in the same core.

4.4 How about Enabling both SMT and Dynamic Over-clocking

The results that enabling Turbo Boost and Hyper-Threadingalone can gain performance improvements inspire us toinvestigate the effect of enabling both of them.

4.4.1 PerformanceThe HT & Turbo enabled bar in Figure 6 presents the nor-malized execution time when both of the technologies areenabled. When we enable both Turbo Boost and Hyper-Threading, most of the workloads can achieve additionalperformance gains in comparison with the counterparts thatonly enable one of them. There are some workloads thatdo not achieve many performance gains when only TurboBoost is enabled, but obtain a pretty good performanceboost with both Turbo Boost and Hyper-Threading. Forinstance, the Hadoop based Grep, which only has about a1.7% performance boost when only Turbo Boost is enabledand a 9.4% performance boost when only Hyper-Threadingis enabled. However, when both of those are enabled, theHadoop based Grep can achieve a 22.4% performance boost.Taking Hadoop Sort as another example, the Hadoop basedSort will not achieve performance gain if only Turbo Boostis enabled and it has about a 55% performance boost whenonly Hyper-Threading is activated. While we enable bothHyper-Threading and Turbo Boost, the Hadoop based Sortcan get additional performance gains and results in a 67.4%performance improvement in comparison with the baseline.Those phenomena indicate that Hyper-Threading and Turbo

7. Jiang et al. [45] find that the burst memory accesses of Sparkworkloads contribute even more than 90% of memory bus traffic.

00.5

11.5

22.5

33.5

Nor

mal

ized

Per

form

ance

per

TCO

Tilera Xeon Turbo & HT disabledXeon Turbo enabled Xeon HT enabledXeon Turbo & HT enabled

Fig. 9. Performance per TCO normalized to the Tile-Gx36 processor.

Boost have mutual enhanced effects. From the analysis inSection 4.2.1, we know the reason why some data analyticsworkloads achieve limited benefits from Turbo Boost isthat they have lots of memory accesses. However Hyper-Threading has the ability to hide memory latency, whichalleviates the factor that limits Turbo Boost’s effect for dataanalytics workloads. This should be the reason that thosetwo technologies enhance each other.

4.4.2 Energy ConsumptionThe HT & Turbo enabled bar in Figure 8 shows the normal-ized energy consumption of each workload running withboth Hyper-Threading and Turbo Boost. After the Hyper-Threading is enabled, most of workloads consume addi-tional energy with Turbo Boost enabled. The only exceptionis Spark based Naive Bayes, for which the extra energyconsumed is offset by the cutdown execution time. For mostof data analytics workloads we investigate in this paper, en-abling both Hyper-Threading and Turbo Boost technologiesconsumes less energy than only enabling Turbo Boost. OnlyHadoop based Grep, Pagerank and Spark based Grep haveslightly increased energy consumption. However, consider-ing the performance improvements they get, the slightlyadditional energy consumption (no more than 2%) is ac-ceptable. Hadoop based Kmeans consumes more energywhen enabling both Turbo Boost and Hyper-Threading thanonly enabling Turbo Boost. This is most probably becausethat the logic unit, which supports the Hyper-Threadingtechnology (e.g., the replicated renamed return stack buffer),is occupied and consumes extra energy, but Hadoop basedKmeans gains limited performance improvements since itsgood locality.

4.4.3 Observations and ImplicationsEnabling both Turbo Boost and Hyper-Threading canachieve performance improvements nearly for all data an-alytics workloads but with slightly increasing in energyconsumption. For some workloads, it can also obtain energysavings. So enabling both Turbo Boost and Hyper-Threadingis recommended for most data analytics workloads.

4.5 Performance-cost EfficiencyThe design decisions for data center systems are alwaysdriven by both performance and cost [1]. So here we would

11

0

0.5

1

1.5

2

2.5

3

3.5

2 3 4 5 6 7 8 9 10 11 12

Nor

mal

ized

per

form

ance

per

TCO

Server cost ratio

Hadoop-Sort

Hadoop-Grep

Hadoop-Bayes

Hadoop-Kmeans

Hadoop-Pagerank

Spark-Sort

Spark-Grep

Spark-Bayes

Spark-Kmeans

Spark-Pagerank

Fig. 10. Normalized performance per TCO as a function of server costratio.

like to investigate the performance-cost efficiency for eachdesign decision mentioned above. We use the metric perfor-mance per TCO to represent the performance-cost efficiency.We use the TCO (Total Cost of Ownership) model build inSection 3.4.4.

4.5.1 Performance per TCOFigure 9 quantifies the performance per TCO, which isnormalized to the Tilera based server. For the Xeon proces-sor has different configurations with different technologiesenabled or disabled, we illustrate the performance per TCOfor each of them in Figure 9. The performance per TCO isdefined as the performance (i.e., data processed per second)divided by TCO, which is a higher-is-better metric. Nearlyfor all data analytics workloads, the Xeon based serverresults in higher performance per TCO. The only exceptionis Hadoop based Sort, which has higher performance perTCO on the Tilera based server. However, when comparedwith the best case on Xeon (The Turbo & HT enabled bar ofHadoop based Sort in Figure 9), the Tilera based server onlyobtains about 3% performance per TCO gains.

When considering the performance per TCO of Xeonbased server running with or without Turbo Boost andHyper-Threading technologies, we can find that the onewith the Hyper-Threading is always the winner. For allworkloads, the Xeon based server running with Hyper-Threading obtains higher performance per TCO (the HTenabled bar in Figure 9). Enabling both Hyper-Threading andTurbo Boost can further achieve the highest performanceper TCO for most of workloads. The exceptions are Hadoopbased Kmeans and Spark based Kmeans due to the limitedperformance gains when both of the technologies are en-abled.

4.5.2 Sensitivity AnalysisSo far, we use the TCO model with the default parameterslisted in Table 4. However, there are uncertainties to calcu-late the TCO. Server costs fluctuate case by case, and also areaffected by time and the number of servers that are brought.Energy cost varies among locations. In order to deal withthe uncertainties, we perform a sensitivity analysis, whichfocuses on two main factors — server purchase cost andenergy cost.

[Server Cost]

0.2

0.7

1.2

1.7

2.2

2.7

3.2

3.7

4 5 6 7 8 9 10 11 12 13 14 15

Nor

mal

ized

per

form

ance

per

TCO

Energy cost (cent per kWh)

Hadoop-Sort

Hadoop-Grep

Hadoop-Bayes

Hadoop-Kmeans

Hadoop-Pagerank

Spark-Sort

Spark-Grep

Spark-Bayes

Spark-Kmeans

Spark-Pagerank

Fig. 11. Performance per TCO normalized to the Tile-Gx36 processor.

For the fluctuation of server costs, it is more meaningfulto make a quantitative discussion on which kind of serveris more performance-cost efficient. We report normalizedperformance per TCO as a function of the server cost ratio.Figure 10 shows the normalized performance per TCOvarying with cost ratio. The X-axis means the cost ratiobetween Xeon and Tilera server. The Y-axis presents theXeon server’s performance per TCO, which is normalizedto Tilera server. A value that greater than one means thatXeon server is more performance-cost efficient than Tileraserver. We can find that the normalized performance perTCO is very sensitive to the cost ratio. The Tilera server getsmore advantages with the cost ratio increase. In case a Xeonserver is more than 8 times expensive than Tilera server,the Tilera outperforms Xeon for most workloads (eight outof ten workloads) from the perspective of performance-costefficiency. When the ratio is greater than 10, the Tilera canbe a clear winner.

[Energy Cost Sensitivity]The energy cost also fluctuates with locations and time.

Different cities may have diverse energy prices. Figure 11illustrates the normalized performance per TCO as a func-tion of the energy cost for each kilowatt hour. The X-axis shows the price of each kilowatt hour in cents. TheY-axis quantifies the Xeon server’s performance per TCOnormalized to Tilera server. The curves are very smooth inthe figure, which means the comparison of performance-cost efficient between Xeon server and Tilera server is notsensitive to energy cost. The curves go down slightly whenthe price of each kilowatt hour decreases. Even with afluctuation of energy cost, the Xeon still outperforms Tilerafor most workloads in the figure from the perspective ofperformance-cost efficiency. Unless the cost of electricity canbe decreased to a small percentage of one cent 8, we can notsee the phenomenon that Tilera outperforms Xeon for mostof workloads.

4.5.3 Observations and ImplicationsFrom the discussion using fixed costs in Table 4, we find thatthe brawny multi-core is more performance-cost efficientthan wimpy many-core for most data analytics workloads indata centers. The only exception is the extreme I/O intensiveworkload (i.e., Hadoop based Sort in this paper), which

8. The cost is much less than the average industrial electricity cost,which is 6.7 cents/kWh in 2012 at U.S. [1]

12

achieves a little more performance-cost efficiency on wimpymany-core than on brawny multi-core server.

We also consider the uncertainties that exist in the TCOmodel. In the comparison of performance-cost efficiencybetween brawny multi-core and wimpy many-core proces-sor, the fluctuation of energy cost has little impact on thecomparison result. While, the fluctuation of server cost is asensitive factor that will impact the comparison result. Forthe server cost, the benefit from wimpy many-core tends tobe increased at higher server cost ratio. However only underthe circumstance of a cost ratio greater than 8, the wimpymany-core can outweigh brawny multi-core for most dataanalytics workloads in terms of performance-cost efficiency.

4.6 SummaryWe can give the following findings based on the observa-tions discussed above:

1) The brawny multi-core processors need less time andprocessor cycles to complete the same amount of work incomparison with wimpy many-core processors. The brawnymulti-core pipeline is also more efficient than wimpy many-core pipeline for data analytics workloads in data centers.

2) The wimpy many-core processor is more energy ef-ficient than brawny multi-core processor only for extremeI/O intensive data analytics workloads, which have lots ofI/O operations in each stage of the job and own simplecomputing logic. For the other data analytics workloads(nine out of ten workloads in this paper), the higher perfor-mance achieved on the brawny multi-core processor offsetsthe higher energy consumptions and even may offsets thehigher costs. So the brawny multi-core processor is moresuitable for data analytics workloads in terms of both per-formance and energy consumption.

3) The dynamic overclocking technology (Intel’s imple-mentation called Turbo Boost) can improve performancefor data analytics workloads with good locality, however itconsumes extra energy even though there is no performanceimprovement. If the workload owns good locality, the dy-namic overclocking technology is preferred to be enabled.Otherwise, the potential energy overhead should be noticed.

4) The SMT (Intel’s implementation called Hyper-Threading) technology can improve the performance of alldata analytics workloads we investigated in this paper andcan even save energy for some workloads due to executiontime cutback. So SMT is always preferred.

5) SMT and dynamic overclocking have mutual en-hanced effects. Enabling both of them is recommended formost of data analytics workloads in data centers.

6) From the perspective of performance-cost efficiency,the brawny multi-core server achieves more performanceper TCO than wimpy many-core server for most of dataanalytics workloads in our fixed cost TCO model. Whenconsidering different server costs, only if the brawny multi-core server is more than 8 times expensive than wimpymany-core server, the wimpy many-core server can out-weigh brawny multi-core server for most data analyticsworkloads.

From above findings, we can conclude that the brawnymulti-core processors with SMT and dynamic overclockingtechnologies are welcomed by most of data analytics work-loads.

5 RELATED WORK

The design decisions for data center systems earn manydiscussions in recent years. Many work embarks studies onconstructing data centers using low-power wimpy cores forimproving energy efficiency in data processing [7], [51], [52],[53], [54]. Vasudevan et al. [51] propose a cluster architecturecalled FAWN in order to provide fast and cost-effectivedata access for key-value stores. Similarly, Mateusz et al. [7]propose to use wimpy core processors for key-value stores.Vijay et al. [52] evaluate web search services and qualifythe efficiency of wimpy core processors. The efficiency ofwimpy servers in internet-scale services is evaluated in [53].In the field of DBMS, wimpy servers are also recommendedto construct energy-proportional clusters [54]. Howeversome studies question the efficiency of using small andwimpy cores in data center systems [5], [8]. Urs [8] analyzesthe disadvantages of switching to wimpy cores in terms ofadditional software development costs. Willis et al. [5] findthat the complex queries may make scale-out with low-endnodes an expensive and lower performance solution.

All above work focuses on service workloads (the websearch, data management systems), instead of data analyt-ics workloads in data centers. The most related work isperformed by Anwar et al. [55]. They present a study toevaluate the efficacy of ARM based microservers in sup-porting Hadoop applications with a metric they defined, i.e.,PerfEC. In comparison to their work, we adopt a differentevaluation granularity and a different experiment method-ology to compare brawny multi-core with wimpy many-core processors. We perform evaluations at the granularityof processor socket instead of server. So our experimentmethodology separates out the factors that are orthogonalto the processor designs. For instance we configure eachserver with a single socket processor and equip the sameamount of memory; we isolate processor behaviors to sat-urate processor performance. Moreover, we employ morestate-of-the-art data analytics workloads in data centers. Sodifferent conclusions are drawn due to the above reasons.

Lim et al. [11] propose a TCO model, on which we buildours, and study a new solution using internet-sector work-loads. Polfliet et al. [10] evaluate data-centric workloads byTCO model as well. Different from our work, most of theworkloads they investigate are service workloads or tradi-tional CPU intensive workloads. Moreover our evaluationsare based on realistic state-of-the-practice platforms insteadof simulator. We run real data analytics workloads on realhardware, so the performance and power data are moreconvinced.

Prior work also studies the performance of SMT andTurbo Boost technologies with traditional workloads insteadof the data analytics workloads in data centers. Huang etal. [48] and Mathis et al. [47] examine the characteriza-tions of traditional applications running on Intel Hyper-Threading supported processor and IBM SMT supportedPower5 processor, respectively. Charles et al. [44] evaluatethe Intel Core i7 Turbo Boost feature with SPEC CPU2006.

6 CONCLUSION AND FUTURE WORK

The big values hidden in the very large data sets boostdeployments of data center systems. Various approaches

13

are proposed to improve the performance or reduce energyconsumption of data centers in both industry and academiafields. In this paper, we conduct comprehensive evaluationsof data analytics applications in data centers on two state-of-the-practice platforms: Intel Xeon E5645 and Tilera Tile-Gx36. Our evaluations reveal that the brawny multi-coreprocessors with SMT and dynamic overclocking technolo-gies outperform the counterparts in terms of not only ex-ecution time, but also energy-efficiency and performance-cost efficiency for most of data analytics workloads weinvestigated in this paper.

Heterogeneous data centers [56], [57], [58] with a widerange of processing elements, e.g., GPUs, FPGAs, in addi-tion to CPUs have the promising potential in both perfor-mance and energy efficiency, which leads the direction ofour future research. We would like to investigate the benefitsof heterogeneous architecture in the future.

ACKNOWLEDGMENTS

We are very grateful to anonymous reviewers. This work issupported by the State Key Development Program for BasicResearch of China (Grant No. 2014CB340402), the MajorProgram of National Natural Science Foundation of China(Grant No. 61432006) and the Natural Science Foundationof China (Grant No.61502279).

REFERENCES

[1] L. A. Barroso, J. Clidaras, and U. Holzle, “The datacenter asa computer: An introduction to the design of warehouse-scalemachines,” Synthesis lectures on computer architecture, vol. 8, no. 3,pp. 1–154, 2013.

[2] “Facebook server count: 60,000 or more,” http://www.datacenterknowledge.com/archives/2010/06/28/facebook-server-count-60000-or-more/.

[3] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Rein-hardt, “Understanding and designing new server architecturesfor emerging warehouse-computing environments,” in ComputerArchitecture, 2008. ISCA’08. 35th International Symposium on. IEEE,2008, pp. 315–326.

[4] J. Hamilton, “Overall data center costs,” http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx.

[5] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, andL. Niccolini, “An energy case for hybrid datacenters,” ACMSIGOPS Operating Systems Review, vol. 44, no. 1, pp. 76–80, 2010.

[6] W. Lang, J. M.Patel, and S. Shankar, “Wimpy node clusters: whatabout non-wimpy workloads?” in DaMoN 2010, 2010, pp. 47–55.

[7] B. Jia, “Facebook: Using emerging hardware to build infrastructureat scale,” BPOE 2013, October 2013.

[8] M. Berezecki, F. E., P. M., and S. K., “Many-core key-value store,”in IGCC 2011. IEEE, 2011, pp. 1–8.

[9] U. Holzle, “Brawny cores still beat wimpy cores, most of the time,”IEEE Micro, vol. 30, no. 4, 2010.

[10] F. Dongrui, H. Zhang, D. Wang, X. Ye, F. Song, G. Li, and N. Sun,“Godson-t: An efficient many-core processor exploring thread-level parallelism,” IEEE Micro, vol. 32, no. 2, pp. 38–47, 2012.

[11] S. Polfliet, F. Ryckbosch, and L. Eeckhout, “Optimizing the dat-acenter for data-centric workloads,” in Proceedings of the interna-tional conference on Supercomputing. ACM, 2011, pp. 182–191.

[12] http://prof.ict.ac.cn/bdb uploads/Processor ZhenJia/TBAdata.zip.

[13] K. Ren, Y. Kwon, M. Balazinska, and B. Howe, “Hadoop’s ado-lescence: an analysis of hadoop usage in scientific workloads,”Proceedings of the VLDB Endowment, vol. 6, no. 10, pp. 853–864,2013.

[14] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, andR. L. Stamm, “Exploiting choice: Instruction fetch and issue on animplementable simultaneous multithreading processor,” in ACMSIGARCH Computer Architecture News, vol. 24, no. 2. ACM, 1996,pp. 191–202.

[15] I. Corporation, “Intel turbo boost technology 2.0.”[16] E. Rotem, A. Naveh, A. Ananthakrishnan, D. Rajwan, and

E. Weissmann, “Power-management architecture of the intel mi-croarchitecture code-named sandy bridge,” IEEE Micro, no. 2, pp.20–27, 2012.

[17] G. Blake, R. G. Dreslinski, and T. Mudge, “A survey of multicoreprocessors,” Signal Processing Magazine, IEEE, vol. 26, no. 6, pp.26–37, 2009.

[18] A. Ltd, “The arm cortex-a9 processors,,” White Paper, 2007.[19] Tensilica, “Configurable processors: What, why, how?” Tensilica

White Paper.[20] C. Ramey, “Tile-gx100 manycore processor: Acceleration interfaces

and architecture,” T ilera Corporation, 2011.[21] D. DeWitt and J. Gray, “Parallel database systems: the future of

high performance database systems,” Communications of the ACM,vol. 35, no. 6, pp. 85–98, 1992.

[22] T. White, Hadoop: The Definitive Guide. O’Reilly Media, 2009.[23] “Powered by hadoop,” http://wiki.apache.org/hadoop/PoweredBy.[24] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,

M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributeddatasets: A fault-tolerant abstraction for in-memory cluster com-puting,” in USENIX Conference on Networked Systems Design andImplementation, 2012.

[25] “Intel xeon processor e5645,” http://ark.intel.com/products/48768/.[26] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, B.-G. Chun,

and V. ICSI, “Making sense of performance in data analyticsframeworks,” NSDI, 2015.

[27] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee,D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi,“Clearing the clouds: A study of emerging scale-out workloadson modern hardware,” in Seventeenth International Conference onArchitectural Support for Programming Languages and Operating Sys-tems(ASPLOS 2012), 2012.

[28] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia,Y. Shi, S. Zhang et al., “BigDataBench: A big data benchmark suitefrom internet services,” in IEEE International Symposium on HighPerformance Computer Architecture, 2014.

[29] “Perf: Linux profiling with performance counters,” https://perf.wiki.kernel.org/index.php/Main Page.

[30] E. Blem, J. Menon, and K. Sankaralingam, “Power struggles: Revis-iting the risc vs. cisc debate on contemporary arm and x86 archi-tectures,” in High Performance Computer Architecture (HPCA2013),2013 IEEE 19th International Symposium on. IEEE, 2013, pp. 1–12.

[31] S. I. Association et al., “International technology roadmap forsemiconductors, 2007,” Update:¡ http://public. itrs. net, 2004.

[32] C. D. Patel and A. J. Shah, “Cost model for planning, developmentand operation of a data center,” 2005.

[33] http://www.micron.com.[34] http://www.newegg.com.[35] http://www.seagate.com.[36] http://www.buildcomputers.net/.[37] http://www.tilera.com.[38] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periy-

athambi, and M. Horowitz, “Towards energy-proportional dat-acenter memory with mobile dram,” in Computer Architecture(ISCA), 2012 39th Annual International Symposium on, vol. 40, no. 3.IEEE Computer Society, 2012, pp. 37–48.

[39] G. Lu, J. Zhan, C. Tan, X. Lin, D. Kong, T. Hao, L. Wang, F. Tang,and C. Zheng, “”isolate first, then share”: a new os architecture forthe worst-case performance,” arXiv preprint arXiv:1604.01378v4,2017.

[40] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, andJ. Wilkes, “Large-scale cluster management at google with borg,”in Proceedings of the Tenth European Conference on Computer Systems.ACM, 2015, p. 18.

[41] A. Goder, A. Spiridonov, and Y. Wang, “Bistro: Scheduling data-parallel jobs against live production systems.” in USENIX AnnualTechnical Conference, 2015, pp. 459–471.

[42] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, I. Goiri, andR. Bianchini, “History-based harvesting of spare cycles and stor-age in large-scale datacenters,” in Proceedings of the 12th USENIXconference on Operating Systems Design and Implementation, no.EPFL-CONF-224446, 2016, pp. 755–770.

[43] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo, “Characterizingdata analysis workloads in data centers,” in Workload Characteriza-tion (IISWC), 2013 IEEE International Symposium on. IEEE.

http://www.datacenterknowledge.com/archives/2010/06/28/facebook-server-count-60000-or-more/



http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx

http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx

http://prof.ict.ac.cn/bdb_uploads/Processor_ZhenJia/TBA_data.zip

http://prof.ict.ac.cn/bdb_uploads/Processor_ZhenJia/TBA_data.zip

https://perf.wiki.kernel.org/index.php/Main_Page

https://perf.wiki.kernel.org/index.php/Main_Page

14

[44] J. Charles, P. Jassi, N. S. Ananth, A. Sadat, and A. Fedorova,“Evaluation of the intel R© core i7 turbo boost feature,” in WorkloadCharacterization, 2009. IISWC 2009. IEEE International Symposiumon. IEEE, 2009, pp. 188–197.

[45] T. Jiang, Q. Zhang, R. Hou, L. Chai, S. A. Mckee, Z. Jia, andN. Sun, “Understanding the behavior of in-memory computingworkloads,” in Workload Characterization (IISWC), 2014 IEEE Inter-national Symposium on. IEEE. IEEE, 2014.

[46] Z. Jia, J. Zhan, L. Wang, R. Han, S. A. McKee, Q. Yang, C. Luo,and J. Li, “Characterizing and subsetting big data workloads,” inWorkload Characterization (IISWC), 2014 IEEE International Sympo-sium on. IEEE. IEEE, 2014.

[47] H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, andS. R. Kunkel, “Characterization of simultaneous multithreading(smt) efficiency in power5,” IBM Journal of Research and Develop-ment, vol. 49, no. 4.5, pp. 555–564, 2005.

[48] W. Huang, J. Lin, Z. Zhang, and J. M. Chang, “Performancecharacterization of java applications on smt processors,” in Per-formance Analysis of Systems and Software, 2005. ISPASS 2005. IEEEInternational Symposium on. IEEE, 2005, pp. 102–111.

[49] J. Funston, K. El Maghraoui, J. Jann, P. Pattnaik, and A. Fedorova,“An smt-selection metric to improve multithreaded applications’performance,” in Parallel Distributed Processing Symposium (IPDPS),2012 IEEE 26th International, May 2012, pp. 1388–1399.

[50] A. Valles, M. Gillespie, and G. Drysdale, “Performanceinsights to intel R© hyper-threading technology,” Intel SoftwareNetwork (November 20, 2009). http://software. intel. com/en-us/articles/performance-insights-to-intel-hyper-threadingtechnology,2010.

[51] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan,and V. Vasudevan, “Fawn: A fast array of wimpy nodes,” inProceedings of the ACM SIGOPS 22nd Symposium on OperatingSystems Principles. ACM, 2009, pp. 1–14.

[52] V. J. Reddi, B. C. Lee, T. Chilimbi, and K. Vaid, “Web search usingmobile cores: quantifying and mitigating the price of efficiency,”in ISCA 2010, 2010, pp. 314–325.

[53] J. R. Hamilton, “Internet-scale data center power efficiency,” inCIDR 2009, 2009.

[54] D. Schall and V. Hudlet, “WattDB: an energy-proportional clusterof wimpy nodes,” in SIGMOD 2011, 2011, pp. 1229–1232.

[55] A. Anwar, K. Krish, and A. Butt, “On the use of microserversin supporting hadoop applications,” in Cluster Computing (CLUS-TER), 2014 IEEE International Conference on, Sept 2014, pp. 66–74.

[56] J. Zhu, J. Li, E. Hardesty, H. Jiang, and K.-C. Li, “Gpu-in-hadoop:Enabling mapreduce across distributed heterogeneous platforms,”in Computer and Information Science (ICIS), 2014 IEEE/ACIS 13thInternational Conference on. IEEE, 2014, pp. 321–326.

[57] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Con-stantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal,J. Gray et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” in Computer Architecture (ISCA), 2014ACM/IEEE 41st International Symposium on. IEEE, 2014, pp. 13–24.

[58] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimiz-ing fpga-based accelerator design for deep convolutional neuralnetworks,” in Proceedings of the 2015 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays. ACM, 2015, pp.161–170.

Zhen Jia is a Postdoctoral Research Asso-ciate at Computer Science Department, Prince-ton University. He obtained his PhD degree in2016 from Institute of Computing Technology,Chinese Academy of Sciences and University ofChinese Academy of Sciences, Beijing, China.His research focuses on parallel and distributedsystems, benchmarks and data center workloadcharacterization. He received his B.S. degree in2010 from Dalian University of Technology inChina.

Wangling Gao is a Ph.D candidate in computerscience at the Institute of Computing Technol-ogy, Chinese Academy of Sciences and Uni-versity of Chinese Academy of Sciences. Herresearch interests focus on big data benchmarkand big data analytics. She received her B.S.degree in 2012 from Huazhong University of Sci-ence and Technology.

Yingjie Shi received the B.S. degree from Shan-dong University in 2005, M.S. degree fromHuazhong University of Science and Technologyin 2007, and Ph.D. degree from Renmin Univer-sity of China in 2013, all in computer scienceand technology. She is currently an associateprofessor in Beijing Institute of Fashion Technol-ogy. Her research interests include cloud datamanagement, online aggregations of big data,data management in wearable computing.

Sally A. McKee received a BA degree from YaleUniversity in 1985, MSE degree from PrincetonUniversity in 1990, and PhD degree from theUniversity of Virginia in 1995, all in ComputerScience. She is a Professor of Computer Sci-ence and Engineering at Chalmers Universityof Technology in Gothenburg, Sweden. She iscurrently working on the design of cryogenic andheterogeneous memory subsystems at RambusLabs while on sabbatical leave from Chalmers.Her research interests include workload char-

acterization, power and performance modeling and analysis, and thedesign of efficient, adaptable memory systems. She is a senior memberof the IEEE, the IEEE Computer Society, and the ACM.

Zhenyan Ji, Ph.D., Associate Professor. Member of Theoretical Computer Science Technical Committee, CCF. She received her Ph.D. degree from Institute of Software, Chinese Academy of Sciences. She had worked for NTNU (Norwegian University of Science and Technology) and Mid Sweden University for many years. Now she is working for Beijing Jiaotong University. Her main research interests include data mining, image registration and distributed computing, etc.

Zhenyan Ji Ph.D., Associate Professor. Mem-ber of Theoretical Computer Science TechnicalCommittee, CCF. She received her Ph.D. degreefrom Institute of Software, Chinese Academy ofSciences. She had worked for NTNU (Norwe-gian University of Science and Technology) andMid Sweden University for many years. Now sheis working for Beijing Jiaotong University. Hermain research interests include data mining, im-age registration and distributed computing, etc.

Jianfeng Zhan is Full Professor and DeputyDirector at Computer Systems Research Center,Institute of Computing Technology (ICT), Chi-nese Academy of Sciences (CAS). His researchwork is driven by interesting problems. He en-joys building new systems, and collaboratingwith researchers with different background. Hefounded the BPOE workshop, focusing on bigdata benchmarks, performance optimization andemerging hardware.

Lei Wang received the master degree in com-puter engineering from Chinese Academy of Sci-ences, Beijing, China, in 2006. He is currentlya senior engineer with Institute of ComputingTechnology, Chinese Academy of Sciences. Hiscurrent research interests include resource man-agement of cloud systems.

Lixin Zhang is a Professor at Institute of Com-puting Technology, Chinese Academy of Sci-ences. His main research areas include com-puter architecture, data center computing, highperformance computing, advanced memory sys-tems, and workload characterization. Dr. Zhangreceived his BS in Computer Science from Fu-dan University in 1993 and his PhD in ComputerScience from the University of Utah in 2001. Hewas previously a Research Staff Member at IBMAustin Research Lab and a Master Inventor of

IBM.

1 understanding processors design decisions for data ...mckee/papers/transbigdata17.pdftions on the...

Documents