regularities considered harmful: forcing randomness to...

11
Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce Row Buffer Conflicts for Multi-Core, Multi-Bank Systems Heekwon Park * Computer Science Department University of Pittsburgh [email protected] Seungjae Baek * Computer Science Department University of Pittsburgh [email protected] Jongmoo Choi Department of Computer Science Dankook University [email protected] Donghee Lee School of Computer Science University of Seoul dhl [email protected] Sam H. Noh School of Computer & Information Engineering Hongik University [email protected] Abstract We propose a novel kernel-level memory allocator, called M 3 (M- cube, Multi-core Multi-bank Memory allocator), that has the fol- lowing two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization, by exclusively assigning a container to a core so that each core achieves bank parallelism as much as possible. Second, it orchestrates page frame allocation so that pages that threads access are dispersed randomly across mul- tiple banks so that each thread’s access pattern is randomized. The development of M 3 is based on a tool that we develop to fully un- derstand the architectural characteristics of the underlying mem- ory organization. Using an extension of this tool, we observe that the same application that accesses pages in a random manner out- performs one that accesses pages in a regular pattern such as se- quential or same ordered accesses. This is because such random- ized accesses reduces inter-thread access interference on the row- buffer in memory banks. We implement M 3 in the Linux kernel version 2.6.32 on the Intel Xeon system that has 16 cores and 32GB DRAM. Performance evaluation with various workloads show that M 3 improves the overall performance for memory intensive bench- marks by up to 85% with an average of about 40%. Categories and Subject Descriptors B.3.1 [Memory Structures]: Semiconductor Memories—Dynamic memory (DRAM); D.4.2 [Operating Systems]: Storage Management—Allocation/deallocati- on strategies; D.4.2 [Operating Systems]: Storage Management— * This work was done when these authors were doctorate students at Dankook University in the Republic of Korea (South Korea). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’13, March 16–20, 2013, Houston, Texas, USA. Copyright c 2013 ACM 978-1-4503-1870-9/13/03. . . $15.00 Main memory; D.4.8 [Operating Systems]: Performance—Measu- rements General Terms Algorithms, Design, Experimentation, Measure- ment, Performance Keywords Row-buffer conflict, Memory management, Random- ized algorithm, Memory container, Analysis tool 1. Introduction A recent distinct trend in computer architecture is that of multi-core support as exemplified by the Intel Core i7 [3], AMD Opteron [1] and ARM Cortex-A15 [2]. In order to take advantage of the in- creased computation of multi-core chips, computer systems com- monly come with a vast amount of main memory. These trends re- quire us to revisit the internal policies and mechanisms of today’s operating systems. In this paper, we rethink the memory allocation issue in light of these architectural trends. In particular, we look into how memory can be allocated more efficiently by alleviating the row-buffer con- flict problem from an operating system standpoint. The row-buffer conflict problem has been looked into primarily as an architecture problem [9, 31, 34]. To the best of our knowledge, this study is the first to present an operating system based solution. Before presenting an outline of our solution, let us look at the source of the row-buffer conflict. A row-buffer is essentially a cache area in a memory bank [34]. A row-buffer conflict occurs when a core alternatively accesses different page frames mapped to the same bank or when several cores access the bank simultane- ously. This leads to significant performance degradation [14, 20, 26, 28, 33]. Our operating system solution to the row-buffer conflict is a novel kernel-level memory allocator, which we refer to as M 3 (M- cube, Multi-core Multi-bank Memory allocator), that takes two distinct approaches. The first is to allocate page frames to a core from different banks as much as possible. To this end, we introduce the notion of a memory container, which, for now, can be thought of as a unit of memory formed from a collection of distinct, non- overlapping pages. Then each core is assigned a memory container such that when a core requests for memory, the request is serviced within the assigned memory container. 181

Upload: others

Post on 23-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

Regularities Considered Harmful: ForcingRandomness to Memory Accesses to Reduce Row

Buffer Conflicts for Multi-Core, Multi-Bank Systems

Heekwon Park∗

Computer Science DepartmentUniversity of [email protected]

Seungjae Baek∗

Computer Science DepartmentUniversity of [email protected]

Jongmoo ChoiDepartment of Computer Science

Dankook [email protected]

Donghee LeeSchool of Computer Science

University of Seouldhl [email protected]

Sam H. NohSchool of Computer & Information Engineering

Hongik [email protected]

AbstractWe propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the fol-lowing two features. First, it introduces and makes use of a notionof a memory container, which is defined as a unit of memory thatcomprises the minimum number of page frames that can cover allthe banks of the memory organization, by exclusively assigning acontainer to a core so that each core achieves bank parallelism asmuch as possible. Second, it orchestrates page frame allocation sothat pages that threads access are dispersed randomly across mul-tiple banks so that each thread’s access pattern is randomized. Thedevelopment of M3 is based on a tool that we develop to fully un-derstand the architectural characteristics of the underlying mem-ory organization. Using an extension of this tool, we observe thatthe same application that accesses pages in a random manner out-performs one that accesses pages in a regular pattern such as se-quential or same ordered accesses. This is because such random-ized accesses reduces inter-thread access interference on the row-buffer in memory banks. We implement M3 in the Linux kernelversion 2.6.32 on the Intel Xeon system that has 16 cores and 32GBDRAM. Performance evaluation with various workloads show thatM3 improves the overall performance for memory intensive bench-marks by up to 85% with an average of about 40%.

Categories and Subject Descriptors B.3.1 [Memory Structures]:Semiconductor Memories—Dynamic memory (DRAM); D.4.2[Operating Systems]: Storage Management—Allocation/deallocati-on strategies; D.4.2 [Operating Systems]: Storage Management—

∗ This work was done when these authors were doctorate students atDankook University in the Republic of Korea (South Korea).

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.ASPLOS’13, March 16–20, 2013, Houston, Texas, USA.Copyright c© 2013 ACM 978-1-4503-1870-9/13/03. . . $15.00

Main memory; D.4.8 [Operating Systems]: Performance—Measu-rements

General Terms Algorithms, Design, Experimentation, Measure-ment, Performance

Keywords Row-buffer conflict, Memory management, Random-ized algorithm, Memory container, Analysis tool

1. IntroductionA recent distinct trend in computer architecture is that of multi-coresupport as exemplified by the Intel Core i7 [3], AMD Opteron [1]and ARM Cortex-A15 [2]. In order to take advantage of the in-creased computation of multi-core chips, computer systems com-monly come with a vast amount of main memory. These trends re-quire us to revisit the internal policies and mechanisms of today’soperating systems.

In this paper, we rethink the memory allocation issue in light ofthese architectural trends. In particular, we look into how memorycan be allocated more efficiently by alleviating the row-buffer con-flict problem from an operating system standpoint. The row-bufferconflict problem has been looked into primarily as an architectureproblem [9, 31, 34]. To the best of our knowledge, this study is thefirst to present an operating system based solution.

Before presenting an outline of our solution, let us look at thesource of the row-buffer conflict. A row-buffer is essentially acache area in a memory bank [34]. A row-buffer conflict occurswhen a core alternatively accesses different page frames mappedto the same bank or when several cores access the bank simultane-ously. This leads to significant performance degradation [14, 20,26, 28, 33].

Our operating system solution to the row-buffer conflict is anovel kernel-level memory allocator, which we refer to as M3 (M-cube, Multi-core Multi-bank Memory allocator), that takes twodistinct approaches. The first is to allocate page frames to a corefrom different banks as much as possible. To this end, we introducethe notion of a memory container, which, for now, can be thoughtof as a unit of memory formed from a collection of distinct, non-overlapping pages. Then each core is assigned a memory containersuch that when a core requests for memory, the request is servicedwithin the assigned memory container.

181

Page 2: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

Iteration

Row buffer (size : 4Kbyte)

Rows (32K entry)

Bank Size : 128Mbyte = 32K * 4Kbyte (Number of entries * row size)

4 R

an

ks

8 Banks

Ra

nk

0

Bank 0 Bank 7

Ra

nk

0

Bank 0 Bank 7

Ra

nk

3

Ra

nk

3

Channel 0 Channel 2

3 Channels

Bank

(a) Conceptual memory organization

Memory Organization (Physical Memory Organization) figure 1 – (b)

Precharge (Close)

Activate (Open)

DIMM

Chip 0

Chip 1

Chip 2

Chip 3

Chip4

Chip 5

Chip 6

Chip 7

… A

dd

r/C

md

B

us

Da

ta

Bu

s

Row Buffer

row

Row Buffer

row

Row Buffer Bank 0

Bank 1

Bank 7

co

lum

n

row

Row Buffer

row

Row Buffer Row Buffer

Bank 0

Bank 1

Bank 7

row

co

lum

n

Me

mo

ry C

on

tro

lle

r

Precharge (Close)

Activate (Open)

Rank

(b) Physical memory module

Figure 1. Memory organization consisting of multiple channels, ranks, and banks

The second approach is to randomize memory accesses. Thismay, at first, sound counter-intuitive, but we show that sequentialmemory accesses or accesses that show regularities such as ref-erences with patterned intervals tend to aggravate the row-bufferconflict. To alleviate such effects, we propose a memory allocationalgorithm that randomizes the actual accesses at the memory banklevel. Note that the memory references generated by the threadsare not the ones being randomized, but that randomization is beingforced upon only at the memory bank level. We devise a downwardsearch and individual page frame management scheme to imple-ment such an algorithm in an efficient manner.

To understand the effect of regularities and randomness and torealize the notion of a memory container, a full understanding ofthe underlying memory organization is necessary. To this end, wedevelop an analysis tool that allows us to infer the relationshipsbetween page frames and the various components that make up thephysical main memory.

We implement M3 in the Linux kernel version 2.6.32 on a sys-tem that consists of two Intel XEON x5570 processors with fourcores on each processor (a total of 16 cores with the hyperthreadingtechnology enabled), 32GB DDR3 main memory, and 3.6TB SASdisks. Using six benchmarks, we conduct various experiments thatshow that M3 can improve the total execution time by up to 85.2%with an average of 39.7% for memory-intensive benchmarks and byup to 7.8% with an average of 1.3% for CPU or I/O intensive bench-marks. Also, we conduct experiments using the PARSEC bench-mark that demonstrates the spectrum of performance improvementsof M3 according to the characteristics of applications.

The contributions of our work can be summarized as follows:• Randomness for the Better: We make a non-intuitive obser-vation that in multi-core and multi-bank systems, a 1randomizedmemory access pattern yields better performance than a regularpattern.•M3 design: The notion of a memory container and a randomizingalgorithm are devised so that the operating system can efficientlyrandomize page frame accesses.• Memory organization analysis tool: A tool that exposes the

1 Henceforth, unless specifically mentioned, when we refer to randomizingor randomized memory accesses, this refers to memory accesses beingrandomized at the memory bank level and not at the thread level.

mappings between page frames and memory components such aschannels, ranks, and banks is presented.• Real implementation: Performance gains of our proposal arequantitatively evaluated through real implementation based experi-ments with various types of workloads.

The rest of the paper is organized as follows. In the next sec-tion, we describe the architectural background and related work.The memory organization analysis tool is presented in Section 3,and the effects of access patterns on the row-buffer conflict arediscussed in Section 4. The design and implementation details ofM3 are elaborated in Section 5. In Section 6, we present the per-formance evaluation results. Finally, conclusions are presented inSection 7.

2. BackgroundIn this section, we first describe the general structure of memoryorganizations adopted in modern PCs and server systems. Then,we discuss how the page frame allocation decision of operatingsystems affects memory parallelism. Finally, we survey previousresearch related to our work.

2.1 Memory Organization and Row-BufferFigure 1(a) depicts the conceptual structure of a memory organiza-tion. It consists of multiple channels, which is divided into multipleranks. Each rank is further divided into multiple banks, and eachbank consists of a set of rows and a row-buffer. For instance, in ourexperimental system, the IBM X3650 M2 system, there are 3 chan-nels, 4 ranks per channel, and 8 banks per rank. Each bank has 32Krows, whose size is 4KB, making the size of each bank 128MB.

In reality, the memory organization complexity goes even astep further as shown in Figure 1(b) [11, 12]. A memory modulesuch as a DIMM (Dual In-line Memory Module), which is directlymounted on a printed circuit board, consists of multiple memorychips, and the chips within a DIMM comprise a rank. Then, arank contains multiple banks. In our experimental system, a rankis composed of 8 chips similar to the one in Figure 1(b). Also,data stored in a bank of the rank is distributed into the 8 chips thatcan be referenced in parallel. In other words, the aggregation ofthe bank 0s in the 8 chips of Figure 1(b) corresponds to bank 0

182

Page 3: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

row-buffer hit & miss cost in DDR3-1600 figure 2

0

10

20

30

40

50

60

[W:W

:M]

[R:W

:M]

[W:R

:M]

[R:R

:M]

[W:W

:H]

[R:W

:H]

[W:R

:H]

[R:R

:H]

Row-buffer miss Row-buffer hit

Ba

nk

Bu

sy T

ime

(nse

c)

* [Previous Request Type : Current Request Type : Row-buffer State]

Write(W) or Read(R) Write(W) or Read(R) Miss(M) or Hit(H)

Figure 2. Row-buffer hit and conflict overhead

in Figure 1(a). Hereafter, all discussions are presented using theorganization of Figure 1(a) since the issues tackled in this papercan be fully discussed using this conceptual structure.

As previously mentioned, there is a row-buffer for each bank,the size of which is the same as a row. A row-buffer is a cachearea used to exploit spatial locality. When a memory request istriggered by a core, the DRAM controller first checks the row-buffer of the bank related to the requested address. If data is cachedin the row-buffer, the request can be served immediately fromthe row buffer. This is called a row-buffer hit. Otherwise, upon arow-buffer miss the DRAM controller performs the precharge andactivate operations. The precharge operation is a step to write backdata of the row-buffer into the original row. The activate operationis a step to load the entire data from the row that contains therequested address into the row-buffer. After the two operations, therequest is served from the row-buffer. This whole scenario is calleda row-buffer conflict. Note that even though the request size froma core is a CPU cache line width (for instance, 64B), the DRAMcontroller loads the entire row into the row buffer (for instance,4KB) anticipating that the next request will be serviced from thesame row.

Row-buffer conflicts eliminate the caching effect of the row-buffer and incur significant precharging and activating overhead,resulting in substantial delays and energy consumption. To assessthe overhead more quantitatively, we plot Figure 2 using raw databorrowed from previous papers [9, 13]. Note that the bank busytime, which is defined as the time spent by a bank to service thememory request issued to that bank, depends not only on the row-buffer state, but also on the previous/current request types. Forexample, [W:W:H] takes 5ns while [W:R:H] and [W:W:M] take22.5 and 50ns, respectively. Although there are some variations,this figure consistently shows that the row-buffer conflict degradesmemory access latency by 2 to 5 times. The eventual goal ofour study is in reducing row-buffer conflicts through an operatingsystem approach.

2.2 Operating System ChallengesFor main memory management, traditional operating systems usetwo abstractions, virtual memory and physical memory. As NUMA(Non-Uniform Memory Access) architecture such as Intel’s Ne-halem QPI (QuickPath Interconnect) [3] and AMD’s hypertrans-port technology [1] gains attention, some operating systems are in-corporating the local and remote memory concepts into their ab-stractions. For instance, the recent Linux kernel employs the notionof a node, which is simply a range of physical memory consideredlocal, and it tries to allocate page frames to a core from a local nodeto some extent taking into account the physical memory layout init’s allocation decision [7, 10].

Virtual Memory

PhysicalMemory

Thread

: First Access

Memory Allocator

: Second Access

… …

Bank 0 Bank 1

Memory Controller

(a) Case 1: Conflict does not occur

Virtual Memory

PhysicalMemory

Thread

: First Access

Memory Allocator

: Second Access

… …

Bank 0 Bank 1

Memory Controller

(b) Case 2: Conflict occurs

Figure 3. Effects of page frame allocation decision on the row-buffer conflict. (a) Case 1: page frames are allocated from differentbanks, (b) Case 2: page frames are allocated from a same bank.

We argue, through this paper, that to decrease row-buffer con-flict and to make full use of the memory parallelism supported byrecent memory organizations, the allocator must understand andtake into account the physical memory organization when makingits allocation decisions. As an example, consider Figure 3 wherewe assume that there is an application that alternatively accessestwo virtual pages. Allocating page frames for this application couldbe done in one of two ways: one, by allocating the pages from dif-ferent banks as depicted in Figure 3(a) as Case 1, and the other, byallocating from the same bank as depicted in Figure 3(b) as Case 2.Then, naturally, Case 2 is expected to cause more row-buffer con-flicts than Case 1. In this manner, though both cases are allocatingfor the same virtual pages, understanding the physical memory or-ganization may allow the allocator to allocate in such a way thatrow-buffer conflicts may be alleviated.

Note that the mapping between virtual pages and physical pageframes are governed by the kernel-level memory allocator, whilethe mapping between page frames and banks are managed by thememory controller. Developing a row-buffer conflict aware mem-ory allocator is the main objective of this paper. This is discussed indetail in Sections 4 and 5. Understanding the characteristics of thememory organization is an essential prerequisite for our allocatordesign, and this is discussed in Section 3.

2.3 Related WorkA large number of research have been conducted with aims to in-crease the effect of the row-buffer and to enhance memory paral-lelism. One approach is based on memory partitioning at the chan-nel level [19], rank level [4, 35] and the bank level [14]. Morespecifically, Jeong et al. propose to partition the internal memorybanks between cores to isolate their access streams and eliminatelocality interference [14]. Muralidhara et al. suggest application-

183

Page 4: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

Algorithm 1 Pseudo code of analysis tool1: procedure ANALYZER CORE(start of mem, end of mem)2: ∗CL 1← First cache line from start of mem3: ∗CL 2← Second cache line from start of mem4: CACHE CONTROL(start of mem, end of mem, Uncached)

. Set uncached mode from start of mem to end of mem5: repeat . Iterate through all test memory area6: i← 07: Begin time←STOPWATCH( )

8: while i < 500 do . Repeat sufficient number of times9: ∗CL 1← ∗CL 2 + i

10: ∗CL 2← ∗CL 1 + i. To prevent out-of-order execution at CPU and memory controller

11: i← i + 112: end while

13: End time←STOPWATCH( )14: RECORD TIME(End time− Begin time)15: CL 2← Next cache line16: until CL 2 ≤ end of mem17: end procedure

aware memory channel partitioning to decrease inter-applicationinterference in the memory system [19].

Another approach is using memory request scheduling. Rixneret al. design the FR-FCFS (First Ready, First Come First Serve)scheduling algorithm that prioritizes memory requests that hit inthe row-buffers of DRAM banks over other requests to exploitthe locality of the row-buffer [25, 26]. Mutlu and Moscibrodapresent the STFM (Stall-Time Fair Memory) scheduler to equalizethe DRAM-related slowdown experienced by each thread due tointerference from other threads sharing the DRAM memory system[20]. Kaseridis et al. describe the complex structure of memoryorganization in many-core systems and propose a fair memoryhashing scheme to control the maximum number of row-buffer hits[15].

Other notable approaches have also been studied. Zhang et al.propose a permutation-based page interleaving scheme to mitigaterow-buffer conflicts by exploiting data access locality in the row-buffer [34]. Yoon et al. design an adaptive-granularity memorysystem that effectively combines fine-grained and coarse-grainedmemory accesses to improve performance and energy savings [33].They also suggest an OS support that augments the virtual memoryinterface to allow software to specify the preferred access granular-ity for each page.

Sudan et al. introduce a new data placement scheme based onmicro-pages, where chunks from different pages are collocated in arow-buffer to improve the overall utilization of the row buffer con-tents, consequently reducing memory energy consumption and ac-cess time [28]. Lee et al. propose mechanisms to maximize DRAMbank-level parallelism by arranging memory requests to be servicedin parallel in different DRAM banks [17]. Kim et al. design TCM(Thread Cluster Memory) scheduling that employs multiple differ-ent memory scheduling policies for different threads based on thethreads’ memory access and interference characteristics [16].

Our approach differs from all previous studies in that it is apurely OS-level approach not requiring any hardware modificationsor special support. Though Jeong et al. propose an approach thatmay be realized at the operating system level, their study is limitedto a simulator and is not conducted on a real system [14]. Moreimportantly, however, to the best of our knowledge, our paper isthe first to apply randomness in the page frame allocation decisionto reduce row-buffer conflicts.

Figure 4. Analysis results

3. Memory Organization AnalysisIn this section, we first explain the pseudo code for our memoryorganization analysis tool. Then, we discuss the analysis results andelaborate on the mappings between page frames and the variousDRAM components such as the channels, ranks, and banks.

3.1 Analysis AlgorithmTo reduce row-buffer conflicts from the operating system level, weneed to understand how page frames are mapped into the memoryorganization. However, such information is not readily available.Hence, we develop an analysis tool that can help explore the de-tailed internal structure of the memory organization. While design-ing the tool, we consider its portability so that it may be used notonly for our experimental system, but with other systems as well.

Algorithm 1 presents the pseudo code used by the tool. Basi-cally, it repeatedly accesses two variables, namely CL 1 and CL 2.The first variable has a fixed physical address (the first availablephysical address of node 1, which is 16GB in size in this experi-ment), while the second has a non-fixed address that is incrementedby the cache line size (64B) at each iteration. The elapsed time ofeach iteration is measured as shown in line 7 and 13. The elapsedtime is used to infer the memory organization and the row-bufferconflict, in particular.

In addition, we employ the following three techniques to en-hance the accuracy of the analysis. First, we set the CPU cachemode as uncacheable, as expressed in line 4, to guarantee that eachaccess is serviced at the memory module. This setting is done onlyfor the memory area starting from start of mem to end of mem,while all other variables and instructions remain cacheable, so thatthese other factors will have minimum effect on the execution of the

184

Page 5: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

PFN 0 PFN 1 PFN 2 PFN 3 PFN 4 PFN 5 PFN 768 PFN 769 PFN 770

PFN 24 PFN 25 PFN 26

*PFN # : Page Frame Number #

(a) Physical memory perspective

Channel interleaving after a cache line(64byte)

Bank interleaving after 3 pages(12KB)

Rank 0

… …

[0:0:0] [0:1:0] [0:7:0]

… …

[1:0:0] [1:1:0] [1:7:0]

… …

[2:0:0] [2:1:0] [2:7:0]

[0:0:1] [0:1:1] [0:7:1]

[1:0:1] [1:1:1] [1:7:1]

[2:0:1] [2:1:1] [2:7:1]

[0:0:3] [0:1:3] [0:7:3]

[1:0:3] [1:1:3] [1:7:3]

[2:0:3] [2:1:3] [2:7:3]

Rank 1

R

ank 3

Bank 0 Bank 1 Bank 7

Ra

nk

in

terl

ea

vin

g

aft

er

76

8 P

ag

es(3

MB

)

Channel 0 Channel 1 Channel 2

Bank 0 Bank 1 Bank 7 Bank 0 Bank 1 Bank 7

*[# : # : #] = [Channel #: Bank# : Rank#]

(b) Memory organization perspective

Figure 5. Relationship between physical memory and memory organization

algorithm. Second, at each iteration, we access the two variablesnumerous times to cancel out measurement noise such as suddenbus contention and inter-command delay. Specifically, for our ex-periments, we access them 500 times as shown in line 8. Third, wemake the two variables mutually dependent, as denoted in lines 9and 10. This avoids various optimizations such as out-of-order ex-ecution in the core or request scheduling in the memory controllerso that the instructions in the code are executed as we want themexecuted.

3.2 Analysis ResultsFigure 4 shows the analysis results. In the figure, the x-axis is theiteration number of the repeat statement, while the y-axis is themeasured elapsed time. Note that the physical address of the secondvariable, that is CL 2, is incremented by the cache line size of 64bytes. Hence, the x-axis also represents the sequence of physicalmemory addresses. That is, the ith iteration corresponds to thephysical memory address of “the address of the second cache line+ 64 × i”.

Figure 4 depicts four graphs, each depicting, from bottom to top,smaller and smaller ranges. Specifically, Figure 4(d), the bottom-most figure, ranges from 0 to 2,000,000 (roughly 128MB size), Fig-ure 4(c) zooms in on the 590,000-640,000 portion of Figure 4(d).Again, Figure 4(b) shows a portion of iterations of Figure 4(c),while Figure 4(a) shows that of Figure 4(b). In other words, wezoom in the results from Figure 4(d) to Figure 4(a) to examine themfrom a finer-grained viewpoint.

Now let us first discuss Figure 4(a). This figure reveals that themeasured elapsed time has a distinct pattern for every 3 addresses.Specifically, the first address shows higher elapsed time than thelatter two addresses due to the row-buffer conflict. It implies that atthe first address, CL 1 and CL 2 are located at the same bank, whileat the second and third addresses, they are at different banks. Notethat our experimental system has 3 channels. Hence, we can inferthat channel interleaving is performed with the cache line unit.

Figure 4(b) shows that the pattern observed in Figure 4(a) oc-curs continuously 64 times. When we convert this number intomemory size, we get 12KB from a calculation of 64B×3×64. Sincethe row size is 4KB in our system [18], we can infer that three con-secutive rows are managed in the same manner.

Now let us turn the discussion to Figure 4(c). This figureshows that the 12KB pattern described in Figure 4(b), reoccursin 96KB memory size intervals. Note that our system has 8 banks,and as 12×8=96, we can deduce that bank interleaving is config-ured in three row units. In addition, Figure 4(c) shows that the96KB pattern is repeated 32 times within the memory size of 3MB(96KB×32).

Finally, Figure 4(d) shows that the 3MB pattern of Figure 4(c)reoccurrs in 12MB intervals. Since our system has 4 ranks, wecan deduce that rank interleaving is performed in 3MB units. The12MB pattern is repeated until the end of memory is reached.

185

Page 6: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

1800

1900

2000

2100

2200

2300

2400

0 40000 80000 120000 160000 200000

Avera

ge E

lap

sed

Tim

e(u

sec)

Iteration

Sequential Access

Figure 6. Row-buffer conflicts with 4 cores: sequential accesspattern

3.3 ImplicationFrom the analysis results discussed in Section 3.2, we can figure outhow page frames are mapped into rows as illustrated in Figure 5.The observation of Figure 4(a) indicates that page frame 0 is dis-tributed into three banks, namely [0:0:0], [1:0:0] and [2:0:0]where the notation refers to [channel number: bank number:rank number]. In other words, channel interleaving is done withthe cache line unit. Also, from Figure 4(b), we can conjecture thatpage frames 1 and 2 are mapped in the same way as page frame 0.

Bank interleaving discussed in Figure 4(c) suggests that pageframes 3, 4, 5 are mapped into banks [0:1:0], [1:1:0], [2:1:0],page frames 6, 7, 8 into banks [0:2:0], [1:2:0], [2:2:0], andso on, until page frames 21, 22, and 23 are mapped to bank 7.Then, page frames 24, 25, 26 are mapped back at bank 0 and aremapped again into banks [0:0:0], [1:0:0], [2:0:0] in round-robin fashion. This round-robin is repeated 32 times as discussedin Figure 4(c). After 32 repetitions, rank interleaving is utilized.So, page frames 768, 769, 770 are mapped into banks [0:0:1],[1:0:1], [2:0:1]. These arrangements continue up to page frame3071, which is mapped into banks [0:7:3], [1:7:3], [2:7:3],the last bank of each channel. The page frames that are mappedinto every bank up to this point make up the 12MB interval asobserved in Figure 4(d). The next page frame 3072 then, is againmapped into banks [0:0:0], [1:0:0], [2:0:0], starting a new12MB interval.

Figure 5 makes it possible to discuss the results observed inFigure 4 more clearly. Assume that the CL 1 variable is locatedin page frame 0, while CL 2 is in page frame 24. Then, executionof the code generates the pattern observed in Figure 4(a). On theother hand, when CL 2 is in page frame 27 or 768, the code doesnot cause a row-buffer conflict leading to the lower elapsed timeranging from 1150 to 1200.

The findings from the analysis can be summarized as follows.First, channel interleaving is conducted in the CPU cache line unit,not the page frame unit. Of course, one can change this configu-ration so that the page frame is used as the channel interleavingunit. However, modern computer systems prefer cache line inter-leaving since it can obtain memory parallelism even with a sin-gle page frame. Second, some consecutive page frames are mappedinto the same row; 3 consecutive frames in our system. Put it an-other way, a single page frame is distributed into several rows, 3in our particular case, leading to higher parallelism and thus, betterperformance. Third, rank interleaving is performed in a relatively

1800

1900

2000

2100

2200

2300

2400

0 40000 80000 120000 160000 200000

Avera

ge E

lap

sed

Tim

e(u

sec)

Iteration

Random Access

Figure 7. Row-buffer conflicts with 4 cores: random access pattern

large unit; 3MB in our system. Finally, a memory area of 12MBsize can cover all banks with a minimum number of page frames inour system. These findings are used as guidelines when designingthe memory container as discussed in the next section.

4. Regularity Considered HarmfulAs the number of cores increase, the row-buffer conflict problembecomes more serious. In a single core system, the row-bufferconflict occurs only when an application accesses two differentpage frames located on identical banks at the same time. However,in a multiple core system, the conflict also happens when two ormore cores access those page frames concurrently. This meansthat row-buffer aware page frame allocation will become moreimportant in a multi-core system.

To evaluate the behavior of the row-buffer under multiple cores,we modify the code presented in Algorithm 1. Specifically, wemake the following two changes. First, we set the two variables sothat they are located in the same cache line. Then, their addressesare incremented by the cache line size at each iteration. Withthis change, row-buffer conflict will not occur should the code beexecuted by itself on a single core. The other change is that we nowrun the modified code concurrently on 4 cores with each instanceof the code having a different starting physical address incrementedby 12MB from the previous core. For instance, if core 0 has a startaddress of X , then core 1 has a start address of X+12MB, and core2, X + 24MB, and so on. Recall from Figure 5 that 12MB is thememory range that covers all banks from [0:0:0] to [2:7:3].Consequently, all four threads start accessing memory from thesame bank and the memory accesses also progress in the sameorder.

Figure 6 presents the experimental results. The x-axis is theiteration number, while the y-axis is the average of the elapsedtimes measured for the 4 cores. The results show that row-bufferconflicts occur in a large number of iterations. When all 4 cores areinvolved in a conflict, the elapsed time becomes the highest.

Figure 6 also shows that some ranges of iterations result inhigher conflicts than other ranges. For instance, conflicts are rarein the 70,000 to 90,000 range, while most of the iterations in the90,000 to 110,000 range result in high conflicts. Overall, we see afluctuation in the intensity of conflicts. This phenomenon, however,is counter-intuitive. To elaborate, the threads executing on the fourcores started referencing from the same bank, which should resultin a row-buffer conflict, and since the threads are executing with thesame rate of progress, the conflicts should, with high probability, be

186

Page 7: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

occuring over and over. Hence, we should be seeing an almost flatgraph, not a graph that fluctuates as shown in Figure 6.

The explanation for this fluctuation is the interference inducedby external events such as interrupt handling and scheduling. Theseexternal events trigger delays at the cores handling the events dis-rupting the progress of the threads. This leads to asynchrony amongthe cores releaving them of conflicts for some time period. How-ever, as these events continue to occur the timing among the coresbecome synchronized again leading to conflicts. Hence, this pat-tern of synchrony and asynchrony repeats itself as the executioncontinues.

One feasible solution to avoid conflicts among cores is memorypartitioning. For instance, from Figure 5, we can partition memoryvertically (banks 0 and 1 to core 0, banks 2 and 3 to core 1, and soon) or horizontally (rank 0 to core 0, rank 1 to core 1, and so on).The essential benefit of this partitioning is that it can completelyremove conflicts among cores.

However, there are problems in applying memory partitioninginto operating systems. First, it reduces memory parallelism suchas bank interleaving or rank interleaving. Second, it raises thescalability issue as the number of cores increase. Third, it brings upthe availability issue since it cannot support large consecutive pageframes that may be requested by the kernel, say, for example, by thedevice driver. For instance, if we partition banks 0 and 1 into core0, in the experimental platform that we use, we would not be able toservice requests demanding more than 6 consecutive frames as eachbank retains only 3 successive pages. Finally, when an applicationis migrated from a core to another core for CPU load balancing,which is frequently adopted in modern operating systems, conflictamong the cores will again occur. In this case, conflicts can be moreserious as the total number of banks accessible by a core has nowbeen reduced via partitioning.

Let us now see how random memory access patterns affect row-buffer conflicts. For this purpose, we again modify the code usedfor the previous experiment such that a random number generatoris used to make sure that each core has a different access patternand that each address is accessed no more than once. Note that inthis case, we are randomizing memory references at the applicationlevel.

The experimental results using the randomly accessing code arepresented in Figure 7. The results show that most iterations do notcause row-buffer conflicts and that conflicts where all 4 cores areinvolved are non-existent. The average elapsed time of all iterationsof Figure 7 is 1925, while that of Figure 6 is 2052. This implies thatrandom accesses mitigate conflicts better then sequential patterns.

Our sensitivity analysis reveals that the worsen elapsed time ofthe sequential pattern is due to correlated conflicts, which we de-fine as a set of successive conflicts. In the sequential pattern, oncea reference iteration causes a conflict, the next reference iterationis likely to cause another conflict. In contrast, in the random pat-tern, since the addresses accessed by consecutive iterations are in-dependent of each other, the next conflict probability is constant at“1/total number of banks”. Note that correlated conflicts occurnot only in sequential patterns, but also in any pattern with regular-ities such as accesses with the same stride or patterned intervals.

This observation is also applicable to the page frame allocationpattern in operating systems. An operating system can govern theallocation pattern to make it regular or random, and we expect thatthe former will cause a larger number of correlated conflicts thanthe latter.

5. M3 Design and ImplementationThe observations discussed in the previous sections inspire us todevelop a novel kernel-level memory allocator called M3 (M-cube,Multi-core Multi-bank Memory allocator). In this section, we elab-

orate on two features of M3, specifically the memory container andthe randomized algorithm.

5.1 Memory ContainerTo reduce row-buffer conflict, page frames allocated to each indi-vidual thread should come from distinct banks as much as possible.For this purpose, we introduce the notion of a memory container,which is defined as a unit of memory that comprises the minimumnumber of page frames that can cover all the banks of the memoryorganization. (Recall that we were able to find this using the anal-ysis tool described in Section 3, and that this number is 12MB forour experimental platform.)

M3, then, divides the entire memory into multiple memory con-tainers, with a memory container being exclusively assigned to acore. When an application requests a page frame, M3 serves therequest from the memory container of the core that is executingthe application. Requests for page frames from other applicationsrunning on different cores will be allocated from the memory con-tainer assigned to that core. Deallocation is also performed withinthe memory container frame.

Note that the notion of a memory container combined with ran-dom allocation will tend to disperse page frame allocation amongthe banks that comprise the memory container. This is becausewhen a page is allocated from a particular bank, the probability ofallocating a page from that same bank decreases on the next allo-cation attempt. The reason the probability decreases is because thenumber of pages belonging to a particular bank is constant withina single memory container, and with random allocation, each pagehas equal probability of being allocated. Hence, once a page is al-located from a particular bank, then the probability of that samebank being selected on the next allocation attempt is reduced bythe probability of a page being selected.

When a core uses up all the page frames from the assigned mem-ory container, M3 assigns a new available memory container to thecore. When the allocated page frames are freed, the page framesare released to the memory container that owns the page frames.If the number of available memory containers become insufficient,the swapping mechanism is triggered to free page frames so thatmore memory containers can be generated. In our current imple-mentation, the swapping mechanism provided in Linux is used asis. Hence, the pages reclaimed may be a part of several differentmemory containers. It may be possible to modify the swap mech-anism so that pages belonging to the same memory container arereclaimed together for better management of memory containers.This, however, is left for future work.

As mentioned previously, the ideal memory container size, bydefinition, is 12MB for our experimental system, since 12MB isthe size that covers all banks with the minimum number of pageframes, as already discussed in Figure 5. In the experimental studyconducted in Section 6, however, we set the size of the memorycontainer to 4MB. This was a choice that we made to make M3 asgeneric as possible so that it can be deployed in a wide range of ex-isting systems. Hence, there are roughly 8000 memory containersin our system.

5.2 Randomized AlgorithmEven though the memory container can reduce row-buffer conflictin a single core, there still exists conflicts among multiple cores.As discussed in Section 4 random allocation can have a positiveeffect on reducing conflicts among cores. To realize randomness,M3 replaces the buddy algorithm [8, 21, 22], which is utilizedin various modern operating systems including the Linux kernelconsidered in this paper, with a randomizing algorithm that wedescribe below.

187

Page 8: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

state 1: initial state

state 2 state 3 state 4 state 5 state 6 state 7

Fre

eli

st

23

22

21

20

23

22

21

20

23

22

21

20

23

22

21

20

23

22

21

20

23

22

21

20

23

22

21

20

0~7

4~7

2~3

4~7

2~3 6~7

5

6~7

0 5

0~1

5

6~7

0~7

1

: Allocate single page frame : Deallocate single page frame

… (Three Single Page Frame Allocating Requests)

… (Three Single Page Frame Deallocating Requests)

(a) Buddy algorithm

state 1: initial state

state 2 state 3 state 4 state 5 state 6 state 7

Fre

eli

st

23

22

21

20 6 4 2 0

5 1

3

7 23

22

21

20 6 4 2 0

5 1

3

23

22

21

20 6 4 2 0

5 1

23

22

21

20 4 2 0

23

22

21

20 7 4 2 0

23

22

21

20 7 4 2 0

3

23

22

21

20 7 4 2 0

5 3

1

6

… (Three Single Page Frame Allocating Requests)

… (Three Single Page Frame Deallocating Requests)

: Allocate single page frame : Deallocate single page frame

(b) Randomized algorithm

Figure 8. Comparison between buddy and our randomized algorithm

Figure 8(a) presents the structure of the buddy algorithm andits operational example. In the figure, we assume that physicalmemory consists of 8 page frames. As shown in Figure 8(a), thebuddy algorithm uses a set of doubly linked list data structurescalled the freelist. The freelist is constructed in multiple levelslabeled (bin) 20, 21, ...,2i, ..., where, at level 2i, 2i number ofconsecutive page frames are treated as a single unit. We refer to thissingle unit as a buddy group. The buddy algorithm utilizes two keyoperations, splitting and coalescing. The former is used to satisfy amemory request as suitably as possible, while the latter is used toreconstruct a buddy group.

At the initial stage of our study, we considered using the buddyalgorithm for managing each memory container. However, we de-tected its tendency to allocate page frames in a regular pattern.Let us discuss this issue using a walk-through example as demon-strated through states 1 to 7 in Figure 8(a). State 1 shows the initialstate, where all 8 page frames are free. Assume that five requestsarrive, where each request wants a single page frame. Then, thebuddy algorithm searches upward from the lowest level, recursivelysplitting, if necessary. In this case, the allocation sequence of pageframes becomes 0→ 1→ 2→ 3→ 4, and the resulting memorystate is shown in state 4.

Now, assume that the allocated page frames are freed in order.Then, the buddy algorithm, again, coalesces recursively, as neces-sary, leading to state 7. If five allocation requests are made again,the allocation sequence at this time again becomes 0→ 1→ 2→3→ 4. Indeed, the buddy algorithm allocates page frames in a reg-ular manner. Even if the initial state is different, it shows partialregularities since it searches upward from the lowest level, makingthe possibility of allocating a buddy page frame from a recentlyallocated one high. If we choose the buddy algorithm to manageeach memory container, each core will have a tendency to accessmemory in a regular pattern, incurring more correlated conflicts asdiscussed in Section 4.

One straightforward way to integrate randomness into the buddyalgorithm is to use a random number generating function to selecta candidate page frame for allocation. However, as such a functionwould need to exist independent of the buddy layer, it could possi-bly become a source of considerable operational overhead as page

frame allocation is one of the most frequently invoked operationsin the kernel. Furthermore, extra effort must be made to manageconsecutive pages at the buddy layer as randomness is introduced.Finally, forced randomness will cause more splitting and coalesc-ing operations to be invoked, resulting in significant overhead.

To overcome these problems, we introduce two new conceptsinto the buddy algorithm; one is individual page frame manage-ment and the other is a downward search. To elaborate on our pro-posed randomized algorithm, we make use of Figure 8(b), whichshows that we use the same data structure as the buddy algorithm.Downward search refers to the fact that to find a candidate pageframe for allocation, the randomized algorithm performs searchingdownward from the highest level, not upward as the buddy algo-rithm does. For example, upon a single page allocation request atinitial state 1, page 7, which is at the highest level, would be allo-cated resulting in state 2.

Let us now discuss individual page frame management. Note inFigure 8(b) that every page frame in the freelist is being managedindividually. This is in contrast to the buddy algorithm where thefreelist is managed per buddy group, which is composed of oneor more page frames, as shown in Figure 8(a). Management ofthe freelist as single page frames tells us that each page may beallocated individually. However, the location of the page allows usto convey a double meaning. Specifically, each bin (20, 21, and soon, represented as solid line boxes in the freelist) of the freelistrepresents the same meaning as that of the conventional buddyalgorithm; that is, it tells of the existence of consecutive multiplepages that can be allocated. For example, at state 2 in Figure 8(b),bin 22 with page 3 on its list indicates that 22 consecutive pages(pages 0 ∼ 3) are allocatable, while bin 21 with pages 5 and 1 onits list indicates that two consecutive pages (pages 0, 1 and pages4, 5) can be allocated together. Note that this meaning conveyedby the bin is in addition to the fact that each of the pages may beallocated as a single page frame.

Let us discuss the characteristics of the randomized algorithmusing examples as presented in Figure 8(b). State 1 is the initialstate, where all free page frames are managed individually. Assumethat five requests for a single page frame arrive. The randomizedalgorithm performs searching downward and allocates page frames

188

Page 9: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

0%

20%

40%

60%

80%

100%

T1 T2 T4 T8

T16

T32 T1 T2 T4 T8

T16

T32 T1 T2 T4 T8

T16

T32

RAMspeed STREAM Sysbench-Memory

% P

erf

orm

ance

Im

pro

vem

en

t T# : # of Threads

(a) Memory intensive benchmark results

-2%

0%

2%

4%

6%

8%

10%

T1 T2 T4 T8

T16

T32 T1 T2 T4 T8

T16

T32 T1 T2 T4 T8

T16

T32

Kernel Unixbench Dbench

T# : # of Threads

% P

erf

orm

ance

Im

pro

vem

en

t

(b) CPU or I/O intensive benchmark results

Figure 9. Performance impact of different workload characteristics

Table 1. Characteristics of programs in PARSEC

Program Application Domain Working Set

blackscholes Financial analysis Smallbodytrack computer vision Mediumcanneal Engineering Unboundeddedup Enterprise storage Unbounded

facesim Computer animation Largeferret Similarity search Unbounded

fluidanimate Computer animation Largefreqmine Data mining Unbounded

streamcluster Machine learning Mediumswaptions Computational finance Medium

vips Media processing Mediumx264 Media processing Medium

from the highest level. Hence, the allocation sequence becomes7 → 3 → 5 → 1 → 6, where state 4 becomes the memory stateafter there allocations.

Now, assume that the allocated page frames are freed in order.Then, the randomized algorithm puts the released page frame at thelowest level and ascends it to the higher level if its buddy is free.The ascending operation works just like the coalescing operation,except that it manages each frame individually, not merging theminto a buddy group. State 7 shows the final state after serving thefive deallocation requests. Note that even if the memory state isthe same (page frames from 0 to 7 are all free), the data structureshown in states 1 and 7 are different. From state 7, if the same fiverequests arrive again, the allocation sequence this time becomes6→ 1→ 5→ 3→ 7, which is not the same as the sequence fromstate 1. Note also that the freeing sequence becomes a seed for adifferent allocation sequence. For example, if the freeing sequencewere 6→ 7→ 1→ 3→ 5 for the above example, the subsequentallocation sequence would become 6 → 3 → 1 → 5 → 7, whichalso differs from the previous allocation sequences.

The walk-through example shows the merits of downwardsearch and individual management. Downward search enables theallocated page frames to be dispersed in a random fashion. Further-more, it allows the allocation order to be determined not only bythe programming logic but also by the deallocation sequence in-creasing randomness even further. Individual management makesit possible to efficiently integrate randomness into the buddy algo-rithm without paying additional overhead for splitting and coalesc-ing. Note that in the randomized algorithm, the coalescing/splittingoperation is also carried out when an allocation/deallocation formultiple page frames is requested.

6. Performance EvaluationIn this section, we first describe the experimental setup and theseven benchmarks that represent a variety of workload character-istics. Then, we present the overall performance results and discusshow the characteristics affect the performance of M3.

6.1 Experiment EnvironmentOur experimental system is an IBM x3650 M2 server system thatconsists of two Intel XEON x5570 quad core processors (whenwe turn on the hyperthreading capability, each processor can haveat most 8 cores), 32GB DDR3 main memory and eight 2.5 inchSAS disks of 450GB capacity each. On this hardware platform, weimplement the M3 allocator on the Linux kernel version 2.6.32.

To evaluate the performance of M3, we use the followingseven benchmarks that can be categorized into three groups. Thefirst group consists of memory intensive benchmarks, specifically,Stream, Sysbench-memory and Ramspeed. The Stream benchmarkis a synthetic benchmark used for measuring sustainable mem-ory bandwidth in high performance systems [27]. The Sysbench-memory benchmark is a memory allocation and transfer speedbenchmark [29]. The Ramspeed benchmark is a utility to measurecache and memory performance of computer systems [24].

The second group consists of CPU or I/O intensive benchmarksthat include Kernel Compile, Dbench and Unixbench. The Ker-nel Compile builds a new kernel image by compiling the Linuxkernel version 2.6.32 and by linking the compiled objects. TheDbench benchmark is a file system benchmark that generates aworkload similar to the commercial Netbench benchmark [5]. TheUnixbench benchmark is a general-purpose benchmark designed toevaluate the performance of a Unix-like system, integrating CPUand file I/O tests as well as system behavior under various userloads [30].

The final group is the PARSEC benchmark that was designedto represent diverse application domains [6]. It is composed of12 programs, whose characteristics are summarized in Table 1[6, 23, 32]. In terms of intensity, canneal, facesim, fluidanimateand streamcluster are categorized as memory intensive, while theothers as CPU or I/O intensive.

6.2 Experimental ResultsFigure 9(a) shows the performance improvements owing to M3 forthe three memory intensive benchmarks. To evaluate the multi-coreeffects, we run the benchmarks with various number of threads. De-pending on the configuration, performance is improved by as littleas 6.5% to as much as 85.2% compared to the original Linux kernelthat uses the buddy algorithm. Specifically, the average improve-ments are 26.6%, 50.4%, and 42.1% for the Stream, Sysbench-memory and Ramspeed benchmark, respectively.

The performance improvement is obtained from two sources.The first is by applying the randomized algorithm. As already dis-

189

Page 10: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

-5%

0%

5%

10%

15%

20%

25%

bla

ck

sch

ole

s

bo

dytr

ack

ca

nn

ea

l

de

du

p

face

sim

ferr

et

flu

ida

nim

ate

fre

qm

ine

str

ea

mclu

ste

r

sw

ap

tio

ns

vip

s

x2

64

% P

erf

orm

an

ce

Im

pro

ve

me

nt

1 Thread 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads

Figure 10. PARSEC benchmark results

cussed in Figures 7 and 8, random access patterns have a positiveeffect on performance. The second source is the memory container.Without the memory container, pages frames would be allocatedinto cores in an interleaved manner, failing to achieve memory par-allelism in the core while still generating references with regulari-ties.

Experimental results for the three CPU and I/O intensive bench-marks are shown in Figure 9(b). As expected, the improvements arenot that considerable since memory references of these benchmarksare only a small portion of the overall benchmark performance.However, from these results, we can deduce that memory containermanagement, individual page frame management, and downwardsearch work efficiently without incurring noticeable overhead.

Figure 10 presents the PARSEC benchmark results. Accord-ing to the characteristics of the programs, performance improve-ments show different aspects. For programs that intensively accessa large memory range such as canneal, facesim, fluidanimate, andstreamcluster, the improvements are up to 21.7% with an averageof 10.4%. The results of other programs can be classified into twogroups. The first group that includes blackscholes, freqmine, swap-tions and x264, does not show noticeable performance differencessince they already have irregular access patterns at the application-level [28]. The second group, which are the remaining programs,shows improvements of up to 8.2% with an average of 2.2%.

While analyzing the results, in particular, we found an interest-ing comparison between our results and the results presented bySudan et al. [28]. In their work, Sudan et al. proposed micro-pagesthat collocates hot cache blocks from different page frames intoa row-buffer to improve the overall utilization of the row buffercontents. Their approach differs from ours in that it tries to opti-mize the cases where small data are accessed heavily while ourstries to optimize cases where large data are accessed in regular pat-terns. Hence, according to the characteristics of the programs, thetwo approaches show contrasting improvements. Specifically, theblackscholes, swaptions and vips show relatively better improve-ments in the approach by Sudan et al., while canneal, facesim andstreamcluster improves more with our approach. Since the two ap-proaches are orthogonal, we speculate that integrating the two willresult in even more positive effects on performance.

7. ConclusionsIn this paper, we proposed M3, a novel kernel-level memory al-locator for multi-core, multi-bank systems. It was developed withtwo goals in mind. The first is to dedicate multiple banks to a coreas much as possible so as to maximize memory parallelism, whilethe second is to reduce cases where multiple cores access the same

bank at the same time. To achieve the first goal, the notion of amemory container was devised. For the second goal, we introduceda randomizing memory allocation algorithm. To the best of ourknowledge, this is the first study that enforces randomness in pageframe allocation decisions to reduce row-buffer conflicts.

We are considering two research directions as future work. Onedirection is analyzing the randomized algorithm more formally.We think that a single page frame allocation can be done in O(1)time complexity, while it takes O(logn) in the original buddyalgorithm. We are also investigating other issues related to M3 suchas fragmentation and lock contention. Another direction is a morethorough study of memory isolation techniques. We believe thatwith the cooperation of the memory controller, operating systemscan partition memory more effectively so that it can completelyeliminate row-buffer conflicts between cores.

AcknowledgmentsWe thank the anonymous ASPLOS reviewers for comments thathelped improved this paper. This work was supported by the Na-tional Research Foundation of Korea(NRF) grant funded by the Ko-rea government(MEST) (No. 2012R1A2A2A01014233) and (No.2012R1A2A2A01045733).

References[1] AMD Multi-core, http://www.amd.com.

[2] ARM Cortex-A9 Processor, http://www.arm.com.

[3] Intel Multi-core Technology, http://www.intel.com.

[4] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber.Future Scaling of Processor-Memory Interfaces. In Proceedings ofthe Conference on High Performance Computing Networking, Storageand Analysis, SC ’09, pages 42:1–42:12, 2009.

[5] Benchmark. Linux Benchmark Suite Home Page. http://lbs.sourceforge.net/.

[6] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC BenchmarkSuite: Characterization and Architectural Implications. In Proceedingsof the 17th International Conference on Parallel Architectures andCompilation Techniques, PACT ’08, pages 72–81, 2008.

[7] M. J. Bligh, M. Dobson, D. Hart, and G. Huizenga. Linux on NUMASystems. In Proceedings of the Linux Symposium, pages 295–306,2004.

[8] G. S. Brodal, E. D. Demaine, and J. I. Munro. Fast Allocation andDeallocation with an Improved Buddy System. Acta Informatica,41:273–291, March 2005.

[9] H. Choi, J. Lee, and W. Sung. Memory Access Pattern-Aware DRAMPerformance Model for Multi-core Systems. In Proceedings of the

190

Page 11: Regularities considered harmful: forcing randomness to ...csl.skku.edu/uploads/SDE5007M16/park13.pdf · Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce

2011 IEEE International Symposium on Performance Analysis of Sys-tems & Software, ISPASS ’11, pages 66 –75, 2011.

[10] M. Correa, A. Zorzo, and R. Scheer. Operating System MultilevelLoad Balancing. In Proceedings of the 2006 ACM Symposium onApplied Computing, SAC ’06, pages 1467–1471, 2006.

[11] B. K. Ganesh Balakrishnan, Ralph M. Begun. Understanding IntelXeon 5600 Series Memory Performance and Optimization in IBMSystem x and BladeCenter Platforms. White paper, IBM, May 2010.

[12] Hewlett-Packard. DDR3 Memory Technology, Technology Brief, 3rdedition. White paper, HP, April 2012.

[13] JEDEC. JEDEC Standard : DDR3 SDRAM Specification.White paper, JEDEC, July 2012. http://www.jedec.org/standards-documents/docs/jesd-79-3d.

[14] M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez.Balancing DRAM Locality and Parallelism in Shared Memory CMPSystems. In Proceedings of the IEEE 18th International Symposiumon High-Performance Computer Architecture, HPCA-18 ’12, pages 1–12, 2012.

[15] D. Kaseridis, J. Stuecheli, and L. K. John. Minimalist Open-page:A DRAM Page-mode Scheduling Policy for the Many-core era. InProceedings of the 44th Annual IEEE/ACM International Symposiumon Microarchitecture, MICRO-44 ’11, pages 24–35, 2011.

[16] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. ThreadCluster Memory Scheduling: Exploiting Differences in Memory Ac-cess Behavior. In Proceedings of the 43rd Annual IEEE/ACM Interna-tional Symposium on Microarchitecture, MICRO-43 ’10, pages 65–76,2010.

[17] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving MemoryBank-level Parallelism in the Presence of Prefetching. In Proceedingsof the 42nd Annual IEEE/ACM International Symposium on Microar-chitecture, MICRO-42 ’09, pages 327–336, 2009.

[18] Micron Technology. DDR3 SDRAM RDIMM : MT18JSF25672PD2GB. White paper, Micron, July 2010. http://www.micron.com/products/dram-modules/.

[19] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, andT. Moscibroda. Reducing Memory Interference in Multicore Systemsvia Application-Aware Memory Channel Partitioning. In Proceedingsof the 44th Annual IEEE/ACM International Symposium on Microar-chitecture, MICRO-44 ’11, pages 374–385, 2011.

[20] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory AccessScheduling for Chip Multiprocessors. In Proceedings of the 40thAnnual IEEE/ACM International Symposium on Microarchitecture,MICRO-40 ’07, pages 146–160, 2007.

[21] I. P. Page and J. Hagins. Improving the Performance of BuddySystems. IEEE Transactions on Computers, 35:441–447, May 1986.

[22] J. L. Peterson and T. A. Norman. Buddy Systems. Communicationsof the ACM, 20:421–431, June 1977.

[23] K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread Tranquilizer:Dynamically Reducing Performance Variation. ACM Transactions onArchitecture and Code Optimization, 8(4):46:1–46:21, January 2012.

[24] Ramspeed. Ramspeed Benchmark. http://alasir.com/software/ramspeed/.

[25] S. Rixner. Memory Controller Optimizations for Web Servers. InProceedings of the 37th Annual IEEE/ACM International Symposiumon Microarchitecture, MICRO-37 ’04, pages 355–366, 2004.

[26] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens.Memory Access Scheduling. In Proceedings of the 27th AnnualInternational Symposium on Computer Architecture, ISCA-27 ’00,pages 128–138, 2000.

[27] STREAM. STREAM Benchmark. http://www.cs.virginia.edu/stream.

[28] K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramo-nian, and A. Davis. Micro-pages: Increasing DRAM Efficiency withLocality-Aware Data Placement. In Proceedings of the 15th Inter-national Conference on Architectural Support for Programming Lan-guages and Operating Systems, ASPLOS-15 ’10, pages 219–230,2010.

[29] SysBench. Sysbench: A System Performance Benchmark. http://sysbench.sourceforge.net/.

[30] UnixBench. UnixBench: A Fundamental High-level LinuxBenchmark Suite. http://www.tux.org/pub/tux/benchmarks/System/unixbench/.

[31] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, andB. Jacob. DRAMsim: A Memory System Simulator. ACM SIGARCHComputer Architecture News, 33(4):100–107, November 2005.

[32] W. Wang, T. Dey, J. Mars, L. Tang, J. W. Davidson, and M. L. Soffa.Performance Analysis of Thread Mappings with a Holistic View of theHardware Resources. In Proceedings of the 2012 IEEE InternationalSymposium on Performance Analysis of Systems & Software, ISPASS’12, pages 156–167, 2012.

[33] D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive Granularity Mem-ory Systems: A Tradeoff between Storage Efficiency and Throughput.In Proceedings of the 38th Annual International Symposium on Com-puter Architecture, ISCA-38 ’11, pages 295–306, 2011.

[34] Z. Zhang, Z. Zhu, and X. Zhang. A Permutation-based Page Inter-leaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Lo-cality. In Proceedings of the 33rd Annual ACM/IEEE InternationalSymposium on Microarchitecture, MICRO-33 ’00, pages 32–41, 2000.

[35] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-rank: Adaptive DRAM Architecture for Improving Memory PowerEfficiency. In Proceedings of the 41st Annual IEEE/ACM Interna-tional Symposium on Microarchitecture, MICRO-41 ’08, pages 210–221, 2008.

191