research issues/challenges to systems software for multicore and data-intensive applications

1

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications

Xiaodong Zhang Ohio State University

In collaboration with F. Chen, X. Ding, Q. Lu, P. Sadayappan, Ohio State

S. Jiang, Wayne State University, Z. Zhang, Iowa StateQ. Lu, Intel, J. Lin, IBM

Kei Davis, Los Alamos National Lab

2

Foot Steps of Challenges in HPC

1970s-80s: Killer applications demand a lot of CPU cycles a single processor was very slow (below 1MH) beginning of PP: algorithms, architecture, software

1980s: communication bottlenecks and burden of PP challenge I: fast interconnection networks challenge II: automatic PP, and shared virtual memory

1990s: “Memory Wall” and utilization of commodity processors challenge I: cache design and optimization challenge II: Networks of Workstations for HPC

2000s and now: “Disk Wall” and Multi-core processors

3

Moore’s Law Driven Computing Research (IEEE Spectrum, May 2008)

hi

25 year of golden age of parallel computing

10 years of dark age of parallel computing, CPU-memory gap is theMajor concern.

New era of multicore computingMemory problem continues

4

0.3 0.37587,0000.9

1.2451,807

0.72

560,000

2.511.66

1,666,666

1.2537.5

5,000,000

0500000

100000015000002000000250000030000003500000400000045000005000000

CPU

Cyc

les

1980 1985 1990 1995 2000

Year

Latencies of Cache, DRAM and Disk in CPU Cycles

SRAM Access Time DRAM Access Time Disk Seek Time

Unbalanced System Improvements

Bryant and O’Hallaron, “Computer Systems: A Programmer’s Perspective”, Prentice Hall, 2003

The disks in 2000 are 57 times “SLOWER” than their ancestors in 1980 --- increasingly widen the Speed Gapbetween Peta-Scale computing and Peta-Byte acesses.

5

Dropping Prices of Solid State Disks (SSDs) (CACM, 7/08)Fl

ash

cost

($) p

er G

B

6

Prices & Performance: Disks, DRAMs, and SSDs (CACM, 7/08)

7

Power Consumption for Typical Components (CACM, 7/08)

8

Opportunities of Technology Advancements

• Single-core CPU reached its peak performance– 1971 (2300 transistors on Intel 4004 chip): 0.4 MHz– 2005 (1 billion + transistors on Intel Pentium D): 3.75 GHz– After 10,000 times improvement, GHz stopped and dropped– CPU improvement is reflected by number of cores in a chip

• Increased DRAM capacity enables large working sets – 1971 ($400/MB) to 2006 (0.09 cent/MB): 444,444 times lower– Buffer cache is increasingly important to break “disk wall”

• SSDs (flash memory) can further break the “wall”– Low power (6-8X lower than disks, 2X lower than DRAM) – Fast random read (200X faster than disks, 25X slower than DRAM)– Slow writing (300X slower than DRAM, 12X faster than disks) – Relatively expensive (8X more than disks, 5X cheaper than DRAM)

9

Research and Challenges • New issues in Multicore

– To utilize parallelism in multicore is much more complex– Resource competition in multicore causes new problems– OS scheduling is multi-core- and shared-cache-unaware – Challenges: Caches are not in the scope of OS management

• Fast data accesses is most desirable – Sequential locality in disks is not effectively exploited. – Where should flash memory be in the storage hierarchy? – How to use flash memory and buffer cache to improve disk

performance/energy? – Challenges: disks are not in the scope of OS managements

10

Data-Intensive Scalable Computing (DISC)

Massively Accessing/Processing Data Sets in Parallel. drafted by R. Bryant at CMU, endorsed by

Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.

Applications in science, industry, and business. Special requirements for DISC Infrastructure:

Top 500 DISC ranked by data throughput, as well FLOPS

Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.

DISC is not an extension of SC, but demands new technology advancements.

11

Systems Comparison: (courtesy of Bryant)

– Disk data stored separately• No support for collection

or management– Brought in for

computation• Time consuming• Limits interactivity

– System collects and maintains data• Shared, active data set

– Computation co-located with disks• Faster access

SystemSystem

DISCConventional Computers

12

Outline Why is multicore the only choice? Performance bottlenecks in multicores

OS plays a strong supportive role A case study of DMMS on multicore OS-based cache partitioning in multicores Summary

13

Multi-Core is the only Choice to Continue Moore’s Law

Performance Power Dual-Core

Over-Clocked (1.2x)

1.13 x

1.73 x

0.51 x

0.87 x

Under-Clocked (0.8x)

1.73 x

Dual-Core (0.8x)

1.02 x

R.M. Ramanathan, Intel Multi-Core Processors: Making the Move to Quad-Core and Beyond, white paper

Much better performance

1.00 x

Baseline Frequency

1.00 x

Similar power

consumption

14

Cache

Memory

Cache Cacheconflict

Cache Sensitive Job

Computation Intensive Job

Streaming Job

Memory Bus

• Scheduling two cache sensitive jobs - causing cache conflicts

jobs jobs

Shared Resource Conflicts in Multicores

15

Cache

Memory

Cache Sensitive Job


Streaming Job

Memory Bus

Cache Cache

Saturation

• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts

jobs jobs


16

Cache

Memory

• Scheduling two CPU intensive jobs – underutilizing cache and bus

Cache Sensitive Job


Streaming Job

Memory Bus

• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts

Underutilized resources

jobs jobs


17

Cache

Memory

CacheCache

Cache Sensitive Job


Streaming Job

Memory Bus

Streaming job pollutes cache

Increased memory activity

jobs jobs

• Scheduling two CPU intensive jobs – underutilizing cache and bus• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts

• Scheduling cache sensitive & streaming jobs – conflicts & congestion


18

Cache Cache

Memory

Memory Bus

• Many Cores – oversupplying computational power

Many Cores, Limited Cache, Single Bus

Cache Cache Cache Cache Cache Cache Cache Cache

• Limited Cache – lowering cache capacity per process and per core• Single Bus – increasing bandwidth sharing by many cores

1919

Moore’s Law Driven Relational DBMS Research

1970: Relational data model (E. F. Codd)

1976: DBMS: System R and Ingres(IBM, UC Berkeley)

1977 – 1997: Parallel DBMSDIRECT, Gamma, Paradise

(Wisconsin, Madison)

Architecture-optimized DBMS (MonetDB):Database Architecture Optimized for the

New Bottleneck: memory Access (VLDB’99)

What should we do as DBMS meets multicores?

20

Query2 Core 2

Shared cache

Memory

Query1Core 1

Good/Bad News for “Hash Join”

Shared cache provides bothdata sharing and cache contention

select * from Ta, Tb where Ta.x = Tb.y

H1

Ta Tb

H2

H1

HJ (H1, Tb)

ta

tbsharing!

H2

HJ (H2, Ta)

Conflict!

21

Consequence of Cache Contention

During concurrent query executions in multicore cache contention causes new concerns:

• Suboptimal Query Plans – Due to multicore shared cache unawareness

• Confused scheduling – generating unnecessary cache conflict

• Cache is allocated by demand not by locality – Weak locality blocks, such as one-time accessed

blocks pollute and waste cache space

22

Suboptimal Query Plan

• Query optimizer selects the “best” plan for a query.

• Query optimizer is not shared-cache aware.

• Some query plan is cache sensitive and some are not.

• A set of cache sensitive/non-sensitive queries would confuse the scheduler.

240

10

20

30

40

50

60

70

HashJ oi n Tabl eScan

Ave

Resp

onse

Tim

e (s

)

Def aul t Schedul i ngCMP- Aware Schedul i ng

Multicore Unaware Scheduling• Default scheduling takes a FIFO order, causing

cache conflicts• Multicore-aware optimization: co-schedule queries

• “hashjoins” (cache sensitive), “table scans” (insensitive)• Default: co-schedule “hashjoins” (cache conflict!)• Multicore-aware: co-schedule “hashjoin” and “tablescan”

30% Improvement!

250

10

20

30

40

50

60

HashJ oi n Tabl eScan

Quer

y Re

spon

se T

ime

(s)

Def aul tCachePart i t i oni ng

Locality-based Cache Allocation • Different queries have different cache utilization.• Cache allocation is demand-based by default. • Weak locality queries should be allocated small space.

Co-schedule “hashjoin” (strong locality) and “tablescan” (one-time accesses), allocate more to “hashjoin”.

16% Improvement!

26

A DBMS Framework for Multicores

Core Core

Shared Last Level Cache

QueryOptimizer

Queries

QueryScheduler

CachePartitioning

• Query Optimization (DB level)Query optimizer generates optimal query plans based on usage of shared cache• Query Scheduling (DB level)To group queries co-running to minimize access conflicts in shared cache• Cache Partitioning (OS level)To allocate cache space to maximize cache utilization for co-scheduled queries

27

Challenges and Opportunities• Challenges to DBMS

– DBMS running in user space is not able to directly control cache allocation in multicores

– Scheduling: predict potential cache conflicts among co-running queries

– Partitioning: Determine access locality for different query operations (join, scan , aggregation, sorting,…)

• Opportunities– Query optimizer can provide hints of data access patterns

and estimate working set size during query executions– Operation system can manage cache allocation by using

page coloring during virtual-physical address mapping.

30

OS-Based Cache Partitioning

• Static cache partitioning– Predetermines the amount of cache blocks allocated to

each program at the beginning of its execution– Divides shared cache to multiple regions and partition

cache regions through OS page address mapping• Dynamic cache partitioning

– Adjusts cache quota among processes dynamically – Dynamically changes processes’ cache usage through

OS page address re-mapping

31

Page Coloring

virtual page numberVirtual address page offset

physical page numberPhysical address Page offset

Address translation

Cache tag Block offsetSet indexCache address

Physically indexed cache

page color bits

… …

OS control

=

• Physically indexed caches are divided into multiple regions (colors).• All cache lines in a physical page are cached in one of those regions (colors).

OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).

32

Enhancement for Static Cache Partitioning

… …

...

………

………

Physically indexed cache

………

………

Physical pages are grouped to page binsaccording to their page color1

234

…

i+2

ii+1

…Process 1

1234

…

i+2

ii+1

…Process 2

OS

address mapping

Shared cache is partitioned between two processes through address mapping.

Cost: Main memory space needs to be partitioned too (co-partitioning).

34

Allocated color

Page Re-Coloring for Dynamic Partitioning

page links table

……

N - 1

0

1

2

3

• Page re-coloring:– Allocate page in new color– Copy memory contents– Free old page

Allocated color

Pages of a process are organized into linked lists by their colors.

Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

35

Reduce Page Migration Overhead

• Control the frequency of page migration– Frequent enough to capture phase changes– Reduce large page migration frequency

• Lazy migration: avoid unnecessary migration– Observation: Not all pages are accessed between

their two migrations.–Optimization: do not migrate a page until it is

accessed

36

• With the optimization– Only 2% page migration overhead on average– Up to 7%.

Lazy Page Migration

Process page links

……

N - 1

0

1

2

3

Avoid unnecessary page migration for these pages!

Allocated color

Allocated color

37

Research in Cache Partitioning

• Merits of page-color based cache partitioning– Static partitioning has low overhead although memory

will have to be co-partitioned among processes– A measurement-based platform can be built to evaluate

cache partitioning methods on real machines• Limits and research issues

– Overhead of dynamic cache partitioning is high – The cache is managed indirectly at the page level– Can OS directly partition the cache with low overhead?– We have proposed a hybrid methods for the purpose

38

“Disk Wall” is a Critical Issue

Many data-intensive applications generate huge data sets in disks world wide in very fast speed.

LANL Turbulence Simulation: processing 100+ TB.

Google searches and accesses over 10 billion web pages and tens of TB data in Internet.

Internet traffic is expected to increase from 1 to 16 million TB/month due to multimedia data.

We carry very large digital data, films, photos, …

Data home is the cost-effective & reliable Disks

Slow disk data access is the major bottleneck

39

Data-Intensive Scalable Computing (DISC)

Massively Accessing/Processing Data Sets in Parallel. drafted by R. Bryant at CMU, endorsed by

Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.

Applications in science, industry, and business. Special requirements for DISC Infrastructure:

Top 500 DISC ranked by data throughput, as well FLOPS

Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.

DISC is not an extension of SC, but demands new technology advancements.

40

Systems Comparison: (courtesy of Bryant)

– Disk data stored separately• No support for collection

or management– Brought in for

computation• Time consuming• Limits interactivity

– System collects and maintains data• Shared, active data set

– Computation co-located with disks• Faster access

SystemSystem

DISCConventional Computers

42

Sequential Locality is Unique in Disks

Sequential Locality: disk accesses in sequence fastest

Disk speed is limited by mechanical constraints.

seek/rotation (high latency and power consumption) OS can guess sequential disk-layout, but not always right.

43

Week OS Ability to Exploit Sequential Locality

OS is not exactly aware disk layout Sequential data placement has been implemented

since Fast File System in BSD (1984) put files in one directory in sequence in disks follow execution sequence to place data in disks.

Assume temporal sequence = disk layout sequence.

The assumption is not always right, performance suffers.Data accesses in both sequential and random

patterns an application accesses multiple files. Buffer caching/prefetching know little about disk

layout.

44

IBM Ultrastar 18ZX Specification *

Seq. Read: 4,700 IO/s

Rand. Read:< 200 IO/s

* Taken from IBM “ULTRASTAR 9LZX/18ZX Hardware/Functional Specification” Version 2.4

Our goal: to maximize opportunities of sequential accesses for high speed and high I/O throughput

47

Existing Approaches and Limits Programming for Disk Performance

Hiding disk latency by overlapping computing Sorting large data sets (SIGMOD’97) Application dependent and programming

burden Transparent and Informed Prefetching (TIP)

Applications issue hints on their future I/O patterns to guide prefetching/caching (SOSP’99)

Not a general enough to cover all applications Collective I/O: gather multiple I/O requests

make contiguous disk accesses for parallel programs

48

Our Objectives Exploiting sequential locality in disks by minimizing random disk accesses making disk-aware caching and prefetching

utilizing both buffer cache and SSDs Application independent approach

putting disk access information on OS map

Exploiting DUal LOcalities (DULO): Temporal locality of program execution Sequential locality of disk accesses

50

What is Buffer Cache Aware and Unaware?

I/O Scheduler

Disk Driver

Application I/O Requests

disk

Buffer cacheCaching &

prefetching

Buffer is an agent between I/O requests and disks. aware access patterns in time

sequence (in a good position to exploit temporal locality)

not clear about physical layout (limited ability to exploit sequential locality in disks)

Existing functions send unsatisfied requests to disks LRU replacement by temporal locality make prefetch by sequential access

assumption. Ineffectiveness of I/O scheduler:

sequential locality in is not open to buffer management.

51

Minimizing cache miss ratio by only exploiting temporal locality

Sequentially accessed blocks small miss penalty

Randomly accessed blocks large miss penalty

Limits of Hit-ratio based Buffer Cache Management

penalty Miss

rateMisstimeHitrateHittimeaccessAverage

Temporal locality

Sequential locality

52

X2

C

A

BD

X1X3 X4

Disk Tracks

Hard Disk Drive

Unique and critical roles of buffer cache Buffer cache can influence request stream patterns in

disks If buffer cache is disk-layout-aware, OS is able to

Distinguish sequentially and randomly accessed blocks Give “expensive” random blocks high caching priority in

DRAM/SSD replace long sequential data blocks timely to disks Disk accesses become more sequential.

53

• Prefetching may incur non-sequential disk access– Non-sequential accesses are much slower than sequential accesses– Disk layout information must be introduced into prefetching policies.

Prefetching Efficiency is Performance Critical

Synchronous requests

Process

Disk

idle idle

Disk

Prefetch requests

Processidle idle

It is increasingly difficult to hide disk accesses behind computation

54

File-level Prefetching is Disk Layout Unaware

• Multiple files sequentially allocated on disks cannot be prefetched at once.

• Metadata are allocated separately on disks, and cannot be prefetched

• Sequentiality at file abstraction may not translate to sequentiality on physical disk.

• Deep access history information is usually not recorded.

File Z

File X File Y

File RA

BC

D

Metadata of files XYZ

55

Opportunities and Challenges With Disk Spatial Locality (Disk-Seen)

Exploit DULO for fast disk accesses. Challenges to build Disk-Seen System Infrastructure Disk layout information is increasingly hidden in disks.

analyze and utilize disk-layout Information accurately and timely identify long disk sequences consider trade-offs of temporal and spatial locality

(buffer cache hit ratio vs miss penalty: not necessarily follow LRU)

manage its data structures with low overhead Implement it in OS kernel for practical usage

56

Disk-Seen Task 1: Make Disk Layout Info. Available

Which disk layout information to use? Logical block number (LBN): location mapping provided

by firmware. (each block is given a sequence number) Accesses of contiguous LBNs have a performance close to

accesses of contiguous blocks on disk. (except bad blocks occur)

The LBN interface is highly portable across platforms. How to efficiently manage the disk layout

information? LBN is only used to identify disk locations for read/write; We want to track access times of disk blocks and search

for access sequences via LBNs; Disk block table: a data structure for efficient disk blocks

tracking.

57

Disk-Seen TASK 2: Exploiting Dual Localities (DULO)

Staging Section

Evicting Section

Correlation Buffer

Sequencing Bank

LRU Stack

Sequence Forming

Sequence ---- a number of blocks whose disk locations are adjacent and have been accessed during a limited time period. Sequence Sorting based on its

recency (temporal locality) and size (spatial locality)

58L=L1

Disk-Seen TASK 3: DULO-Caching

LRU Stack

Adapted GreedyDual Algorithm a global inflation value L , and a value

H for each sequence Calculate H values for sequences in

sequencing bank:H = L + 1 / Length( sequence )Random blocks have larger H values When a sequence (s) is replaced,L = H value of s .L increases monotonically and make future sequences have larger H values Sequences with smaller H values

are placed closer to the bottom of LRU stack

H=L0+1

L=L0

H=L0+0.25

H=L0+1

H=L0+0.25

60

DULO-Caching Principles

Moving long sequences to the bottom of stack replace them early, get them back fast from

disks Replacement priority is set by sequence length.

Moving LRU sequences to the bottom of stack exploiting temporal locality of data accesses

Keeping random blocks in upper level stack hold them: expensive to get back from disks.

65

Prefetch size: maximum number of blocks to be prefetched.

Disk-Seen Task 5: DULO-Prefetching

LBN

Timestamp

Temporal window size

Spatial window size

Block initiating prefetching

Resident block

Non-resident block

66

What can DULO-Caching/-Prefetch do and not do? Effective to

mixed sequential/random accesses. (cache them differently)

many small files. (packaging them in prefetch) many one-time sequential accesses (replace them

quickly). repeatable complex patterns that cannot be detected

without disk info. (remember them)

Not effective to dominantly random/sequential accesses. (perform

equivalently to LRU) a large file sequentially located in disks. (file-level

prefetch can do it) non-repeatable accesses. (perform equivalently to file-

level prefetch)

67

DiskSeen: a System Infrastructure to Support DULO-Caching and DULO-Prefetching

Prefetching area

Buffer CacheCaching

areaDestaging

area

DiskBlock transfers between areas

DULO-Prefetching:adj. window/stream

On-demand read:place in stack top

DULO-Caching:LRU blks and Long seqs.

72

DULO Caching does not affect Execution Times of Pure Sequential or Random Workloads

TPC-H Query #6(sequential accesses)

Diff(random accesses)

74

DULO Caching Reduces Execution Times for Workloads with Mixed Patterns

PostMark(mixed patterns of both sequential and

random)

76

DULO Prefetching Reduces Execution Times for Workloads with Many Small Files

0

20

40

60

80

100

120

grep cvs diff

Exec

utio

n Ti

me

(Sec

)

Linux 2.6.11DULO

77

DULO Prefetching Reduces Execution Times for Workloads With Complex Access Patterns

0

20

40

60

80

100

120

strided reverse TPC-H(Q4)

Exec

utio

n Ti

me

(Sec

)

Linux 2.6.11DULO

78

Conclusions (1) Issues of Multicores

resource conflicts in shared caches and memory bus OS is not shared-cache-aware

multicore- and cache –aware scheduling is essential scheduling jobs based on resource demand initially rescheduling jobs to optimize the resource

utilization Building a hybrid OS resource management

system dynamically allocate cache space to each process

putting cache information into OS map minimize OS and hardware overheads.

79

Conclusions (2) Disk performance is limited by

OS is unable to effectively exploit sequential locality. The buffer cache is a critical component for

storage. temporal locality is mainly exploited by existing OS.

Building a Disk-Seen system infrastructure for DULO-Caching and -prefetching

Flash memory in the storage hierarchy holding the random accesses serve as L2 cache for DRAM buffer cache caching for low power

research issues/challenges to systems software for multicore and data-intensive applications

Documents

large systems

scientific applications

multimedia applications

access patterns

killer applications

low access time

large blocks of data

systems software