redundant memory mappings for fast access to large memories vasileios karakostas, jayneel gandhi,...

Post on 19-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Redundant Memory Mappings for Fast Access to Large Memories

Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. Mckinley,

Mario Nemirovsky, Michael M. Swift, Osman S. Ünsal

Executive Summary

• Problem: Virtual memory overheads are high (up to 41%)• Proposal: Redundant Memory Mappings– Propose compact representation called range translation– Range Translation – arbitrarily large contiguous mapping– Effectively cache, manage and facilitate range translations– Retain flexibility of 4KB paging

• Result:– Reduces overheads of virtual memory to less than 1%

2

Outline

Motivation Virtual Memory Refresher + Key Technology TrendsPrevious ApproachesGoals + Key Observation

Design: Redundant Memory MappingsResultsConclusion

3

Virtual Memory Refresher

4

TLB(Translation Lookaside Buffer)

Proc

ess

1Pr

oces

s 2

Virtual Address Space

Physical Memory

Page Table

Challenge: How to reduce costly

page walks?

Two Technology Trends

5*Inflation-adjusted 2011 USD, from: jcmit.com

1980 1985 1990 1995 2000 2005 2010 20150

0

0

1

10

100

1,000

10,000 Memory capacity for $10,000*

Years

Mem

ory

size

MB

GB

TB

1

10

100

1

10

100

1

10

TLB reach is limited

Year Processor L1 DTLB entries

1999 Pent. III 722001 Pent. 4 642008 Nehalem 962012 IvyBridge 1002015 Broadwell 100

0. Page-based Translation

6

Virtual Memory

VPN0 PFN0TLB

Physical Memory

1. Multipage Mapping

7

Virtual Memory

Clustered TLB

[ASPLOS’94, MICRO’12 and HPCA’14]

Physical Memory

Sub-blocked TLB/CoLTVPN(0-3) PFN(0-3) BitmapMap

2. Large Pages

8

Virtual Memory

Physical Memory

[Transparent Huge Pages and libhugetlbfs]

VPN0 PFN0

Large Page TLB

3. Direct Segments

9

Virtual Memory

Direct Segment(BASE,LIMIT) OFFSET

BASE LIMIT

OFFSET

[ISCA’13 and MICRO’14]

Physical Memory

Can we get best of many worlds?Multipage Mapping Large Pages Direct Segments Our

ProposalFlexible alignment

Arbitrary reach

Multiple entries

Transparent to applications

Applicable to all workloads

10

Key Observation

11

Virtual Memory

Physical Memory

Key Observation

12

Virtual Memory

1. Large contiguous regions of virtual memory2. Limited in number: only a few handful

Physical Memory

Code Heap Stack Shared Lib.

Compact Representation: Range Translation

13

Virtual Memory

Physical Memory

BASE1 LIMIT1

OFFSET1Range

Translation 1

Range Translation: is a mapping between contiguous virtual pages mapped to contiguous physical pages with uniform protection

Redundant Memory Mappings

14

Virtual Memory

Physical Memory

Range Translation 1

Range Translation 2

Range Translation 3

Range Translation 4

Range Translation 5

Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappings

Outline

MotivationDesign: Redundant Memory Mappings

A. Caching Range TranslationsB. Managing Range TranslationsC. Facilitating Range Translations

ResultsConclusion

15

A. Caching Range Translations

16

V47 …………. V12

P47 …………. P12

L1 DTLB

L2 DTLB Range TLB

Page Table WalkerEnhanced Page Table Walker

A. Caching Range Translations

17

Hit

V47 …………. V12

P47 …………. P12

L1 DTLB

Range TLB

Enhanced Page Table Walker

L2 DTLB

A. Caching Range Translations

18

Miss

V47 …………. V12

P47 …………. P12

L1 DTLB

Range TLB

Enhanced Page Table Walker

L2 DTLBHit

Refill

A. Caching Range Translations

19

Miss

V47 …………. V12

P47 …………. P12

L1 DTLB

Range TLB

Enhanced Page Table Walker

L2 DTLB Hit

Refill

A. Caching Range Translations

20

Miss

V47 …………. V12

P47 …………. P12

L1 DTLB

Range TLBL2 DTLB Hit

RefillEntry 1

BASE 1 LIMIT 1≤ >

Entry NBASE N LIMIT N≤ >

OFFSET 1 Protection 1

OFFSET N Protection N

L1 TLB Entry Generator Logic:(Virtual Address + OFFSET) Protection

A. Caching Range Translations

21

Miss

V47 …………. V12

P47 …………. P12

L1 DTLB

Range TLB

Enhanced Page Table Walker

L2 DTLBMiss Miss

B. Managing Range Translations

• Stores all the range translations in a OS managed structure• Per-process like page-table

22

Range Table

CR-RTRTC RTD RTF RTG

RTA RTB RTE

B. Managing Range Translations

23

A) Page Table B) Range Table

C) Both A) and B) D) Either?

On a L2+Range TLB miss, what structure to walk?

Is a virtual page part of range? – Not known at a miss

B. Managing Range TranslationsRedundancy to the rescue

One bit in page table entry denotes that page is part of a range

24

Page Table Walk

1

Insert into L1 TLB

2

Application resumes memory access

3

Range Table Walk (Background)

Insert into Range TLB

Part of a range

CR-RTRTC RTD RTF RTG

RTA RTB RTE

CR-3

C. Facilitating Range Translations

25

Virtual Memory

Physical Memory

Does not facilitate physical page contiguity for range creation

Demand Paging

C. Facilitating Range Translations

26

Virtual Memory

Physical Memory

Allocate physical pages when virtual memory is allocatedIncreases range sizes Reduces number of ranges

Eager Paging

Outline

MotivationDesign: Redundant Memory MappingsResults

MethodologyPerformance ResultsVirtual Contiguity

Conclusion

27

Methodology

• Measure cost on page walks on real hardware– Intel 12-core Sandy-bridge with 96GB memory– 64-entry L1 TLB + 512-entry L2 TLB 4-way associative for 4KB pages– 32-entry L1 TLB 4-way associative for 2MB pages

• Prototype Eager Paging and Emulator in Linux v3.15.5– BadgerTrap for online analysis of TLB misses and emulate Range TLB

• Linear model to predict performance• Workloads– Big-memory workloads, SPEC 2006, BioBench, PARSEC

28

Comparisons

• 4KB: Baseline using 4KB paging• THP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages]

• CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14]

• DS: Direct Segments [ISCA’13 and MICRO’14]

• RMM: Our proposal: Redundant Memory Mappings [ISCA’15]

29

Performance Results

30

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

Exec

ution

Tim

e O

verh

ead

Measured using performance counters

Modeled based on emulator

5/14 workloadsRest in paper

Assumptions:CTLB: 512 entry fully-associativeRMM: 32 entry fully-associative

Both in parallel with L2

Performance Results

31

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

Exec

ution

Tim

e O

verh

ead

Overheads of using 4KB pages are very high

Performance Results

32

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

Exec

ution

Tim

e O

verh

ead

Clustered TLB works well, but limited by 8x reach

Performance Results

33

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

Exec

ution

Tim

e O

verh

ead

2MB page helps with 512x reach: Overheads not very low

Performance Results

34

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

0.00

%

0.00

%

0.06

%

Exec

ution

Tim

e O

verh

ead

Direct Segment perfect for some but not all workloads

Performance Results

35

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

4KB

CTLB

THP DS

RMM

cactusADM canneal graph500 mcf tigr

0%5%

10%15%20%25%30%35%40%45%

0.00

%

0.25

%

0.40

%

0.00

%

0.14

%

0.06

%

0.26

%

1.06

%

Exec

ution

Tim

e O

verh

ead

RMM achieves low overheads robustly across all workloads

Why low overheads? Virtual Contiguity

BenchmarkPaging Ideal RMM ranges

4KB + 2MBTHP

# of ranges #of ranges to cover more than 99% of memory

cactusADM 1365 + 333 112 49canneal 10016 + 359 77 4graph500 8983 + 35725 86 3mcf 1737 + 839 55 1tigr 28299 + 235 16 3

36

1000s of TLB entries requiredOnly 10s-100s of ranges per applicationOnly few ranges for 99% coverage

Summary

• Problem: Virtual memory overheads are high• Proposal: Redundant Memory Mappings– Propose compact representation called range translation– Range Translation – arbitrarily large contiguous mapping– Effectively cache, manage and facilitate range translations– Retain flexibility of 4KB paging

• Result:– Reduces overheads of virtual memory to less than 1%

37

38

Questions ?

top related