2013/06/10 yun-chung yang kandemir, m., yemliha, t. ; kultursay, e. pennsylvania state univ.,...

Paper Presentation

2013/06/10 Yun-Chung Yang

Kandemir, M., Yemliha, T. ; Kultursay, E.Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEEPage 954 – 959

A Helper Thread Based Dynamic CachePartitioning Scheme for

Multithreaded Applications

Abstract Related Work Motivation Difference between inter and intra application Proposed Method Experiment Result Conclusion

Outline

Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

Abstract

Related Work

Off-chip bandwidth[3,

10, 13]

Processor cores[6]

Resource Management

Shared cache[5, 4, 8, 11, 12, 17, 18, 20]

Application granularity

Intra-application shared cache[16]

This paperImprove the cache layer problem

Run application of facesim(PARSEC) and art(SPECOMP).

Perform six scheme and recorded the Average Memory Access Time(AMAT). No-partition Uniform Nonuniform Nonuniform-L2 Nonuniform-L3 Dynamic

Dynamic outer perform the rest Divide application into fixed epoch and performs the best.

Motivation

The objectives and the implementation are different on cache partition.

The intra-application cache partition tries to minimize the latency of the slowest thread. Runtime system or dynamic compiler

The inter-application cache partition tries to optimize workload throughput. OS problem

Difference between Inter & Intra App.

Dynamic Partition System

Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance.

The Proposed Method

System Interfacing

Performance Monitoring

Performance Modeling

Each OS epoch is composed many application, which divided into 5 epoch. Performance Monitoring Performance Modeling Resource Partitioning System Interfacing Application Execution

Proposed Method(cont.)

Use Average Memory Access Time as measure of the cache performance of a thread.

AMAT The ratio of total cycles spent on memory instructions

and total number of instructions Depends on the cache partition size Take into account with different level of cache

Performance Monitoring

Need to predict the impact of increasing and decreasing the cache space to a thread.

Expressed a thread with 3D plot X and Y respectively for cache space allocation from L2

and L3

Thread i, point d(sL2, sL3) value to build dynamic model for thread i.

Purpose – predict the performance of a thread

Performance Modeling

ith L2 cache, qL2,i denotes the total cache way allocated to this application.

qL2,i are shared by mL2,i thread(from 0 to mL2,i)

The number of ways allocated to the kth thread is denoted as sL2,i(k)

Cache Space Partitioning

P[t] denotes cache resources(numbers of way in L2 & L3).

Cache Space Partitioning Algorithm

New partition information is delivered to the OS using system call.

Add new instruction to ISA

COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation

System Interfacing

The experimental environment Compare with other scheme

Average Memory Access Time。The main target of the performance monitoring

Execution Cycle

What we want to know

SIMICS and GEMS to model below multicore architecture.

Run SPECOMP and PARSEC application. Use 120 million instruction as application epoch.

Experiment Environment

Perform 8 schemes and recorded average memory access time No-partition Uniform – as evenly as possible for each core Static Best – static partition for best result through

exhaustive search Dynamic – the proposed method Dynamic-L2 – partition only L2 Dynamic-L3 – partition only L3 L2+L3 – a separate performance model for each one. Ideal – optimal strategy

Experiment Environment(cont.)

Improve Performance Shows that balancing the data access latency of

different threads. As the execution went on, they all end up at

about 8 AMAT(cycle).

Intra-application cache partitioning for multithread Dynamic model, able to partition cache in multiple

layer. Average improvement of 17.1% in SECOMP and

18.6% in PARSEC.

My Comment Remind me the importance of software and hardware

cooperation. Thread is a main issue in CMP.

Conclusion

2013/06/10 yun-chung yang kandemir, m., yemliha, t. ; kultursay, e. pennsylvania state univ.,...

cache performance

cache id

cache level

th l2 cache

cache partition size

cache layer problem

cache space allocation

different level of cache

Documents

mblj_tan yun qiang_3s1_37(edited)

zhang yun jing

application-aware memory channel partitioning † sai...

yun thesesgg

mustafa caglayan, ozge kandemir and kostas mouratidis

yun zhang_learn to lead

cse 531 parallel processors and processing dr. mahmut...

kandemir et al-2013-angewandte chemie international edition

yu ＆ young yun powerpoint by yu in formations by young ...

yun sheng-step polymerization

jun yun cafe

eto na yun ampupou

lawrence yun, phd

yun jeong space poems

optimizations - tu dortmund · cache kandemir@dac01...

yun slideshow 08282014

yun presentation

shen yun 2013

“computer engineering” yeditepe university april 19th,...

akbar sharifi , emre kultursay , mahmut kandemir and chita...