shared memory optimizations for multi-core cpu systems abdulla,shakeel nguyen,quocdat

Shared Memory Optimizations for Multi-core CPU systems

Abdulla,Shakeel

Nguyen,Quocdat

Agenda

• Motivation• Fundamental Problem• Our proposal• Simulation Results and Discussion• Other Data Parallel Algorithm/application and CUDA

Technology• CUDA discussion

Why Multi-cores ? • Pushing transistor to higher clock frequencies is

more physically and economically challenging.– Higher thermal density, has to be managed through

better cooling systems.– Process and fabrication technology have to mature

before a new technology can be mass produced.

• Application space don’t follow Moore’s law.– It demands enhanced digital life-style, improve user

experience.– It demands better system security.– It demands better multi-tasking capability.– Simply put user needs better performing system.

Why does embedded multi-core CPUs use lots of SRAMs than DRAM memories ?

• SRAMs have relatively lower power– No refresh cycles, hence no power wasted on refresh cycles.– No power wasted on pre-charge cycles.

• Even to access few columns, entire row has to the pre-charged for the DRAMs.

• DRAM-Reads are destructive hence power is wasted in writing the data back to the original location.

• SRAMs have better reliability (think about health care and NASA)– better immunity against single event upsets.– ECC can be incorporated easily.

• No need of complex memory controller for tracking the states of the memory.

• Less access latency– Due to row conflicts even to access a single memory location, the

DRAM’s needs to wait for several clocks for pre-charge (usually ~30 to 40 ns). SRAMs don’t need any pre-charge and there is no restrictions to any memory location access.

Threads, Temporal and Spatial Locality

• Thread– Independent piece of code that is

capable of some function. A CPU can have more than one thread at a time.

– Chip level multi-processing (CMP)• Intel Dual core (not dual processor)

– Simultaneous multi-threading (SMT)• Hyper threading

• Block/Chunk– A fixed length of memory addresses

that can be accessed consecutively.

• Temporal locality– Tendency of the program to access the

same memory block as its previous access.

• Spatial locality – Tendency of the program to access a

different blocks of memory.

Memory Subsystem Clocking

• Standard approach used in the industry

What is STFMA ?• Stall time fair memory access

– When there are system of threads, we cannot allow any particular thread to dwell on page for a unfairly long time, even though the thread has higher priority over the other threads.

– Unfair thread stalling affects the real time behavior of the system. Since some threads can be starved indefinitely over the other due to its lower priority. This makes predicting the latencies for that particular thread a whole lot difficult.

What is POI ?

• Each CPU can have multiple threads and each thread can exhibit certain probability of page access.

• Page conflicts occurring among threads give rise to POI.

• What happens during POI ?

• reduction in MIPs for the entire system.

• increased latency for one or more threads, depending upon the priority of the arbitration.

Fundamental Problem• What if we don’t know the PMF can we optimize the memory accesses for

the entire system to minimize the POI?

• Can we make the system stall time fair, there by making the system more predictable this is very essential to a real time embedded system.

• Can we reduce the clock frequency of memory sub-systems ?

Improving the memory subsystem

Each CPU can now run at there maximum speed and memory does not need to run at the maximum speed.

Statistical Assumptions on threads

• Each thread is an independent random variable (RV) in the observation interval.

• Existence of spatial and temporal locality for a thread causes its memory access to follow certain probability mass function (PMF).

• And we don’t know what type of PMF the thread would follow., depends upon the nature of the program.

Solutions

• If we can break down the temporal locality of the program, we can achieve better thread level parallelism (TLP).– Increase the number of pages

• If we can convert the given distribution to a uniform distribution we can achieve STFMA. – Apply Polynomial transformations (PTs) on RV.– Permutation based transformations on RV.

N-page M-CPU parallel systemunder uniform distribution.

N > M is assumed all the times.

What does PTs do ?

Address transformation Properties

• Should be one to one.– No wastage of memory.

• Should be easy to implement (low complexity).– Can’t let the software handle this.

• Should work on any type of PMF.– Should be very high entropy transformation.

Results on PTs

Multi-core simulation results

Our system 4-CPU, 8-page

Theoretical best is 0.29

Motivation of Massively Parallel Processor

• A quiet revolution and potential build-up– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Prior to CUDA, programmed through graphics API

What are GPU and GPGPU?• GPU: Graphics Processing Unit – Graphics

Card• GPGPU: general purpose computing using

GPU in application other than normal graphical processing task

• Data parallel algorithms using GPU attributes– Large data array and streaming throughput– Fine-grain SIMD concept (SPMD)– Low-latency floating point computation

What is CUDA ?• Nvidia’s CUDA: “Compute

Unified Device Architecture”– Software stack and a driver

for loading computational program into GPU

– GPU viewed as co-processor to a host

– Users kick-off batch of threads on the GPU

CUDA Programming Model• A kernel is created as a grid

of thread blocks– All threads share data

memory space

• A thread block is a batch of threads that can cooperate with each other by:– Synchronizing their

execution• For hazard-free shared

memory accesses

– Sharing data through a low latency shared memory

• Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

CUDA Device Memory Space• Local Memory: per-thread

– Private per thread– Auto variables, register spill

• Shared Memory: per-Block– Shared by threads of the

same block– Inter-thread communication

• Global Memory: per-application

• The host can R/W global, constant, and texture memories

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Extended C

gcc / cl

G80 SASSfoo.sass

OCG

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

CUDA Programming Function Call Examples

• Allocate memory for Matrix Md:cudaMalloc((void**)&Md.elements, size);

cudaFree(Md.elements);

• Call a Kernel Function: Thread Creation__global__ void KernelFunc(...);

dim3 DimGrid(100, 50); // 5000 thread blocks

dim3 DimBlock(4, 8, 8); // 256 threads per block

size_t SharedMemBytes = 64; // 64 bytes of shared memory

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

References[1] A Performance Comparison of DRAM Memory System Optimizations for SMT Processors, Zhichun Zhu and Zhao Zhang,

Proceedings of the 11th Int’l Symposium on High Performance Computer Architecture (HPCA-11-2005).

[2] Cached DRAM for ILP Processor Memory Access Latency Reduction, Zhao Zhang, Zhichun Zhu, Xiaodong Zhang , IEEE MICRO 2001.

[3] Quantifying Locality in the Memory Access Patterns of HPC Applications, Jonathan Weinberg, Michael O. McCracken, Allan Snavely, Erich Strohmaier, ACM Journal 2005.

[4] A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality, Zhao Zhang, Zhichun Zhu, Xiaodong Zhang, Proceedings of 33rd Annual International Symposium on Microarchitecture, (Micro-33).

[5] Memory Access Pattern Analysis, Mary Brown, Roy M. Jenevein, Nasr Ullah, System Performance and Modeling and Simulation, IEEE Conference.

[6] Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, Onur Mutlu, Thomas Moscibroda, Microsoft Research 2008.

[7] Psuedo-Randomly Interleaved Memory, B Ramakrishna Rau, ACM journal 1991.

[8] A Performance Comparison of Contemporary DRAM Architectures, Vinodh Cuppu, Bruce Jacob, Brian Davis, Trevor Mudge, IEEE Proceedings of the 26th International Symposium on Computer Architecture -1999.

[9] Computers as Components, Principles of Embedded Computing Design, Wayne Wolf, Second Edition, Morgan Kaufmann Publishers.

[10] A Model For Memory Interference In Multiprocessors, Stanley Rabinowitz, Missouri Journal of Mathematics , 1991.

[11] A Measure of Program Locality and Its Application, Richard B. Bunt, Jennifer M. Murphy, Shikharesh Majumdar, ACM journal 1984.

[12] A Study of Storage Partitioning Using a Mathematical Model of Locality, E.G.Coffman Jr, Thomas A Ryan Jr, 3rd ACM Symposium on Operating Systems, October 1971.

[13] Parallel Processing with CUDA, Nvidia’s High Performance Computing Platform Uses Massive Multithreading, Tom R Halfhill , Microprocessor Report, JAN-2008.

[14] Embedded DRAM: Technology platform for the Blue Gene/L chip, S S Iyer, J.E. Barth Jr, P.C.Parries, J.P.Norum, J.P.Rice, L.R.Logan, D.Hoyniak. IBM Journal Resarch & Development. Vol 49 NO 2/3 MAR/MAY 2005.

[15] Near-memory Caching for Improved Energy Consumption, Nevine AbouGhazaleh, Bruce Childers, Daniel Moss´e, Rami Melhem. Department of Computer Science, University of Pittsburgh, ICCD-05 Conference.