mrutyunjay (mjay) university of colorado, denver

37
Modern Hardware for DBMS Mrutyunjay (Mjay) University of Colorado, Denver

Upload: roland-mcdowell

Post on 29-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mrutyunjay (Mjay) University of Colorado, Denver

Modern Hardware for DBMS

Mrutyunjay (Mjay)University of Colorado, Denver

Page 2: Mrutyunjay (Mjay) University of Colorado, Denver

MotivationHardware Trends

Multi-Core CPUsMany Core: Co-Processors

GPU (NVIDIA, AMD Radeon)Huge main memory capacity with complex access characteristics (Caches, NUMA)Non-Volatile Storage

Flash SSD (Solid State Drive)

Page 3: Mrutyunjay (Mjay) University of Colorado, Denver

Multi-Core CPU: MotivationAround 2005, frequency-scaling wall, improvements by adding multiple processing cores to the same CPU chip, forming chip multiprocessors servers with multiple CPU sockets of multicore processors (SMP of CMP)

Page 4: Mrutyunjay (Mjay) University of Colorado, Denver

The Multi-core AlternativeUse Moore’s law to place more cores per chip

2x cores/chip with each CMOS generationRoughly same clock frequencyKnown as multi-core chips or chip-multiprocessors (CMP)

The good newsExponentially scaling peak performanceNo power problems due to clock frequencyEasier design and verification

The bad newsNeed parallel program if we want to ran a single app fasterPower density is still an issue as transistors shrink

Page 5: Mrutyunjay (Mjay) University of Colorado, Denver

Multi-Core CPU: ChallengesThis how we think its works.

This how EXACTLY it works.

Page 6: Mrutyunjay (Mjay) University of Colorado, Denver

Multi-Core CPU: ChallengesType of cores

E.g. few OOO cores Vs many simple cores

Memory hierarchyWhich caching levels are shared and which are privateCache coherenceSynchronization

On-chip interconnectBus Vs Ring Vs scalable interconnect (e.g., mesh)Flat Vs hierarchical

Page 7: Mrutyunjay (Mjay) University of Colorado, Denver

Multi-Core CPUAll processor have access to unified physical memory

The can communicate using loads and storesAdvantages

Looks like a better multithreaded processor (multitasking)Requires evolutionary changes the OSThreads within an app communicate implicitly without using OSSimpler to code for and low overheadApp development: first focus on correctness, then on performance

DisadvantagesImplicit communication is hard to optimizeSynchronization can get trickyHigher hardware complexity for cache management

Page 8: Mrutyunjay (Mjay) University of Colorado, Denver

NUMA ArchitectureNUMA: Non-Uniform Memory Access

Page 9: Mrutyunjay (Mjay) University of Colorado, Denver

Many-Core: GPU / GPGPU GPU (Graphics Processing Unit) is a specialized microprocessor for accelerating graphics rendering

GPUs traditionally for graphics computing

GPUs now allow general purpose computing easily

GPGPU: using GPU for general purpose computing Physics, Finance, Biology, Geosciences, Medicine, etc

NVIDIA and AMD Radeon

Page 10: Mrutyunjay (Mjay) University of Colorado, Denver

GPU vs CPU

GPU design with up to a thousand of core enables massively parallel computingGPUs architecture with streaming multiprocessors has form of SIMD processors

CPU GPU

Page 11: Mrutyunjay (Mjay) University of Colorado, Denver

SIMD Processor

SIMD: Single Instruction Multiple Data

Distributed memory SIMD computer Shared memory SIMD computer

Page 12: Mrutyunjay (Mjay) University of Colorado, Denver

NVIDIA GPUs with SIMD ProcessorsEach GPU has ≥ 1 Streaming Multiprocessors (SMs)Each SM has design of an simple SIMD Processor

8-192 Streaming Processors (SPs)NVIDIA GeForce 8-Series GPUs and later

Page 13: Mrutyunjay (Mjay) University of Colorado, Denver

Questions from Previous Session

SMP of CMP: SMP: sockets of multicore processors (Multiple CPU in single system)CMP: Chip Multiprocessor (Single Chip with multi/many cores)

SP: Streaming ProcessorSFU: Special Function Units Double Precision UnitMultithreaded Instruction Unit

Hardware thread scheduling

Page 14: Mrutyunjay (Mjay) University of Colorado, Denver

GPU Cores

14 Streaming Multiprocessors per GPU 32 cores per Streaming Multiprocessors

Page 15: Mrutyunjay (Mjay) University of Colorado, Denver

Development tools for GPU

Two main approaches:

Other tool ? OpenACC

Page 16: Mrutyunjay (Mjay) University of Colorado, Denver

What is CUDA?CUDA = Compute Unified Device ArchitectureA development framework for Nvidia GPUsExtensions of C languageSupport NVIDIA GeForce 8-Series & later

DefinitionsHost = CPUDevice = GPUHost memory = RAM Device memory = RAM on GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

PCI Express bus

Page 17: Mrutyunjay (Mjay) University of Colorado, Denver

CUDA Compute ModelCPU sends data to the GPUCPU instructs the processing on GPUGPU processes data CPU collects the results from GPU

Host memory

Device memory

Host(CPU)

Device(GPU)

12

3

4

Page 18: Mrutyunjay (Mjay) University of Colorado, Denver

CUDA Example1. CPU sends data to the GPU

2. CPU instructs the processing on GPU

3. GPU processes data

4. CPU collects the results from GPU

Host Codeint N= 1000;int size = N*sizeof(float);float A[1000], *dA;

cudaMalloc((void **)&dA, size);cudaMemcpy(dA , A, size, cudaMemcpyHostToDevice);

ComputeArray <<< 10, 20 >>> (dA ,N);

cudaMemcpy(A, dA, size, cudaMemcpyDeviceToHost);cudaFree(dA);

Device Code__global__ void ComputeArray(float *A, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i<N) A[i] = A[i]*A[i]; }

Page 19: Mrutyunjay (Mjay) University of Colorado, Denver

CUDA Example•A kernel is executed as a grid of blocks

•A block is a batch of threads that can cooperate with each other by:

– Sharing data through shared memory – Synchronizing their execution

• Threads from different blocks cannot cooperate

Page 20: Mrutyunjay (Mjay) University of Colorado, Denver

GPU Computation ChallengeLimiting kernel launchesLimiting data transfers(Solution Overlapped Transfers)

GPU in Databases & Data MiningGPU strengths are useful

Memory bandwidthParallel processing

Accelerating SQL queries – 10x improvementAlso well suited for stream miningContinuous queries on streaming data instead of one-time queries on static database

Page 21: Mrutyunjay (Mjay) University of Colorado, Denver

Memory/Storage

Page 22: Mrutyunjay (Mjay) University of Colorado, Denver

Memory Hierarchy

Slowest part: Main Memory and Fixed Disk.Can we decrease the latency between Main Memory and Fixed disk?

Solution: SSD

Page 23: Mrutyunjay (Mjay) University of Colorado, Denver

SSD: New Generation Non-Volatile MemoryA Solid-State Disk (SSD) is a data storage device that emulates a hard disk drive (HDD). It has no moving parts like in HDD.NAND Flash SSD’s are essentially arrays of flash memory devices which include a controller that electrically and mechanically emulate, and are software compatible with magnetic HDD’s

Page 24: Mrutyunjay (Mjay) University of Colorado, Denver

SSD: ArchitectureHost Interface LogicSSD ControllerRAM BufferFlash Memory Package

Page 25: Mrutyunjay (Mjay) University of Colorado, Denver

Flash Memory

NAND-flash cells have a limited lifespan due to their limited number of P/E cycles (Program/Erase Cycle).

What will be the initial state of SSD? Ans: Still looking for it.

Page 26: Mrutyunjay (Mjay) University of Colorado, Denver

SSD: Architecture

Page 27: Mrutyunjay (Mjay) University of Colorado, Denver

Read, Write and EraseReads are aligned on page size: It is not possible to read less than one page at once. One can of course only request just one byte from the operating system, but a full page will be retrieved in the SSD, forcing a lot more data to be read than necessary.Writes are aligned on page size: When writing to an SSD, writes happen by increments of the page size. So even if a write operation affects only one byte, a whole page will be written anyway. Writing more data than necessary is known as write amplificationPages cannot be overwritten: A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”.Erases are aligned on block size: Pages cannot be overwritten, and once they become stale, the only way to make them free again is to erase them. However, it is not possible to erase individual pages, and it is only possible to erase whole blocks at once.

Page 28: Mrutyunjay (Mjay) University of Colorado, Denver

Example of Write:

Buffer small writes: To maximize throughput, whenever possible keep small writes into a buffer in RAM and when the buffer is full, perform a single large write to batch all the small writes

Align writes: Align writes on the page size, and write chunks of data that are multiple of the page size.

Page 29: Mrutyunjay (Mjay) University of Colorado, Denver

SSD: How it stores data?

Page 30: Mrutyunjay (Mjay) University of Colorado, Denver

SSD: How it stores data?Latency difference for each type.More levels increases the latency: Delays in read and write.Solution: Hybrid SDD, consisting mixed levels

Page 31: Mrutyunjay (Mjay) University of Colorado, Denver

Garbage collectionThe garbage collection process in the SSD controller ensures that “stale” pages are erased and restored into a “free” state so that the incoming write commands can be processed.Split cold and hot data. : Hot data is data that changes frequently, and cold data is data that changes infrequently. If some hot data is stored in the same page as some cold data, the cold data will be copied along every time the hot data is updated in a read-modify-write operation, and will be moved along during garbage collection for wear leveling. Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easierBuffer hot data: Extremely hot data should be buffered as much as possible and written to the drive as infrequently as possible.

Page 32: Mrutyunjay (Mjay) University of Colorado, Denver

Flash Translation LayerThe main factor that made adoption of SSDs so easy is that they use the same host interfaces as HDDs. Although presenting an array of Logical Block Addresses (LBA) makes sense for HDDs as their sectors can be overwritten, it is not fully suited to the way flash memory worksFor this reason, an additional component is required to hide the inner characteristics of NAND flash memory and expose only an array of LBAs to the host. This component is called the Flash Translation Layer (FTL), and resides in the SSD controller.The FTL is critical and has two main purposes: logical block mapping and garbage collection.This mapping takes the form of a table, which for any LBA gives the corresponding PBA. This mapping table is stored in the RAM of the SSD for speed of access, and is persisted in flash memory in case of power failure. When the SSD powers up, the table is read from the persisted version and reconstructed into the RAM of the SSD

Page 33: Mrutyunjay (Mjay) University of Colorado, Denver

Internal Parallelism in SSDsInternal parallelism: Internally, several levels of parallelism allow to write to several blocks at once into different NAND-flash chips, to what is called a “clustered block”.Multiple levels of parallelism:

Channel-level parallelismPackage-level parallelismChip-level parallelismPlane-level parallelism

Page 34: Mrutyunjay (Mjay) University of Colorado, Denver

Characteristics and latencies of NAND-flash memory

Page 35: Mrutyunjay (Mjay) University of Colorado, Denver

Advantages & DisadvantagesSSD Advantages

Read and write are much faster than traditional HDDAllow PCs to boot up and launch programs far more quicklyMore physically Robust.Use less power and generate less heat

SSD DisadvantagesLower capacity than HDDsHigher storage cost per GBLimited number of data write cyclesPerformance degradation over time

Page 37: Mrutyunjay (Mjay) University of Colorado, Denver

Questions???