mars: a mapreduce framework on graphics processors bingsheng he 1, wenbin fang, qiong luo hong kong...
Post on 22-Dec-2015
215 Views
Preview:
TRANSCRIPT
Mars: A MapReduce Framework on Graphics Processors
Bingsheng He1, Wenbin Fang, Qiong LuoHong Kong Univ. of Sci. and Tech.
Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp.
1, Currently in Microsoft Research Asia
Graphics Processing Units
• Massively multi-threaded co-processors– 240 streaming processors on NV GTX 280– ~1 TFLOPS of peak performance
• High bandwidth memory– 10+x more than peak bandwidth of the main
memory– 142 GB/s, 1 GB GDDR3 memory on GTX280
4/41
Graphics Processing Units (Cont.)
• High latency GDDR memory– 200 clock cycles of latency– Latency hiding using large number of concurrent
threads (>8K)– Low context-switch overhead
• Better architectural support for memory – Inter-processor communication using a local
memory– Coalesced access
• High speed bus with the main memory– Current: PCI-E express (4GB/sec)
5/41
GPGPU
• Linear algebra [Larsen 01, Fatahalian 04, Galoppo 05]
• FFT [Moreland 03, Horn 06]• Matrix operations [Jiang 05]• Folding@home, Seti@home• Database applications
– Basic Operators [Naga 04]– Sorting [Govindaraju 06]– Join [He 08]
6/41
GPGPU Programming
• “Assembly languages”– DirectX, OpenGL
• Graphics rendering pipelines
• “C/C++”– NVIDIA CUDA, ATI CAL or Brook+
• Different programming models• Low portability among different hardware vendors
– NV GPU code cannot run on AMD GPU
• “Functional language”?7/41
MapReduce
8/41
Without worrying about hardware details—• Make GPGPU programming much easier.• Well harness high parallelism and high computational capability of GPUs.
MapReduce Functions
• Process lots of data to produce other data
• Input & Output: a set of records in the form of key/value pair
• Programmer specifies two functions – map (in_key, in_value) -> emit
list(intermediate_key, intermediate_value) – reduce (out_key, list(intermediate_value)) ->
emit list(out_key, out_value)
9/41
MapReduce outside google
• Hadoop [Apache project]
• MapReduce on multicore CPUs -- Phoenix [HPCA'07, Ranger et al.]
• MapReduce on Cell [07, Kruijf et al.]
• Merge [ASPLOS '08, Linderman et al.]
• MapReduce on GPUs [stmcs'08, Catanzaro et al.)]
11/41
MapReduce on GPU
13/41
Web Analysis
MapReduce (Mars)
Data Mining
GPGPU languages(CUDA, Brook+)
Rendering APIs(DirectX)
GPU Drivers
Limitations on GPUs
• Rely on the CPU to allocate memory– How to support variant length data?– How to allocate output buffer on GPUs?
• Lack of lock support– How to synchronize to avoid write conflict?
15/41
Data Structure for Mars
A Record = <Key, Value, Index entry>
Key1 Key2 Key3 …
Value1 Value2 Value3 …
16/41
Index entry1 Index entry2 Index entry3 …
An index entry = <key size, key offset, val size, val offset>
Support variant length record!
Lock-free scheme for result output
17/41
Basic idea:
Calculate the offset for each thread on the output buffer.
Lock-free scheme example
18/41
Pick up odd numbers from thearray [1, 3, 2, 3, 4, 6, 9, 8].
map function as a filter –filter all odd numbers
Lock-free scheme example
T1 T2 T3 T4 T1 T2 T3 T4
1 2 1 1
0 1 3 419/41
14
37
29
3
T1 T2 T3 T4
[ 1, 3, 2, 3, 4, 7, 9, 8 ]
Step1: Histogram
Step2: Prefix sum (5)
8
Lock-free scheme example
T1 T2 T3 T4 T1 T2 T3 T4
T1 T2 T2 T3 T4
1 2 1 1
20/41
T1 T2 T3 T4
[ 1, 3, 2, 3, 4, 7, 9 ]
Histogram (5)
Step3: Allocate
Lock-free scheme example
T1 T2 T3 T4 T1 T2 T3 T4
0 1 3 4
21/41
T1 T2 T3 T4
[ 1, 3, 2, 3, 4, 7, 9, 8 ]
Step4: Computation 1 3 37 9
Prefix sum
Lock-free scheme
22/41
1.Histogram on key size, value size, and record count.2.Prefix sum on key size, value size, and record count.3.Allocate output buffer on GPU memory.4.Perform computing.
Avoid write conflict.Allocate output buffer exactly once.
Mars Workflow
23/41
MapCount
Map
ReduceCount
Reduce
Input
Output
Sort and Group
Allocate intermediate buffer on GPU
Prefixsum
PrefixsumAllocate output buffer on GPU
Mars Workflow– Map Only
24/41
MapCount
Map
Input
Output
Allocate intermediate buffer on GPU
Prefixsum
Map only, without grouping and reduce
Mars Workflow – Without Reduce
25/41
MapCount
Map
Input
Output
Sort and Group
Allocate intermediate buffer on GPU
Prefix Sum
Map and grouping, without reduce
APIs of Mars
26/41
User-defined:
MapCountMapCompare (optional)ReduceCount (optional)Reduce (optional)
Runtime Provided:
•AddMapInput•MapReduce•EmitInterCount•EmitIntermediate•EmitCount (optional)•Emit (optional)
Mars-GPU
• Operating system’s thread APIs
• Each map instance or reduce instance is a CPU thread.
28/41
Mars-CPU
• NVIDIA CUDA
• Each map instance or reduce instance is a GPU thread.
Optimization According to CUDA features
• Coalesced Access– Multiple accesses to consecutive memory
addresses are combined into one transfer.
• Build-in vector type (int4, char4 etc)– Multiple small data items are fetched in one
memory request.
29/41
Experimental Setup
• Comparison– CPU: Phoenix, Mars-CPU– GPU: Mars-GPU
CPU (P4 Quad) GPU (NV GTX8800)
Processors (HZ) 2.66G*4 1.35G*128
Cache size 8MB 256KB
Bandwidth
(GB/sec)10.4 86.4
OS Fedora Core 7.0
31/41
Applications
• String Match (SM): Find the position of a string in a file.[S: 32MB, M: 64MB, L: 128MB]
• Inverted Index (II): Build inverted index for links in HTML files.[S: 16MB, M: 32MB, L: 64MB]
• Similarity Score (SS): Compute the pair-wise similarity score for a set of documents.[S: 512x128, M: 1024x128, L: 2048x128]
32/41
Applications (Cont.)
• Matrix Multiplication (MM): Multiply two matrices. [S: 512x512, M: 1024x10242, L: 2048x2048]
• Page View Rank (PVR): Count the number of distinct page views from web logs. [S: 32MB, M: 64MB, L: 96MB]
• Page View Count (PVC): Find the top-10 hot pages in the web log.[S: 32MB, M: 64MB, L: 96MB]
33/41
Conclusion
• MapReduce framework on GPUs–Ease of GPU application
development
–Performance acceleration
• Want a Copy of Mars? http://www.cse.ust.hk/gpuqp/Mars.html
41/41
top related