5 1 gpu architecture

Upload: usharamanath

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 5 1 GPU Architecture

    1/12

    GPU Architecture

    2CS 6823-003 Spring12 @ ASU Supported by2

    CPU & APUArchitecture

  • 8/2/2019 5 1 GPU Architecture

    2/12

    3CS 6823-003 Spring12 @ ASU Supported by3

    CPU Computing

    CPU performance is the product of many related advances Increased transistor density

    Increased transistor performance

    Wider data paths

    Pipelining

    Superscalar execution

    Speculative execution

    Caching

    Chip- and system-level integration

    4CS 6823-003 Spring12 @ ASU Supported by4

    Bandwidth Gravity of Modern Computer Systems

    The Bandwidth between key components ultimatelydictates system performance

    Especially true for massively parallel systems processing massiveamount of data

    Tricks like buffering, reordering, caching can temporarily defy therules in some cases

    Ultimately, the performance goes falls back to what the speedsand feeds dictate

    Perform well themselves

    Cooperate well too

  • 8/2/2019 5 1 GPU Architecture

    3/12

    5CS 6823-003 Spring12 @ ASU Supported by5

    Superscalar Execution

    Superscalar and, by extension, out-of-order execution is onesolution that has been included on CPUs for a long time

    Out-of-order scheduling logic requires a substantial area of the CPUdie to maintain dependence information and queues of instructionsto deal with dynamic schedules throughout the hardware

    Speculative instruction execution necessary to expand the windowof out-of-order instructions to execute in parallel results in inefficientexecution of throwaway work

    6CS 6823-003 Spring12 @ ASU Supported by6

    Out-of-order Execution

  • 8/2/2019 5 1 GPU Architecture

    4/12

    7CS 6823-003 Spring12 @ ASU Supported by7

    VLIW

    VLIW is a heavily compiler-dependent method for

    increasing instruction-level parallelism in a processor VLIW moves the dependence analysis work into the compiler

    8CS 6823-003 Spring12 @ ASU Supported by8

    SIMD and Vector Processing

    SIMD and its generalization in vector parallelism approachimproved efficiency by:

    The same operation be performed on multiple data elements

  • 8/2/2019 5 1 GPU Architecture

    5/12

    9CS 6823-003 Spring12 @ ASU Supported by9

    Multithreading

    10CS 6823-003 Spring12 @ ASU Supported by10

    Two Threads Scheduled in Time Slice Fashion

  • 8/2/2019 5 1 GPU Architecture

    6/12

    11CS 6823-003 Spring12 @ ASU Supported by11

    Multi-Core Architectures

    A large number of threads interleave execution to keepthe device busy, whereas each individual thread takeslonger to execute than the theoretical minimum

    12CS 6823-003 Spring12 @ ASU Supported by12

    AMD Bobcat and Bulldozer CPUs

    Bobcat (left) follows a traditional approach mappingfunctional units to cores, in a low-power design

    Bulldozer (right) combines two cores within a module,offering sharing of some functional units

  • 8/2/2019 5 1 GPU Architecture

    7/12

    13CS 6823-003 Spring12 @ ASU Supported by13

    The AMD Radeon HD6970 GPU

    The device is divided into two halves where instruction

    control: scheduling and dispatch is performed by the levelwave scheduler for each half

    The 24 16-lane SIMD cores execute four-way VLIW instructions on eachSIMD lane and contain private level 1 caches and local data shares(LDS)

    14CS 6823-003 Spring12 @ ASU Supported by14

    CPU and GPU Architectures

  • 8/2/2019 5 1 GPU Architecture

    8/12

    15CS 6823-003 Spring12 @ ASU Supported by15

    Niagara 2 CPU from Sun/Oracle

    Relative similar to the GPU design

    16CS 6823-003 Spring12 @ ASU Supported by16

    E350 "Zacate" AMD APU

    Two "Bobcat" low-power x86 cores

    Two 8-wide SIMD cores with five-way VLIW units

    Connected via a shared bus and a single interface to DRAM

  • 8/2/2019 5 1 GPU Architecture

    9/12

    17CS 6823-003 Spring12 @ ASU Supported by17

    Intel Sandy Bridge

    Intel combines four Sandy Bridge x86 cores with an

    improved version of its embedded graphics processor The concept is the same as the devices under that category from

    AMD

    18CS 6823-003 Spring12 @ ASU Supported by18

    Intels Core i7 Processor Four CPU cores with simultaneous multithreading

    Made with 45nm process technology

    Each chip has 731 million transistors and consumes up to130W of thermal design power

  • 8/2/2019 5 1 GPU Architecture

    10/12

    19CS 6823-003 Spring12 @ ASU Supported by19

    Intels Nehalem

    Four-wide superscalar

    Out of order, speculativeexecution

    Simultaneousmultithreading

    Multiple branchpredictors

    On-die power gating

    On-die memorycontrollers

    Large caches

    Multiple interprocessorinterconnects

    20CS 6823-003 Spring12 @ ASU Supported by20

    Todays Intel PC Architecture:Single Core System

    FSB connection betweenprocessor and Northbridge(82925X)

    Memory Control Hub

    Northbridge handles primaryPCIe to video/GPU and DRAM.

    PCIe x16 bandwidth at 8 GB/s

    (4 GB each direction)

    Southbridge (ICH6RW) handlesother peripherals

  • 8/2/2019 5 1 GPU Architecture

    11/12

    21CS 6823-003 Spring12 @ ASU Supported by21

    Todays Intel PC Architecture:

    Dual Core System

    Bensley platform Blackford Memory Control Hub (MCH) is

    now a PCIe switch that integrates(NB/SB).

    FBD (Fully Buffered DIMMs) allowsimultaneous R/W transfers at 10.5 GB/sper DIMM

    PCIe links form backbone PCIe device upstream bandwidth now

    equal to down stream Workstation version has x16 GPU link v ia

    the Greencreek MCH

    Two CPU sockets Dual Independent Bus to CPUs, each is

    basically a FSB (Front-Side Bus) CPU feeds at 8.510.5 GB/s per socket Compared to current Front -Side Bus CPU

    feeds 6.4GB/s PCIe bridges to legacy I/O devices

    Source: http://www.2cpu.com/review.php?id=109

    22CS 6823-003 Spring12 @ ASU Supported by22

    Todays AMD PC Architecture

    AMD HyperTransportTechnology bus replaces theFront-side Bus architecture

    HyperTransport similarities toPCIe:

    Packet based, switching network Dedicated links for both directions

    Shown in 4 socket configuraton, 8GB/sec per link

    Northbridge/HyperTransport ison die

    Glueless logic to DDR, DDR2 memory PCI-X/PCIe bridges (usually

    implemented in Southbridge)

  • 8/2/2019 5 1 GPU Architecture

    12/12

    23CS 6823-003 Spring12 @ ASU Supported by23

    Todays AMD PC Architecture

    Torrenza technology

    Allows licensing of coherentHyperTransport to 3rd partymanufacturers to make socket-compatible accelerators/co-processors

    Allows 3rd party PPUs (PhysicsProcessing Unit), GPUs, and co-processors to access main systemmemory directly and coherently

    Could make acceleratorprogramming model easier to use

    than say, the Cell processor, whereeach SPE cannot directly accessmain memory.

    24CS 6823-003 Spring12 @ ASU Supported by24

    HyperTransport Feeds and Speeds

    Primarily a low latency directchip-to-chip interconnect,supports mapping to board-to-board interconnect such as PCIe

    HyperTransport 1.0Specification

    800 MHz max, 12.8 GB/saggregate bandwidth (6.4 GB/seach way)

    HyperTransport 2.0Specification Added PCIe mapping

    1.0 - 1.4 GHz Clock, 22.4 GB/saggregate bandwidth (11.2 GB/seach way)

    HyperTransport 3.0 Specification 1.8 - 2.6 GHz Clock, 41.6 GB/s

    aggregate bandwidth (20.8 GB/seach way)

    Added AC coupling to extendHyperTransport to long distanceto system-to-system interconnect

    Courtesy HyperTransport ConsortiumSource: White Paper: AMD HyperTransport

    Technology-Based System Architecture