cs 758: advanced topics in computer...

36
CS 758: Advanced Topics in Computer Architecture Lecture #1: Introduction, Administration, & Motivation Professor Matthew D. Sinclair Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, and Wen- mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign. Slides enhanced by Matt Sinclair

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

CS 758: Advanced Topics in Computer Architecture

Lecture #1: Introduction, Administration, & Motivation

Professor Matthew D. Sinclair

Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, and Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign.

Slides enhanced by Matt Sinclair

Page 2: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

About me

• Assistant Professor• Started in 2018

• Worked for a few hardware and software companies in the past• National Instruments, Qualcomm, NVIDIA Research, AMD Research

• Research on accelerator architectures and systems

• Lots of work on GPGPUs

Page 3: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

My goals for this class• (Some) Practical programming experience with GPUs

• CS/ECE/EMA/ME 759 focuses on this in detail

• Our focus is on hardware and systems

• Brief overview on programming first to inform hardware discussion

• In-depth understanding of accelerator architecture• Primary focus: GPUs and ML Accelerators

• Paper readings on accelerator architectures

• Practical experience with a simulator

• How to run simulator experiments in CHTC (Condor)

Page 4: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Course Structure• Overview on the programming model/software

• Programming guides as references

• In-depth exploration of GPU architecture and tradeoffs• Reading research papers

• T. M. Aamodt, W.L. Fung, & T.G. Rogers. 2018. General Purpose Graphics Processor Architecture. Synthesis Lectures on Computer Architecture.

• Alternative: H. Kim, R. Vuduc, S. Baghsorki, J. Choi, and W-M Hwu, Performance Analysis & Tuning for General-Purpose Graphics Processing Units (GPGPU). Synthesis Lectures on Computer Architecture.

• Look at machine learning accelerators and limits of acceleration• Reading research papers

Page 5: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Course Prerequisites

• CS/ECE 552 Basic Architecture• Logic: gates, boolean functions, latches, memories• Datapath: ALU, register file, memory interface, muxes• Control: single-cycle control, micro-code• (Simple) Caches & pipelining• Some familiarity with assembly language

• CS/ECE 752 & 757: Advanced computer architecture concepts• Advanced memory systems, out-of-order processing, branch prediction• Memory consistency, cache coherence, interconnect networks• Optional but encouraged

• CS/ECE/EMA/ME 759: GPU programming• Helpful but hopefully not necessary (hopefully complementary)

Page 6: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Grading Scheme

• Class Participation: 5%

• Midterm Exams: 20% each

• Homework: 10%

• Paper Reviews: 10%

• Final Project: 35%

Page 7: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Paper Readings & Reviews

• Each class will have at least 1 assigned paper (or book chapter)

• Roughly 1 paper review per week to go with readings• See course website for more details

• Upload your reviews to Canvas

• First paper review due next Tuesday 9/10

Page 8: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Homework Assignments• Homework 0 (potentially)

• Learn how to write a basic program in CUDA• Poll: Needed?

• 1st one: running simple CUDA program in GPU simulator• Force you to learn and use CUDA + open-source GPU simulators• Access to GPUs in euler cluster (hopefully not needed)

• 2nd one:• Use open-source GPGPU-Sim, implement a portion of an academic research paper• Run the code you write for assignment #1 on the simulator easily• Use CHTC to run experiments in parallel

• 3rd one:• ML algorithms

Page 9: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Midterms

• Two midterms: 10/8/19 and 11/26/19 715-915 PM

Page 10: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Final Project

• Groups of 2

• Opportunity to be creative• Good projects completely evaluate an idea and compare it prior work

• Very good projects could be submitted to a workshop

• Excellent projects could be submitted as a conference paper (with a bunch of extra work)

• Additional details on course webpage

Page 11: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Computing Resources

• All students will have access to GPUs in euler cluster• 100s of GPUs

• All students should have had CHTC (Condor) accounts created• If you joined late, you may not have one – please let me know

• Special guest lecture by CHTC folks – some details TBA

Goal: Learn skills that are transferrable to your subsequent research

Page 12: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

In-Class Activity: Hameed 2010

• With a partner, answer the following questions:• Why not create an accelerator for every application?

• How do we avoid becoming the steel industry?• “It’s all about the cupholders” – Mark Horowitz

• X86 has CISC instructions + SIMD already, why not just use that?

• Can mere mortals program this?

• In 5 minutes we’ll come back together and discuss as a class

Page 13: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Moore’s Law

Page 14: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Silver Bullet for Moore’s Law?

Parallelism Specialization

Page 15: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Fundamental tradeoff: Programmability vs. Efficiency

Ease of

Programming

Hardware Efficiency

Single Core OoO Superscalar CPU

(OoO) Multicore 4 to 20 threads

ASIC

Better

32K thread, GPU

(how to get here?)

Page 16: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Why accelerators?

• Hardware acceleration is everywhere• Specialized chips for video/image encode/decoding

• Machine learning specific accelerators (including autonomous agents)

• Network accelerators

• Cryptography

• Bitcoin mining

• Genomics

• Database accelerators

• …

You can tape out efficient hardware for lots of specific problems

Page 17: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

What is a programmable accelerator?

• However, these are not generally programmable!

• Many custom accelerators have knobs, configuration registers, etc. …that allow you to “program” them.

• But you cannot run arbitrary code on them• They are not Turing Complete

This class focuses on accelerators that can execute (mostly) arbitrary code

Page 18: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

The rise of accelerators• On the general-purpose front

• CMOS compute frequency has reached its limits

• ILP is mostly mined out• Branch predictors, caches, memory dependency prediction have done great things

• However, these are energy-hungry operations

• For the time being, we are still getting more transistors• NVIDIA Volta V100: 21B transistors, 120 TFLOPS, 900 GB/s Memory BW

• Many important workloads are highly parallel• Machine learning is one very popular example

The future is in acceleration.

Page 19: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Hameed 2010

• Authors Goal: reach ASIC-type performance, without an ASIC• Profile the different pieces of the app – 5 functions > 99% of total exec time

• Analyze each piece – figure out how to optimize

Page 20: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

H264 Encoder Pipeline

IME FME IP DCT/Quant CABAC

Find closest match Refine initial match Predict current block

Transform and quantize

Encode coefficients

Dominate compute time

Mostly SIMD friendly

Compute heavy SIMD friendly

Fuse with IP

Control-flow heavy

Not SIMD friendlyIrregular

Use Amdahl’s Law to optimize

Page 21: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Hameed 2010 (Cont.)

• Authors Goal: reach ASIC-type performance, without an ASIC• Profile the different pieces of the app – 5 functions > 99% of total exec time

• Analyze each piece – figure out how to optimize

• Used Tensilica processor• Key Underlying Idea: extensible ISA

• Allowed them to implement many optimizations for a specific app

Page 22: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Hameed 2010 Key Takeaways

• Start with H264 encoder application (advanced video coding)

• Add 16-wide SIMD extensions → 10X performance, 7X energy

• Custom fused instructions → further 1.4X (1.1 – 1.9X)

• Why?• Specific, CISC-like instructions needed → each app needs specific instructions

• Custom datapath widths → different for each app

• Data movement is king

Still 50X less efficient than an ASIC

Modern accelerators and apps exploit similar insights!

Page 23: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

What was a GPU?

• GPU = Graphics Processing Unit• Accelerator for raster-based graphics (OpenGL, DirectX, Vulkan)

• Highly programmable

• Commodity hardware

• 100’s of ALUs; 10000’s of concurrent threads

1.23

Today the name GPU is not really meaningful.In reality they are highly parallel, highly programmable vector supercomputers.

Page 24: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Modern GPUs: Good at drawing triangles

24

Highly Parallel Operation

Requires Significant Memory Bandwidth

Page 25: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

25

pixel color result of running “shader” program +

Page 26: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

26

GPU: The Life of a Triangle

Texture

Host / Front End / Vertex Fetch

Fr a

me

Bu

ffe

rC

on

tro

l ler

Vertex Processing

Primitive Assembly, Setup

Rasterize & Zcull

Pixel Shader

Pixel Engines (ROP)

process commands

transform vertices to screen-space

generate per-

triangle equations

generate pixels, delete pixels that cannot be seen

determine the colors, transparencies and depth of the pixel

do final hidden surface test,blend and write out color and new depth

[David Kirk / Wen-mei Hwu]

+

Page 27: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Today: GPUs are Ubiquitous

27[APU13 keynote]

+

Page 28: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Why use a GPU for computing?

• GPU uses larger fraction of silicon for computation than CPU.

• At peak performance GPU uses order of magnitude less energy per operation than CPU.

28

CPU2nJ/op

GPU200pJ/op

Rewrite Application

Order of Magnitude More Energy Efficient

However….Application must perform well

Page 29: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

GPU uses larger fraction of silicon for computation than CPU?

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

29[NVIDIA]

Page 30: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Growing Interest in GPGPU

• Supercomputing – Green500.org Nov 2014 “the top three slots of the Green500 were powered by three different accelerators with number one, L-CSC, being powered by AMD FirePro™ S9150 GPUs, powered by NVIDIA K20x GPUs. Beyond these top three, the next 20 supercomputers were also accelerator-based.”

• Deep Belief Networks map very well to GPUs (e.g., Google keynote at 2015 GPU Tech Conf.)

http://blogs.nvidia.com/blog/2015/03/18/google-gpu/

http://www.ustream.tv/recorded/60071572

30

“Machine learning is the manna sent to GPUs from heaven”- Industry Researcher

Page 31: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

GPGPUs vs. Vector Processors

• Similarities at hardware level between GPU and vector processors.

• (Arguably) SIMT programming model moves hardest parallelism detection problem from compiler to programmer.

31

Page 32: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

GPU Compute Programming Model

CPU GPU

1.32

How is this system programmed (today)?

Page 33: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

GPGPU Programming Model

CPUspawn

done

GPU

CPU

Time

CPUspawn

GPU

33

• CPU “Off-load” parallel kernels to GPU

• Transfer data to GPU memory

• GPU HW spawns threads

• Need to transfer result data back to CPU main memory

Page 34: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

SIMT Execution Model

• Programmers sees MIMD threads (scalar)

• GPU bundles threads into warps (wavefronts) and runs them in lockstep on SIMD hardware

• An NVIDIA warp groups 32 consecutive threads together (AMD wavefronts group 64 threads together)

1.34

• Aside: Why “Warp”? In the textile industry, the term “warp” refers to “the threads stretched lengthwise in a loom to be crossed by the weft” [Oxford Dictionary].

[https://en.wikipedia.org/wiki/Warp_and_woof]

Page 35: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

For Next Class

• Review Esmaeilzadeh 2011

• Office Hours Poll Due

Page 36: CS 758: Advanced Topics in Computer Architecturepages.cs.wisc.edu/.../lecture/cs758-fall19-intro.pdf · •Overview on the programming model/software •Programming guides as references

Final Thought

• Who has heard of Spectre and Meltdown?

• Are GPUs affected by these?