cache performance, interfacing, multiprocessors cpsc 321 andreas klappenecker
Post on 19-Dec-2015
214 views
TRANSCRIPT
![Page 1: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/1.jpg)
Cache Performance, Interfacing, Multiprocessors
CPSC 321
Andreas Klappenecker
![Page 2: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/2.jpg)
Today’s Menu
Cache PerformanceReview of Virtual MemoryProcessor and PeripheralsMultiprocessors
![Page 3: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/3.jpg)
Cache Performance
![Page 4: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/4.jpg)
Caching Basics
• What are the different cache placement schemes? • direct mapped• set associative• fully associative
• Explain how a 2-way cache with 4 sets works• If we want to read a memory block whose address is
addr, then we search the set addr mod 4• The memory block could be in either element of the
set• Compare tags with upper n-2 bits of addr
![Page 5: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/5.jpg)
Implementation of a Cache
Sketch an implementation of a 4-way associative
cache
![Page 6: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/6.jpg)
Measuring Cache Performance
• CPU cycle time• CPU execution clock cycles (including cache
hits)• Memory-stall clock cycles (cache misses)• CPU time = (CPU execution clock cycles +
memory stall clock cycles) x clock cycle time
• Memory stall clock cycles • read stall cycles (rsc)• write stall clock cycles (wsc)• Memory stall clock cycles = rsc + wsc
![Page 7: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/7.jpg)
Measuring Cache Performance
• Write-stall cycle [write-through scheme]:• two sources of stalls:• write misses (usually require to fetch the block)• write buffer stalls (write buffer is full when write
occurs)• WSCs are sum of the two:
WSCs = (writes/prg x write miss rate x write miss penalty) + write buffer stalls
• Memory stall clock cycles similar
![Page 8: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/8.jpg)
Cache Performance Example
• Instruction cache rate: 2%• Data miss rate: 4%• Assume that 2 CPI without any memory stalls• Miss penalty 40 cycles for all misses• Instruction count I• Instruction miss cycles = I x 2% x 40 = 0.80 x I
• gcc has 36% loads and stores• Data miss cycles = I x 36% x 4% x 40 = 0.58 x I
![Page 9: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/9.jpg)
Review of Virtual Memory
![Page 10: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/10.jpg)
Virtual Memory
• Processor generates virtual addresses• Memory is accessed using physical
addresses • Virtual and physical memory is broken
into blocks of memory, called pages• A virtual page may be absent from main
memory, residing on the disk or may be mapped to a physical page
![Page 11: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/11.jpg)
Page Tables
Physical memory
Disk storage
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Virtual pagenumber
Physical page ordisk address
The page table maps each page to either a page in mainmemory or to a page stored on disk
![Page 12: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/12.jpg)
Pages: virtual memory blocks
• Page faults: if data is not in memory, retrieve it from disk• huge miss penalty, thus pages should be fairly large
(e.g., 4KB)• reducing page faults is important (LRU is worth the
price)• using write-through takes too long so we use writeback• Example: page size 212=4KB; 218 physical pages; main memory <= 1GB; virtual memory <= 4GB
3 2 1 011 10 9 815 14 13 1231 30 29 28 27
Page offsetVirtual page number
Virtual address
3 2 1 011 10 9 815 14 13 1229 28 27
Page offsetPhysical page number
Physical address
Translation
![Page 13: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/13.jpg)
Page Faults
• Incredible high penalty for a page fault• Reduce number of page faults by
optimizing page placement• Use fully associative placement
• full search of pages is impractical• pages are located by a full table that indexes
the memory, called the page table• the page table resides within the memory
![Page 14: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/14.jpg)
Making Memory Access Fast
• Page tables slow us down• Memory access will take at least twice as
long• access page table in memory• access page
• What can we do?
Memory access is local => use a cache that keeps track of recently used address translations, called translation lookaside buffer
![Page 15: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/15.jpg)
Making Address Translation Fast
A cache for address translations: translation lookaside buffer
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Physical pageaddressValid
TLB
1
1
1
1
0
1
TagVirtual page
number
Physical pageor disk address
Physical memory
Disk storage
![Page 16: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/16.jpg)
Processors and Peripherals
![Page 17: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/17.jpg)
Collection of I/O Devices
Communication between I/O devices, processor and
memory use protocols on the bus and interrupts
Mainmemory
I/Ocontroller
I/Ocontroller
I/Ocontroller
Disk Graphicsoutput
Network
Memory– I/O bus
Processor
Cache
Interrupts
Disk
![Page 18: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/18.jpg)
Impact of I/O on Performance
A benchmark executes in 100 seconds• 90 seconds CPU time• 10 seconds I/O timeIf CPU improves 50% per year for next 5 years, howmuch faster does the benchmark run in 5 years?
90/(1.5)5 = 90/7.59 = 11.851
![Page 19: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/19.jpg)
I/O Devices
• Very diverse devices— behavior (i.e., input vs. output)— partner (who is at the other end?)— data rateDevice Behavior Partner Data rate (KB/sec)
Keyboard input human 0.01Mouse input human 0.02Voice input input human 0.02Scanner input human 400.00Voice output output human 0.60Line printer output human 1.00Laser printer output human 200.00Graphics display output human 60,000.00Modem input or output machine 2.00-8.00Network/LAN input or output machine 500.00-6000.00Floppy disk storage machine 100.00Optical disk storage machine 1000.00Magnetic tape storage machine 2000.00Magnetic disk storage machine 2000.00-10,000.00
![Page 20: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/20.jpg)
Communicating with Processor
• Polling• simple• I/O device puts information in a status register• processor retrieves information• check the status periodically
• Interrupt driven I/O• device notifies processor that it has completed some
operation by causing an interrupt• similar to exception, except that it is asynchronous• processor must be notified of the device csng interrupt• interrupts must be prioritized
![Page 21: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/21.jpg)
I/O Example: Disk Drives
• To access data:— seek: position head over the proper track (8 to 20 ms. avg.)— rotational latency: wait for desired sector (.5 / RPM)— transfer: grab the data (one or more sectors) 2 to 15 MB/sec
Platter
Track
Platters
Sectors
Tracks
![Page 22: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/22.jpg)
I/O Example: Buses
• Shared communication link (one or more wires)• Difficult design:
— may be bottleneck— tradeoffs (buffers for higher bandwidth increases latency)— support for many different devices— cost
• Types of buses:— processor-memory (short high speed, custom design)— backplane (high speed, often standardized, e.g., PCI)— I/O (lengthy, different devices, standardized, e.g., SCSI)
• Synchronous vs. Asynchronous— use a clock and a synchronous protocol,
fast and small, but every device must operate at same rate and clock skew requires the bus to be short
— don’t use a clock - use handshaking instead
![Page 23: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/23.jpg)
Asynchronous Handshake Protocol
• ReadyReq: Indicates a read request from memory
• DataRdy: Indicates that data word is now ready on data lines
• Ack: Used to acknowledge the ReadyReq or DataRdy signal of the other party
DataRdy
Ack
Data
ReadReq 13
4
57
642 2
![Page 24: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/24.jpg)
Asynchronous Handshake Protocol
1. Memory sees ReadReq, reads address from data bus, raises Ack2. I/O device sees Ack high, releases ReadReq and data lines3. Memory sees ReadReq low, drops Ack to acknowledge ReadReq4. When memory has data ready, it places data from the read request on the data
lines and raises DataRdy5. I/O devices sees DataRdy, reads data from the bus, signals that it has the data
by raising Ack6. Memory sees the Ack signal, drops DataRdy, releases datalines 7. If DataRdy goes low, the I/O device drops Ack to indicate that transmission is
over
DataRdy
Ack
Data
ReadReq 13
4
57
642 2
![Page 25: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/25.jpg)
Synchronous vs. Asynchronous Buses
• Compare max. bandwidth for a synchronous bus and an asynchronous bus
• Synchronous bus • has clock cycle time of 50 ns• each transmission takes 1 clock cycle
• Asynchronous bus• requires 40 ns per handshake
• Find bandwidth for each bus when performing one-word reads from a 200ns memory
![Page 26: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/26.jpg)
Synchronous Bus
1. Send address to memory: 50 ns2. Read memory: 200 ns3. Send data to device: 50ns4. Total: 300 ns5. Max. bandwidth:
1. 4 bytes/300ns = 13.3 MB/second
![Page 27: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/27.jpg)
Asynchronous Bus
• Apparently much slower because each step of the protocol takes 40 ns and memory access 200 ns
• Notice that several steps are overlapped with memory access time
• Memory receives address at step 1• does not need to put address until step 5• steps 2,3,4 can overlap with memory access
• Step 1: 40 ns• Step 2,3,4: max(3 x 40ns =120ns, 200 ns)• Steps 5,6,7: 3x40ns = 120ns• Total time 360ns • max. bandwidth 4bytes/360ns=11.1MB/second
![Page 28: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/28.jpg)
Other important issues
• Bus Arbitration:— daisy chain arbitration (not very fair)— centralized arbitration (requires an arbiter), e.g., PCI— self selection, e.g., NuBus used in Macintosh— collision detection, e.g., Ethernet
• Operating system:— polling, interrupts, DMA
• Performance Analysis techniques:— queuing theory— simulation— analysis, i.e., find the weakest link (see “I/O System
Design”)
![Page 29: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/29.jpg)
Overhead of Polling
![Page 30: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/30.jpg)
Overhead of Polling
![Page 31: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/31.jpg)
Ways to Transfer Data between Memory and Device
![Page 32: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/32.jpg)
Multiprocessors
![Page 33: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/33.jpg)
Idea
Build powerful computers by connecting
many smaller ones.
![Page 34: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/34.jpg)
Multiprocessors
+ Good for timesharing
+ easy to realize
- difficult to write good concurrent programs
- hard to parallelize tasks
- mapping to architecture can be difficult
![Page 35: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/35.jpg)
Questions
• How do parallel processors share data?— single address space— message passing
• How do parallel processors coordinate? — synchronization (locks, semaphores)— built into send / receive primitives— operating system protocols
• How are they implemented?— connected by a single bus — connected by a network
![Page 36: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/36.jpg)
Shared Memory Multiprocessors
Problems???Symmetric multiprocessor (SMP)
![Page 37: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/37.jpg)
Distributed Memory Multiprocessors
• Distributed shared-memory multiprocessor
• Message passing multiprocessor
![Page 38: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/38.jpg)
Multiprocessors
Global Memory Distributed memory
Common Address Space
Symmetric Multiprocessor
Distributed shared-memorymultiprocessor
Distributed Address Space
does not exist Message passing multiprocessor
![Page 39: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/39.jpg)
Connection Network
• Static Network• fixed connections between nodes
• Dynamic Network• packet switching (packets routed from
sender to recipient)• circuit switching (connection between
nodes can be established by crossbar or switching network)
![Page 40: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/40.jpg)
Static Connection Networks
![Page 41: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/41.jpg)
0
01
Circuit Switching: Delta Networks
• Route from any input x to output y by selecting links determined by successive d-ary digits of y’s label.
• This process is reversible; we can route from output y back to x by following the links determined by successive digits of x’s label.
• This self-routing property allows for simple hardware-based routing of cells.
1101
0101
0
11
1
1
x=xk-1 . . . x0 y=yk-1 . . . y0y0
xk-1y1
xk-2
yk-1 yk-2 . . . xk-1y1
yk-3
x2
yk-1 yk-2xk-1 . . . x3yk-3
yk-2
x1
yk-1xk-1 . . . x2yk-2
x0
yk-1
xk-1 . . . x1 yk-1
![Page 42: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/42.jpg)
Network versus Bus
![Page 43: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/43.jpg)
Performance / Unit Cost
![Page 44: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/44.jpg)
Programming
• lock variables• semaphores• monitor• …
![Page 45: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/45.jpg)
Cache Coherency
![Page 46: Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d375503460f94a105f1/html5/thumbnails/46.jpg)
Outlook
• Distributed Algorithms• Distributed Systems• Parallel Programming• Parallelizing Compilers