cs 7810 lecture 22
DESCRIPTION
CS 7810 Lecture 22. Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001. Clock Frequencies. Aggressive clocks => little work per pipeline stage => deep pipelines => low IPC, large buffers, high - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/1.jpg)
CS 7810 Lecture 22
Processor Case Studies,The Microarchitecture of the Pentium 4 Processor
G. Hinton et al.Intel Technology Journal
Q1, 2001
![Page 2: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/2.jpg)
Clock Frequencies
• Aggressive clocks => little work per pipeline stage => deep pipelines => low IPC, large buffers, high power, high complexity, low efficiency
• 50% increase in clock speed => 30% increase in performance Mispredict latency = 10 cyc
Mispredict latency = 20 cyc
![Page 3: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/3.jpg)
Deep Pipelines
![Page 4: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/4.jpg)
Variable Clocks
• The fastest clock is defined as the time for an ALU operation and bypass (twice the main processor clock)
• Different parts of the chip operate at slower clocks to simplify the pipeline design (e.g. RAMs)
![Page 5: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/5.jpg)
Microarchitecture Overview
![Page 6: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/6.jpg)
Front End
• ITLB, RAS, decoder
• Trace Cache: contains 12Kops (~8K-16KB I-cache), saves 3 pipe stages, reduces power
• Front-end BTB accessed on a trace cache miss and smaller Trace-cache BTB to detect next trace line – no details on branch pred algo
• Microcode ROM: implements op translation for complex instructions
![Page 7: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/7.jpg)
Execution Engine
• Allocator: resource (regs, IQ, LSQ, ROB) manager
• Rename: 8 logical regs are renamed to 128 phys regs; ROB (126 entries) only stores pointers (Pentium 4) and not the actual reg values (unlike P6) – simpler design, less power
• Two queues (memory and non-memory) and multiple schedulers (select logic) – can issue six instrs/cycle
![Page 8: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/8.jpg)
Schedulers
• 3GHz clock speed = time for a 16-bit add and bypass
![Page 9: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/9.jpg)
NetBurst
• 3GHz ALU clock = time for a 16-bit add and bypass to itself (area is kept to a minimum)
• Used by 60-70% of all ops in integer programs
• Staggered addition – speeds up execution of dependent instrs – an add takes three cycles
• Early computation of lower 16 bits => early initiation of cache access
![Page 10: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/10.jpg)
Detailed Microarchitecture
![Page 11: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/11.jpg)
Data Cache
• 4-way 8KB cache; 2-cycle load-use latency for integer instrs and 6-cycle latency for fp instrs
• Distance between load scheduler and execution is longer than load latency
• Speculative issue of load-dependent instrs and selective replay
• Store buffer (24 entries) to forward results to loads (48 entries) – no details on load issue algo
![Page 12: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/12.jpg)
Cache Hierarchy
• 256KB 8-way L2; 7-cycle latency; new operation every two cycles
• Stream prefetcher from memory to L2 – stays 256 bytes ahead
• 3.2GB/s system bus: 64-bit wide bus at 400MHz
![Page 13: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/13.jpg)
Performance Results
![Page 14: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/14.jpg)
Quick Facts
• November 2000: Willamette, 0.18, Al interconnect, 42M transistors, 217mm2, 55W, 1.5GHz
• February 2004: Prescott, 0.09, Cu interconnect, 125M transistors, 112mm2, 103W, 3.4GHz
![Page 15: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/15.jpg)
Improvements
• Willamette (2000) Prescott (2004)
• L1 data cache 8KB 16KB
• L2 cache 256KB 1MB
• Pipeline stages 20 31
• Frequency 1.5GHz 3.4GHz
• Technology 0.18 0.09
![Page 16: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/16.jpg)
Pentium M
• Based on the P6 microarchitecture
• Lower design complexity (some inefficiencies persist, such as copying register values from ROB to architected register file)
• Improves on P4 branch predictor
![Page 17: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/17.jpg)
PM Changes to P6, cont.• Intel has not released the exact length of the pipeline.• Known to be somewhere between the P4 (20 stage)
and the P3 (10 stage). Rumored to be 12 stages.• Trades off slightly lower clock frequencies (than P4) for better
performance per clock, less branch prediction penalties, …
![Page 18: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/18.jpg)
Banias
• 1st version• 77 million transistors, 23
million more than P4• 1 MB on die Level 2 cache• 400 MHz FSB (quad
pumped 100 MHZ)• 130 nm process• Frequencies between 1.3 –
1.7 GHz• Thermal Design Point of
24.5 wattshttp://www.intel.com/pressroom/archive/photos/centrino.htm
![Page 19: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/19.jpg)
Dothan
• Launched May 10, 2004• 140 million transistors• 2 MB Level 2 cache• 400 or 533 MHz FSB• Frequencies between 1.0
to 2.26 GHz• Thermal Design Point of
21(400 MHz FSB) to 27 watts
http://www.intel.com/pressroom/archive/photos/centrino.htm
![Page 20: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/20.jpg)
Branch Prediction
• Longer pipelines mean higher penalties for mispredicted branches
• Improvements result in added performance and hence less energy spent per instruction retired
![Page 21: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/21.jpg)
Branch Prediction in Pentium M
• Enhanced version of Pentium 4 predictor• Two branch predictors added that run in
tandem with P4 predictor: – Loop detector– Indirect branch detector
• 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance
![Page 22: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/22.jpg)
Branch Prediction
Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php
![Page 23: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/23.jpg)
Loop Detector• A predictor that always
branches in a loop will always incorrectly branch on the last iteration
• Detector analyzes branches for loop behavior
• Benefits a wide variety of program types
http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm
![Page 24: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/24.jpg)
Indirect Branch Predictor
• Picks targets based on global flow control history
• Benefits programs compiled to branch to calculated addresses
http://www.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/p05_branch.htm
![Page 25: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/25.jpg)
Benchmark
![Page 26: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/26.jpg)
Battery Life
![Page 27: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/27.jpg)
UltraSPARC IV
• CMP with 2 UltraSPARC IIIs – speedups of 1.6 and 1.14 for swim and lucas (static parallelization)
• UltraSPARC III : 4-wide, 16 queue entries, 14 pipeline stages
• 4KB branch predictor – 95% accuracy, 7-cycle penalty
• 2KB prefetch buffer between L1 and L2
![Page 28: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/28.jpg)
Alpha 21364
• Tournament predictor – local and global; 36Kb
• Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP
• Two clusters, each with 2 FUs and a copy of the 80-entry register file
![Page 29: CS 7810 Lecture 22](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681678d550346895ddcb0f7/html5/thumbnails/29.jpg)
Title
• Bullet