highlights of the 36 th annual international symposium on microarchitecture december 2003 theo...

Highlights of the 36th Annual International Symposium on Microarchitecture

December 2003

Highlights of the 36th Annual International Symposium on Microarchitecture

December 2003

Theo Theocharides

Embedded and Mobile Computing Center Department of Computer Science and Engineering

The Pennsylvania State University

Acknowledgements: K. Bernstein, T. Austin, D. Blaauw, L. Peh, D.

Jimenez

IntroductionIntroduction

The International Symposium on Microarchitecture is the premier forum for discussing new microarchitecture and software techniques

Processor architecture, compilers, and systems for technical interaction on traditional MICRO topics special emphasis on optimizations to take advantage of

application specific opportunities microarchitecture and embedded architecture communities

http://www.microarch.org

http://www.microarch.org/micro36/

Symposium OutlineSymposium Outline

Session 1: Voltage Scaling & Transient

Session 2: Cache

Session 3: Power and Energy Efficient Architectures

Session 4: Application-Specific Optimization and Analysis

Session 5: Dynamic Optimization Systems

Session 6: Dynamic Program Analysis and Optimization

Session 7: Branch, Value, and Scheduling Optimization

Session 8: Dataflow, Data Parallel, and Clustered Architectures

Session 9: Secure and Network Processors

Session 10: Scaling Design

HighlightsHighlights

Keynote Speech

Caution Flag Out: Microarchitecture's Race for Power Performance Kerry Bernstein, IBM T. J. Watson Research Center

Interesting Papers

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, D. Ernst, et. al

Power-Driven Design of Router Microarchitectures in On-Chip Networks, H. Wang, Li-Shiuan Peh, S. Malik

Fast Path-Based Neural Branch Prediction, D. Jimenez

Workshops and TutorialsWorkshops and Tutorials

5th Workshop on Media and Streaming Processors (MSP)

3rd Workshop on Power-Aware Computer Systems (PACS)

2nd Workshop on Application Specific Processors (WASP)

Tutorial: Challenges in Embedded Computing

Tutorial: Open Research Compiler (ORC): Proliferation of Technologies and Tools

Tutorial: Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design

Tutorial: Network Processors

Tutorial: Architectural Exploration with Liberty

Keynote SpeechKeynote Speech

Given by Kerry Bernstein, IBM T.J. Watson Research Center

Microarchitecture and technology relationship

We cannot continue to scale down to achieve higher frequencies without any catch

Increasing pipeline depth does not necessarily help

Power consumption, process variation, soft errors, die area erosion becoming more and more important

Keynote explored how past technologies have influenced high speed microarchitectures

Keynote showed how characteristics of proposed new devices and interconnects for lithographies beyond 90nm may shape future machine design.

Given the present issues and incoming trends, role of microarchitecture in extending CMOS performance will be more important than ever

Where Scaling fails…Where Scaling fails…

Cost of Performance in terms of powerCost of Performance in terms of power

Issues in summary:Issues in summary:

Feature size

Device count (transistors per chip)

Pipeline depth

Power consumption increases non-linearly with scaling

Power growths when we reduce the FO4 delay

Delay and power affected by process variation

Cooling creates more problems

Cost of power diverges from performance gain

How does Microarchitecture help?How does Microarchitecture help?

RepairsRepairs

Monitor-based Full Chip Voltage, Clock Throttling

Voltage Islands Technology aid required here Latency required Low-activity FET count increase

Clock Gating So far has been a nice solution…

Pipeline depth optimization

Performance accelerators for ASICs (DSP, GPU’s, etc.) As in, they need power anyways, at least make them efficient Software solutions should be developed here

Compute-Informed Power Management Instruction Stream Dynamic Resource Assertion Power Aware OS Thermal Modeling

New IdeasNew Ideas

“Evolutionary” Strained Silicon High-K Gate Dielectrics Hybrid Crystal Silicon

Increase current drive/micron of device Allow transistor density improvement Introduce Features which enable active static power management

“Revolutionary” Double Gated MOSFETs 3D Integration Molecular Computing

Reduce Power Density without architectural management Eliminate power dependence on frequency Return the industry to threshold and supply voltage scaling

Molecular ComputingMolecular Computing

Keynote ConclusionsKeynote Conclusions

New technologies will likely help, not necessarily

Power is by far the predominant factor in scaling – we need to see what new technologies can give us

Staying ahead requires power-aware systems

Razor Project (T. Austin, D. Blaauw, T. Mudge)Razor Project (T. Austin, D. Blaauw, T. Mudge)

We (designers/architects) have been scaling the voltage down but up to a point where it was proven that under all possible worst cases, there were no errors

Very conservative voltage scaling IDEA!

Instead of trying to avoid ALL errors, ALLOW some errors to happen and correct them!

Major argument: Scaling the voltage supply by almost 0.25V down, gives an average error rate of less than 5%

Instead of spending energy, logic, effort, time and so many other useful factors into avoiding error, allow a very small error percentage to happen, and gain huge power savings

Cost of fixing errors is minimal when the error percentage is kept under control

Razor ProjectRazor Project

Razor Pipeline Flip-FlopRazor Pipeline Flip-Flop

Error-Rate vs. Power SavingsError-Rate vs. Power Savings

IPC vs. Error RateIPC vs. Error Rate

DVSDVS

Razor AdvantagesRazor Advantages

Eliminate safety margins Process variation, IR-drop, temperature fluctuation, data-

dependent latencies, model uncertainty

Operate at sub-critical voltage for optimal trade-off between: Energy gain from voltage scaling Energy overhead from dynamic error correction

Tune voltage for average instruction data Exploit delay dependence in data

Tolerate delay degradation due to infrequent noise events SER, capacitive, inductive noise, charge sharing, floating body

effect… Most severe noise also least frequent

Power-driven Design of Router Microarchitectures in On-chip Networks (Hangsheng Wang, Li-Shiuan Peh, Sharad Malik)

Investigates on-chip network microarchitectures from a power-driven perspective

Power-efficient network microarchitectures: segmented crossbar, cut-through crossbar and write-through buffer

Studies and uncovers the power saving potential of an existing network architecture: Express cube

Reduction in network power of up to 44.9%,

NO degradation in network performance

Improved latency throughput in some cases.

Power in NoCPower in NoC

Ewrt is the average energy dissipated when writing a flit into the input buffer

Erd is the average energy dissipated when reading a flit from the input buffer

Ebuf = Ewrt + Erd is average buffer energy Earb is average arbitration energy Exb is average crossbar traversal energy Elnk is average link traversal energy H is the number of hops traversed by this flit

Architectural MethodsArchitectural Methods

Segmented crossbar

Cut-through crossbar

Write-through input buffer

Express cube

Segmented CrossbarSegmented Crossbar

Schematic of a matrix crossbar and a segmented crossbar. F is flit size in bits, dw is track width, E, W, N, S are ports.

Cut-through crossbarCut-through crossbar

Schematic of cut-through crossbars

F is flit size, dw is track width, E, W, N, S are ports

Write-through bufferWrite-through buffer

(a) Bypassing without overlapping

(b) Bypassing with overlapping

(c) Schematic of a write-through input buffer.

Express cube topology and microarchitectureExpress cube topology and microarchitecture

Power savings and conclusionsPower savings and conclusions

Importance of a power-driven approach to on-chip network designNeed to investigate the interactions between traffic patterns and On Chip Network architecturesNeed to reach a systematic design methodology for on-chip networks

Fast Path-Based Neural Branch Prediction (J. Himenez)Fast Path-Based Neural Branch Prediction (J. Himenez)

Paper presented a new neural branch predictor both more accurate and much faster than previous

neural predictors

Accuracy far superior to conventional predictors

Latency comparable to predictors from industrial designs

Improves the instructions-per-cycle (IPC) rate of an aggressively clocked microarchitecture by 16%

Latency - Accuracy GainLatency - Accuracy GainRather than being done all at once (above), computation is staggered (below)

•Train a neural network with path history, and update it dynamically.

•Choose the weight vectors according to the path leading up to the branch rather than branch address alone

•Directly reduces latency (can begin prior to the prediction – see figure on the left)

•Improves accuracy as the predictor incorporates path information

Comparative Results – Misprediction rateComparative Results – Misprediction rate

IPC per hardware costIPC per hardware cost

Faster and more accurate than existing neural branch predictors

ConclusionConclusion

Overview of MICRO36

Conference lasted 5 days – impossible to review in half hour!

If you are interested, you should read the proceedings on-line at

http://www.microarch.org/micro36

The Call For Papers for MICRO37 is available, at

http://www.microarch.org/micro37

DEADLINE FOR PAPER SUBMISSION: May 28th, 2004

Links to the papers reviewedLinks to the papers reviewed

Razor http://www.microarch.org/micro36/html/pdf/ernst-Razor.pdf

NoC Router Power-Driven Design http://www.microarch.org/micro36/html/pdf/wang-PowerDrivenDesig

Fast-Path Neural Branch Predictor http://www.microarch.org/micro36/html/pdf/jimenez-FastPath.pdf

Questions?Questions?

THANK YOU !

36th Annual International Symposium on Micro-Architecture

- A Review

36th Annual International Symposium on Micro-Architecture

- A Review

Rajaraman Ramanarayanan

Talk OverviewTalk Overview

Session covered in this presentation

Review papers Architectural vulnerability factors

Introduction Proposed technique Soft error terminology Computing AVF’s Results Conclusion

L2-Miss Drive Variable Supply voltage scaling Introduction Proposed Solution Transitions Results Achievements

Session CoveredSession Covered

Voltage Scaling & Transient Faults Methodology to compute Artificial vulnerability factors VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for

Low Power

Architectural Vulnerability Factors(S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, T. Austin)

Single-event upsets from particle strikes have become a key challenge in microprocessor design.

Soft errors due to cosmic rays making an impact in industry. In 2000, Sun Microsystems acknowledged cosmic ray strikes on

unprotected cache memories as the cause of random crashes at major customer sites in its flagship Enterprise server line

The fear of cosmic ray strikes prompted Fujitsu to protect 80% of its 200,000 latches in its recent SPARC processor with some form of error detection

require accurate estimates of processor error rates to make appropriate cost/reliability trade-offs.

IntroductionIntroduction

All existing approaches introduce a significant penalty in performance, power, die size, and design time

Tools and techniques to estimate processor transient error rates are not readily available or fully understood.

Estimates are needed early in the design cycle.

In this Paper : Define architectural vulnerability factor (AVF) identify numerous cases, such as pre-fetches, dynamically dead

code, and wrong-path instructions, in which a fault will not affect correct execution

Proposed techniqueProposed technique

Not all faults in a micro-architectural structure affect the final outcome of a program.

Architectural Vulnerability factor (AVF) probability that a fault in that particular structure will result in an error in

the final output of the program

The overall error rate = product of raw fault rate and AVF.

Can examine the relative contributions of various structures identify cost-effective areas to employ fault protection techniques

Tracks the subset of processor state bits required for architecturally correct execution (ACE) fault in a storage cell containing one of these bits affects output

For example, a branch predictor’s AVF is 0% predictor bits are always un-ACE bits.

Bits in the committed PC are always ACE bits, has an AVF of 100%

Soft error terminologySoft error terminology

Error budget expressed in terms of: Mean Time Between Failures (MTBF). Failures In Time (FIT) - inversely related to MTBF.

Errors are often classified as: Undetected - silent data corruption (SDC) Detected - detected unrecoverable errors (DUE)

Effective FIT rate for a structure is the product of its raw circuit FIT rate and the structure’s vulnerability factor

effective FIT rate per bit is influenced by several vulnerability factors also known as de-rating factors or soft error sensitivity factor

Examples include timing vulnerability factor for latches and AVF

Silent data corruption in the futureSilent data corruption in the future

Identifying Un-ACE BitsIdentifying Un-ACE Bits

Bits that do not affect final program output

Analyzed a uniprocessor system

Micro-architectural Un-ACE bits Idle or Invalid State. Miss-speculated State. Predictor Structures. Ex-ACE State.

Architectural Un-ACE Bits NOP instructions. Performance-enhancing instructions. Predicated-false instructions. Dynamically dead instructions. Logical masking.

Computing AVFComputing AVF

AVFs for storage cells - fraction of time an upset in that cell will cause a visible error in the final output of a program

AVF Equations for a Hardware Structure average AVF for all its bits in that structure ∑ residency (in cycles) of all ACE bits in a structure

-------------------------------------------------------------------------------- total number of bits in the hardware structure × total execution cycles

Little’s Law: N = B×L, where

N = average number of bits in a box, B = average bandwidth per cycle into the box, and L = average latency of an individual bit through the box.

Bace × Lace AVF = -------------------------------------------------------------- total number of bits in the hardware structure

Computing AVFs using a Performance ModelComputing AVFs using a Performance Model

Two structures—the instruction queue and execution units—using the Asim performance model framework

Need following information Sum of all residence cycles of all ACE bits of the objects

resident in the structure during the execution of the program, Total execution cycles for which we observe the ACE bits’

residence time, and Total number of bits in a hardware structure.

AVF algorithm Record the residence time of the instruction in the structure as

an instruction flows through different structures in the pipeline Update the structures the instruction flowed through Put the instruction in a post-commit analysis window to

Determine if the instruction is dynamically dead or Determine if there are any bits that are logically masked

Methodology for evaluationMethodology for evaluation

Use an Itanium2®-like IA64 processor [14] scaled to current technology

Modeled in detail in Asim performance model framework.

Results – Program level DecompositionResults – Program level Decomposition

ResultsResults

Program-level Decomposition We get about 45% ACE instructions. The rest—55% of the

instructions—are un-ACE instructions Some of these un- ACE instructions still contain ACE bits, such

as the op-code bits of pre-fetch instructions UNKNOWN and NOT_PROCESSED instructions account for

about 1% of the total instructions NOPs, predicated false instructions, and prefetch instructions

account for 26%, 6.7%, and 1.5%, respectively. FDD_reg and FDD_mem denote results that are written back to

registers and memory, respectively Account for about 9.4% and 2% of the dynamic instructions IA64 has a large number of registers

TDD_reg and TDD_mem account for 6.6% and 1.6% of the dynamic instructions

AVF for instruction queueAVF for instruction queue

Shows what percentage of cycles a storage cell in the instruction queue contains ACE and un-ACE bits.

Instruction queue contains an ACE bit about 28% of the time. Thus AVF of the instruction queue is 28%.

Floating point programs, in general, have higher AVFs compared to integer programs (31% vs. 25%, respectively) Long-latency instructions and few branch mispredictions Use the instruction queue more effectively than integer

programs, leading to a higher AVF

Apply Little’s law : Number of ACE instructions in the queue = bandwidth or ACE IPC X the average number of cycles an

instruction can be considered to be in ACE state or ACE latency

The ACE IPC and ACE latency from our performance mode

AVFs for the Execution UnitsAVFs for the Execution Units

Four integer pipes and two floating point pipes 50% control latches and 50% datapath latches

11% of the cycles processing ACE instructions Significantly lower

Instructions must wait in the instruction queue Speculatively issued instructions succeeding cache-miss loads

must replay through the instruction queue The floating point pipes are mostly idle while executing integer code

Implemented logical masking functions for a small but important subset

ConclusionConclusion

Estimated AVFs using a novel approach that tracks bits required for architecturally correct execution (ACE) and un-ACE bits

Computed the AVF for the instruction queue and execution units of an Itanium2®-like IA64 processor.

Further refinement could further lower the AVF estimates but expect the contribution from further refinement to be small

Can estimate the FIT rate of an entire processor early in the design cycle

Can help designers choose the appropriate error detection or correction schemes

Can lower the FIT rate of the chip iteratively by adding more and more error protection, using AVF estimates as a guide.

L2-Miss Driven VSV for low power (H. Li, C. Cher, T. N. Vijaykumar, K. Roy)L2-Miss Driven VSV for low power (H. Li, C. Cher, T. N. Vijaykumar, K. Roy)

Idea: Upon a L2 miss, pipeline performs independent computations, but almost always ends up stalled, waiting for data despite out-or-order issue and other latency-hiding techniques

During an L2 miss, scale down the supply, carry out independent computations at lower speed instead

Performance degradations if there are sufficient independent computations however, which will overlap with the delay of the cache

Returning to full speed however, will likely reduce power savings if there are multiple misses and insufficient independent computations to overlap with the misses

Proposed solutionProposed solution

Two state machines tracking parallelism on the fly

Scale down voltage depending on parallelism of the two events

Factors considered Circuit level complexities reducing VSV to two voltages Stability Signal propagation speed issues Energy overhead issues in RAMs and Register files

Average reduction of processor power is 7% while performance degradation is 0.9%

VSV StructureVSV Structure

TransitionsTransitions

High-To-Low transition

Low-To-High transition

VSV -ResultsVSV -Results

VSV - AchievementsVSV - Achievements

Power savings with minimal performance degradation

Complexity of circuits taken into consideration

FSM’s control the level of parallelism between independent operations and delay caused by an L2 cache miss

VSV achieves 4% reduction in power for all SPEC2K benchmarks

VSV achieves 12% for the benchmarks with high L2 miss rates

QuestionsQuestions

Any questions or feedback ??

highlights of the 36 th annual international symposium on microarchitecture december 2003 theo...

problemscost of power

new microarchitecture

lowpower pipeline

role of microarchitecture

performance gain

microarchitecture help

cmos performance

network processors session

Documents

ti ii: computer architecture microarchitecture

chapter 4 - microarchitecture - oregon state...

nehalem (microarchitecture)

computer architectures -...

the microarchitecture of superscalar processors

contribution of trabecular microarchitecture and its ... ·...

on tuning microarchitecture for programs

demystifying gpu microarchitecture through microbenchmarking

these notes: nvidia gpu microarchitecture org 1 nvidia gpu...

comp22111’ processor’ microarchitecture’

intel’s haswell cpu microarchitecture

2.2 microarchitecture 2.2b – instruction phases

design of digital circuits - eth z n introduction to...

ee 7722|gpu microarchitecture

microarchitecture of the ultrasparct1 cpu

the architecture for discovery...tick-tock development model...

amd bulldozer microarchitecture

mips microarchitecture multicycle processor

2.2 msp430 microarchitecture

power-adaptive microarchitecture and compiler …