winter 2004 class representation for advanced vlsi course instructor : dr s.m.fakhraie presented by...

20
Winter 2004 Winter 2004 Representation For Representation For Advanced VLSI Advanced VLSI Course Course Instructor : Dr S.M.Fakhraie Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Presented by : Naser Sedaghati Major Reference : Major Reference : Design and Implementation of the Design and Implementation of the POWER5 POWER5 TM TM Microprocessor Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J. Plass2, J. Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, M. M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Restle3, Restle3, Kalla1, J. McGill1, S. Dodson1 Kalla1, J. McGill1, S. Dodson1 1IBM System Group, Austin, TX 1IBM System Group, Austin, TX 2IBM System Group, Poughkeepsie, NY 2IBM System Group, Poughkeepsie, NY 3IBM Research, Yorktown Heights, NY 3IBM Research, Yorktown Heights, NY IEEE International Solid-State Circuits Conference IEEE International Solid-State Circuits Conference 2004 2004

Upload: aileen-walsh

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Winter 2004Winter 2004

Class Representation For Class Representation For Advanced VLSI Course Advanced VLSI Course

Instructor Dr SMFakhraieInstructor Dr SMFakhraiePresented by Naser SedaghatiPresented by Naser Sedaghati

Major Reference Major Reference

Design and Implementation of theDesign and Implementation of thePOWER5POWER5TMTM Microprocessor Microprocessor

J Clabes1 J Friedrich1 M Sweet1 J DiLullo1 S Chu1 D Plass2 JJ Clabes1 J Friedrich1 M Sweet1 J DiLullo1 S Chu1 D Plass2 JDawson2 P Muench2 L Powell1 M Floyd1 B Sinharoy2 M Lee1 MDawson2 P Muench2 L Powell1 M Floyd1 B Sinharoy2 M Lee1 M

Goulet1 J Wagoner1 N Schwarz1 S Runyon1 G Gorman1 P Restle3Goulet1 J Wagoner1 N Schwarz1 S Runyon1 G Gorman1 P Restle3Kalla1 J McGill1 S Dodson1Kalla1 J McGill1 S Dodson1

1IBM System Group Austin TX1IBM System Group Austin TX2IBM System Group Poughkeepsie NY2IBM System Group Poughkeepsie NY3IBM Research Yorktown Heights NY3IBM Research Yorktown Heights NY

IEEE International Solid-State Circuits Conference 2004IEEE International Solid-State Circuits Conference 2004

OutlineOutline

Motivation

Background

Threading Fundamentals

Enhancement SMT Implementation in POWER5

Memory Subsystem Enhancements

Power Efficiency

Additional SMT Considerations

Summary

Microprocessor Design Optimization Focus Areas

Memory latency Increased processor speeds make memory appear further away Longer stalls possible

Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and

wasted power Predication drives increased power and larger chip area

Execution Unit Utilization Currently 20-25 execution unit utilization common

Simultaneous multi-threading (SMT) and POWER architecture address these areas

bull Motivation hellip

POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI

POWER4+ shipping in 130nm today 267mm2 185M transistors

Dual processor core8-way superscalar

Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition

Register Branch Execution unit

gt 200 instructions in flightHardware instruction and data prefetch

bull Background hellip

POWER5 --- The Next StepTechnology 130nm lithography Cu SOI

389mm2 276M Transistors

Dual processor core

8-way superscalar

Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real

processor Natural extension to POWER4

design

bull Background hellip

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 2: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

OutlineOutline

Motivation

Background

Threading Fundamentals

Enhancement SMT Implementation in POWER5

Memory Subsystem Enhancements

Power Efficiency

Additional SMT Considerations

Summary

Microprocessor Design Optimization Focus Areas

Memory latency Increased processor speeds make memory appear further away Longer stalls possible

Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and

wasted power Predication drives increased power and larger chip area

Execution Unit Utilization Currently 20-25 execution unit utilization common

Simultaneous multi-threading (SMT) and POWER architecture address these areas

bull Motivation hellip

POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI

POWER4+ shipping in 130nm today 267mm2 185M transistors

Dual processor core8-way superscalar

Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition

Register Branch Execution unit

gt 200 instructions in flightHardware instruction and data prefetch

bull Background hellip

POWER5 --- The Next StepTechnology 130nm lithography Cu SOI

389mm2 276M Transistors

Dual processor core

8-way superscalar

Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real

processor Natural extension to POWER4

design

bull Background hellip

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 3: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Microprocessor Design Optimization Focus Areas

Memory latency Increased processor speeds make memory appear further away Longer stalls possible

Branch processing Mispredict more costly as pipeline depth increases resulting in stalls and

wasted power Predication drives increased power and larger chip area

Execution Unit Utilization Currently 20-25 execution unit utilization common

Simultaneous multi-threading (SMT) and POWER architecture address these areas

bull Motivation hellip

POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI

POWER4+ shipping in 130nm today 267mm2 185M transistors

Dual processor core8-way superscalar

Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition

Register Branch Execution unit

gt 200 instructions in flightHardware instruction and data prefetch

bull Background hellip

POWER5 --- The Next StepTechnology 130nm lithography Cu SOI

389mm2 276M Transistors

Dual processor core

8-way superscalar

Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real

processor Natural extension to POWER4

design

bull Background hellip

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 4: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER4 --- Shipped in Systems December 2001Technology 180nm lithography Cu SOI

POWER4+ shipping in 130nm today 267mm2 185M transistors

Dual processor core8-way superscalar

Out of Order execution Load Store units 2 Fixed Point units 2 Floating Point units Logical operations on Condition

Register Branch Execution unit

gt 200 instructions in flightHardware instruction and data prefetch

bull Background hellip

POWER5 --- The Next StepTechnology 130nm lithography Cu SOI

389mm2 276M Transistors

Dual processor core

8-way superscalar

Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real

processor Natural extension to POWER4

design

bull Background hellip

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 5: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER5 --- The Next StepTechnology 130nm lithography Cu SOI

389mm2 276M Transistors

Dual processor core

8-way superscalar

Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real

processor Natural extension to POWER4

design

bull Background hellip

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 6: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

System-level view of POWER5

bull Background hellip

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 7: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Multi-threading Evolution

bull Threading hellip

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 8: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Changes Going From ST to SMT Core

SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth GPRFPR rename mapper expanded to map second set of registers (High order

address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most addresstag buses

bull Enhancement hellip

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 9: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER5 Resources Size Enhancements

Enhanced caches and translation resources I-cache 64 KB 2-way set associative LRU

D-cache 32 KB 4-way set associative LRU

First level Data Translation 128 entries fully associative LRU

L2 Cache 192 MB 10-way set associative LRU

Larger resource pools Rename registers GPRs FPRs increased to 120 each

L2 cache coherency engines increased by 100

Enhanced data stream prefetching

Memory controller moved on chip

bull Enhancement hellip

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 10: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Thread Priority

Instances when unbalanced execution desirable

No work for opposite thread Thread waiting on lock Software determined non uniform

balance Power management hellip

Solution Control instruction decode rate Softwarehardware controls 8

priority levels for each thread

bull Enhancement hellip

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 11: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Modifications to POWER4 System Structurebull Memory hellip

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 12: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Power Efficient Design Implementation

DC power mitigation 1048599Leverage triple Vt technology

Decrease low Vt usage by 90

Increase high Vt usage by 30

1048599Leverage triple Tox technology

Thick Tox usage for decoupling capacitors

1048599 AC power mitigation 1048599Minimal usage of dynamic circuits 1048599Reduce loading on clock mesh 1048599Incorporation of dynamic clock gating

bull Power hellip

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 13: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Thermal control logic and sample thermal responsebull Power hellip

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 14: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

16-way Building Blockbull Additional hellip

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 15: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER5 Multi-Chip Module

95mm 95mm

Four POWER5 chips

Four cache chips

4491 signal IOs

89 layers of metal

bull Additional hellip

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 16: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

64-way SMP Interconnection

bull Additional hellip

Interconnection exploits enhanced distributed switch

bull All chip interconnections operate at half processor frequency and scale with processor frequency

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 17: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER4 and POWER5 Storage Hierarchybull Additional hellip

POWER4 POWER5

L2 Cache L2 Cache Capacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

144 MB 128 B Line 144 MB 128 B Line 8-way LRU8-way LRU

192 MB 128 B Line192 MB 128 B Line10 ndash way LRU10 ndash way LRU

Off-Chip L3 CacheOff-Chip L3 CacheCapacity Line Size Capacity Line Size Associativity ReplacementAssociativity Replacement

32 MB 512 B Line32 MB 512 B Line8-way LRU8-way LRU

36 MB 256 B Line 36 MB 256 B Line 12 ndash way LRU12 ndash way LRU

Chip interconnectChip interconnectType Type Intra-MCM data busesIntra-MCM data busesInter-MCM data busesInter-MCM data buses

Distributed SwitchDistributed Switchfrac12 processor speed frac12 processor speed frac12 processor speedfrac12 processor speed

Enhanced Distributed SwitchEnhanced Distributed SwitchProcessor speedProcessor speedProcessor speedProcessor speed

MemoryMemory 512 GB maximum512 GB maximum 1024 GB (1 TB) Maximum1024 GB (1 TB) Maximum

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 18: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

POWER Server Roadmap

bull Additional hellip

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 19: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Summary

First dual core SMT microprocessor

Extended SMP to 64-way

Operating in laboratory

Power dynamically managed with no performance penalty

Implementation permits future technology scalability from

circuit and power perspective

Innovative approach leveraging technology with system

focus for high performance in a power efficient design

bull Summary hellip

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003

Page 20: Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation

Other ReferencesOther References

[1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE [1] R Kalla B Sinharoy J Tendler ldquoIBM POWER5 CHIP A DUAL-CORE MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 MULTITHREADED PROCESSORrdquo IEE Computer Society MARCH-APRIL 2004 [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM [2] R Kalla IBM System Group ldquoIBMrsquos POWER5 Design and Methodologyrdquo IBM Corporation 2003Corporation 2003