processor designjarnau.site.ac.upc.edu/pd/01_moores_law_and_dennard_scaling.pdf76 / 98 end of...
TRANSCRIPT
1 / 98
Processor Design
José María Arnau
2 / 98
Objective
● Understand the intricacies of advanced IC design and development
3 / 98
Objective
● Understand the intricacies of advanced IC design and development
module dff(input clk, input d, output q);
always @(posedge clk)q <= d;
endmodule
...
RTL code
4 / 98
Objective
● Understand the intricacies of advanced IC design and development
module dff(input clk, input d, output q);
always @(posedge clk)q <= d;
endmodule
...
RTL code
IC layout
5 / 98
PD Contents
1. Moore’s Law and Dennard Scaling
2. Hardware Design Cycle
3. Functional Verification
4. Circuit Design
5. Physical Design
6 / 98
Evaluation
● Presentation on related research work– PechaKucha format: 20 slides, 20 seconds/slide
● List of papers posted soon in PD web site– http://jarnau.site.ac.upc.edu/PD/
● 20% of final mark
7 / 98
Practical approach
● Theory sessions are useful for lab assigments– How to write effective test benches
– How to verify your design
– How to synthesize, place and route your design
– How to estimate max freq and area of your design
● EDA tools for digital synthesis– Qflow: http://opencircuitdesign.com/qflow/
8 / 98
PD Contents
1. Moore’s Law and Dennard Scaling
(a) Transistor basic physics
(b) Power wall
(c) Dark silicon
2. Hardware Design Cycle
3. Functional Verification
4. Circuit Design
5. Physical Design
9 / 98
1. Moore’s Law and Dennard Scaling
10 / 98
1. Moore’s Law and Dennard Scaling
The number of transistors in a dense integrated circuit doubles about every two years.
Moore’s Law
11 / 98
1. Moore’s Law and Dennard Scaling
The number of transistors in a dense integrated circuit doubles about every two years.
As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
Moore’s Law
Dennard Scaling
12 / 98
MOS Transistors
● Metal Oxide Semiconductor
nMOS Transistor pMOS Transistor
13 / 98
MOS Transistors
nMOS gate
drain
source
pMOS gate
drain
source
14 / 98
MOS Transistors
nMOS gate
drain
source
pMOS gate
drain
source
gate = 0
drain
source
OFF
15 / 98
MOS Transistors
nMOS gate
drain
source
pMOS gate
drain
source
gate = 0
drain
source
OFF gate = 1
drain
source
ON
16 / 98
MOS Transistors
nMOS gate
drain
source
pMOS gate
drain
source
gate = 0
drain
source
OFF gate = 1
drain
source
ON
gate = 0
drain
source
ON
17 / 98
MOS Transistors
nMOS gate
drain
source
pMOS gate
drain
source
gate = 0
drain
source
OFF gate = 1
drain
source
ON
gate = 0
drain
source
ON gate = 1
drain
source
OFF
18 / 98
CMOS Logic
● Inverter
19 / 98
CMOS Logic
● Inverter
0
20 / 98
CMOS Logic
● Inverter
0
ON
21 / 98
CMOS Logic
● Inverter
0
ON
OFF
22 / 98
CMOS Logic
● Inverter
0
ON
OFF1
23 / 98
CMOS Logic
● Inverter
0
ON
OFF1
24 / 98
CMOS Logic
● Inverter
0
ON
OFF1
1
25 / 98
CMOS Logic
● Inverter
0
ON
OFF1
1
OFF
26 / 98
CMOS Logic
● Inverter
0
ON
OFF1
1 ON
OFF
27 / 98
CMOS Logic
● Inverter
0
ON
OFF1
1 ON
OFF
0
28 / 98
CMOS Logic
● NAND gate
29 / 98
CMOS Logic
● NAND gate
0
1
30 / 98
CMOS Logic
● NAND gate
0
1
ON
31 / 98
CMOS Logic
● NAND gate
0
1
OFF
ON
32 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
33 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
34 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
1
35 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
1
36 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
11
1
37 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
11
1
OFF OFF
38 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
11
1
OFF OFF
ON
ON
39 / 98
CMOS Logic
● NAND gate
0
1
OFF
OFF
ON
ON
11
1
OFF OFF
ON
ON
0
40 / 98
CMOS Logic
● Complementary Metal-Oxide Semicondutor– pMOS pull-up network
– nMOS pull-down network
41 / 98
CMOS Logic
● Multiplexers
42 / 98
CMOS Logic
● Multiplexers
0
0
43 / 98
CMOS Logic
● Multiplexers
0
0
1
44 / 98
CMOS Logic
● Multiplexers
0
0
1
ON
ON
45 / 98
CMOS Logic
● Multiplexers
0
0
1
ON
ON
OFFOFF
46 / 98
CMOS Logic
● Multiplexers
0
0
1
ON
ON
OFFOFF
D0
47 / 98
CMOS Logic
● Latches and flip-flops
D latch D flip-flop
48 / 98
Moore’s Law
● The number of transistors in a dense integrated circuit doubles about every two years
49 / 98
Moore’s Law
● The number of transistors in a dense integrated circuit doubles about every two years
50 / 98
Moore’s Law
● The number of transistors in a dense integrated circuit doubles about every two years
~2 years
51 / 98
Moore’s Law
● The number of transistors in a dense integrated circuit doubles about every two years
161514131211109876543210
1959
1960
1961
1962
1963
196
419
65
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
LO
G 2 O
F T
HE
NU
MB
ER
OF
CO
MP
ON
EN
TS
PE
R I
NT
EG
RA
TE
D F
UN
CT
ION
Electronics, April 19, 1965.
52 / 98
Moore’s Law
53 / 98
Moore’s LawTransistors per Square Millimeter by Year, 1971–2018. Logarithmic scale. Data from Wikipedia.
54 / 98
Technology
● What do “7nm” or “10nm” technology mean?
55 / 98
Layout Design Rules● Define how small features can be and how closely
they can be reliably packed
● Lambda based rules
– C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA: Addison-Wesley, 1980.
56 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
57 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
S
Area = S2
58 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
S
Area = S2
0.7*S
0.7*S
Area = 0.5*S2
59 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
S
Area = S2
0.7*S
0.7*S
Area = 0.5*S2
● Transistor dimensions scaled by 30%
60 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
S
Area = S2
0.7*S
0.7*S
Area = 0.5*S2
● Transistor dimensions scaled by 30%
● Delay reduced by 30%
61 / 98
Dennard Scaling
● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
S
Area = S2
0.7*S
0.7*S
Area = 0.5*S2
● Transistor dimensions scaled by 30%
● Delay reduced by 30%
● Frequency increased by ~40%
62 / 98
Performance Scaling
● Smaller transistors are faster– ~1.4x higher performance per generation
● More transistors allow more functionality– Better branch predictors
– Larger caches
– ILP: superscalar, out-of-order execution
– DLP: vector unit
– Better memory controller
63 / 98
CPU Performance
Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.
64 / 98
CPU Frequency
65 / 98
CPU FrequencyPower wall!
66 / 98
CMOS Power
● Dynamic Power– Switching power
● Static Power (Leakage)– Power dissipated even
when not switching
– Around one third of the total power in current technology
Pdyn=αCV 2 f
Psta=I leakV
Ptotal=Pdyn+Psta
67 / 98
CMOS Power
● Static Power
– Gate leakage (Igate)
– Subthreshold leakage (Isub)
– Junction leakage (Ijunct)
– Depends on the temperature
68 / 98
CMOS Power
Im, Y., Cho, E. S., Choi, K., & Kang, S. (2007, March). Development of Junction Temperature Decision (JTD) Map for Thermal Design of Nano-scale Devices Considering Leakage Power. In Twenty-Third Annual IEEE Semiconductor Thermal Measurement and Management Symposium (pp. 63-67). IEEE.
69 / 98
Frequency vs Power
70 / 98
Power Wall
71 / 98
Power Wall
72 / 98
Power Wall and ILP Limits
● Cannot increase performance by raising the frequency due to thermal constraints
● Diminishing returns in ILP– Computer architects optimized superscalar out-of-
order CPUs for decades
● But Moore’s law still provides more and more transistors
● What can we do with the extra transistors?
73 / 98
Multi-core Processors
https://www.guru3d.com/articles-summary/amd-ryzen-5-2400g-review,2.html
74 / 98
Multi-core Processors
● Use extra transistors to include multiple cores in the same chip
● Exploit TLP (Thread-Level Parallelism)● Programmer has to manually extract parallelism
– The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software. Herb Sutter. http://www.gotw.ca/publications/concurrency-ddj.htm
75 / 98
CPU Trends
76 / 98
End of Dennard Scaling● According to Dennard scaling, power density remains
constant
– Smaller transistors consume less power
– If frequency is not increased, total power remains the same
● Dennard scaling did not take subthreshold leakage into account
● In current technology, more transistors result in more power
– Power density increases even at the same frequency!
P=αCV 2 f + I leakV
77 / 98
Power Density
John Hennessy and David Patterson. “A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development
78 / 98
Dark Silicon
● Dark Silicon and the End of Multicore Scaling. Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. ISCA 2011.
● Part of the chip must be powered off due to thermal constraints– 22 nm: 21% of the chip powered off
– 8 nm: ~50% of the chip powered off
79 / 98
Multicore Limitations
● Increasing the number of cores is no longer an effective solution to improve performance– Due to the dark silicon problem, not all the cores
can be powered at the same time
– Diminishing returns in TLP
● But Moore’s law still provides more and more transistors
● What can we do with the extra transistors?
80 / 98
Hardware Accelerators● Welcome to the golden age of hardware
accelerators!
https://www.hotchips.org/hc30/1conf/1.05_AMD_APU_AMD_Raven_HotChips30_Final.pdf
81 / 98
Hardware Accelerators
Qualcomm Snapdragon 835 [1] NVIDIA “Parker” SoC [2]
1. https://www.notebookcheck.net/Qualcomm-Snapdragon-835-SoC-Benchmarks-and-Specs.207842.0.html2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.30-Low-Power-Epub/HC28.22.322-Tegra-Parker-AndiSkende-NVIDIA-v01.pdf
82 / 98
ASIC
● Application Specific Integrated Circuit
https://ngcodec.com/markets-cloud-transcoding/
83 / 98
Software Overheads
for (i = 0; i < 4; i++) v[i] *= 2.0f;
84 / 98
Software Overheads
for (i = 0; i < 4; i++) v[i] *= 2.0f;
la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for
85 / 98
Software Overheads
for (i = 0; i < 4; i++) v[i] *= 2.0f;
la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
86 / 98
Software Overheads
for (i = 0; i < 4; i++) v[i] *= 2.0f;
la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for
ReadI$
R/W D$
ReadRF
WriteRF
ADD/SUB
MUL BytesR MM
Bytes W MM
4 4 4 4 4 16
4 8 4 4
4 4 8 4 16
4 4 4 4
4 8 4
20 8 32 12 16 4 16 16TOTAL:
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
87 / 98
Software Overheads
for (i = 0; i < 4; i++) v[i] *= 2.0f;
la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for
ReadI$
R/W D$
ReadRF
WriteRF
ADD/SUB
MUL BytesR MM
Bytes W MM
4 4 4 4 4 16
4 8 4 4
4 4 8 4 16
4 4 4 4
4 8 4
20 8 32 12 16 4 16 16TOTAL:
Accesses to TLBTLB missesROB accessesIssue queue...
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
88 / 98
ASIC vs CPU
Main MemoryMain Memory
X
X
X
X
ReadMEM
WriteMEM
2.0f
MUL BytesR MM
Bytes W MM
4 16 16
89 / 98
ASIC vs CPU
Main MemoryMain Memory
X
X
X
X
ReadMEM
WriteMEM
2.0f
MUL BytesR MM
Bytes W MM
4 16 16
ReadI$
R/W D$
ReadRF
WriteRF
ADD/SUB
MUL BytesR MM
Bytes W MM
20 8 32 12 16 4 16 16
90 / 98
ASIC vs CPU
Main MemoryMain Memory
X
X
X
X
ReadMEM
WriteMEM
2.0f
MUL BytesR MM
Bytes W MM
4 16 16
ReadI$
R/W D$
ReadRF
WriteRF
ADD/SUB
MUL BytesR MM
Bytes W MM
20 8 32 12 16 4 16 16
ASIC does much less work, so it is faster and consumes less energy
91 / 98
Hardware Accelerators
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.
92 / 98
Hardware Accelerators
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.
93 / 98
Sea of Accelerators
Y. S. Shao, B. Reagen, G. Wei and D. Brooks, "Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures," 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014, pp. 97-108.
94 / 98
Hardware Accelerators
● ASICs provide higher performance and energy-efficiency
● But they are less programmable and have large development costs
● Only accelerate in hardware applications that are widely popular and compute intensive
● Example: machine learning accelerators
95 / 98
Google TPUJouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 2017.
● 92 TOPS
● 28 MB on-chip
● 15x – 30x speedup
● 30x – 80x higher Perf/W
96 / 98
Turing Lecture at ISCA 2018
https://www.acm.org/hennessy-patterson-turing-lecture
97 / 98
What about the future?
● Hardware accelerators will start providing diminishing returns in the future
● Any solution?– Quantum computing
– DNA computing
– Photonic computing
– Superconducting computing
98 / 98
Bibliography
● Weste, N. H., & Harris, D. “CMOS VLSI Design : A Circuits and Systems Perspective”. 4th Edition, 2010.
● Kahng AB, Lienig J, Markov IL, Hu J. “VLSI Physical Design: from Graph Partitioning to Timing Closure”. Springer Science & Business Media; 2011 Jan 27.
● Mead, C., & Conway, L. (1980). Introduction to VLSI systems (Vol. 1080). Reading, MA: Addison-Wesley.