processor designjarnau.site.ac.upc.edu/pd/01_moores_law_and_dennard_scaling.pdf76 / 98 end of...

1 / 98

Processor Design

José María Arnau

2 / 98

Objective

● Understand the intricacies of advanced IC design and development

3 / 98

Objective


module dff(input clk, input d, output q);

always @(posedge clk)q <= d;

endmodule

...

RTL code

4 / 98

Objective


module dff(input clk, input d, output q);

always @(posedge clk)q <= d;

endmodule

...

RTL code

IC layout

5 / 98

PD Contents

1. Moore’s Law and Dennard Scaling

2. Hardware Design Cycle

3. Functional Verification

4. Circuit Design

5. Physical Design

6 / 98

Evaluation

● Presentation on related research work– PechaKucha format: 20 slides, 20 seconds/slide

● List of papers posted soon in PD web site– http://jarnau.site.ac.upc.edu/PD/

● 20% of final mark

http://jarnau.site.ac.upc.edu/PD/

7 / 98

Practical approach

● Theory sessions are useful for lab assigments– How to write effective test benches

– How to verify your design

– How to synthesize, place and route your design

– How to estimate max freq and area of your design

● EDA tools for digital synthesis– Qflow: http://opencircuitdesign.com/qflow/

http://opencircuitdesign.com/qflow/

8 / 98

PD Contents


(a) Transistor basic physics

(b) Power wall

(c) Dark silicon

2. Hardware Design Cycle

3. Functional Verification

4. Circuit Design

5. Physical Design

9 / 98


10 / 98


The number of transistors in a dense integrated circuit doubles about every two years.

Moore’s Law

11 / 98


The number of transistors in a dense integrated circuit doubles about every two years.

As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.

Moore’s Law

Dennard Scaling

12 / 98

MOS Transistors

● Metal Oxide Semiconductor

nMOS Transistor pMOS Transistor

13 / 98

MOS Transistors

nMOS gate

drain

source

pMOS gate

drain

source

14 / 98

MOS Transistors

nMOS gate

drain

source

pMOS gate

drain

source

gate = 0

drain

source

OFF

15 / 98

MOS Transistors

nMOS gate

drain

source

pMOS gate

drain

source

gate = 0

drain

source

OFF gate = 1

drain

source

ON

16 / 98

MOS Transistors

nMOS gate

drain

source

pMOS gate

drain

source

gate = 0

drain

source

OFF gate = 1

drain

source

ON

gate = 0

drain

source

ON

17 / 98

MOS Transistors

nMOS gate

drain

source

pMOS gate

drain

source

gate = 0

drain

source

OFF gate = 1

drain

source

ON

gate = 0

drain

source

ON gate = 1

drain

source

OFF

18 / 98

CMOS Logic

● Inverter

19 / 98

CMOS Logic

● Inverter

0

20 / 98

CMOS Logic

● Inverter

0

ON

21 / 98

CMOS Logic

● Inverter

0

ON

OFF

22 / 98

CMOS Logic

● Inverter

0

ON

OFF1

23 / 98

CMOS Logic

● Inverter

0

ON

OFF1

24 / 98

CMOS Logic

● Inverter

0

ON

OFF1

1

25 / 98

CMOS Logic

● Inverter

0

ON

OFF1

1

OFF

26 / 98

CMOS Logic

● Inverter

0

ON

OFF1

1 ON

OFF

27 / 98

CMOS Logic

● Inverter

0

ON

OFF1

1 ON

OFF

0

28 / 98

CMOS Logic

● NAND gate

29 / 98

CMOS Logic

● NAND gate

0

1

30 / 98

CMOS Logic

● NAND gate

0

1

ON

31 / 98

CMOS Logic

● NAND gate

0

1

OFF

ON

32 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

33 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

34 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

1

35 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

1

36 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

11

1

37 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

11

1

OFF OFF

38 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

11

1

OFF OFF

ON

ON

39 / 98

CMOS Logic

● NAND gate

0

1

OFF

OFF

ON

ON

11

1

OFF OFF

ON

ON

0

40 / 98

CMOS Logic

● Complementary Metal-Oxide Semicondutor– pMOS pull-up network

– nMOS pull-down network

41 / 98

CMOS Logic

● Multiplexers

42 / 98

CMOS Logic

● Multiplexers

0

0

43 / 98

CMOS Logic

● Multiplexers

0

0

1

44 / 98

CMOS Logic

● Multiplexers

0

0

1

ON

ON

45 / 98

CMOS Logic

● Multiplexers

0

0

1

ON

ON

OFFOFF

46 / 98

CMOS Logic

● Multiplexers

0

0

1

ON

ON

OFFOFF

D0

47 / 98

CMOS Logic

● Latches and flip-flops

D latch D flip-flop

48 / 98

Moore’s Law

● The number of transistors in a dense integrated circuit doubles about every two years

49 / 98

Moore’s Law


50 / 98

Moore’s Law


~2 years

51 / 98

Moore’s Law


161514131211109876543210

1959

1960

1961

1962

1963

196

419

65

1966

1967

1968

1969

1970

1971

1972

1973

1974

1975

LO

G 2 O

F T

HE

NU

MB

ER

OF

CO

MP

ON

EN

TS

PE

R I

NT

EG

RA

TE

D F

UN

CT

ION

Electronics, April 19, 1965.

52 / 98

Moore’s Law

53 / 98

Moore’s LawTransistors per Square Millimeter by Year, 1971–2018. Logarithmic scale. Data from Wikipedia.

https://en.wikipedia.org/wiki/Transistor_count#Microprocessors

54 / 98

Technology

● What do “7nm” or “10nm” technology mean?

55 / 98

Layout Design Rules● Define how small features can be and how closely

they can be reliably packed

● Lambda based rules

– C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA: Addison-Wesley, 1980.

56 / 98

Dennard Scaling

● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.

– Miniaturization provides smaller and faster transistors

57 / 98

Dennard Scaling



S

S

Area = S2

58 / 98

Dennard Scaling



S

S

Area = S2

0.7*S

0.7*S

Area = 0.5*S2

59 / 98

Dennard Scaling



S

S

Area = S2

0.7*S

0.7*S

Area = 0.5*S2

● Transistor dimensions scaled by 30%

60 / 98

Dennard Scaling



S

S

Area = S2

0.7*S

0.7*S

Area = 0.5*S2


● Delay reduced by 30%

61 / 98

Dennard Scaling



S

S

Area = S2

0.7*S

0.7*S

Area = 0.5*S2


● Delay reduced by 30%

● Frequency increased by ~40%

62 / 98

Performance Scaling

● Smaller transistors are faster– ~1.4x higher performance per generation

● More transistors allow more functionality– Better branch predictors

– Larger caches

– ILP: superscalar, out-of-order execution

– DLP: vector unit

– Better memory controller

63 / 98

CPU Performance

Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.

64 / 98

CPU Frequency

65 / 98

CPU FrequencyPower wall!

66 / 98

CMOS Power

● Dynamic Power– Switching power

● Static Power (Leakage)– Power dissipated even

when not switching

– Around one third of the total power in current technology

Pdyn=αCV 2 f

Psta=I leakV

Ptotal=Pdyn+Psta

67 / 98

CMOS Power

● Static Power

– Gate leakage (Igate)

– Subthreshold leakage (Isub)

– Junction leakage (Ijunct)

– Depends on the temperature

68 / 98

CMOS Power

Im, Y., Cho, E. S., Choi, K., & Kang, S. (2007, March). Development of Junction Temperature Decision (JTD) Map for Thermal Design of Nano-scale Devices Considering Leakage Power. In Twenty-Third Annual IEEE Semiconductor Thermal Measurement and Management Symposium (pp. 63-67). IEEE.

69 / 98

Frequency vs Power

70 / 98

Power Wall

71 / 98

Power Wall

72 / 98

Power Wall and ILP Limits

● Cannot increase performance by raising the frequency due to thermal constraints

● Diminishing returns in ILP– Computer architects optimized superscalar out-of-

order CPUs for decades

● But Moore’s law still provides more and more transistors

● What can we do with the extra transistors?

73 / 98

Multi-core Processors

https://www.guru3d.com/articles-summary/amd-ryzen-5-2400g-review,2.html

74 / 98

Multi-core Processors

● Use extra transistors to include multiple cores in the same chip

● Exploit TLP (Thread-Level Parallelism)● Programmer has to manually extract parallelism

– The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software. Herb Sutter. http://www.gotw.ca/publications/concurrency-ddj.htm

http://www.gotw.ca/publications/concurrency-ddj.htm

75 / 98

CPU Trends

76 / 98

End of Dennard Scaling● According to Dennard scaling, power density remains

constant

– Smaller transistors consume less power

– If frequency is not increased, total power remains the same

● Dennard scaling did not take subthreshold leakage into account

● In current technology, more transistors result in more power

– Power density increases even at the same frequency!

P=αCV 2 f + I leakV

77 / 98

Power Density

John Hennessy and David Patterson. “A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development

78 / 98

Dark Silicon

● Dark Silicon and the End of Multicore Scaling. Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. ISCA 2011.

● Part of the chip must be powered off due to thermal constraints– 22 nm: 21% of the chip powered off

– 8 nm: ~50% of the chip powered off

79 / 98

Multicore Limitations

● Increasing the number of cores is no longer an effective solution to improve performance– Due to the dark silicon problem, not all the cores

can be powered at the same time

– Diminishing returns in TLP

● But Moore’s law still provides more and more transistors

● What can we do with the extra transistors?

80 / 98

Hardware Accelerators● Welcome to the golden age of hardware

accelerators!

https://www.hotchips.org/hc30/1conf/1.05_AMD_APU_AMD_Raven_HotChips30_Final.pdf

81 / 98

Hardware Accelerators

Qualcomm Snapdragon 835 [1] NVIDIA “Parker” SoC [2]

1. https://www.notebookcheck.net/Qualcomm-Snapdragon-835-SoC-Benchmarks-and-Specs.207842.0.html2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.30-Low-Power-Epub/HC28.22.322-Tegra-Parker-AndiSkende-NVIDIA-v01.pdf

82 / 98

ASIC

● Application Specific Integrated Circuit

https://ngcodec.com/markets-cloud-transcoding/

83 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

84 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for

85 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;


https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors

86 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;


ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

4 4 4 4 4 16

4 8 4 4

4 4 8 4 16

4 4 4 4

4 8 4

20 8 32 12 16 4 16 16TOTAL:


87 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;


ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

4 4 4 4 4 16

4 8 4 4

4 4 8 4 16

4 4 4 4

4 8 4

20 8 32 12 16 4 16 16TOTAL:

Accesses to TLBTLB missesROB accessesIssue queue...


88 / 98

ASIC vs CPU

Main MemoryMain Memory

X

X

X

X

ReadMEM

WriteMEM

2.0f

MUL BytesR MM

Bytes W MM

4 16 16

89 / 98

ASIC vs CPU


X

X

X

X

ReadMEM

WriteMEM

2.0f

MUL BytesR MM

Bytes W MM

4 16 16

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

20 8 32 12 16 4 16 16

90 / 98

ASIC vs CPU


X

X

X

X

ReadMEM

WriteMEM

2.0f

MUL BytesR MM

Bytes W MM

4 16 16

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

20 8 32 12 16 4 16 16

ASIC does much less work, so it is faster and consumes less energy

91 / 98


Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.

92 / 98


Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.

93 / 98

Sea of Accelerators

Y. S. Shao, B. Reagen, G. Wei and D. Brooks, "Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures," 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014, pp. 97-108.

94 / 98


● ASICs provide higher performance and energy-efficiency

● But they are less programmable and have large development costs

● Only accelerate in hardware applications that are widely popular and compute intensive

● Example: machine learning accelerators

95 / 98

Google TPUJouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 2017.

● 92 TOPS

● 28 MB on-chip

● 15x – 30x speedup

● 30x – 80x higher Perf/W

96 / 98

Turing Lecture at ISCA 2018

https://www.acm.org/hennessy-patterson-turing-lecture

https://www.acm.org/hennessy-patterson-turing-lecture

97 / 98

What about the future?

● Hardware accelerators will start providing diminishing returns in the future

● Any solution?– Quantum computing

– DNA computing

– Photonic computing

– Superconducting computing

98 / 98

Bibliography

● Weste, N. H., & Harris, D. “CMOS VLSI Design : A Circuits and Systems Perspective”. 4th Edition, 2010.

● Kahng AB, Lienig J, Markov IL, Hu J. “VLSI Physical Design: from Graph Partitioning to Timing Closure”. Springer Science & Business Media; 2011 Jan 27.

● Mead, C., & Conway, L. (1980). Introduction to VLSI systems (Vol. 1080). Reading, MA: Addison-Wesley.

processor designjarnau.site.ac.upc.edu/pd/01_moores_law_and_dennard_scaling.pdf76 / 98 end of...

Documents