1 thread level parallelism: it’s time now ! andré seznec irisa/inria caps team
TRANSCRIPT
![Page 1: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/1.jpg)
1
Thread level parallelism: It’s time now !
André Seznec
IRISA/INRIA
CAPS team
![Page 2: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/2.jpg)
2
Disclaimer
No nice mathematical formula
No theorem
BTW : I got the aggregation of maths, but it was long time ago
![Page 3: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/3.jpg)
3
Focus of high performance computer architecture
Up to 1980 Mainframes
Up to 1990 Supercomputers
Till now: General purpose microprocessors
Coming: Mobile computing, embedded computing
![Page 4: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/4.jpg)
4
Uniprocessor architecture has driven performance progress so far
The famous “Moore’s law”
![Page 5: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/5.jpg)
5
“Moore’s Law”: predict exponential growth
Every 18 months:
The number of transistors on chip doubles The performance of the processor doubles The size of the main memory doubles
30 years IRISA = 1,000,000 x
![Page 6: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/6.jpg)
6
Moore’s Law: transistors on a processor chip
1
100
10000
1000000
100000000
10000000000
1970 1980 1990 2000 2010
year
tra
ns
isto
rs
400480486
HP8500
Montecito IRISA80 persons
IRISA550 persons
IRISA200 persons
IRISA350 persons
![Page 7: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/7.jpg)
7Moore’s Law: bits per DRAM chip
110
1001000
10000100000
100000010000000
1000000001000000000
1970 1975 1980 1985 1990 1995 2000 2005 2010
year
1K
64K
1M
1Gbit
64M
M. Métivier
G. Ruget L. Kott
J.P. Verjus
J.P. Banatre
C. Labit
and the IRISA directors
![Page 8: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/8.jpg)
8
Moore’s Law: Performance
1970 1980 1990 2000 2010
year
MIP
S
0.06
16
20,000
3,000
J. Lenfant
A. Seznec
P. Michaud
F. Bodin
I. Puaut
and the CAPS group
![Page 9: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/9.jpg)
9
And parallel machines, so far ..
Parallel machines have been built from every processor generation: Hardware coherent shared memory processors:
• Dual processors board• Up to 8-processor server• Large (very expensive) hardware coherent NUMA
architectures
Distributed memory systems:• Intel iPSC, Paragon• clusters, clusters of clusters ..
![Page 10: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/10.jpg)
10
Parallel machines have not been mainstream so far
But it might change
But it will change
![Page 11: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/11.jpg)
11
What has prevented parallel machines to prevail ?
Economic issue: Hardware cost grew superlinearly with the number of processors Performance:
• Never been able to use the last generation micropocessor: Scalability issue:
• Bus snooping does not scale well above 4-8 processors
Parallel applications are missing: Writing parallel applications requires thinking “parallel” Automatic parallelization works on small segments
![Page 12: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/12.jpg)
12
What has prevented parallel machines to prevail ? (2)
We ( the computer architects) were also guilty: We just found how to use these transistors in a
uniprocessor
IC technology brought the transistors and the frequency
We brought the performance : Compiler guys also helped a little bit
![Page 13: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/13.jpg)
13
Up to now, what was microarchitecture about ?
Memory access time is 100 ns Program semantic is sequential Instruction life (fetch, decode,..,execute, ..,memory access,..) is
10-20 ns
How can we use the transistors to achieve the highest performance as possible? So far, up to 4 instructions every 0.3 ns
![Page 14: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/14.jpg)
14
The architect tool box for uniprocessor performance
Pipelining
Instruction Level Parallelism
Speculative execution
Memory hierarchy
![Page 15: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/15.jpg)
15
Pipelining
Just slice the instruction life in equal stages and launch concurrent execution:
IF … DC … EX M … CTI0I1
I2
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CT
time
![Page 16: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/16.jpg)
16
+ Instruction Level Parallelism
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
IF … DC … EX M … CTIF … DC … EX M … CT
![Page 17: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/17.jpg)
17
+ out-of-order execution
To optimize resource usage: Executes as soon as operands are valid
IF DC EX M CTwait waitIF DC EX M
IF DC EX M CTIF DC EX M CT
IF DC EX M CTIF DC EX M CT
CT
![Page 18: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/18.jpg)
18
+ speculative execution
10-15 % branches: On Pentium 4: direction and target known at cycle 31 !!
Predict and execute speculatively: Validate at execution time State-of-the-art predictors:
• ≈2 mispredictions per 1000 instructions
Also predict: Memory (in)dependency (limited) data value
![Page 19: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/19.jpg)
19
+ memory hierarchy
Processor
Main memory
I $ D$
300 cycles ≈ 1000 instL2 cache: 10 cycles
Small1-2 cycles
![Page 20: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/20.jpg)
20
+ prefetch
On a miss on the L2 cache: stops for 300 cycles ! ?!
Try to predict which memory block will be missing in the near future and bring it in the L2 cache
![Page 21: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/21.jpg)
21
Can we continue to just throw transistors in uniprocessors ?
•Increasing inst. per cycles?•Larger caches ?•New prefetch mechanisms ?
![Page 22: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/22.jpg)
22
One billion transistors now !!The uniprocessor road seems over 16-32 way uniprocessor seems out of reach:
just not enough ILP quadratic complexity on a few key components:
register file, bypass, issue logic,..
to avoid temperature hot spots: • very long intra-CPU communications would be
needed 5-7 years to design a 4-way superscalar core:
• How long to design a 16-way ?
![Page 23: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/23.jpg)
23One billion transistors:
Process/Thread level parallelism..it’s time now !
Simultaneous multithreading:TLP on a uniprocessor !
Chip multiprocessor
Combine both !!
![Page 24: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/24.jpg)
24
Simultaneous Multithreading (SMT): parallel processing on a uniprocessor
functional units are underused on superscalar processors
SMT: Sharing the functional units on a superscalar
processor between several processes Advantages:
Single process can use all the resources dynamic sharing of all structures on
parallel/multiprocess workloads
![Page 25: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/25.jpg)
25
Superscalar SMT
Tim
e
Unused
Thread 1Thread 2Thread 3
Thread 4
![Page 26: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/26.jpg)
26
The Chip Multiprocessor
Put a shared memory multiprocessor on a single die: Duplicate the processor, its L1 cache, may
be L2, Keep the caches coherentShare the last level of the memory
hierarchy (may be)Share the external interface (to memory
and system)
![Page 27: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/27.jpg)
27
General purpose Chip MultiProcessor (CMP):why it did not (really) appear before 2003
Till 2003 better (economic) usage for transistors: Single process performance is the most important More complex superscalar implementation More cache space
Diminishing return !!
![Page 28: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/28.jpg)
28
General Purpose CMP: why it should not still appear as mainstream
No further (significant) benefit in complexifying single processors: Logically we shoud use smaller and cheaper chips or integrate the more functionalities on the same chip:
• E.g. the graphic pipeline
Poor catalog of parallel applications: Single processor is still mainstream Parallel programming is the privilege (knowledge) of a few
![Page 29: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/29.jpg)
29
General Purpose CMP: why they appear as mainstream now !
The economic factor:-The consumer user pays 1000-2000 euros for a PC-The professional user pays 2000-3000 euros for a PC
A constant:The processor represents 15-30 % of the PC price
Intel and AMD will not cut their share
![Page 30: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/30.jpg)
30General Purpose Multicore SMT in 2005:
an industry reality Intel and IBM
PCs : Dual-core 2-thread SMT Pentium 4 Dual-core Amd64
Servers: Itanium Montecito: dual-core 2-thread SMT IBM Power 5: dual-core 2-thread SMT Sun Niagara: 8-core 4-thread
![Page 31: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/31.jpg)
31
a 4-core 4-thread SMT: think about tuning for performance !
P0
4 threads
P1 P2 P3
Shared L3 Cache
HUGE MAIN MEMORY
![Page 32: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/32.jpg)
32
a 4-core 4-thread SMT: think about tuning for performance ! (2)
Write the application with more than 16 threads
Take in account first level memory hierarchy sharing !
Take in account 2nd level memory hierarchy sharing !
Optimize for instruction level parallelism
Competing for memory bandwidth !!
![Page 33: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/33.jpg)
33
Hardware TLP is there !!
But where are the threads/processes ?
A unique opportunity for the software industry: hardware parallelism comes for free
![Page 34: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/34.jpg)
34
Waiting for the threads (1)
Generate threads to increase performance of single threads: Speculative threads:
• Predict threads at medium granularity– Either software or hardware
Helper threads:• Run ahead a speculative skeleton of the application to:
– Avoid branch mispredictions– Prefetch data
![Page 35: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/35.jpg)
35
Waiting for the threads (2)
Hardware transcient faults are becoming a concern: Runs twice the same thread on two cores and
check integrity
Security: Array bound checking is nearly for free on a out-
of-order core Encryption/decryption
![Page 36: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/36.jpg)
36
Waiting for the threads (3)
Hardware clock frequency is limited by: Power budget: every core running Temperature hot-spots
On single thread workload: Increase clock frequency and migrate the process
![Page 37: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e755503460f94b77083/html5/thumbnails/37.jpg)
37
Conclusion
Hardware TLP is becoming mainstream for general-purpose computing
Moderate degrees of hardware TLPs will be available for mid-term
That is the first real opportunity for the whole software industry to go parallel !
But it might demand a new generation of application developpers !!