Download - Multi core-architecture
![Page 1: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/1.jpg)
1
Multi-core architectures
![Page 2: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/2.jpg)
2
Single-core computer
![Page 3: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/3.jpg)
3
Single-core CPU chipthe single core
![Page 4: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/4.jpg)
4
Multi-core architectures
• This lecture is about a new trend in computer architecture:Replicate multiple processor cores on a single die.
Core 1 Core 2 Core 3 Core 4
Multi-core CPU chip
![Page 5: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/5.jpg)
5
Multi-core CPU chip
• The cores fit on a single processor socket• Also called CMP (Chip Multi-Processor)
core
1
core
2
core
3
core
4
![Page 6: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/6.jpg)
6
The cores run in parallel
core
1
core
2
core
3
core
4
thread 1 thread 2 thread 3 thread 4
![Page 7: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/7.jpg)
7
Within each core, threads are time-sliced (just like on a uniprocessor)
core
1
core
2
core
3
core
4
several threads
several threads
several threads
several threads
![Page 8: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/8.jpg)
8
Interaction with theOperating System
• OS perceives each core as a separate processor
• OS scheduler maps threads/processes to different cores
• Most major OS support multi-core today:Windows, Linux, Mac OS X, …
![Page 9: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/9.jpg)
9
Why multi-core ?• Difficult to make single-core
clock frequencies even higher • Deeply pipelined circuits:
– heat problems– speed of light problems– difficult design and verification– large design teams necessary– server farms need expensive
air-conditioning• Many new applications are multithreaded • General trend in computer architecture (shift
towards more parallelism)
![Page 10: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/10.jpg)
10
Instruction-level parallelism
• Parallelism at the machine-instruction level• The processor can re-order, pipeline
instructions, split them into microinstructions, do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
![Page 11: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/11.jpg)
11
Thread-level parallelism (TLP)• This is parallelism on a more coarser scale• Server can serve each client in a separate
thread (Web server, database server)• A computer game can do AI, graphics, and
physics in three separate threads• Single-core superscalar processors cannot
fully exploit TLP• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
![Page 12: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/12.jpg)
12
General context: Multiprocessors
• Multiprocessor is any computer with several processors
• SIMD– Single instruction, multiple data– Modern graphics cards
• MIMD– Multiple instructions, multiple data
Lemieux cluster,Pittsburgh
supercomputing center
![Page 13: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/13.jpg)
13
Multiprocessor memory types
• Shared memory:In this model, there is one (large) common shared memory for all processors
• Distributed memory:In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
![Page 14: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/14.jpg)
14
Multi-core processor is a special kind of a multiprocessor:
All processors are on the same chip
• Multi-core processors are MIMD:Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).
• Multi-core is a shared memory multiprocessor:All cores share the same memory
![Page 15: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/15.jpg)
15
What applications benefit from multi-core?
• Database servers• Web servers (Web commerce)• Compilers• Multimedia applications• Scientific applications,
CAD/CAM• In general, applications with
Thread-level parallelism(as opposed to instruction-level parallelism)
Each can run on itsown core
![Page 16: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/16.jpg)
16
More examples
• Editing a photo while recording a TV show through a digital video recorder
• Downloading software while running an anti-virus program
• “Anything that can be threaded today will map efficiently to multi-core”
• BUT: some applications difficult toparallelize
![Page 17: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/17.jpg)
17
A technique complementary to multi-core:Simultaneous multithreading
• Problem addressed:The processor pipeline can get stalled:– Waiting for the result
of a long floating point (or integer) operation
– Waiting for data to arrive from memory
Other execution unitswait unused BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
Source: Intel
![Page 18: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/18.jpg)
18
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads” on the same core
• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
![Page 19: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/19.jpg)
19
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTB
L2 Cache and
Control
Bu
s
Thread 1: floating point
Without SMT, only a single thread can run at any given time
![Page 20: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/20.jpg)
20
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTB
L2 Cache and
Control
Bu
s
Thread 2:integer operation
![Page 21: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/21.jpg)
21
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTB
L2 Cache and
Control
Bu
s
Thread 1: floating pointThread 2:integer operation
![Page 22: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/22.jpg)
22
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTB
L2 Cache and
Control
Bu
s
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE
![Page 23: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/23.jpg)
23
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)• OS and applications perceive each
simultaneous thread as a separate “virtual processor”
• The chip has only a single copy of each resource
• Compare to multi-core:each core has its own copy of resources
![Page 24: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/24.jpg)
24
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
Thread 1 Thread 2
![Page 25: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/25.jpg)
25
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
Thread 3 Thread 4
Multi-core: threads can run on separate cores
![Page 26: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/26.jpg)
26
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)• The different combinations:
– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
![Page 27: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/27.jpg)
27
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTB
L2 Cache and
Control
Bu
s
Thread 1 Thread 3 Thread 2 Thread 4
![Page 28: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/28.jpg)
28
Comparison: multi-core vs SMT
• Advantages/disadvantages?
![Page 29: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/29.jpg)
29
Comparison: multi-core vs SMT
• Multi-core:– Since there are several cores,
each is smaller and not as powerful(but also easier to design and manufacture)
– However, great with thread-level parallelism• SMT
– Can have one large and fast superscalar core– Great performance on a single thread– Mostly still only exploits instruction-level
parallelism
![Page 30: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/30.jpg)
30
The memory hierarchy
• If simultaneous multithreading only: – all caches shared
• Multi-core chips:– L1 caches private– L2 caches private in some architectures
and shared in others• Memory is always shared
![Page 31: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/31.jpg)
31
“Fish” machines
• Dual-coreIntel Xeon processors
• Each core is hyper-threaded
• Private L1 caches• Shared L2 caches
memory
L2 cache
L1 cache L1 cacheC
O R
E 1 C
O R
E 0
hyper-threads
![Page 32: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/32.jpg)
32
Designs with private L2 caches
memory
L2 cache
L1 cache L1 cache
C O
R E
1 C O
R E
0
L2 cache
memory
L2 cache
L1 cache L1 cache
C O
R E
1 C O
R E
0
L2 cache
Both L1 and L2 are private
Examples: AMD Opteron, AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches
Example: Intel Itanium 2
![Page 33: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/33.jpg)
33
Private vs shared caches?
• Advantages/disadvantages?
![Page 34: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/34.jpg)
34
Private vs shared caches
• Advantages of private:– They are closer to core, so faster access– Reduces contention
• Advantages of shared:– Threads on different cores can share the
same cache data– More cache space available if a single (or a
few) high-performance thread runs on the system
![Page 35: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/35.jpg)
35
The cache coherence problem• Since we have private caches:
How to keep the data consistent across caches?• Each core should perceive the memory as a
monolithic array, shared by all the cores
![Page 36: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/36.jpg)
36
The cache coherence problemSuppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 37: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/37.jpg)
37
The cache coherence problemCore 1 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 38: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/38.jpg)
38
The cache coherence problemCore 2 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 39: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/39.jpg)
39
The cache coherence problemCore 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
![Page 40: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/40.jpg)
40
The cache coherence problemCore 2 attempts to read x… gets a stale copy
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 41: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/41.jpg)
41
Solutions for cache coherence
• This is a general problem with multiprocessors, not limited just to multi-core
• There exist many solution algorithms, coherence protocols, etc.
• A simple solution:invalidation-based protocol with snooping
![Page 42: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/42.jpg)
42
Inter-core bus
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memory
multi-core chip
inter-corebus
![Page 43: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/43.jpg)
43
Invalidation protocol with snooping
• Invalidation:If a core writes to a data item, all other copies of this data item in other caches are invalidated
• Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores.
![Page 44: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/44.jpg)
44
The cache coherence problemRevisited: Cores 1 and 2 have both read x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
![Page 45: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/45.jpg)
45
The cache coherence problemCore 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
INVALIDATEDsendsinvalidationrequest
inter-corebus
![Page 46: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/46.jpg)
46
The cache coherence problem After invalidation:
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 47: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/47.jpg)
47
The cache coherence problemCore 2 reads x. Cache misses, and loads the new copy.
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
![Page 48: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/48.jpg)
48
Alternative to invalidate protocol: update protocol
Core 1 writes x=21660:
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=21660
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chipassuming write-through caches
UPDATED
broadcastsupdatedvalue inter-core
bus
![Page 49: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/49.jpg)
49
Which do you think is better? Invalidation or update?
![Page 50: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/50.jpg)
50
Invalidation vs update
• Multiple writes to the same location– invalidation: only the first time– update: must broadcast each write
(which includes new variable value)
• Invalidation generally performs better:it generates less bus traffic
![Page 51: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/51.jpg)
51
Invalidation protocols
• This was just the basic invalidation protocol
• More sophisticated protocols use extra cache state bits
• MSI, MESI(Modified, Exclusive, Shared, Invalid)
![Page 52: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/52.jpg)
52
Programming for multi-core
• Programmers must use threads or processes
• Spread the workload across multiple cores
• Write parallel algorithms
• OS will map threads/processes to cores
![Page 53: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/53.jpg)
53
Thread safety very important
• Pre-emptive context switching:context switch can happen AT ANY TIME
• True concurrency, not just uniprocessor time-slicing
• Concurrency bugs exposed much faster with multi-core
![Page 54: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/54.jpg)
54
However: Need to use synchronization even if only time-slicing on a uniprocessor
int counter=0;
void thread1() { int temp1=counter; counter = temp1 + 1;}
void thread2() { int temp2=counter; counter = temp2 + 1;}
![Page 55: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/55.jpg)
55
Need to use synchronization even if only time-slicing on a uniprocessor
temp1=counter;counter = temp1 + 1;temp2=counter;counter = temp2 + 1
temp1=counter;temp2=counter;counter = temp1 + 1;counter = temp2 + 1
gives counter=2
gives counter=1
![Page 56: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/56.jpg)
56
Assigning threads to the cores
• Each thread/process has an affinity mask
• Affinity mask specifies what cores the thread is allowed to run on
• Different threads can have different masks
• Affinities are inherited across fork()
![Page 57: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/57.jpg)
57
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT
1011
core 3 core 2 core 1 core 0
• Process/thread is allowed to run on cores 0,2,3, but not on core 1
![Page 58: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/58.jpg)
58
Affinity masks when multi-core and SMT combined
• Separate bits for each simultaneous thread• Example: 4-way multi-core, 2 threads per core
1
core 3 core 2 core 1 core 0
1 0 0 1 0 1 1
thread1
• Core 2 can’t run the process• Core 1 can only use one simultaneous thread
thread0
thread1
thread0
thread1
thread0
thread1
thread0
![Page 59: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/59.jpg)
59
Default Affinities
• Default affinity mask is all 1s:all threads can run on all processors
• Then, the OS scheduler decides what threads run on what core
• OS scheduler detects skewed workloads, migrating threads to less busy processors
![Page 60: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/60.jpg)
60
Process migration is costly
• Need to restart the execution pipeline• Cached data is invalidated• OS scheduler tries to avoid migration as
much as possible: it tends to keeps a thread on the same core
• This is called soft affinity
![Page 61: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/61.jpg)
61
Hard affinities
• The programmer can prescribe her own affinities (hard affinities)
• Rule of thumb: use the default scheduler unless a good reason not to
![Page 62: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/62.jpg)
62
When to set your own affinities
• Two (or more) threads share data-structures in memory– map to same core so that can share cache
• Real-time threads:Example: a thread running a robot controller:- must not be context switched, or else robot can go unstable- dedicate an entire core just to this thread
Source: Sensable.com
![Page 63: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/63.jpg)
63
Kernel scheduler API#include <sched.h>int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask);
Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)
![Page 64: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/64.jpg)
64
Kernel scheduler API#include <sched.h>int sched_setaffinity(pid_t pid,
unsigned int len, unsigned long * mask);
Sets the current affinity mask of process ‘pid’ to *mask ‘len’ is the system word size: sizeof(unsigned int long)
To query affinity of a running process:[barbic@bonito ~]$ taskset -p 3935pid 3935's current affinity mask: f
![Page 65: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/65.jpg)
65
Windows Task Manager
core 2
core 1
![Page 66: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/66.jpg)
66
Legal licensing issues
• Will software vendors charge a separate license per each core or only a single license per chip?
• Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core
![Page 67: Multi core-architecture](https://reader033.vdocuments.us/reader033/viewer/2022061114/5457ec4fb1af9fc15d8b48dc/html5/thumbnails/67.jpg)
67
Conclusion
• Multi-core chips an important new trend in computer architecture
• Several new multi-core chips in design phases
• Parallel programming techniques likely to gain importance