bandwidth-aware scheduling for clustered multi-core...
TRANSCRIPT
2 CASPER GROUP
UCY, Cyprus
1 SiPS GROUP
IST, Portugal
Bandwidth-Aware Scheduling for Clustered Multi-Core Systems
Panayiotis Petrides2, Frederico Pratas1 and Pedro Trancoso2
Leonel Sousa1
technology from seed
Motivation
technology from seed
Outline
• Target Architectures
• Overall Scheduling/Mapping Method
• Bandwidth and Execution Time Profiling
• Bandwidth-Aware Scheduler (BAS)
• Experimental Results
• Conclusions and Future Work
technology from seed
Target Architectures: Clustered Multi-Core
• Pros:
• Integration - reduces memory
latency
• Multiple controllers – larger
memory bandwidth
• Cons:
• Die area
• Difficult memory access
management
Bandwidth Aware Scheduler
Balance memory requests among different controllers
• Existing architectures:
• Intel Single-chip Cloud
Computer
• Intel Nehalem
(approximately the same
problem).
technology from seed
Target Architectures: hardware and tools
• Binary instrumentation
– PIN + cache simulation
extension
– Single core execution
• Bandwidth
– The bandwidth calculated
independently for each
application using:
• Type of memory accesses
• Number of memory
accesses
• Average execution time
Characteristics SCC-like architecture
Core type Intel [email protected]
L1 Cache 4-way
32KB Data
32KB Instructions
L2 Cache 8-way 2MB Unified
Cache policies Write-back, write-allocate,
No support for coherency
Cluster
configurations
32 (8x4)
64 (16x4)
128 (32x4)
# Controllers 4
Main memory DDR3-800 (6.4 GB/s)
• Architectural model
technology from seed
Overall Scheduling Method
Static
• Memory bandwidth profiling of representative applications from different areas.
• Classification according to the bandwidth requirements
Dynamic
• Dynamically bandwidth sensing
• Use static information to classify applications for run-time scheduling
• Rebalance memory accesses according to the classification
• Distribute (schedule) applications by the multicore-
clusters to overcome “memory wall”
– Main assumptions: all cores are busy
technology from seed
Experimental Setup: Set of Applications
Name Description Tests
TPC-H Decision Support benchmark 16 queries
DEX Graph-based database query
application
8 queries
MrBayes Bioinformatics application performing
Bayesian inference of phylogeny
17 DNA data sets
Biobench Benchmark suite containing different
bioinformatics algorithms
phylip_protdist,
phylip_protpars, fasta_dna,
fasta_protein, and hmmer
NAMD Computer chemistry application for
molecular dynamics simulation
single precision, double
precision
PARSEC Benchmark of representative
applications from different areas
Blackscholes,
streamcluster, freqmine
Total 51 different workloads
technology from seed Bandwidth and Execution Time Profiling:
Classification
• Dimensions considered for characterizing application
– Execution time
– Bandwidth requirements
• Classification used for each application:
Bandwidth
Low (<0.5*AV)
Medium (≥0.5AV and <1.5*AV)
High (≥1.5*AV)
Exec
uti
on
tim
e
Short (<10s)
Short-Low Short-Medium Short-High
Medium (≥10s and <100s)
Medium-Low Medium-Medium Medium-High
Long (≥100s)
Long-Low Long-Medium Long-High
• Bandwidth calculation:
• Chopped the execution
in several phases or
quantum.
• Calculate the bandwidth
for each phase.
• Calculate the average
of all the phases in the
application.
• Phase - smallest period
of time considered
between two scheduling
actions.
technology from seed Bandwidth and Execution Time Profiling:
Classification (cont.)
• Selection of one representative
application per class:
– Calculate the center of each class
– Select the application that is nearest
to the center
• Nine representative
applications were
selected.
Short - Low Short - Medium Short - High
tpch Q3, tpch Q6, tpch Q7, tpch Q8
tpch Q12, tpch Q13, tpch Q16
dex 3 Q4, dex 3 Q8, dex 3 All
dex 4 Q4, dex 4 Q8
dex 5 Q8, dex 5 All
namd single
tpch Q10, tpch Q14 tpch Q15 tpch Q11
Medium - Low Medium - Medium Medium - High
dex 4 All, dex 5 Q4
namd double
freqmine
tpch Q1, tpch Q2 tpch Q9 , streamcluster,
mrbayes 10x5000, 10x20000,
20x5000, 50x1000, 50x1000,
50x5000, 100x1000
Long - Low Long - Medium Long - High
phylip protdist, fasta dna, fasta protein hmmer, phylip protpars mrbayes 10x50 000, 20x20000,
20x50000, 50x20000, 100x5000,
100x20000, 100x50000
technology from seed
Scheduler: Policies Evaluated
• Random Static Scheduler
Agnostic policy representing a common scenario where
applications are mapped to cores according to their resource
availability
• Oracle Scheduler
A policy which takes into account a priori the overall application
bandwidth characteristics to define the best static placement
of the different applications through the chip
• Proposed Bandwidth Aware Scheduler (BAS)
Proposed policy which takes into account the different
demands of applications at run-time level in order to satisfy
their demands and utilize the systems’ bandwidth through
their execution
technology from seed
BAS: Distributing Applications to Cores
• Different distribution scenarios – Variation of the number of cores per cluster
– Variation of the distribution of applications:
» Per cluster
» Overall
• Only considered:
– Distribution of applications within the same time category.
– Reduce the exploration space to some representative distributions
technology from seed
BAS: Considered Inner-cluster Distributions
• Considered distributions inside one cluster:
Ba
nd
wid
th
Low 100% 50% 50% 50% 33% 25% 0% 25% 0% 0%
Medium 0% 50% 25% 0% 33% 50% 100% 25% 50% 0%
High 0% 0% 25% 50% 33% 25% 0% 50% 50% 100%
Low Medium High
Bandwidth Class
00:00:100100:00:00 50:50:00 25:25:50
technology from seed
BAS: Inner-cluster Distributions (cont.)
• Increasing number of cores per
cluster linearly increases the number
of extra phases
• Scalability of multi-core processors
highly dependent of the off-chip
memory bandwidth
Short applications Medium applications
Long applications
technology from seed
BAS: Overall Distribution
• Overall distributions considered:
Ban
dw
idth
Low 50% 50% 33% 0%
Medium 50% 0% 33% 50%
High 0% 50% 33% 50%
Note: Distributions with 100% were not considered because there are no scheduling
opportunities. Distributions with 25% were removed for sake of complexity.
• For the random policy all combinations of inner-core
distributions were considered, for example:
100:
00:0
0
00:1
00:0
0
00:0
0:10
0
33:3
3:33
Overall = 33:33:33
50:5
0:00
50:5
0:00
00:0
0:10
0
33:3
3:33
Overall = 33:33:33
25:2
5:50
25:5
0:25
50:2
5:25
33:3
3:33
Overall = 33:33:33
technology from seed
BAS algorithm
Application
Execution
Apps.
Bandwidth
Distribution
per cluster
Calculate
Bandwidth
Utilization
MAX(UBW)>1
Adaptive
Procedure
T
F
technology from seed
BAS algorithm (cont.)
Calculate
UBWai, UBWbi
and new
BWBal for new
distributions
T
F
UBWa = MAX(UBWi) UBWb = MIN(UBWi)
BWBal(UBWa,UBWb)
Get valid
new
distributions
Compatibility
Distribution
Lookup Table
More ai-bi
valid
distributions
?
Perform
applications
exchanges
from cluster a
to b
Adaptive Procedure
Low complexity: O(n) – n size of the compatible lookup table
technology from seed
50:50:00 50:00:50
00:50:50 33:33:33
BAS: Experimental Results
Short applications
Proposed Scheduler Shows better Results Only one case with worst results
technology from seed
50:50:00 50:00:50
00:50:50 33:33:33
BAS: Experimental Results
Medium applications
Only for two cases same performance with Random Policy Proposed Scheduler Shows better Results
technology from seed
50:50:00 50:00:50
00:50:50 33:33:33
BAS: Experimental Results
Long applications
Proposed Scheduler Shows better Results for all cases
technology from seed Applications Execution Speedup Using
the Bandwidth-Aware Scheduler
Average Speedups
• Short applications: 1.36x
• Medium applications: 1.48x
• Long applications: 1.46x
Short applications
Long applications
Medium applications
technology from seed
Results Analysis
• The majority of the applications distributions benefit from the proposed bandwidth-aware scheduler
• Very close performance of the proposed bandwidth-aware scheduler to the Oracle policy
• Stable performance of the proposed scheduler
• Multi-cores scalability can benefit from the use of the proposed bandwidth-aware scheduler
technology from seed
Conclusions
• We have shown:
–The importance of having a bandwidth aware scheduling policy in clustered multi-core architectures
–There are benefits even for short applications
–Scaling multi-cores is highly correlated with the available bandwidth
• We have proposed:
–A quite simple dynamic bandwidth-aware scheduler
–A set of representative applications with different bandwidth and time requirements
–Scaling multi-cores with the use of the proposed bandwidth-aware scheduler
technology from seed
Future work
• We are performing experimental work on the SCC (Intel donated to us an SCC system)
• Investigation of different but still simple scheduling algorithms, also with more accurate cost functions
• Can we integrate this work in automatic tools at the compiler and OS levels? – Rephrasing the question: can we expect to have automatic
scheduling in these type of systems to overcome memory bandwidth limitation?
technology from seed
Thank You!
http://www.sips.inesc-id.pt
http://www.cs.ucy.ac.cy/carch/casper
Questions?