managing heterogeneity by light-weight abstraction and...
Post on 21-Oct-2019
8 Views
Preview:
TRANSCRIPT
Karlsruher Institut für Technologie
KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Managing Heterogeneity by Light-weightAbstraction and Self-GuidanceRainer Buchty
KIT, Institute of Computer Science & Engineering (ITEC), Chair for Computer Architecture and Parallel ProcessingEberhard Karls Univ. Tübingen, Wilhelm Schickard Institute for Computer Science (WSI), Dept. of Computer Engineering
Karlsruher Institut für TechnologieMotivationHeterogenity on the rise
in the past: “everything is software”Application requirements and Technology aspects shift focusRevival of heterogenous architectures
System architectures (Host + accelerator)Processor architectures (STI Cell BE)Platform FPGAs (Xilinx Virtex)
Status quoMulticore architectures forgeneral-purpose useManycore architectures fordata-parallel accelerationReconfigurable architecturesfor dedicated acceleration Source: Intel Corp.
2/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieMotivation (cont’d)
ArchitecturesThread & task-level parallelism
Multiplication of general-purpose coresSame ISA and (typ.) speedExamples: IA32, Tilera
Data parallelismALU replication, e.g. FP acceleratorsHost/Master ↔ Accelerator/SlaveHost enforces control flowExamples: GPU, ClearSpeed
Heterogeneous architecturesHeterogeneous CPUs (Cell BE)Host+accelerator (GPU, FPGAs)
Source: Intel
Source: Clearspeed
3/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieMotivation (cont’d)
Example: HTX-based reconfigurable acceleratorFPGA-based universal acceleratorFlexible use of FPGA resources by partitioning
Dynamic configuration of individual “slots”Focus on use within multitasking/multithreading environments
Mon.
Accelerator
PR
B
Accelerator Slot
Accelerator Wrapper
HT Core
DMA Unit
Co
mm
an
d &
Sta
tus
Bu
s
Da
ta B
us
Mon.P
RB
Accelerator
Accelerator Slot
Accelerator Wrapper
Static Dynamic
Inte
rfa
ce
Ac
ce
l.
Coder
Request
Reconfiguration
Controller
Inte
rfa
ce
Ac
ce
l.David Kramer, Thorsten Vogel,Rainer Buchty, Fabian Nowak,Wolfgang Karl: A general purposeHyperTransport-based ApplicationAccelerator Framework;Proceedings of the SecondInternational Workshop onHyperTransport Research andApplications (WHTRA 2009),Mannheim, Germany, February 12,2009
4/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieHTX-based reconfigurable accelerator
Accelerator systemPC-based host running LinuxFPGA fabric partitioned into 6 slots
Individual accelerator modulesCentral control via Command & Status BusAbstract interface in hardwareMonitoring facility
HyperTransport bus interfaceMemory-mapped I/ODMA-capable accelerators
5/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie“If you build it, they will come ...”... and curse you.
Heterogeneous architectures:Easy to build but a pain to programHardware-aware approach
Leaving everything to the programmerFine-grain control, but tedious workWorst case: several environments, several languages
Vendor-specific approachesDedicated platform-specific environmentsEasing programming, but transition basically meansreimplementation
Problem-specific approachesFocus on parallelism level
In any case: heavy impact on source code
6/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieProgrammability
Arising problems1 Collision of principles
Parallelization on abstract levelArchitecture mapping: hardware-aware, specific
2 Complexity aspectsResource sharing in multitasking environmentsPhase behavior of applicationsImpact of workload
3 Compatibility aspectsRe-programming means re-approval
7/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
Providing required abstraction
8/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieStep 1: Achieving abstraction
Uniform application descriptionIntroduce abstraction layer for decoupling programmersfrom hardwareFunction-level granularity sufficient
Provide individual function implementationsInvoke desired implementation (and therefore associatedhardware) during run-time
Sounds like dynamic linkingDynamic linking included in any modern OShowever: performed only once per function call
But: Any-time re-linking requiredDynamic selection of suitable implementation
9/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieAbstraction layer
Light-weight run-time layer extensionFunction call is a proxyProxy dynamically mapped to desired implementationFlexible mapping-control enabling external guidanceNo measurable impact on run-time
Function Switcher
Control
Daemon
long libfct_a(int a, ...) long libfct_b(int a, ...)
Interface
Kernel
long (*fct)(int a, ...)
Function Pointer
Rainer Buchty, Mario Kicherer, DavidKramer, Wolfgang Karl: Anembrace-and-extend approach tomanaging the complexity of futureheterogeneous systems;Proceedings of SAMOS IX, Springer,Series Lecture Notes in ComputerScience (LNCS) Volume 5657, pp.226-235, Samos, Greece, July 2009
10/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieAbstraction layer (cont’d)
Expansion of Task-State Segment (TSS)TSS: OS’s task management structure
TSS handled in software, hence changes possibleSlight changes to kernel source required
Keep management list with threadFunction mappings individual per thread“Unlimited” implementation alternatives possibleRegistering of alternatives required
11/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieAbstraction layer (cont’d)
Control
Daemon
Proxy Function
long (*fct)(int a, ...)
dls_fcts_ptr: dls_fct_type*num_fcts: intnext: dls_struct*
dls_struct
long libfct_a(int a, ...) long libfct_b(int a, ...)
dls_struct_ptr
this: dls_struct*
next: dls_struct_ptr*
dls.h
Kernel
ProcFS
dls_set_fct()
12/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieSoftware stack
Flexibility and compatibilityEmbrace OS structureSpans 4 dedicated system layers
Application and library reside in user address spaceControl daemon decoupled in own address spaceKernel address space (hardware access)Hardware
Interfacing between layersInter-process communication (IPC) using procfs between userand daemon address spaceHardware device drivers between kernel and hardware
Basic framework open for later extension
13/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieSoftware stack (cont’d)
Daemon
address space
IPC
Device
Accel.Accel. Main Memory
Device
Kernel
address space
User
address space
Application Application
Hardware
Library
Control Daemon
Mem.AMS
AMS
AMS
Kernel Driver
14/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
Dealing with complexity
15/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieComplexity Issues
Hardware awareness? Application awareness!Hardware-aware mapping not enough
Tasks competing for resourcesApplications expose phase behaviorDifferent workloads ↔ different “best” implementations
Programmer unable to oversee all eventualitiesBut even if...
Most programming time is spent on implementation selection,not implementation itselfDetection of workload, congestion, phase ...
How to deal with that?
16/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieComplexity issues (cont’d)
Overcoming complexity by Self-XSelf-awareness: system-state analysis and evaluationSelf-adaptation and Self-optimizationSelf-protection and Self-healing
Introduce bio-inspired flexibility“Sensors and actuators”Communication and control
Rainer Buchty, Wolfgang Karl: Design Aspects forSelf-Organizing Heterogeneous Multi-CoreArchitectures; it - Information Technology Journal 5/08"‘Computer Architecture – New Developments"’,Oldenbourg Wissenschaftsverlag, 2008
17/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Control
Analysis
FunctionObjective
Reorganization
Monitoring Configuration
Karlsruher Institut für TechnologieCost-aware function migration
Harnessing the power of Self-X1 More than workload balancing required
Selection of most suitable implementationKnowledge about run-time required
2 Run-time insufficient criterionRun-time differs with workload of taskDifferent workloads might require different implementations
3 Off-line training lacks dynamicsApplication phases in relation to workloadCompetition in multitasking systemsDynamic resource availability
18/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieStep 2: Achieving cost-awareness
Measuring execution timeUnobtrusive method required
No instrumentation on source-code levelCould we move it into abstraction layer?
Proxy points to function implementationWhy not call timer functions before and after as well?Dynamic instrumentation on run-time level
19/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
Dynamic instrumentationLight-weight expansionof the abstraction layerProxy function resolves tofunction listCall any amount of fcts.before and after selectedimplementationHowever: requires callerstack-frame duplicationInstrumentation costshidden by pre/post fcts.
Pro
xy lis
t
Post
Functions
f()
Pre
Functions
f()
Application
f()
f()
using
proxy list
proxy resolving
simple
Mario Kicherer, Fabian Nowak, Rainer Buchty, Wolfgang Karl:Extending a Light-weight Runtime System by DynamicInstrumentation for Performance Evaluation; ARCS 2010Workshop Proceedings (PARMA 2010), pages 279-284, VDE, ISBN978-3-8007-3222-7, Hannover, Germany, February 2010
20/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
21/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Stack-frame manipulation
Karlsruher Institut für TechnologieStep 3: Cost-awareness and Evaluation
Evaluation and guided executionTwo-step process
1 Online-creation of initial classification2 Guided execution
Learning retriggered upon changes / deviations
Example: Time consumptionof square-matrix multiplicationrelated to dimensions andacceleration method
22/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
Phase 1: Online learningRate only execution, not start-up time
First two executions are not measuredNeglect influence of library loading and linking, CUDA kernelinvocation, etc.
Create initial classificationAlternate use of implementations5 runs per implementationDetermine cost value from workload size and execution time
23/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
24/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Classification process
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
Phase 2: Guided executionSelect implementationbased on workload sizeand associated cost valueMeasure execution timeRedo classification if toomuch deviation fromexpectation
Mario Kicherer, Rainer Buchty, Wolfgang Karl: Cost-awareFunction Migration in Heterogeneous Systems HiPEAC 2011,Proceedings of the 2011 International Conference on HighPerformance Embeddded Architectures & Compilers, Heraklion,Crete, Greece, January 2011
25/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCost-aware function migration (cont’d)
Adaptation of classes during application runtimein reaction to resource contention
26/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
Delivering guidance information
27/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCompatibility issues
So far we achieved...
(Almost) compatibility on source-code levelOnly registration of functions requiredNo code overloading with implementation selectionApproach orthogonal to parallel programming models
Compatibility on execution levelTransparent changes to runtime systemLegacy software unharmed
But what about run-time compatibility?Classification takes timeApplication might break due to given constraints!
28/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieCompatibility issues (cont’d)
Constraint-based guidanceAnnotate application requirements
Throughput, execution speed, accuracy, ...Deliver pre-classification of implementations
Speed up/avoid initial classificationRe-classification eventually done later
Source-code attribution using pragmasBinary-level attribution using additionalsections or resource filesCompatibility achieved on both levels
Fabian Nowak, Rainer Buchty:Providing Guidance Informationfor Application Mapping onHeterogeneous Parallel Systems;22nd PARS Workshop, Parsberg,Juni 2009
29/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieProviding guidance information
1 Extract attributes fromsource code
2 Generate attribute file3 Bind attributes into binary
format
Compatibility withexisting tool chain
30/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieEvaluate guidance information
Run-time evaluationRequirements forfunction callsImplementation performanceAvailable HW resources
Fabian Nowak, Mario Kicherer, Rainer Buchty, Wolfgang Karl:Delivering Guidance Information in Heteroge- neousSystems, PARS 2010, Hannover, Februar 2010
31/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
Summary
32/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieSummary of Features
BenefitsMaximum compatibility
Source-code level (registering, guidance information)Binary level (guidance information)Run-time (performance)
Interoperability with existing approachesProgramming models (HW-aware, OpenMP, CUDA)Programming tools (gcc, gdb, ...)
Modest expansion of existing services
→ easy upgrade path from conventional to self-guiding systems
33/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieManaging Heterogeneity...
HW Library
SW Library
System/HW Monitor(s)
Univ. Binary
Code
LOAD r1,arg
Attributes
Code
Application
LOAD r0,arg
call fn()
Processing
Heterogen.
Hardware
Layer
Abstraction
Hardware
Layer
Run−time
Layer
Code
Layer
HW
Impl. #2
PUSH r0,r1
POP r0CALL asf_sp
Attributes
Impl. #1
Attributes
UDI r0,r0,r1
Impl. #3
PUSH r0,r1
POP r0CALL asf_dp
Attributes
Library
Control System
Run−time Domain
Hardware
Predef.
HW
Domain
Compiler Domain
34/35 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Managing Heterogeneity by Light-weightAbstraction and Self-GuidanceRainer Buchty
KIT, Institute of Computer Science & Engineering (ITEC), Chair for Computer Architecture and Parallel ProcessingEberhard Karls Univ. Tübingen, Wilhelm Schickard Institute for Computer Science (WSI), Dept. of Computer Engineering
Karlsruher Institut für TechnologieFunction resolution
Basic overhead
min avg. max Ovhd.native 21.26s 21.60s 21.91s –GLS 21.26s 21.60s 21.91s 0DLS-DL 21.08s 21.54s 21.88s ∼0DLS-SL 21.06s 21.57s 21.94s ∼0
1/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieFunction resolution (cont’d)
Worst case (no fct. payload, external trigger)
min avg. max Ovhd.native 21.26s 21.60s 21.91s –GLS 60.86s 63.22s 65.58s 2.93
DLS-DL 66.88s 69.60s 72.41s 3.22DLS-SL 35.20s 37.20s 39.40s 1.72
Worst case (no fct. payload, internal trigger)
min avg. max Ovhd.native 21.26s 21.60s 21.91s –GLS n/a n/a n/a n/a
DLS-DL 47.33s 48.41s 49.35s 2.24DLS-SL 21.03s 21.85s 22.66s 1.01
2/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieFunction resolution (cont’d)
TSS overhead (OpenMP baseline)
min avg. max Ovhd.w/o 24.09s 24.88s 25.84s –
DLS-SL 25.88s 26.53s 28.26s 1.06
TSS overhead (OpenMP stress test)
min avg. max Ovhd.w/o 24.09s 24.88s 25.84s –
DLS-SL 36.32s 38.36s 41.87s 1.54
3/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieFunction resolution (cont’d)
Thread-related overhead
#Threads 1 5 10 20DLS-SL 35.93s 39.38s 38.36s 38.29s
Thread-related overhead
#Functions 2 4 8 16DLS-SL 37.28s 38.10s 37.46s 37.76s
4/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieInstrumentation
Cost of instrumentation
Measurement Time for 106 iterations
Simple fct. call 6 nsBasic instrumentation (no payload) 57 nsDyninst v6.1 fct. start 132 nsDyninst v6.1 fct. start/end 243 nsInstrumentation w/ time measuring 2137 ns
Cost of stack-frame manipulation (32-bit args.)
# of args. 1 4 8 16 24 32 48 64 128time (ns) 57 57 57 65 73 86 104 120 188
5/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieSelf-Guidance
Sorting applicationKernel Rel. Time App. Rel. Time
DLS-RTopt 445µs 1.00 72.4s 1.00CPU (ser.) 563µs 1.27 82.8s 1.14CPU (par.) 801µs 1.80 107.6s 1.49GPU 523µs 1.18 79.2s 1.09
Matrix multiplicationKernel Rel. Time App. Rel. Time
DLS-RTopt 207µs 1.00 154s 1.00CPU (ser.) 1184µs 5.72 247s 1.60CPU (par.) 259µs 1.25 159s 1.03GPU 285µs 1.38 158s 1.03
6/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologieSelf-Guidance (cont’d)
Mersenne TwisterKernel Rel. Time App. Rel. Time
DLS-RTopt 282µs 1.00 0.291s 1.00CPU (ser.) 443µs 1.57 0.443s 1.52GPU 302µs 1.07 0.312s 1.07
7/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für TechnologiePerformance of Self-Guidance (cont’d)
Worst-case estimationKernel Rel. Time App. Rel. Time
DLS-RTopt 1340µs ≈0.99 25.4s ≈0.99CPU (ser.) 1330µs 1.00 25.3s 1.00GPU 2900µs 2.18 40.0s 1.58
Best-case estimationKernel Rel. Time App. Rel. Time
DLS-RTopt 174µs 1.00 416ms 1.00CPU (ser.) 395µs 2.27 571ms 1.37GPU 388µs 2.22 628ms 1.51
8/9 09.02.2011 Rainer Buchty – Managing Heterogeneity ITEC/WSI
Karlsruher Institut für Technologie
KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Managing Heterogeneity by Light-weightAbstraction and Self-GuidanceRainer Buchty
KIT, Institute of Computer Science & Engineering (ITEC), Chair for Computer Architecture and Parallel ProcessingEberhard Karls Univ. Tübingen, Wilhelm Schickard Institute for Computer Science (WSI), Dept. of Computer Engineering
top related