programming(for(mobile(gadgets( - the 5th …prism.sejong.ac.kr/download/p04_calin_cascaval.pdf ·...
TRANSCRIPT
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Programming(for(Mobile(Gadgets(Calin(Cascaval(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Mobile(Compu.ng(Landscape(
2
users:(109((
devices:(1011( developers:(106(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Anatomy(of(a(mobile(SoC((MSM8960)(
3
Source:(hXp://www.qualcomm.com/documents/files/snapdragon:s4:processors:system:on:chip:solu.ons:for:a:new:mobile:age:white:paper.pdf(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Parallel(Compu.ng(:(Use(Cases(in(Mobile(
• Computa.onal(Photography(– HDR,%Camera%arrays%
• Mul.media(Processing(– Photo%and%video%editors%
• Computer(Vision(Libraries(&(Applica.ons(– OpenCV,%FastCV%%
• Gaming(&(Graphics(– Unity,%Bullet,%Box2D%
• Web(Applica.ons(&(Browser(Technologies(– Qualcomm%Zoomm,%Mozilla%Servo,%Intel%Rivertrail%
• User(Interfaces(– Samsung%Touchwiz,%HTC%Sense,%Windows%Metro%
• Security(
4
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Programming(for(mobile(compu.ng(
5
Developers(• Think(in(terms(of(services(and(“features”(• Portable(apps,(performance(not(the(main(concern(• Use(JavaScript(frameworks(
DSP( GPU(
AXI
T SE / RAS
VBIF
CP
CPU
VSC
HLSQ
VPC
RBBM
MARB
GMEM GMEM GMEM GMEM
OXILI SS
memory
PM4 PacketBuffers
VS Buffer
Index Buffers
Vertex Objects
Texture Objects
Frame Buffer
PC
VFD
SP#3
SP#2
SP#1
SP#0
UCHE
arb
OXILI TOP AHB Slave I/F
SS (shader system)
FF (fix-function)
External - other
arb
arb
arb
RB RB RB RB
MARB
MARB
MARB
2xTPL1
2xTPL1
2xTPL1
2xTPL1
ARM926EJ-S
16k I$ 8k D$
AHB
DDR AXI
VBIF (DDR)
ahb2axi ahb2crif
ARM Peripheral
VBIF (OCMEM)
OCMEM AXI
SP
SG
AP
VS
P-V
PP
I/F
(FIF
O)
Sha
red
RA
M B
anks
4K
B x
8
DM
A
VPE MEC
L1 $ / L2 $ Ctrl
TREPPEOCMEM
Data Mover & Controller
FPB (AHB)
VPP
VSP
Shared RAM pool(FIFO / Cache / Tile Buffer / Internal RAM)
Hardware(Accel.(
CPU(s)(
CPU(( CPU((
(VeNum(
CPU((
L2(
(VeNum((VeNum(
(VeNum(
CPU((
High%Power%Efficiency
Low%Power%Efficiency
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Pain(Points(for(Parallel(Compu.ng(
• Portability(and(performance(portability(across(plagorms(
• What(to(parallelize?((profiling)(
• How(to(parallelize?((programming(models)(
• Debugging(&(Tes.ng((correctness)(
• Achieving(maximum(power(&(performance(efficiency((tuning)(
6
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Issues(with(current(models(
7
(Concurrency(
Model(
(Heterogeneity(
Support(
(Power(
Efficiency(
1( 2( 3(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Concurrency(model(issues(
• Most(models(designed(with(data(parallelism(in(mind(– CUDA,(RenderScript,(OpenMP,(C++(AMP,(ABB,(…(
(
• Most(task:parallel(models(are(not(expressive(enough(for(complex(sooware(– Apple’s(GCD(uses(queues(to(set(dependences(between(tasks(– Intel’s(TBB(task(model(has(cumbersome(support(for(complex(dependences(
between(tasks(
– OpenMP(focuses(on(fork(and(join(parallelism((
• Irregular(applica.ons,(such(as(browser(are(not(loop(based(parallelism(– Need(to(combine(different(models(and(concurrency(constructs:(task(and(data(
parallelism,(call:backs(from(events,(blocking(and(network(.meouts(
– Unpredictable(amounts(of(computa.on(
– Fast(response(and(unexpected(events(
8
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Heterogeneity(issues(
• Heterogeneity(exploita.on(is(hard(– Requires(explicit(programming(and(detailed(knowledge(of(device,(resul.ng(in(non:portable(code(
• OpenCL(is(a(first(aXempt(to(change(this(for(programmable(GPUs(
– Many(poten.ally(programmable(devices(are(hidden(behind(fixed(func.on(drivers(
• FastCV(is(an(example(that(aXempts(to(break(that(barrier((
• Only(OpenCL(was(designed(from(the(ground(up(to(be(heterogeneous(– HSA(::(a(heterogeneous(system(architecture(defini.on((
– Renderscript?((
• Code(running(in(these(models(targets(lowest(common(denominator(– Restric.ons(for(specialized(cores(percolate(to(general(purpose(cores((
– GPUs(are(great(for(data(parallel(computa.on,(terrible(for(task(parallelism(
– Impose(limited(languages,(ooen(derived(from(ISO(C(99(
– Programmers(end(up(using(two(concurrency(models(in(their(apps.(i.e:(OpenCL(and(TBB ((
– No(common(interface(for(tasks(running(in(main(CPU(and(accelerators(
9
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Power(efficiency(issues(
• Mobile(hardware(provides(aggressive(power(management(support:(– Clock(ga.ng(– Voltage(and(frequency(scaling(– Energy(efficient(instruc.ons(
• Narrow(ISAs,(e.g.,(Thumb(
• Vector(instruc.ons(– Low(power(states(– Specialized,(power(efficient(units:(DSP,(H.264(and(JPEG(decoders(
• OSes(provide(some(degree(of(power(management(– Scheduling(– Managing(power(as(a(resource:(Cinder,(etc.((
• No(current(model(provides(abstrac.ons(to(think(about(power(– Almost(no(programmer(ever(thinks(about(power(
10
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Heterogeneous(System(Architecture((HSA)(
• HSA(is(a(system(specifica.on(that(envisions(to:(– Make(the(GPU(easily(accessible(
• GPU(as(a(first(class(computa.on(engine(
– Make(GPU(compute(offload(more(efficient(
• Direct(offload,(without(going(through(drivers(
• Eliminate(memory(copies(by(suppor.ng(shared(virtual(memory(
– Provide(a(virtual(ISA,(HSAIL,(to(allow(compa.bility(across(devices.(
• HSAIL(relies(on(a(Finalizer(to(generate(code(for(the(GPU(– Defines(a(memory(model(across(CPU(and(GPU,(compa.ble(with(C++11(
11
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
OpenCL(
12
PlaJorm%Model% Memory%Model%
ExecuLon%Model%A(host(program(that(orchestrates(the(execu.on(of(kernels(execu.ng(on(devices(
– Kernels(are(enqueued(and(communicate(through(events(or(GPU(global(memory(
– The(host(is(responsible(for(managing(memory(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
What(does(the(programmer(need(to(think(about?(
• Abstrac.ons(that(express(concurrency(and(hide(specific(hardware(implementa.ons:(– Computa.on:(Tasks(
– Memory(locality:(Sharing,(ownership(and(placement((
– Synchroniza.on:(Concurrent(access(to(memory,(dependencies(between(tasks,(termina.on(
– Communica.on:(Values(flowing(between(tasks,(copies(to(memory(closer(to(computa.on(units(
• Parallel(programming(paXerns(
• Concurrent,(distributed(data(structures(
13
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Mul.core(vs.(Heterogeneous(Parallelism(
MulLcore% Heterogeneous%
Work(decomposi.on( flexible:(fine(grain(tasks( dependent(on(latency,(currently(coarse(grain(
Data(decomposi.on( flexible:(shared(memory( requires(data(copy(to(device((
Synchroniza.on( locks,(atomics( atomics(within(the(device(events(across(devices(
Communica.on( coherent(caches( through(memory(or(message(passing(
Coding( high(level(languages(+(||extensions(and(libraries(
restricted(languages(or(assembly.(No(||(extensions(beside(GPU((OpenCL)((
Scheduling( uniform(capabili.es,(availability(based(
complex(op.miza.on(space(because(of(different(capabili.es,(func.on(constraints(and(availability(
14
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
What(Is(MARE?(
• MARE(is(a(programming(model(and(a(run.me(system(that(provide(simple(yet(powerful(abstrac.ons(for(parallel,(heterogeneous,(and(power:efficient(sooware((
• MARE’s(abstrac.ons(reduce(the(effort(to(scale(concurrency(to(whatever(hardware(is(available(
• MARE(key(benefits(extend(to(heterogeneous(execu.on((
15
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
How(is(MARE(solving(these(issues?(
• Task(parallel:(parallel(computa.on(is(a(par.ally:ordered(set(of(sequen.al(tasks((task(graph)((
• Par..oned(global(address(space:(tasks(can(access(any(memory(on(SoC(((
• Task(proper.es:(provide(mapping(and(scheduling(guidance(to(the(run.me(system(
• Support(for(heterogeneous(execu.on((
• Explicit(support(for(parallel(programming(paXerns,(design(geared(toward(ease(of(programming(
(16
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
MARE(programming(
17
MARE(App(
Algorithm( Par..on(Work/Data(
Setup(task(dependences(
Parallel(Algorithm(Skeleton(
Implement(tasks(
Map(resources(
Device(
Device(
Task(
Task(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
MARE(Workflow(
• Programmer:((– Par..ons(applica.on(in(small(units(of(work((tasks)(
– Sets(up(dependencies(between(tasks((– Asks(MARE(executes(them(on(as(many(cores(as(possible((launch)(((
• The(goal(of(the(MARE(API(is(to(simplify(well(established(parallel(programming(paXerns(
(Advantages:(• Simpler(than(managing(threads(manually(• Task(scheduler(can(make(more(informed(decisions(using(
op.onal(annota.ons(supplied(by(the(programmer(or(inferred(by(the(run.me(
18
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
BarnesHut(–(A(Galois(benchmark(
• n:body(simula.on(
• Galois:(a(run.me(system(for(graph:based(applica.ons(
0.000(
0.500(
1.000(
1.500(
2.000(
2.500(
3.000(
3.500(
4.000(
1( 2( 4( 6(
ExecuL
on%Lme%(s)%
Cores%
MARE(
Pthreads(
Galois(
Machine:(6:core(x86(
20
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
MARE(users(
• System(developers(– Use(MARE(APIs(to(develop(low:level(libraries(
– ZOOMM(engine,(FastCV,(gaming(engines(…((
• Advanced(app(developers(– Use(MARE(APIs(and(domain:specific(libraries(
to(develop(high:performance(applica.ons(
– Games,(video(and(photo(editors…((
• Regular(and(Web(app(developers(– Use(MARE(APIs(through(domain:specific(
libraries((
– Parallelism(is(hidden(from(the(programmer(
– Banking(apps,(restaurant(reviews…((
21
Na.ve(Apps(
Device(API(
JS(frameworks(
ZOOMM(Engine(
Web((Apps(
MARE(
Parallel(Libraries(
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Development(tools(
Manually%OpLmized%(Ninja)%
Programming%Models%&%AbstracLons%
(Expert%Developers)%
OpLmized%Domain%Specific%Libraries%
(Developers%needing%performance%but%lacking%parallelizaLon%experLse)%
Auto%VectorizaLon%&%Auto%ParallelizaLon%Tools%
(Minimal%Developer%effort)% Num
ber(of(expert(de
velope
rs(
Performance(Gains(
22
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(
Conclusions(
• Are(we(reinven.ng(the(wheel,(and(just(calling(it(mobile(and(embedded?(
• The(issues(facing(the(mobile(industry(are(similar(to(parallel(and(heterogeneous(programming(for(large(scale(machines(and(desktops(
• However:(– Mobile(devices(are(mainly(“single(task”(
– Mobile(devices(are(power(and(energy(constraint(
– Mobile(SoCs(already(have(a(variety(of(specialized(cores(
• Therefore:(– Heterogeneous(programming(must(contribute(to(improving(power(efficiency(
23
Qualcomm(Confiden.al(and(Proprietary(—(Restricted(Distribu.on(:(DO(NOT(COPY(MAY(CONTAIN(U.S.(AND(INTERNATIONAL(EXPORT(CONTROLLED(INFORMATION(—(24