architecture-aware automatic computation offload for native applications · 2015. 12. 17. ·...

37
1 /23 Architecture-aware Automatic Computation Offload for Native Applications Gwangmu Lee, Hyunjoon Park, Seonyeong Heo, Kyung-Ah Chang*, Hyogun Lee*, and Hanjun Kim. POSTECH / Samsung Electronics *

Upload: others

Post on 24-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

1/23

Architecture-aware Automatic Computation Offload for Native Applications

Gwangmu Lee, Hyunjoon Park, Seonyeong Heo, Kyung-Ah Chang*, Hyogun Lee*, and Hanjun Kim.

POSTECH / Samsung Electronics*

Page 2: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

2/23

Mobile devices are slow

Chess Movement Computation

Mobile

5~6xPerformance Gap

Desktop

1x 2x 3x 4x 5x

void runGame () {while (!gameover) {Move mv;

/* User Inputs */mv = getPlayerTurn ();pieces[mv.tar] = mv.to;

/* Heavy Computation */mv = getAITurn (); pieces[mv.tar] = mv.to;

}}

Page 3: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

3/23

Offloading can boost your mobile device!

High-performanceServer

ComputationResult

Move theknight to A6!

Which pieceto move?

Mobile Device

Page 4: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

4/23

(Offloaded)

Most offloading systems are based on VMs.

CloneCloud (EuroSys`11), MAUI (MobiSys`10), CMCloud (CCGrid`14)

VM

ARM

Application

VM

x86

Android Linux

Mobile Server

(Offloaded)

Page 5: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

5/23

Native Code Execution Time of

Top 20 Android Applications

0% 20% 40% 60% 80% 100%

AdAway

Orbot

Firefox

VLC Player

Open Camera

osmAnd

Syncthing

AFWall+

2048

K-9 Mail

PDF Reader

ownCloud

DAVdroid

Barcode Scanner

SatStat

Cool Reader

OS Monitor

Orweb

PPSSPP

Adblock Plus

Web Browser

Video Player

Navigation

E-book Reader

Console Emulator

PDF Reader

Page 6: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

6/23

Challenges in offloading native workloads

Android

ARM

ARMApplication

Linux

x86

Mobile Server

DifferentArchitecture

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

ARMApplication

Page 7: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

7/23

Challenges in offloading native workloads

Android

ARM

ARMApplication

Linux

x86

Mobile Server

Stack

Heap

Text

Stack

Heap

Text

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

ARMApplication

Page 8: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

8/23

Stack

Challenges in offloading native workloads

Android

ARM

ARMApplication

Linux

x86

Mobile Server

ptr = 0x1234 ptr = 0x1234

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

ARMApplication

intdouble

int double

Page 9: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

9/23

Our Strategy 1: Compile Both Binaries!

Android

ARM

ARMApplication

Linux

x86

Mobile Server

DifferentArchitecture

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

ARMApplication

x86Offloaded

Page 10: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

10/23

Our Strategy 2: Unified Virtual Address

Android

ARM

ARMApplication

Linux

x86

Mobile Server

x86Offloaded

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

Stack

Heap

Text

Stack

Heap

Text

Page 11: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

11/23

Our Strategy 3: Unified Memory Layout

intdouble

ptr = 0x1234 Android

ARM

ARMApplication

Linux

x86

Mobile Server

intdouble

ptr = 0x1234

x86Offloaded

DistinctMemory

DifferentMemory Layouts

DifferentArchitecture

doubleintdouble

int double

Page 12: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

12/23

Com

pile

r Mobile Binary

Server Binary

Target Selection

Virtual Address Unification

runGame:70%getAITurn:68%

Profile

Sources

Partition

Server Specific Opt.

Native Offloader : Structure Overview

Offload

Page 13: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

13/23

Selecting Profitable Targets

CandidateExec.

Time

Call

Count

Mem.

Size

runGame 27 s 1 20 MB

getAITurn 26 s 3 12 MB

getPlayerTurn 1.5 s 3 10 MB

CandidateIdeal

GainComm.

runGame 21.6 s 4 s

getAITurn 20.8 s 7.2 s

getPlayerTurn 1.2 s 6 s

Estimation Estimated

Gain

17.6 s

13.6 s

-4.8 s

Target Selection VA Unification Partition Server Specific Opt.

Page 14: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

14/23

Selecting Profitable Targets

CandidateExec.

Time

Call

Count

Mem.

Size

runGame 27 s 1 20 MB

getAITurn 26 s 3 12 MB

getPlayerTurn 1.5 s 3 10 MB

Estimation

Selected

CandidateIdeal

GainComm.

Estimated

Gain

runGame 21.6 s 4 s 17.6 s

getAITurn 20.8 s 7.2 s 13.6 s

getPlayerTurn 1.2 s 6 s -4.8 s

runGame

getAITurn getPlayerTurn

This calls input operations.

getAITurn 20.8 s 7.2 s 13.6 s

Target Selection VA Unification Partition Server Specific Opt.

runGame

getPlayerTurn

Page 15: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

15/23

Unifying Structure Layouts

typedef struct { char tar, to; double score;

} Move;<pad>;

tar to score

tar paddingto

score

tar to

score

x86 ARM

Unified

Target Selection VA Unification Partition Server Specific Opt.

Page 16: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

16/23

Unifying Heap Areas

Mobile Server

Heap

board = malloc (sizeof(char) * 32);free (board);

u_mallocu_free

Target Selection VA Unification Partition Server Specific Opt.

Page 17: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

17/23

Aligning Two Stack Areas

Mobile StackServer Stack

Mobile Server

i = 30 j = 60j = 60

int bar (int *i) {int j = 60;if (*i!=30) crash;

}

int foo () {int i = 30;bar (&i);

}

Pointed by int *i

*i==60 Crash!

if (*i!=30) crash;

Target Selection VA Unification Partition Server Specific Opt.

Page 18: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

18/23

void runGame () {while (!gameover) {Move mv;mv = getPlayerTurn (); pieces[mv.tar] = mv.to;mv = getAITurn ();

pieces[mv.tar] = mv.to;}

}

void listenClient () {FunctionID offID;while (true) {

offID = acceptOffload ();receiveData ();executeFunction (offID);sendData ();

}}

Partitioning for The Separate Binaries

requestOffload (getAITurn);sendData ();

receiveData ();

Server Source

Offload

ReturnExecute

Mobile Source

Target Selection VA Unification Partition Server Specific Opt.

Page 19: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

19/23

Server Specific Optimizations

Target Selection VA Unification Partition Server Specific Opt.

Function Pointer Mapper:: Maps the mobile’s function address to the server’s one

Remote I/O:: The server’s request for I/O operations remotely.

Please refer to our paper for more details!

Page 20: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

20/23

Remaining Challenges for Mobile Apps

Multi-language Support:: Mobile apps are written in multiple languages.ex) Android Apps w/ NDK (Java + C/C++)

Multi-threaded Applications:: Emerging mobile applications are multi-threaded.

So, this work uses SPEC benchmark suites.

Page 21: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

21/23

17 SPEC 2000/2006 C Benchmarks:: LLVM Compile Error: 400.perlbench, 403.gcc

Non-profitable target: 197.parser, 254.gap, 255.vortex

Galaxy S5 as the Mobile Device, Dell XPS8700 as the Server:: Mobile (2.5GHz Quad-core Krait 400)

Server (Intel 3.60GHz Quad-core i7-4790)

2 Different Network Bandwidths:: 802.11n (Maximum 144Mbps)

802.11ac (Maximum 844Mbps)

LLVM Compiler Framework

Evaluation

Page 22: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

22/23

Normalized Execution Time

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

6.42xSpeed-up!

Native Offloader (144 Mbps)Native Offloader (844 Mbps)Ideal Offloading

Norm

aliz

ed E

xecu

tion T

ime

Page 23: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

23/23

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Normalized Battery Consumption

Native Offloader (144 Mbps)Native Offloader (844 Mbps)Ideal Offloading

82%Saving!

Norm

aliz

ed B

attery

Consu

mption

Page 24: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

24/23

Conclusion

Native Offloader:: Compiler/runtime cooperative offloading system

for general-purpose native applications

:: To minimize offloading overheads,this work unified virtual address spaces.

Fast & Battery-friendly:: 6.42x Speed-up / 82% Battery saving

for 17 SPEC 2000/2006 C Benchmarks

Image: http://www.worth1000.com/contests/27097/interspecies-teamwork-2

Woo-hoo!

The paper also includes remote I/O, function pointer mapping, pointersize / endianness translation, runtimesystem and comm. optimizations.

Page 25: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

25/23

Native OffloaderArchitecture-aware Automatic Computation Offload for Native Applications

Backup Slides

Page 26: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

26/23

How It Works at Runtime

Mobile Server

Page Table

Physical Pages

Stage: Local ExecutionStage: Initialization

Synchronize

Prefetch

Page 27: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

27/23

How It Works at Runtime

Mobile Server

Page Table

Physical Pages

On-demandCopy

Stage: Offloading Execution

PageFault

Dirty Pages

Page 28: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

28/23

How It Works at Runtime

Mobile Server

Page Table

Physical Pages

Stage: Finalization

Synchronize

Write-back

Dirty Pages

Page 29: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

29/23

Several offloading systems are already proposed.

Pointer Analysis-based System

Instr. Instr. Instr.

Obj. Obj. Obj. Obj.

Li et al. (CASES`01), Wang and Li (PLDI`04)

Page 30: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

30/23

Function Pointer Mapping

Mobile Server

Stack

Heap

fptr(“ARGS”);

fptr = toServer(fptr);fptr(“ARGS”);

foofoo

Text

void (*fptr) ();fptr = foo;

Page 31: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

31/23

Unifying the Virtual Address Space

01 23 45 67

01 23 45 67 00 00 00 00

0123456700 00 00 00

Pointer Size Translation

EndiannessTranslation

int value = *ptr32;

ptr64 = zext (ptr32);int value = toLittle(*ptr64);

Installing Pointer Size & Endianness Translators

Page 32: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

32/23

Reallocating Global Variables

gvargvar

Mobile Server

int gvar;int *gptr = &gvar;

int foo () {*gptr = 400;

}

int *gvar_re= u_malloc(4);int *gptr = gvar_re;

gvar_re gvar_re

gvargvar

Heap

Text

Page 33: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

33/23

Server Specific OptimizationsInstalling Remote I/O APIs

Hello World!

Mobile Server

World!

printf (“%s”, “Hello,\n”); printf (“%s”, “World!”);r_printf (“%s”, “World!”);

Send to the mobile

Page 34: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

34/23

Runtime Battery Consumption Pattern

445.gobmk (Fast, 802.11ac) 445.gobmk (Slow, 802.11n)

Initial Prefetching Overhead : 802.11ac < 802.11nTotal Remote I/O Overhead : 802.11ac >> 802.11n

Total Battery Overhead : 802.11ac > 802.11n

+

Page 35: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

35/23

Runtime Estimator

void runGame () {while (!gameover) {Move mv;mv = getPlayerTurn (); pieces[mv.tar] = mv.to;

requestOffload (getAITurn);sendData ();receiveData ();

pieces[mv.tar] = mv.to;}

}

if (profitable (getAITurn)) {

}

Real-time bandwidthinformation

Page 36: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

36/23

Overhead Analysis

0.0

0.2

0.4

0.6

0.8

1.0

s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f0.0

0.2

0.4

0.6

0.8

1.0

Useful Work Function Pointer Translation Remote I/O Communication

Large data size &Ineffective data compression

Frequent remote input

The algorithm uses function pointers

frequently.

Page 37: Architecture-aware Automatic Computation Offload for Native Applications · 2015. 12. 17. · getAITurn 26 s 3 12 MB getPlayerTurn 1.5 s 3 10 MB Candidate Ideal Gain Comm. runGame

37/23

QUESTIONS & ANSWERS

Thank you for your attention!

Compiler Research Lab (CORELAB)Dept. of Computer Science & Engineering

POSTECH, Korea

Gwangmu [email protected]