workload offloading for native codes from arm to x86 hyunjoon park, gwangmu lee, taehwa kim, hanjun...

28
Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

Upload: beverly-turner

Post on 24-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

Workload Offloading for Native Codes

from ARM to x86

Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun KimCoreLab POSTECH

Page 2: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

ARM is widely used in various smart devices

2Source: http://www.rudebaguette.com/assets/smart_devices.jpg

Page 3: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

ARM is much slower than x86

3

2mm3mm

cholesky

corre

lation

cova

riance

doitgen

dynpro

g

fdtd-2d

gemm

reg_detect

symm

syr2k

syrk

GEOMEAN

0

5

10

15

20

25

30

35

Execution Time Normalized to x86 Execution time

ARM x86

Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)

Page 4: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

4

Offloading has been proposed• Existing offloading techniques rely on virtual machines

ARM

OS

Application

Migration

Profiler

Runtime

App.VMM

anag

er

x86

VMM

Virtual HW

OS

Application

Migration

Profiler

App.VMM

anag

er

Virtual HW

OS

Source: Byung-Gon Chun et al. CloneCloud: elastic execution between mobile device and cloud. EuroSys '11

Page 5: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

VMs are SLOW!!!

5

C C++ using STL containers

Java JIT JavaScript Interpreted JavaScript

0

1

2

3

4

5

6

7

8

9

Execution time of Image edge detection program

Runti

me

Nor

mal

ized

to C

50X

Source: Mojtaba Mehrara et al. Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. HPCA '11

Huge Performance Overhead of Virtual Machine

Page 6: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

6

Offloading for Native Code is necessary

ARM

OS A

Application Binary

x86

OS B

Application Binary

• Different ISAs• Different Memory Layouts• Different ABIs (Application Binary Interface)• Sizes, layout, and alignment of data types• Calling convention• System Libraries

Page 7: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

Overall System

7

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

Page 8: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

8

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

164.gzip in SPEC 2000

main() { init(); compress(); uncompress(); verification();}

Page 9: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

9

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

init() { .. file_read_to_memory(); ..}

• Constraint cases• File I/O• System call• Machine specific code

main() { init(); compress(); uncompress(); verification();}

Page 10: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

10

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

main() { init(); compress(); uncompress(); verification();}

Function Coverage

compress 37%

uncompress 42%

verification 1.5%

Total 100%

Page 11: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

11

Source Code

Target Info.

ARM Binary(Whole Prgm)

x86 Binary(Offloaded Fcn)

Hot Function DetectorFunction Filter

Target Selector

Partitioner Unified Virtual Address Mngr.

ABI Convertor Communication Optimizer

Native Offloader

main() { init(); compress(); uncompress(); verification();}

Page 12: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

12

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { while(id = recv()){ switch(id) { } send(ret); }}

Client: ARM Server: x86

main() { init(); compress(); uncompress(); verification();}

Page 13: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

13

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { init(); send(compress_id); ret = recv(); uncompress(); verification();}

main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); } send(ret); }}

Client: ARM Server: x86

Page 14: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

14

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

main() { init(); send(compress_id); ret = recv(); send(uncompress_id); ret = recv(); verification();}

main() { while(id = recv()){ switch(id) { case: compress_id ret = compress(); case: uncompress_id ret = uncom-press(); } send(ret); }}

Client: ARM Server: x86

Page 15: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

15

Stack

global variables

text

Heap

textglobal variables

Client’s memory layout Server’s memory layout

Overwritten

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

sp

sp

brk

brk

sp

brkOverwritten

Page 16: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

16

Stack

global variables

text

Heap

textglobal variables

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

sp

brk

sp

brk

brk

spsp

brk

Client’s memory layout Server’s memory layout

sp

brk

Page 17: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

17

struct Foo { char a; long long b; int c;};

a

b

c

a b

c

Structure layout

x86

a

b

c

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

x86

ARMConversion

Page 18: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

18

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

internal Foo fn1(Foo a);

Function offloaded() { … Foo a = *pa; Foo ret = fn1(a); …}

Page 19: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

19

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

internal Foo fn1(Foo a);internal Foo_cvrt fn1_cvrt(Foo_cvrt a);

Function offloaded() { … Foo_cvrt a = *pa; Foo_cvrt ret = fn1_cvrt(a); …}

Page 20: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

20

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

external Foo fn2(Foo a);

Function offloaded() { … Foo a = *pa;

Foo tret = fn2(a);

…}

Page 21: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

21

struct Foo{ char a; long long b; int c;};

struct Foo_cvrt{ char a; char dummy[7]; long long b; int c; int dummy;};

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

external Foo fn2(Foo a);

Function offloaded() { … Foo_cvrt a = *pa; Foo ta = convert_to_x86(a); Foo tret = fn2(ta); Foo_cvrt ret = convert_to_arm(tret); …}

Page 22: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

22

Migration

• Speculative page migration (Before offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1234

……

Used In Offloaded()Page #1, #2, #3 …

Profiling result

Client memory Server memory

1 0x5052 02 0xFF00 03 0x2A48 0

Page 23: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

23

• Lazy Loading (During offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)

……

Client memory Server memory

Request

4 0xF35A 0

Page 24: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

24

• Lazy Loading (During offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 (Page Fault)

……

Client memory Server memory

Migration

4 0xF35A 0

Page 25: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

25

• Write-back (After offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0x00AC 13 0x2000 14 0xF35A 0

……

Client memory Server memory

Page 26: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

26

• Write-back (After offloading)

Partitioner Unified Virtual Address Mngr. ABI Convertor Communication

Optimizer

Native Offloader

Page# Value Dirty

1 0x5052 02 0xFF00 03 0x2A48 04 0xF35A 0

……

Page# Value Dirty

1 0x5052 02 0x00AC 03 0x2000 04 0xF35A 0

……

Client memory Server memory

Write-back

2 0x00AC 03 0x2000 0

Page 27: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

27

gemm

2mm3mm

cholesky

corre

lation

cova

riance

doitgen

dynpro

g

fdtd-2d

reg_detect

symm

syr2k

syrk

GeoMean

0

1

2

3

4

5

6

7

8

9

10

Spea

dup

Evaluation

Client: ARMv7 Processor 1.70GHz 4-core (Lubuntu 14.04)Server: Intel(R) Xeon(R) E5-2407 2.20GHz 8-core (Ubuntu 14.04)

Page 28: Workload Offloading for Native Codes from ARM to x86 Hyunjoon Park, Gwangmu Lee, Taehwa Kim, Hanjun Kim CoreLab POSTECH

28

Conclusion• We developed a compiler framework provides

workload offloading for native codes from ARM to x86.

• We solve the different ISAs, memory layout, ABI problems which occurs in offloading for native code.