architecture-aware automatic computation offload for ... · 3/23 offloading can boost your mobile...

37
1 /23 Architecture-aware Automatic Computation Offload for Native Applications Gwangmu Lee, Hyunjoon Park, Seonyeong Heo, Kyung-Ah Chang*, Hyogun Lee*, and Hanjun Kim. POSTECH / Samsung Electronics *

Upload: others

Post on 18-Feb-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

  • 1/23

    Architecture-aware Automatic Computation Offload for Native Applications

    Gwangmu Lee, Hyunjoon Park, Seonyeong Heo, Kyung-Ah Chang*, Hyogun Lee*, and Hanjun Kim.

    POSTECH / Samsung Electronics*

  • 2/23

    Mobile devices are slow

    Chess Movement Computation

    Mobile

    5~6xPerformance Gap

    Desktop

    1x 2x 3x 4x 5x

    void runGame () {while (!gameover) {Move mv;

    /* User Inputs */mv = getPlayerTurn ();pieces[mv.tar] = mv.to;

    /* Heavy Computation */mv = getAITurn (); pieces[mv.tar] = mv.to;

    }}

  • 3/23

    Offloading can boost your mobile device!

    High-performanceServer

    ComputationResult

    Move theknight to A6!

    Which pieceto move?

    Mobile Device

  • 4/23

    (Offloaded)

    Most offloading systems are based on VMs.

    CloneCloud (EuroSys`11), MAUI (MobiSys`10), CMCloud (CCGrid`14)

    VM

    ARM

    Application

    VM

    x86

    Android Linux

    Mobile Server

    (Offloaded)

  • 5/23

    Native Code Execution Time of

    Top 20 Android Applications

    0% 20% 40% 60% 80% 100%

    AdAway

    Orbot

    Firefox

    VLC Player

    Open Camera

    osmAnd

    Syncthing

    AFWall+

    2048

    K-9 Mail

    PDF Reader

    ownCloud

    DAVdroid

    Barcode Scanner

    SatStat

    Cool Reader

    OS Monitor

    Orweb

    PPSSPP

    Adblock Plus

    Web Browser

    Video Player

    Navigation

    E-book Reader

    Console Emulator

    PDF Reader

  • 6/23

    Challenges in offloading native workloads

    Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    DifferentArchitecture

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    ARMApplication

  • 7/23

    Challenges in offloading native workloads

    Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    Stack

    Heap

    Text

    Stack

    Heap

    Text

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    ARMApplication

  • 8/23

    Stack

    Challenges in offloading native workloads

    Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    ptr = 0x1234 ptr = 0x1234

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    ARMApplication

    intdouble

    int double

  • 9/23

    Our Strategy 1: Compile Both Binaries!

    Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    DifferentArchitecture

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    ARMApplication

    x86Offloaded

  • 10/23

    Our Strategy 2: Unified Virtual Address

    Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    x86Offloaded

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    Stack

    Heap

    Text

    Stack

    Heap

    Text

  • 11/23

    Our Strategy 3: Unified Memory Layout

    intdouble

    ptr = 0x1234 Android

    ARM

    ARMApplication

    Linux

    x86

    Mobile Server

    intdouble

    ptr = 0x1234

    x86Offloaded

    DistinctMemory

    DifferentMemory Layouts

    DifferentArchitecture

    doubleintdouble

    int double

  • 12/23

    Com

    pile

    r Mobile Binary

    Server Binary

    Target Selection

    Virtual Address Unification

    runGame:70%getAITurn:68%

    Profile

    Sources

    Partition

    Server Specific Opt.

    Native Offloader : Structure Overview

    Offload

  • 13/23

    Selecting Profitable Targets

    CandidateExec.

    Time

    Call

    Count

    Mem.

    Size

    runGame 27 s 1 20 MB

    getAITurn 26 s 3 12 MB

    getPlayerTurn 1.5 s 3 10 MB

    CandidateIdeal

    GainComm.

    runGame 21.6 s 4 s

    getAITurn 20.8 s 7.2 s

    getPlayerTurn 1.2 s 6 s

    Estimation EstimatedGain

    17.6 s

    13.6 s

    -4.8 s

    Target Selection VA Unification Partition Server Specific Opt.

  • 14/23

    Selecting Profitable Targets

    CandidateExec.

    Time

    Call

    Count

    Mem.

    Size

    runGame 27 s 1 20 MB

    getAITurn 26 s 3 12 MB

    getPlayerTurn 1.5 s 3 10 MB

    Estimation

    Selected

    CandidateIdeal

    GainComm.

    Estimated

    Gain

    runGame 21.6 s 4 s 17.6 s

    getAITurn 20.8 s 7.2 s 13.6 s

    getPlayerTurn 1.2 s 6 s -4.8 s

    runGame

    getAITurn getPlayerTurn

    This calls input operations.

    getAITurn 20.8 s 7.2 s 13.6 s

    Target Selection VA Unification Partition Server Specific Opt.

    runGame

    getPlayerTurn

  • 15/23

    Unifying Structure Layouts

    typedef struct { char tar, to; double score;

    } Move;;

    tar to score

    tar paddingto

    score

    tar to

    score

    x86 ARM

    Unified

    Target Selection VA Unification Partition Server Specific Opt.

  • 16/23

    Unifying Heap Areas

    Mobile Server

    Heap

    board = malloc (sizeof(char) * 32);free (board);

    u_mallocu_free

    Target Selection VA Unification Partition Server Specific Opt.

  • 17/23

    Aligning Two Stack Areas

    Mobile StackServer Stack

    Mobile Server

    i = 30 j = 60j = 60

    int bar (int *i) {int j = 60;if (*i!=30) crash;

    }

    int foo () {int i = 30;bar (&i);

    }

    Pointed by int *i

    *i==60 Crash!

    if (*i!=30) crash;

    Target Selection VA Unification Partition Server Specific Opt.

  • 18/23

    void runGame () {while (!gameover) {Move mv;mv = getPlayerTurn (); pieces[mv.tar] = mv.to;mv = getAITurn ();

    pieces[mv.tar] = mv.to;}

    }

    void listenClient () {FunctionID offID;while (true) {

    offID = acceptOffload ();receiveData ();executeFunction (offID);sendData ();

    }}

    Partitioning for The Separate Binaries

    requestOffload (getAITurn);sendData ();

    receiveData ();

    Server Source

    Offload

    ReturnExecute

    Mobile Source

    Target Selection VA Unification Partition Server Specific Opt.

  • 19/23

    Server Specific Optimizations

    Target Selection VA Unification Partition Server Specific Opt.

    Function Pointer Mapper:: Maps the mobile’s function address to the server’s one

    Remote I/O:: The server’s request for I/O operations remotely.

    Please refer to our paper for more details!

  • 20/23

    Remaining Challenges for Mobile Apps

    Multi-language Support:: Mobile apps are written in multiple languages.ex) Android Apps w/ NDK (Java + C/C++)

    Multi-threaded Applications:: Emerging mobile applications are multi-threaded.

    So, this work uses SPEC benchmark suites.

  • 21/23

    17 SPEC 2000/2006 C Benchmarks:: LLVM Compile Error: 400.perlbench, 403.gcc

    Non-profitable target: 197.parser, 254.gap, 255.vortex

    Galaxy S5 as the Mobile Device, Dell XPS8700 as the Server:: Mobile (2.5GHz Quad-core Krait 400)

    Server (Intel 3.60GHz Quad-core i7-4790)

    2 Different Network Bandwidths:: 802.11n (Maximum 144Mbps)

    802.11ac (Maximum 844Mbps)

    LLVM Compiler Framework

    Evaluation

  • 22/23

    Normalized Execution Time

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    6.42xSpeed-up!

    Native Offloader (144 Mbps)Native Offloader (844 Mbps)Ideal Offloading

    Norm

    aliz

    ed E

    xecu

    tion T

    ime

  • 23/23

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Normalized Battery Consumption

    Native Offloader (144 Mbps)Native Offloader (844 Mbps)Ideal Offloading

    82%Saving!

    Norm

    aliz

    ed B

    attery

    Consu

    mption

  • 24/23

    Conclusion

    Native Offloader:: Compiler/runtime cooperative offloading system

    for general-purpose native applications

    :: To minimize offloading overheads,this work unified virtual address spaces.

    Fast & Battery-friendly:: 6.42x Speed-up / 82% Battery saving

    for 17 SPEC 2000/2006 C Benchmarks

    Image: http://www.worth1000.com/contests/27097/interspecies-teamwork-2

    Woo-hoo!

    The paper also includes remote I/O, function pointer mapping, pointersize / endianness translation, runtimesystem and comm. optimizations.

  • 25/23

    Native OffloaderArchitecture-aware Automatic Computation Offload for Native Applications

    Backup Slides

  • 26/23

    How It Works at Runtime

    Mobile Server

    Page Table

    Physical Pages

    Stage: Local ExecutionStage: Initialization

    Synchronize

    Prefetch

  • 27/23

    How It Works at Runtime

    Mobile Server

    Page Table

    Physical Pages

    On-demandCopy

    Stage: Offloading Execution

    PageFault

    Dirty Pages

  • 28/23

    How It Works at Runtime

    Mobile Server

    Page Table

    Physical Pages

    Stage: Finalization

    Synchronize

    Write-back

    Dirty Pages

  • 29/23

    Several offloading systems are already proposed.

    Pointer Analysis-based System

    Instr. Instr. Instr.

    Obj. Obj. Obj. Obj.

    Li et al. (CASES`01), Wang and Li (PLDI`04)

  • 30/23

    Function Pointer Mapping

    Mobile Server

    Stack

    Heap

    fptr(“ARGS”);

    fptr = toServer(fptr);fptr(“ARGS”);

    foofoo

    Text

    void (*fptr) ();fptr = foo;

  • 31/23

    Unifying the Virtual Address Space

    01 23 45 67

    01 23 45 67 00 00 00 00

    0123456700 00 00 00

    Pointer Size Translation

    EndiannessTranslation

    int value = *ptr32;

    ptr64 = zext (ptr32);int value = toLittle(*ptr64);

    Installing Pointer Size & Endianness Translators

  • 32/23

    Reallocating Global Variables

    gvargvar

    Mobile Server

    int gvar;int *gptr = &gvar;

    int foo () {*gptr = 400;

    }

    int *gvar_re= u_malloc(4);int *gptr = gvar_re;

    gvar_re gvar_re

    gvargvar

    Heap

    Text

  • 33/23

    Server Specific OptimizationsInstalling Remote I/O APIs

    Hello World!

    Mobile Server

    World!

    printf (“%s”, “Hello,\n”); printf (“%s”, “World!”);r_printf (“%s”, “World!”);

    Send to the mobile

  • 34/23

    Runtime Battery Consumption Pattern

    445.gobmk (Fast, 802.11ac) 445.gobmk (Slow, 802.11n)

    Initial Prefetching Overhead : 802.11ac < 802.11nTotal Remote I/O Overhead : 802.11ac >> 802.11n

    Total Battery Overhead : 802.11ac > 802.11n

    +

  • 35/23

    Runtime Estimator

    void runGame () {while (!gameover) {Move mv;mv = getPlayerTurn (); pieces[mv.tar] = mv.to;

    requestOffload (getAITurn);sendData ();receiveData ();

    pieces[mv.tar] = mv.to;}

    }

    if (profitable (getAITurn)) {

    }

    Real-time bandwidthinformation

  • 36/23

    Overhead Analysis

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f s f0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Useful Work Function Pointer Translation Remote I/O Communication

    Large data size &Ineffective data compression

    Frequent remote input

    The algorithm uses function pointers

    frequently.

  • 37/23

    QUESTIONS & ANSWERS

    Thank you for your attention!

    Compiler Research Lab (CORELAB)Dept. of Computer Science & Engineering

    POSTECH, Korea

    Gwangmu [email protected]