jungheejungheelee lee - baylor...

69
Emerging Topics in Emerging Topics in Computer Architecture: Computer Architecture: Programmable Accelerator, Programmable Accelerator, Solid Solid-state Drive, and state Drive, and Security Security Emerging Topics in Emerging Topics in Computer Architecture: Computer Architecture: Programmable Accelerator, Programmable Accelerator, Solid Solid-state Drive, and state Drive, and Security Security Junghee Junghee Lee Lee

Upload: others

Post on 03-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Emerging Topics in Emerging Topics in Computer Architecture: Computer Architecture:

Programmable Accelerator, Programmable Accelerator, SolidSolid--state Drive, and state Drive, and

SecuritySecurity

Emerging Topics in Emerging Topics in Computer Architecture: Computer Architecture:

Programmable Accelerator, Programmable Accelerator, SolidSolid--state Drive, and state Drive, and

SecuritySecurity

JungheeJunghee LeeLee

Page 2: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Your SpeakerYour SpeakerYour SpeakerYour Speaker

•• EducationEducation–– B.S. Seoul National University, 2000B.S. Seoul National University, 2000–– M.S. Seoul National University, 2003M.S. Seoul National University, 2003–– Ph.D. Georgia Institute of Technology, 2013Ph.D. Georgia Institute of Technology, 2013

•• Work ExperienceWork Experience–– Samsung Electronics, 2003Samsung Electronics, 2003--20082008–– SoteriaSoteria Inc., 2013Inc., 2013--presentpresent

2

–– SoteriaSoteria Inc., 2013Inc., 2013--presentpresent–– Research scientist, 2013Research scientist, 2013--presentpresent

•• PublicationsPublications–– 10 Journal papers including 4 papers in ACM and IEEE 10 Journal papers including 4 papers in ACM and IEEE transactionstransactions

–– 13 Conference papers, 2 of which were nominated for the 13 Conference papers, 2 of which were nominated for the best paper awardbest paper award

2/54

Page 3: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Research ExperienceResearch ExperienceResearch ExperienceResearch Experience

•• Electronic systemElectronic system--level design (level design (SoCSoC/embedded system)/embedded system)

–– Electronic systemElectronic system--level model verification methodologylevel model verification methodology

•• HardwareHardware--based load balancing (computer architecture)based load balancing (computer architecture)

•• NetworksNetworks--onon--Chip (computer architecture)Chip (computer architecture)

–– RingRing--based onbased on--chip router architecturechip router architecture

–– Control and data packet segregationControl and data packet segregation

3

–– Control and data packet segregationControl and data packet segregation

•• Programmable hardware accelerator (heterogeneous Programmable hardware accelerator (heterogeneous computer architecture)computer architecture)

•• SolidSolid--state drives (embedded system)state drives (embedded system)

–– Preemptive garbage collectionPreemptive garbage collection

–– Write cache design for an array of solidWrite cache design for an array of solid--state drivesstate drives

•• HardwareHardware--assisted security (security)assisted security (security)

3/54

Page 4: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Programmable AcceleratorProgrammable AcceleratorProgrammable AcceleratorProgrammable Accelerator

•• IntroductionIntroduction

•• Execution Model Execution Model

•• Hardware ArchitectureHardware Architecture

•• EvaluationEvaluation

•• ConclusionConclusion

4

•• ConclusionConclusion

4/54

Page 5: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

IntroductionIntroductionIntroductionIntroduction

Many Core

Fusion

Massively Parallel Processing Array

5

Single Core

Multi Core

Programmable

Hardware AcceleratorEx) GPGPU

Powerful cores +

H/W accelerator in a single die

Ex) AMD Fusion

5/54

Page 6: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

MPPA as Hardware AcceleratorMPPA as Hardware AcceleratorMPPA as Hardware AcceleratorMPPA as Hardware Accelerator

CPU CPU

CPU CPU

I/O

Massively

Parallel

Processing

Array

CoreTile

Ho

st

CP

U I

nte

rface

Devic

e M

em

ory

CoreTile

CoreTile

CoreTile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

6

I/O

I/O

Ho

st

CP

U I

nte

rface

Devic

e M

em

ory

CoreTile

CoreTile

CoreTile

CoreTile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Core

Tile

Challenges

ExpressivenessDebugging

Memory Hierarchy Design

6/54

Page 7: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Related WorksRelated WorksRelated WorksRelated Works

Expressiveness Debugging Memory

GPGPUAMD Fusion

Tilera

SIMDMultiple debuggers

Event graphScratch-pad memory

Cache

Multi-threading Multiple debuggers Coherent cache

7

Tilera

Rigel

Ambric

Multi-threading

Multi-threading

Kahn process network

Multiple debuggers Coherent cache

Not addressedSoftware-managed

cache

Formal model Scratch-pad memory

Proposed

MPPA

Event-driven

model

Inter-module debug

Intra-module debug

Scratch-pad memory

Prefetching

7/54

Page 8: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Programmable AcceleratorProgrammable AcceleratorProgrammable AcceleratorProgrammable Accelerator

•• IntroductionIntroduction

•• Execution Model Execution Model

•• Hardware ArchitectureHardware Architecture

•• EvaluationEvaluation

•• ConclusionConclusion

8

•• ConclusionConclusion

8/54

Page 9: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Execution ModelExecution ModelExecution ModelExecution Model

Von Neuman Model

Assembly

Structural

Object-oriented

Von Neuman Model

Multi-threadMPI

SIMD

9

Von Neuman Model

x86 MIPS ARM

Von Neuman Model

x86 MIPS ARM

9/54

Page 10: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

RequirementsRequirementsRequirementsRequirements

•• DecouplingDecoupling

–– The execution model should decouple the programming The execution model should decouple the programming model and the execution model of the parallel hardwaremodel and the execution model of the parallel hardware

•• Hardware perspectiveHardware perspective

–– Low implementation overheadLow implementation overhead

10

–– HeterogeneityHeterogeneity

–– ScalabilityScalability

•• Software perspectiveSoftware perspective

–– Easy to programEasy to program

–– Easy to debugEasy to debug

–– PerformancePerformance

10/54

Page 11: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

EventEvent--driven Execution Modeldriven Execution ModelEventEvent--driven Execution Modeldriven Execution Model

•• SpecificationSpecification

–– Module = Module = (b, P(b, Pii, P, Poo, C, F), C, F)•• bb = Behavior of module= Behavior of module

•• PPii = Input ports= Input ports

•• PPoo = Output ports= Output ports

•• CC = Sensitivity list= Sensitivity list

•• SemanticsSemantics

–– A module is triggered A module is triggered when any signal when any signal connected to connected to CC changeschanges

–– Function calls and Function calls and memory accesses are memory accesses are

11

•• CC = Sensitivity list= Sensitivity list

–– SignalSignal

–– Net = Net = (d, K)(d, K)•• dd = Driver port= Driver port

•• KK = A set of sink ports= A set of sink ports

memory accesses are memory accesses are limited to within a limited to within a modulemodule

–– NonNon--blocking write and blocking write and block readblock read

–– The specification can be The specification can be modified during runmodified during run--timetime

11/54

Page 12: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Programmable AcceleratorProgrammable AcceleratorProgrammable AcceleratorProgrammable Accelerator

•• IntroductionIntroduction

•• Execution Model Execution Model

•• Hardware ArchitectureHardware Architecture

•• EvaluationEvaluation

•• ConclusionConclusion

12

•• ConclusionConclusion

12/54

Page 13: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

MPPA MPPA MicroarhitectureMicroarhitectureMPPA MPPA MicroarhitectureMicroarhitecture

Core

TileH

ost

CP

U I

nte

rface

Devic

e M

em

ory

CoreTile

Core

Core

Tile

CoreTile

Core

Core

Tile

CoreTile

Core

Core

Tile

CoreTile

Core

Core

Tile

CoreTile

CoreE

• Identical core tiles• Consists of uCPU, scratch-pad memory, and

peripherals that support the execution model• One core tile is designated to an execution engine

13

Ho

st

CP

U I

nte

rface

Devic

e M

em

ory

CoreTile

Core

Tile

Core

Tile

CoreTile

Core

Tile

Core

Tile

CoreTile

Core

Tile

Core

Tile

CoreTile

Core

Tile

Core

Tile

CoreTile

Core

Tile

Core

Tile

E

• Software running on a core tile

• Consists of scheduler, signal storage and interconnect directory

• Supports the execution model

• If necessary, it is split into multiple instances running on different core tiles

13/54

Page 14: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Core Tile ArchitectureCore Tile ArchitectureCore Tile ArchitectureCore Tile Architecture

Scratch Pad Memory

For

Current

Module

For

Next

Module

uCPU Prefetcher

• Software-managed on-chip SRAM• Double buffering where

one is for the current module and the other is

for the next module to be

prefetched

• Prefetches the code and

data of the next module

• Switches the context

when the current module finishes and the next

module is ready

• Stores information

• Stores the input data

• The actual data is stored

14

Context Manager

Input Signal Queue

Output Signal Queue

Message Queue

Network Interface

Message

Handler

• Generic small processor• Treated as a black box

data of the next module

while the current module is running on uCPU

• Counter-part of the prefetcher

• Sends data to the

requester

• Stores information about the modules

• The actual data is stored

in the SPM while its information is managed

by this module

• Stores the output data• Notifies the update event

to the interconnect

directory when the output is updated

• Handles the system

messages

• NoC router

14/54

Page 15: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Execution EngineExecution EngineExecution EngineExecution Engine

•• Most of its functionality is implemented in Most of its functionality is implemented in softwaresoftwarewhile the hardware facilitates communicationwhile the hardware facilitates communication�� Software implementation gives us flexibility in the Software implementation gives us flexibility in the number and location of the execution enginenumber and location of the execution engine

•• One way to visualize our MPPA is to regard the One way to visualize our MPPA is to regard the execution engine as an execution engine as an eventevent--driven simulation kerneldriven simulation kernel

•• The execution engine interacts with modules running on The execution engine interacts with modules running on

15

•• The execution engine interacts with modules running on The execution engine interacts with modules running on other core tiles through other core tiles through messagesmessages

Type From To Payload

REQ_FETCH_MODULE Prefetcher Scheduler Request a new module

RES_FETCH_MODULE Scheduler Prefetcher Module ID and list of input ports

MODULE_INSTANCE Scheduler Prefetcher Code of the module

REQ_SIGNAL Prefetcher Interconnect Port ID

RES_SIGNAL Signal storage

or a node

Prefetcher Data

15/54

Page 16: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Components of Execution EngineComponents of Execution EngineComponents of Execution EngineComponents of Execution Engine

•• SchedulerScheduler

–– Keeps track of the status and location of modulesKeeps track of the status and location of modules

–– Maintains three queues: wait, ready and run queueMaintains three queues: wait, ready and run queue

•• Signal storageSignal storage

–– Stores signal values in the device memoryStores signal values in the device memory

16

–– If a signal is updated but its value is still stored in the If a signal is updated but its value is still stored in the node, the signal storage invalidates its value and node, the signal storage invalidates its value and keeps the location of the latest valuekeeps the location of the latest value

•• Interconnect directoryInterconnect directory

–– Keeps track of connectivity of signals and portsKeeps track of connectivity of signals and ports

–– Maintains the sensitivity listMaintains the sensitivity list

16/54

Page 17: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ModuleModule--Level Level PrefetchingPrefetchingModuleModule--Level Level PrefetchingPrefetching

•• Hides the overhead of the dynamic schedulingHides the overhead of the dynamic scheduling•• PrefetchesPrefetches the next module while the current module is runningthe next module while the current module is running

uCPU Prefetcher SchedulerInterconn.

Directory

Signal

StorageOther Node

17

Execu

te a

mo

du

le

Mem

ory

access

Mem

ory

access

17/54

Page 18: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Programmable AcceleratorProgrammable AcceleratorProgrammable AcceleratorProgrammable Accelerator

•• IntroductionIntroduction

•• Execution Model Execution Model

•• Hardware ArchitectureHardware Architecture

•• EvaluationEvaluation

•• ConclusionConclusion

18

•• ConclusionConclusion

18/54

Page 19: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

BenchmarkBenchmarkBenchmarkBenchmark

•• Recognition, Synthesis and Mining (RMS) benchmarkRecognition, Synthesis and Mining (RMS) benchmark•• FineFine--grained parallelism: dominated by short tasksgrained parallelism: dominated by short tasks

–– Small memory foot printSmall memory foot print–– High runHigh run--time scheduling overheadtime scheduling overhead

•• TaskTask--level parallelism: exhibits dependencylevel parallelism: exhibits dependency–– Hard to be implemented with GPGPUHard to be implemented with GPGPU

Benchmark Min Max Average

19

Benchmark Min Max Average

Forward Solve (FS) 26 646 336.00

Backward Solve (BS) 42 569 305.50

Cholesky Factorization (CF) 151 11800 789.35

Canny Edge Detection (CED) 330 5011 669.68

Binomial Tree (BT) 117 4506 462.71

Octree Partitioning (OP) 1441 6679 2678.70

Quick Sort (QS) 88 47027 683.70

19/54

Page 20: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SimulatorSimulatorSimulatorSimulator

•• InIn--house cyclehouse cycle--level simulatorlevel simulator

•• ParametersParameters

Parameter Value

Number of core tiles 32

Memory access time 1 cycle for scratch-pad memory

20

100 cycles for device memory

Memory size 8 KB scratch-pad memory

32 MB device memory

Communication Delay 4 cycles per hop

20/54

Page 21: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

UtilizationUtilizationUtilizationUtilization

0.6

0.8

1.0

Co

re u

tiliza

tio

n

21

Benchmarks

FS BS CF CED BT0

0.2

0.4

Co

re u

tiliza

tio

n

w/o prefetching w/ prefetching

OP QS

21/54

Page 22: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ConclusionConclusionConclusionConclusion

•• A novel MPPA architecture is proposed that employs A novel MPPA architecture is proposed that employs an an eventevent--driven execution modeldriven execution model

–– Handles dependencies by Handles dependencies by dynamic schedulingdynamic scheduling

–– Hides dynamic scheduling overhead by Hides dynamic scheduling overhead by modulemodule--level level prefetchingprefetching

22

prefetchingprefetching

•• Future worksFuture works

–– Supports applications that require larger memory Supports applications that require larger memory footprintfootprint

–– Adjusts the number of execution engines dynamicallyAdjusts the number of execution engines dynamically

–– Supports interSupports inter--module debuggingmodule debugging

22/54

Page 23: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SolidSolid--state Drivestate DriveSolidSolid--state Drivestate Drive

•• IntroductionIntroduction

•• Background and MotivationBackground and Motivation

•• SemiSemi--Preemptive Garbage CollectionPreemptive Garbage Collection

•• EvaluationEvaluation

•• ConclusionConclusion

23

•• ConclusionConclusion

23/54

Page 24: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

High Performance Storage SystemsHigh Performance Storage SystemsHigh Performance Storage SystemsHigh Performance Storage Systems

•• Server centric servicesServer centric services

–– File, web & media servers, transaction processing serversFile, web & media servers, transaction processing servers

•• EnterpriseEnterprise--scale Storage Systemsscale Storage Systems

–– Information technology focusing on storage, protection, Information technology focusing on storage, protection, retrieval of data in largeretrieval of data in large--scale environmentsscale environments

24

Google's massive server farms High Performance

Storage Systems

Storage UnitHard Disk Drive

24/54

Page 25: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Emergence of NAND Flash based SSDEmergence of NAND Flash based SSDEmergence of NAND Flash based SSDEmergence of NAND Flash based SSD

•• NAND Flash vs. Hard Disk DrivesNAND Flash vs. Hard Disk Drives

–– Pros:Pros:

•• SemiSemi--conductor technology, no mechanical partsconductor technology, no mechanical parts

•• Offer Offer lower lower access access latencieslatencies

–– µµss for for SSDs vs. SSDs vs. msms for for HDDsHDDs

•• Lower power consumptionLower power consumption

•• Higher robustness to vibrations and Higher robustness to vibrations and temperaturetemperature

25

•• Higher robustness to vibrations and Higher robustness to vibrations and temperaturetemperature

–– Cons:Cons:

•• Limited lifetimeLimited lifetime

–– 10K 10K -- 1M erases per block1M erases per block

•• High costHigh cost

–– About 8X more expensive than current hard disksAbout 8X more expensive than current hard disks

•• Performance variability Performance variability

25/54

Page 26: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SolidSolid--state Drivestate DriveSolidSolid--state Drivestate Drive

•• IntroductionIntroduction

•• Background and MotivationBackground and Motivation

•• SemiSemi--Preemptive Garbage CollectionPreemptive Garbage Collection

•• EvaluationEvaluation

•• ConclusionConclusion

26

•• ConclusionConclusion

26/54

Page 27: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

NAND Flash based SSDNAND Flash based SSDNAND Flash based SSDNAND Flash based SSD

fwrite

(file, data)

Block write

(LBA, size)

Application

OS

Process Process

File System (FAT, Ext2, NTFS …)

Block Device Driver

27

Page write

(bank, block, page)Device

Block Device Driver

Block Interface (SATA, SCSI, etc)

MemoryCPU(FTL)

Flash Flash Flash Flash

SSD

27/54

Page 28: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

NAND Flash OrganizationNAND Flash OrganizationNAND Flash OrganizationNAND Flash Organization

Package

Plane 0

Block 0

Page 0

Page 63

Die 0

Pla

ne

0

Pla

ne

1

Pla

ne

2

Pla

ne

3

Register Read

0.025 ms

Write

0.200 ms

Die 1

Pla

ne

0

Pla

ne

1

Pla

ne

2

Pla

ne

3

28

Page 63

Block 2047

Page 0

Page 63

…Pla

ne

Pla

ne

Pla

ne

Pla

ne

0.200 ms

Erased

Erase

1.500 ms

Pla

ne

Pla

ne

Pla

ne

Pla

ne

28/54

Page 29: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

OutOut--OfOf--Place WritePlace WriteOutOut--OfOf--Place WritePlace Write

LPN0

LPN1

LPN2

Logical-to-Physical Address Mapping Table

PPN1

PPN4

PPN2

Physical Blocks

P0 IP1 VP2 VP3 E

P4 V

IP3 V

PPN3

29

LPN2

LPN3

PPN2

PPN5

P4 VP5 VP6 EP7 E

Write to

LPN2

PPN3

Invalidate

PPN2

Write to

PPN3

Update

table

29/54

Page 30: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Garbage CollectionGarbage CollectionGarbage CollectionGarbage Collection

Physical Blocks

P0 IP1 IP2 IP3 I

P4 V

Select Victim Block

Move Valid Pages

P1 V

P3 V

P0 EP1 EP2 EP3 E

30

P4 VP5 VP6 EP7 E

Erase Victim BlockP6 VP7 V

2 reads + 2 writes + 1 erase= 2*0.025 + 2*0.200 + 1.5 = 1.950(ms) !!

30/54

Page 31: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SolidSolid--state Drivestate DriveSolidSolid--state Drivestate Drive

•• IntroductionIntroduction

•• Background and MotivationBackground and Motivation

•• SemiSemi--Preemptive Garbage CollectionPreemptive Garbage Collection

•• EvaluationEvaluation

•• ConclusionConclusion

31

•• ConclusionConclusion

31/54

Page 32: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Technique #1: SemiTechnique #1: Semi--PreemptionPreemptionTechnique #1: SemiTechnique #1: Semi--PreemptionPreemption

RxRx WxWx EERyRy WyWyTime

GC

WzWz Request

32

RxRx

WxWx

EE

Read page x

Write page x

Erase a block

Data transfer

Meta data update

Non-Preemptive GC

WzWzWzWz

Preemptive GC

32/54

Page 33: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Technique #2: MergeTechnique #2: MergeTechnique #2: MergeTechnique #2: Merge

RxRxTime

GC

RyRy Request

WxWx RyRy WyWy EE

33

RxRx

WxWx

EE

Read page x

Write page x

Erase a block

Data transfer

Meta data update

33/54

Page 34: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Technique #3: PipelineTechnique #3: PipelineTechnique #3: PipelineTechnique #3: Pipeline

RxRx WxWx EERyRy WyWyTime

GC

RzRz Request

RR

34

RxRx

WxWx

EE

Read page x

Write page x

Erase a block

Data transfer

Meta data update

RzRz

34/54

Page 35: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Level of Allowed PreemptionLevel of Allowed PreemptionLevel of Allowed PreemptionLevel of Allowed Preemption

•• Drawback of PGCDrawback of PGC: The completion time of GC is delayed: The completion time of GC is delayed�� May incur lack of free blocksMay incur lack of free blocks�� Sometimes need to prohibit preemptionSometimes need to prohibit preemption

•• States of PGCStates of PGC

35

State 0

State 1

State 2

State 3

Garbage collection

Read requests

Write requests

X

O

O

O

O

O

X

O

X

X

35/54

Page 36: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SolidSolid--state Drivestate DriveSolidSolid--state Drivestate Drive

•• IntroductionIntroduction

•• Background and MotivationBackground and Motivation

•• SemiSemi--Preemptive Garbage CollectionPreemptive Garbage Collection

•• EvaluationEvaluation

•• ConclusionConclusion

36

•• ConclusionConclusion

36/54

Page 37: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SetupSetupSetupSetup

•• SimulatorSimulator

–– MSR’s SSD simulator based on MSR’s SSD simulator based on DiskSimDiskSim

•• WorkloadsWorkloads

WorkloadsAverage request

size (KB)Read ratio

(%)Arrival rate

(IOP/s)

37

size (KB) (%) (IOP/s)

Financial 7.09 18.92 47.19

Cello 7.06 19.63 74.24

TPC-H 31.62 91.80 172.73

OpenMail 9.49 63.30 846.62

Write dominant

Read dominant

37/54

Page 38: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Performance Improvement for Performance Improvement for Realistic WorkloadsRealistic WorkloadsPerformance Improvement for Performance Improvement for Realistic WorkloadsRealistic Workloads•• Average Response TimeAverage Response Time

0.6

0.8

1.0

ave.

resp

on

se t

ime

•• Variance of Response TimesVariance of Response Times

0.6

0.8

1.0

No

rmali

zed

sta

nd

ard

devia

tio

n

38

Improvement of average response time by 6.5% and 66.6% for Financial and Cello.

Financial Cello TPC-H OpenMail0

0.2

0.4

No

rmali

zed

ave

Improvement of variance of response time by 49.8% and 83.3% for Financial and Cello.

Financial Cello TPC-H OpenMail0

0.2

0.4

No

rmali

zed

sta

nd

ard

devia

tio

n

38/54

Page 39: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ConclusionsConclusionsConclusionsConclusions

•• Solid state drivesSolid state drives–– Fast access speedFast access speed–– Performance variation Performance variation garbage garbage collectioncollection

•• SemiSemi--preemptive garbage collectionpreemptive garbage collection–– Service incoming requests during GCService incoming requests during GC

•• Average response time and Average response time and

39

•• Average response time and Average response time and performance variation are reduced by performance variation are reduced by up to 66.6% and 83.3%up to 66.6% and 83.3%

39/54

Page 40: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SecuritySecuritySecuritySecurity

•• IntroductionIntroduction

•• AppendAppend--only Storageonly Storage

•• Use CasesUse Cases

•• ConclusionConclusion

40

40/54

Page 41: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Evolutionary Digital Systems AdvanceEvolutionary Digital Systems AdvanceEvolutionary Digital Systems AdvanceEvolutionary Digital Systems Advance

•• IoTIoT (Internet of Things) (Internet of Things) –– By 2015, By 2015, 5 billion 5 billion individuals will be individuals will be

connected to the Internet (source: GKP)connected to the Internet (source: GKP)–– 100 billion uniquely identifiable objects will 100 billion uniquely identifiable objects will

be connected to the Internet by 2020be connected to the Internet by 2020

•• Big Data VisualizationBig Data Visualization–– Digital data is doubling every other yearDigital data is doubling every other year

Information from

Network

41

•• Cloud Computing and Mobile ComputingCloud Computing and Mobile Computing

•• CybersecurityCybersecurity–– New business models based on innovative New business models based on innovative

thinking will be neededthinking will be needed

41/54

New Business Models

BroadbandEverywhere

Page 42: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Financial ImpactFinancial ImpactFinancial ImpactFinancial Impact

•• Computer Computer crimes cost firms who detect and crimes cost firms who detect and verify incidents verify incidents between $145 million and $730 between $145 million and $730 million each year million each year (NCSA (NCSA Annual Worry Report)Annual Worry Report)

•• A A company that experiences a computer company that experiences a computer outage lasting outage lasting more more than 10 days will never fully than 10 days will never fully recover financiallyrecover financially. 50 percent . 50 percent will be out of business will be out of business within five years within five years ("Disaster Recovery ("Disaster Recovery Planning: Planning: Managing Risk Managing Risk & Catastrophe in Information & Catastrophe in Information

42

Planning: Planning: Managing Risk Managing Risk & Catastrophe in Information & Catastrophe in Information Systems" by Systems" by Jon Jon ToigoToigo))

•• 43% of lost or stolen data is valued at $5 million or more43% of lost or stolen data is valued at $5 million or more

•• 43% of companies experiencing data disasters never reopen, 43% of companies experiencing data disasters never reopen, and 29 percent close within two years (and 29 percent close within two years (McGladreyMcGladrey and and Pullen)Pullen)

•• It is estimated that 1 out of 500 data centers will have a It is estimated that 1 out of 500 data centers will have a severe disaster each year (severe disaster each year (McGladreyMcGladrey and Pullen)and Pullen)

42/54

Page 43: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ScopeScopeScopeScope

•• Network SecurityNetwork Security–– Efficient Efficient

•• It can protect numerous hosts by It can protect numerous hosts by securing only the perimetersecuring only the perimeter

–– But not perfectBut not perfect•• Although data centers are equipped Although data centers are equipped with various network security with various network security techniques, it is estimated 1 out of 500 techniques, it is estimated 1 out of 500 data centers will have a severe disaster data centers will have a severe disaster each year (each year (McGladreyMcGladrey and Pullen)and Pullen)

43

each year (each year (McGladreyMcGladrey and Pullen)and Pullen)

–– The ultimate goal is protecting hostsThe ultimate goal is protecting hosts

•• Host SecurityHost Security–– Protect hosts directlyProtect hosts directly–– Compatibility issueCompatibility issue

•• Heterogeneity of the hosts (different Heterogeneity of the hosts (different version and types of OS and different version and types of OS and different hardware)hardware)

–– Performance overheadPerformance overhead

HostsHosts

Network

Security

43/54

Page 44: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

HardwareHardware--assisted Securityassisted SecurityHardwareHardware--assisted Securityassisted Security

•• Trusted Platform Module (TPM)Trusted Platform Module (TPM)

–– Key burnt in hardwareKey burnt in hardware

•• Intel Intel vProvPro

–– Trusted Execution TechnologyTrusted Execution Technology

•• Virtualization (Virtualization (TrustZoneTrustZone of ARM)of ARM)

44

•• Virtualization (Virtualization (TrustZoneTrustZone of ARM)of ARM)

–– Identity Protection TechnologyIdentity Protection Technology

•• OneOne--time passwordtime password

•• MonitoringMonitoring

–– Copilot, RKRD, KICopilot, RKRD, KI--MonMon

•• CoprocessorCoprocessor--basedbased

44/54

Page 45: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SecuritySecuritySecuritySecurity

•• IntroductionIntroduction

•• AppendAppend--only Storageonly Storage

•• Use CasesUse Cases

•• ConclusionConclusion

45

45/54

Page 46: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Elevator PitchElevator PitchElevator PitchElevator Pitch

Protect Protect reference data reference data from from unauthorized modification unauthorized modification

by using by using AAppendppend--only Storage only Storage

46

•• WriteWrite--only read many (WORM) devices: CD or DVDonly read many (WORM) devices: CD or DVD

by using by using AAppendppend--only Storage only Storage

46/54

Page 47: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SoteriaSoteria Security Card (SSC)Security Card (SSC)SoteriaSoteria Security Card (SSC)Security Card (SSC)

ARM7-based controller

SATA interface

47

NAND flash memories

47/54

Page 48: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SSC FirmwareSSC FirmwareSSC FirmwareSSC Firmware

HostHost

Host Interface Layer (HIL)

SATA Device Driver

SSC Device Driver

48

SSCSSC

Host Interface Layer (HIL)

Flash Translation Layer (FTL)

Flash Interface Layer (FIL)

Log Management Layer (LML)

48/54

Page 49: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

SecuritySecuritySecuritySecurity

•• IntroductionIntroduction

•• AppendAppend--only Storageonly Storage

•• Use CasesUse Cases

•• ConclusionConclusion

49

49/54

Page 50: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

•• Using SSCUsing SSC

–– Logs are stored in both the Logs are stored in both the hard disk and SSChard disk and SSC

–– Log integrity checker checks Log integrity checker checks if the logs are contaminated if the logs are contaminated by comparing those in the by comparing those in the hard disk against those in hard disk against those in

Use Case #1: Log ProtectionUse Case #1: Log ProtectionUse Case #1: Log ProtectionUse Case #1: Log Protection

Server

File System

Block Device

SSC Device Driver

SATA/PCI

50

hard disk against those in hard disk against those in SSCSSC

•• PerformancePerformance

–– Performance degradation of Performance degradation of the response time of the the response time of the Apache web server is 0.88% Apache web server is 0.88% employing a separate employing a separate process to store logsprocess to store logs

Block Device Driver

Hard Disk

SATA/PCI Device Driver

SSC

Log Integrity Checker50/54

Page 51: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Current PracticeCurrent PracticeCurrent PracticeCurrent Practice

•• Log protection techniquesLog protection techniques–– Logging serverLogging server

•• Vulnerabilities involved in collecting and transferring logsVulnerabilities involved in collecting and transferring logs

–– EncryptionEncryption•• Encryption is secure only if the key is not revealedEncryption is secure only if the key is not revealed•• According to According to 2012 Verizon Data Breach2012 Verizon Data Breach report, report, 76%76% of data of data breaches exploited weak or stolen credentialsbreaches exploited weak or stolen credentials

–– HypervisorHypervisor

51

–– HypervisorHypervisor•• Who protects hypervisor itself?Who protects hypervisor itself?

•• Does this really happen?Does this really happen?–– According to a police officer in charge of cyber crime According to a police officer in charge of cyber crime investigation, investigation, -- some attackers some attackers delete their traces from logsdelete their traces from logs, and , and -- some attackers some attackers delete everything from the hard diskdelete everything from the hard disk, which , which includes logsincludes logs

51/54

Page 52: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

•• File integrityFile integrity–– File modification is usually (if not always) a prerequisite File modification is usually (if not always) a prerequisite or a result of malwareor a result of malware

–– Therefore, file integrity checking is a powerful tool to find Therefore, file integrity checking is a powerful tool to find out the cause of attacks and malwareout the cause of attacks and malware

•• Using SSCUsing SSC–– The integrity information of files is stored in the hardwareThe integrity information of files is stored in the hardware

Use Case #2: File Integrity CheckUse Case #2: File Integrity CheckUse Case #2: File Integrity CheckUse Case #2: File Integrity Check

52

–– The integrity information of files is stored in the hardwareThe integrity information of files is stored in the hardware–– By comparing against the stored integrity information, By comparing against the stored integrity information, unauthorized modifications can be detectedunauthorized modifications can be detected

•• PerformancePerformance–– Since the file integrity checker is an offSince the file integrity checker is an off--line utility, the line utility, the performance impact can be minimized by assigning a low performance impact can be minimized by assigning a low prioritypriority

–– Malware Malware detectors and integrity checkers detect malicious detectors and integrity checkers detect malicious activities by comparing against some reference activities by comparing against some reference datadata

52/54

Page 53: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ConclusionConclusionConclusionConclusion

•• SoteriaSoteria Security Card:Security Card:

–– Prevents reference data from unauthorized Prevents reference data from unauthorized modificationmodification

–– Stored data cannot be modified or erasedStored data cannot be modified or erased

•• Use casesUse cases

–– Log protectionLog protection

53

–– Log protectionLog protection

–– File integrity checkingFile integrity checking

–– File access monitoringFile access monitoring

–– NonNon--repudiationrepudiation

–– Medical recordMedical record

–– Financial transactionFinancial transaction

53/54

Page 54: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Thank you!Thank you!Thank you!Thank you!

54

54/54

Page 55: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

AppendixAppendixAppendixAppendix

Page 56: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

A Programmable Processing Array A Programmable Processing Array Architecture Supporting Dynamic Architecture Supporting Dynamic Task Scheduling and ModuleTask Scheduling and Module--Level Level

PrefetchingPrefetching

A Programmable Processing Array A Programmable Processing Array Architecture Supporting Dynamic Architecture Supporting Dynamic Task Scheduling and ModuleTask Scheduling and Module--Level Level

PrefetchingPrefetching

JungheeJunghee LeeLee**, , HyungHyung GyuGyu LeeLee**, , SoonhoiSoonhoi HaHa††, , JongmanJongman KimKim**, and , and ChrysostomosChrysostomos NicopoulosNicopoulos‡‡JongmanJongman KimKim**, and , and ChrysostomosChrysostomos NicopoulosNicopoulos‡‡

*

† ‡

Page 57: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ExampleExampleExampleExample

•• Quick sortQuick sort–– Pivot is selectedPivot is selected–– The given array is partitioned so thatThe given array is partitioned so that

•• The left segment should contain The left segment should contain smaller elements than the pivotsmaller elements than the pivot

•• The right segment should contain The right segment should contain larger elements than the pivotlarger elements than the pivot

–– Recursively partition the left and right Recursively partition the left and right segmentssegments

57

–– Recursively partition the left and right Recursively partition the left and right segmentssegments

•• Specifying quick sortSpecifying quick sort–– MultiMulti--threadingthreading

•• OK but hard to debugOK but hard to debug

–– SIMDSIMD•• Inefficient due to input dependencyInefficient due to input dependency

–– Kahn process networkKahn process network•• Impossible due to the dynamic Impossible due to the dynamic naturenature

Page 58: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Specify Quick Sort with EventSpecify Quick Sort with Event--driven driven ModelModelSpecify Quick Sort with EventSpecify Quick Sort with Event--driven driven ModelModel•• Partition modulePartition module

–– bb (behavior): select a pivot, partition the input (behavior): select a pivot, partition the input array, instantiate another partition module if array, instantiate another partition module if necessarynecessary

–– PPii (input port): input array and its position(input port): input array and its position–– PPoo (output port): left and right segments and (output port): left and right segments and

their positiontheir position–– CC (sensitivity list): input array(sensitivity list): input array–– PP ((prefetchprefetch list): input arraylist): input array

Partition

Partition

Input array

58

–– PP ((prefetchprefetch list): input arraylist): input array

•• Collection moduleCollection module–– bb (behavior): collect segments(behavior): collect segments–– PPii (input port): sorted segments and (input port): sorted segments and

intermediate resultintermediate result–– PPoo (output port): final result and intermediate (output port): final result and intermediate

resultresult–– CC (sensitivity list): sorted segments(sensitivity list): sorted segments–– PP ((prefetchprefetch list): sorted segments and list): sorted segments and

intermediate resultintermediate result

Partition

Collection

Final result

Intermediate result

Page 59: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Illustrative ExampleIllustrative ExampleIllustrative ExampleIllustrative Example

uCPU

Prefetcher

Out Sig Q

Msg Handler

uCPU

Prefetcher

Out Sig Q

Msg Handler

uCPU

Prefetcher

Out Sig Q

Msg Handler

Partition 0 Partition 2

Partition 3 Partition 4 Partition 5

Partition 4Partition 1

59

Msg Handler Msg Handler Msg Handler

Interconnect Directory

Signal Storage

Scheduler

Wait Q

Ready Q

Run Q

Collection

Collection

Collection

Page 60: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

ScalabilityScalabilityScalabilityScalability

0.6

0.8

1.0

Co

re u

tiliza

tio

n

14000

16000

20000

Ex

ec

uti

on

tim

e (

cyc

les

)

18000

60

Number of core tiles

24 32 40 48 560

0.2

0.4

Co

re u

tiliza

tio

n

Util (1)

Util (3)

648000

10000

12000

Ex

ec

uti

on

tim

e (

cyc

les

)

Execution time (1)

Execution time (3)

Page 61: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

A SemiA Semi--Preemptive Garbage Preemptive Garbage Collector for Solid State Collector for Solid State

DrivesDrives

A SemiA Semi--Preemptive Garbage Preemptive Garbage Collector for Solid State Collector for Solid State

DrivesDrives

JungheeJunghee Lee, Lee, Youngjae Kim, Galen M. Youngjae Kim, Galen M. JungheeJunghee Lee, Lee, Youngjae Kim, Galen M. Youngjae Kim, Galen M. Shipman, Shipman, SarpSarp Oral, Oral, FeiyiFeiyi WangWang, and , and

JongmanJongman KimKim

Page 62: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Spider: A LargeSpider: A Large--scale Storage System scale Storage System Spider: A LargeSpider: A Large--scale Storage System scale Storage System

•• Jaguar Jaguar –– PetaPeta--scale computing scale computing machinemachine

–– 25,000 nodes with 250,000 25,000 nodes with 250,000 cores and over 300 TB cores and over 300 TB memorymemory

•• Spider storage systemSpider storage system–– The largest centerThe largest center--wide wide

62

–– The largest centerThe largest center--wide wide LustreLustre--based file systembased file system

–– Over Over 10.7 PB of RAID 6 10.7 PB of RAID 6 formatted capacityformatted capacity•• 13,400 x 1 TB HDDs13,400 x 1 TB HDDs

–– 192 192 LustreLustre I/O serversI/O servers•• Over 3TB of memory (on Over 3TB of memory (on LustreLustre I/O servers)I/O servers)

Page 63: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Pathological Behavior of SSDsPathological Behavior of SSDsPathological Behavior of SSDsPathological Behavior of SSDs

•• Does GC have an impact on the foreground operations?Does GC have an impact on the foreground operations?

–– If so, we can observe sudden bandwidth dropIf so, we can observe sudden bandwidth drop

–– More drop with more write requestsMore drop with more write requests

–– More drop with more More drop with more burstybursty workloadsworkloads

63

•• Experimental SetupExperimental Setup

–– SSD devicesSSD devices

•• Intel (SLC) 64GB SSDIntel (SLC) 64GB SSD

•• SuperTalentSuperTalent (MLC) 120GB SSD(MLC) 120GB SSD

–– I/O generatorI/O generator

•• Used Used libaiolibaio asynchronous I/O library for blockasynchronous I/O library for block--level testinglevel testing

Page 64: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Bandwidth Drop for WriteBandwidth Drop for Write--Dominant Dominant WorkloadsWorkloadsBandwidth Drop for WriteBandwidth Drop for Write--Dominant Dominant WorkloadsWorkloads

•• ExperimentsExperiments

–– Measured bandwidth for 1MB by varying readMeasured bandwidth for 1MB by varying read--write write ratioratio

200

220

240

260

280

B/s

1MB Sequential

180

200

220

240

B/s

1MB Sequential

64

120

140

160

180

200

0 10 20 30 40 50 60

MB

Time (Sec)

80% Write 20% Read60% Write 40% Read

40% Write 60% Read20% Write 80% Read

Intel SLC (SSD) 120

140

160

180

0 10 20 30 40 50 60

MB

Time (Sec)

80% Write 20% Read60% Write 40% Read

40% Write 60% Read20% Write 80% Read

SuperTalent MLC (SSD)

Performance variability increases as we increase write-percentage of workloads.

Page 65: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Performance Variability for Performance Variability for BurstyBurstyWorkloadsWorkloadsPerformance Variability for Performance Variability for BurstyBurstyWorkloadsWorkloads

•• ExperimentsExperiments

–– Measured SSD write bandwidth for Measured SSD write bandwidth for queue depth (queue depth (qdqd) ) is 8 is 8 and and 6464

–– Normalized I/O bandwidth with a Z distributionNormalized I/O bandwidth with a Z distributionIntel SLC (SSD) SuperTalent MLC (SSD)

65

Performance variability increases as we increase the arrival-rate of requests (bursty workloads).

Page 66: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Lessons LearnedLessons LearnedLessons LearnedLessons Learned

•• From the empirical study, we learned:From the empirical study, we learned:

–– Performance Performance variability increases as the percentage variability increases as the percentage of writes in workloads increases. of writes in workloads increases.

–– Performance variability increases with respect to the Performance variability increases with respect to the arrival rate of write requestsarrival rate of write requests..

66

•• This is because:This is because:

–– Any Any incoming requests during the GC should wait incoming requests during the GC should wait until the onuntil the on--going GC ends. going GC ends.

–– GC is not preemptiveGC is not preemptive

Page 67: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Performance Improvements for Performance Improvements for Synthetic WorkloadsSynthetic WorkloadsPerformance Improvements for Performance Improvements for Synthetic WorkloadsSynthetic Workloads•• Varied four parameters: request size, interVaried four parameters: request size, inter--arrival time, arrival time,

sequentialitysequentiality and read/write ratioand read/write ratio•• Varied one at a time fixing othersVaried one at a time fixing others

2.0

2.5

Ave

rag

e

resp

on

se t

ime (

ms)

3.5

4.0

4.5

Sta

nd

ard

devia

tio

n

NPGC

PGC

NPGC std

67

Request size (KB)

8 16 32 640

0.5

1.0

1.5

2.0

Ave

rag

e

resp

on

se t

ime (

ms)

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Sta

nd

ard

devia

tio

n

NPGC std

PGC std

Page 68: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

Performance Improvement for Performance Improvement for Synthetic Workloads (Synthetic Workloads (con’tcon’t))Performance Improvement for Performance Improvement for Synthetic Workloads (Synthetic Workloads (con’tcon’t))

Bursty Random dominant Write dominant

68

Inter-arrival time (ms)

10 5 3 1 0.8 0.6 0.4 0.2

Probability of

sequential access

0.8 0.6 0.4 0.2

Probability of

read access

Page 69: JungheeJungheeLee Lee - Baylor ECSweb.ecs.baylor.edu/faculty/marks/BEARS/Files/LeeJunghee2014.pdf · Intra-module debug Scratch-pad memory Prefetching 7/54. Programmable Accelerator

HardwareHardware--assisted assisted Intrusion Detection by Intrusion Detection by Preserving Reference Preserving Reference Information IntegrityInformation Integrity

HardwareHardware--assisted assisted Intrusion Detection by Intrusion Detection by Preserving Reference Preserving Reference Information IntegrityInformation Integrity

JungheeJunghee Lee, Lee, ChrysostomosChrysostomos NicopoulosNicopoulos, ,

GiGi Hwan Oh, SangHwan Oh, Sang--Won Lee, and Won Lee, and JongmanJongman KimKim