experiences€with€moving intelligence€closer€to€storage · pdf...

30
Experiences with moving intelligence closer to storage Pankaj Mehra, HP Labs Patents Pending Click to buy NOW! P D F - X C H A N G E w w w . d o c u - t r a c k . c o m Click to buy NOW! P D F - X C H A N G E w w w . d o c u - t r a c k . c o m

Upload: nguyendang

Post on 13-Mar-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Experiences with movingintelligence closer to storagePankaj Mehra, HP Labs

Patents Pending

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Outline

n Storage intelligencen Real examples from …n Transaction processingn Business intelligencen Content management

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Storage is perceived asintelligent if it …n Is application­aware

n Knows about objectsand/or metadata

n Embeds higher­layerfunctionsn Packs in an index or

run­time env

orn Is smart about its low­

level functionalityn Can predict access

pattern or canguarantee QoS

Tag

?

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Selected examples

n Examples fromdatabase siden Stock exchangen Data warehouse

n An intelligent dataaccess manageremployingn Persistent memoryn Multidimensional

indexn Embedded query

processing

n An example fromthe content siden Document archive

n Smart Cells =storage nodes withembeddedn Containersn Content indexn Hash index

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Why have we spent 3 years re­architecting database I/O?• Current database I/O technologies can deliver high

throughput, but …Techniques that improve throughput hurt response timeIn real­world systems, response time must be bounded

• Persistent Memory is the way to provide higherthroughput with faster response time

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Does faster response time matter?Yes• Response­time­critical apps

Stock exchangesHot stocks – dependent tradesMini­batches or group commit• Increases RT• high transaction abort cost

• Real­time enterprise informationdirectors

Telco, retail, supply chainPublish­subscribe workflow –response time extends throughout

• Mixed­workload appsLot of small, unrelatedtransactionsHigh response time – morepressure for system resources,locks, etc.

Front­endprocesses

Safe­storeprocessesOrder

log filesMatchingprocesses

Order booksand traderesults log

Back­endprocesses

StockSegment 1

StockSegment n

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

The ‘long pole’in the commit path

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Traditional I/O vs. RDMA I/O• Traditional SCSI­like I/O has inherently high latency

100s of microsecondsTarget­initiated DMAHigh­overhead software path

• RDMA I/O is much faster10s of microseconds of latency or lessCan be host­initiatedVery thin software path; hardware does most of the work

Host TargetSend Command

DMA (read or write)

to/from host

ack

InitiateCommand

CompletionInterrupt

ReceivedCommand,Initiate DMA

DMAComplete, send

ack

Host Target

DMA write to target

Initiate RDMAWrite

Completion(Interrupt or

polled)

RDMAtarget is

not activelyinvolved

SCSI RDMA

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Persistent Memory (PM)

Fast• Hardware accelerated• Can be used synchronously

Simpler protocol stackAlmost entirely in hardware

Reliable• As durable as disk

Non­volatileSingle fault tolerant: mirrored

• Independent fault zoneNot in a processor’s fault domainSurvives faults of other system

components

SANSAN

ClientClient

Non­volatile memory

REGIONS & PERMISSIONS +  ­­+  ­­REGION CONTENTS

SAN Interface

Side RAM

password

ClientClient PMMPMM PM Unit (PMU)

Read/WritePM Region

Byte­grained• No read­modify­write• Structure friendly• Byte­grained locking/sharing

Better concurrencyNo false sharing

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

A write­aside buffer in PM• Persistent Memory used to

buffer disk writesCritical synchronous writescomplete in fast PMLarger asynchronous diskwrites for higher throughputAudit log volume is alwaysflushed• Better scaling with concurrent

log writers

LogVol...

RemoteCopy

Database Database LogWriter

Log Record

DataVol 1 DataVol n

SCSI/iSCSI/FC

SCSI/iSCSI/FC

LogVolDataVol nDataVol 1

NPMU

...

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Unified write­aside buffer and stablecommunication buffer in PM• PM also used as a

communication bufferProvides a shared end­to­endpersistence medium for thecommit pathLog writing is now completelyoff of the critical pathFast RDMA writes replaceslow communication roundtripsPM is still utilized as a bufferfor log writes

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

PMU Implementations

ServerNet/

InfiniBandServerNet/

InfiniBand PMP

PMM

Allocated

Memory

Map/unmapmemory

Read/writemetadata

Read/writePM region

ClientClient

Management

commands

Software PMU prototypes for HP­UX and NonStop

Advanced PMUs in designfor next­generation servers

Hardware PMU for NonStop

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Performance of prototype PMUs

32.8 MB/s14.2 µs26.5 MB/s14.5 µsServerNet 2(S86000,NonStop,Hardware PMU)

337 MB/s9.9 µs337 MB/s14.7 µsInfiniBand 4x(rx5670, HP­UX,Software PMU)

BandwidthLatencyBandwidthLatency

WriteReadNetwork

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

The original commit path of an INSERTtransaction in NonStop SQL

F. Checkpoint/ack

B. Checkpoint/ack

B. Checkpoint/ack

B. Checkpoint/ack

D. C

heckpoint/ack

D. C

heckpoint/ack

Critical path:

n clusterdata copies

n 2 waited diskI/Os to auditvolume

n 11 waitedround tripson messages

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Insert transaction commit flow with anaudit write­aside checkpointing buffer

F. Checkpoint/ack

B. C

heckpoint/ack

B. C

heckpoint/ack

B. C

heckpoint/ack

D,H

. Checkpoint D

eltas to PM

D,H

. Checkpoint D

eltas to PM

J. Save C

omm

it Record in P

M

Critical path:

n cluster datacopies

n 2 waitedwrites topersistentmemory, notdisk

n 6 waitedround trips onmessages

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

End­to­end persistence eliminates entireprocesses and messages from critical path

Critical path:

n Just 2cluster datacopies

n 2 waitedwrites topersistentmemory

n Merely 2waited roundtrips onmessages

TMF lib

client

A. Flush Changes

E. All Flushed

J. Commit RecordsH. Deltas

F,I. Checkpoint Commit Record to PM

DataVol DataVol

ADPpri

AuditVol

H. Deltas

ADPpri

TMF lib

AuditVol

J. Release locks

PMU PMU

UpdatedRecords

PMU

TMF lib

TMPpri

PMUDataVol

TMF lib

DP2pri

TMF lib

DP2pri

TMF lib

DP2pri

TMF lib TMF lib

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Significantly better throughputon benchmark hot stocks

0.00000

500.00000

1000.00000

1500.00000

2000.00000

2500.00000

32k 64k 128kTransaction Size (larger size = more boxcarring)

Thro

ughp

ut (4

k in

sert

s/Se

c)

1 Driver No PM 2 Drivers No PM 3 Drivers No PM1 Driver PM 2 Drivers PM 3 Drivers PM

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

What would persistent­memoryenabled storage controllers do?

B, C

. Checkpoint C

hanges to PM

n Just 2 data copiestotal (optimal)

n ADP takes over data­volume updatefunction from DP2n Drains data from

PMUs to disks andRDF peers off thecritical path

Critical path:n Just 2 data copiesn 2 waited writes to

persistent memoryn Merely 2 waited round

trips on messagesystem

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Completing the picture …

B, C. C

heckpoint Changes to PM

n Shared state inpersistent memoryeliminates manyTMFLib PIO messages

n Faster TMP releaseslocks based on sharedtransaction state

Critical path:n Just 2 data copiesn 2 waited writes to

persistent memoryn Zero waited round trips

on message systembefore releasing locks

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Other applications

n NAS filersn Directory and Inode Updatesn Changes to actual file

n Local Filesystems (VxFS)n Journal logsn Metadata changes

n Other database systems (Oracle)n Transaction Logs

n iSCSI serversnWrites of disk blocks

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Pushing function closer to data:DP2 and Business Intelligence workloads

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Parallel query planClic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Parallel query execution

Application ProcessApplication ProcessApplication Process

ESPESPESP

DP2DP2DP2

ESPESPESP

DP2DP2DP2

ESPESPESP

DP2DP2DP2

ESPESPESP

DP2DP2DP2

ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP ESPESPESP

parallel groupbyparallel groupbyof a 4of a 4­­waywaypartitioned tablepartitioned table

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

DP2 is a database­awarevolume managern Embeds a B­Tree index with

awareness of database rowsn B­tree based multi­dimensional access

methodn See “Multi Dimensional Access Method: An

Efficient Search Method for MultidimensionalB­Trees,”by Leslie et al.

n Has an embedded database run­timen Technically, can run any plann In most practical situations, runs scans

(including filters) and partial groupings

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

Pushing content­awareness closerto data in HP StorageWorksTM Grid

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

StorageWorks Grid ArchitectureSmart CellsSmart Cellsn Scalable distributed system

of self contained, all­inclusive data repositories

PrinciplesPrinciplesn Scale­outn Federationn Intelligence close to datan Pluggable platforms

supporting HP and 3rd­party storage services

Examplen HP RISSTM platform for

Information LifecycleManagement services

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

Smart Query FabricSmart Query Fabric

Storage:Storage:Block,Block,File &File &ObjectObject

ContentContentindexingindexing

AttributeAttributeindexingindexing

Supported protocols and A

PIs

Supported protocols and A

PIs

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

NASA/IEEE MSST 2005

HP Reference Information Storage Server (RISS):Principles of Storage Service Integration

HP RISSPlatform™

HarvesterMailbox crawlerGRAUfile/doc loader

Protocol plug in

SMTP/IMAPHTTPS (WebDAV, SOAP)DICOMCIFS/NFS Appliance realm

Application realm

ApplicationsFile systemMail server

Database serverHTTP server

Fault handlerE­mail/document shortcutVirtual File SystemDatabase

Protocolhandler

Integration Realm

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

NASA/IEEE MSST 2005

SC SC SC SC SCSC SC SC SCSCSC SC SC SCSC

SC SC SC SC SCSC SCSC SC SC

HP RISS platform uses “Grid” principles forscalability and performance today

RISS Scope•Manage the semanticsof application data

•Provide unified view ofcomputing and storageresources

Email DocMgmtDoc

MgmtThird Party

Apps

•Off the shelf server orblade technologies

•Leveragesadvancements inhardware technology

HP ProLiant Servers

LifecycleManaged

•Secure•Protected•Retention•Access Controlled•Highly Available•Tamperproof

LifecycleManaged

•Secure•Protected•Retention•Access Controlled•Highly Available•Tamperproof

SC SC SC SC SCSC SC SC SCSCSC SC SC SCSC

SC SC SC SC SCSC SCSC SC SC

SOAP SOAPSMTP SMTP

Stores

HTTP HTTP HTTP

Queries

StorageWorks™ RISS

APIs

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

NASA/IEEE MSST 2005

3rd­party storage services can integrate with HPStorageWorks Grid RISS 1.5 at even deeper level

Exchangeintegration

Lotus Notesintegration

File discovery& classification

Basic metadatamanagement File migration Backup

Dynamicmetadata

management

DB discovery& archiving

D2D agent XAM clientGDS IMA(RISS)Partner Q BIBO API Partner XPartner A

Chunking Container Versioning Replication DuplicateElimination Compression

QueryService

ContentIndexingService

AccountServices Auditing

Install&

Config

Monitor&

ControlPolicy Security Notifica­

tion XAM API

Basic WebServices SMTP CIFS, NFS XAM API

Web binding DICOM ECM/CRM

Firewall /Load Balancing

Core Platform

DeepIntegrationLayer

ShallowIntegrationLayer

Clients &Agentware

RISS 1.5 Platform

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com

NASA/IEEE MSST 2005

Final Comments

• The new face of intelligent storage– Memory semantic access with durability– Transaction aware– Index enabled– Sophisticated search and query substrate embedded

• SCSI is evil and having that as the only standardizedtransport for OSD command set is pathetic! (personalopinion)– Higher level functions demand higher level protocols and

APIs

• SNIA efforts– XAM over OSD presents a thin ray of hope

Click t

o buy NOW!

PDF­XCHANGE

www.docu­track.com Clic

k to buy N

OW!PDF­XCHANGE

www.docu­track.com