reddnet: enabling data intensive science in the wide area alan tackett mat binkley, bobby brown,...

49
REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt University

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

REDDnet: Enabling Data Intensive Science in the Wide

Area

REDDnet: Enabling Data Intensive Science in the Wide

AreaAlan Tackett

Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller

Vanderbilt University

Alan Tackett

Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller

Vanderbilt University

Page 2: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

OverviewOverview

What is REDDnet? and Why is it needed? Logistical Networking Distributed Hash Tables L-Store – Putting it all together

This is a work in progress!

Most of the metadata operations are in the prototype stage.

What is REDDnet? and Why is it needed? Logistical Networking Distributed Hash Tables L-Store – Putting it all together

This is a work in progress!

Most of the metadata operations are in the prototype stage.

Page 3: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

REDDnet: Distributed Storage Project

http://www.reddnet.org

REDDnet: Distributed Storage Project

http://www.reddnet.org

NSF funded project Currently have 500TB Currently 13 sites Multiple disciplines

Sat imagery (AmericaView) HEP Terascale Supernova Initative Bioinformatics Vanderbilt TV News Archive Large Synoptic Survey

Telescope CMS-HI

NSF funded project Currently have 500TB Currently 13 sites Multiple disciplines

Sat imagery (AmericaView) HEP Terascale Supernova Initative Bioinformatics Vanderbilt TV News Archive Large Synoptic Survey

Telescope CMS-HI

Goal: Providing working storage for distributedscientific collaborations.

Page 4: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

MetadataMetadata Raw storageRaw storage

“Data Explosion”“Data Explosion”

Focus on increasing bandwidth and raw storage

Assume metadata growth is minimal

Focus on increasing bandwidth and raw storage

Assume metadata growth is minimal

Page 5: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

MetadataMetadata Raw storageRaw storage

“Data Explosion”of both data and metadata

“Data Explosion”of both data and metadata

Focus on increasing bandwidth and raw storage

Assume metadata growth is minimal Works great for large files

For collections of small files the metadata becomes the bottleneck

Need ability to scale metadata

ACCRE examples Proteomics: 89,000,000 files totaling

300G Genetics: 12,000,000 files totaling

50G in a single directory

Focus on increasing bandwidth and raw storage

Assume metadata growth is minimal Works great for large files

For collections of small files the metadata becomes the bottleneck

Need ability to scale metadata

ACCRE examples Proteomics: 89,000,000 files totaling

300G Genetics: 12,000,000 files totaling

50G in a single directory

Page 6: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Design RequirementsDesign Requirements

Availability Should survive a partial network outage No downtimes, hot upgradable

Data and Metadata Integrity End-to-end guarantees a must!

Performance Metadata(transactions/s) Data(MB/s)

Security Fault Tolerance - Should survive multiple complete device failures

Metadata Data

Availability Should survive a partial network outage No downtimes, hot upgradable

Data and Metadata Integrity End-to-end guarantees a must!

Performance Metadata(transactions/s) Data(MB/s)

Security Fault Tolerance - Should survive multiple complete device failures

Metadata Data

Very different design

requirements

Page 7: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Logistical NetworkingLogistical Networking

Page 8: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Logistical Networking (LN) provides a “bits are bits”

Infrastructure Standardize on what we have an adequate common model for

Storage/buffer management Coarse-grained data transfer

Leave everything else to higher layers

End-to-end services: checksums, encryption, error encoding, etc.

Enable autonomy in wide area service creation

Security, resource allocation, QoS guarantees

Gain the benefits of interoperability!

One structure serves all

Page 9: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

IBP Internet Backplane Protocol Middleware for managing and using

remote storage Allows advanced space and TIME

reservation Supports multiple connections/depot User configurable block size Designed to support large scale,

distributed systems Provides global “malloc()” and “free()” End-to-end guarantees Capabilities

» Each allocation has separate Read/Write/Manage keys

LoCI Tools*Logistical Computing and Internetworking Lab

LoCI Tools*Logistical Computing and Internetworking Lab

*http://loci.cs.utk.edu

IBP is the “waist of the hourglass” for

storage

Page 10: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

What makes LN different?Based on a highly generic abstract block for

storage(IBP)

Page 11: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

What makes LN different?Based on a highly generic abstract block for

storage(IBP)

What makes LN different?Based on a highly generic abstract block for

storage(IBP)

Storage Depot

Page 12: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

IBP Server or DepotIBP Server or Depot

RID: 1005 Size: 2TB Type: Disk

RID: CMS-1004 Size: 120GB Type: SSD

RID: 1006 Size: 1TB Type: Disk

OS and optionally RID DBs

Runs the ibp_server process Resource

Unique ID Separate data and metadata

partitions (or directories) Optionally can import metadata

to SSD Typically JBOD disk

configuration Heterogeneous disk sizes and

types Don’t have to use dedicated

disks

Runs the ibp_server process Resource

Unique ID Separate data and metadata

partitions (or directories) Optionally can import metadata

to SSD Typically JBOD disk

configuration Heterogeneous disk sizes and

types Don’t have to use dedicated

disks

Page 13: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

IBP functionalityIBP functionality Allocate

Reserve space for a limited time Can create allocation Aliases to control access Enable block level disk checksums to detect silent read errors

Manage allocations INC/DEC allocation reference count (when 0 allocation is removed) Extend duration Change size Query duration, size, and reference counts

Read/Write Can use either append mode or use random access Optionally use network checksums for end-to-end guarantees Depot-Depot Transfers Data can be either pushed or pulled to allow firewall traversal Supports both append and random offsets Supports network checksums for transfers Pheobus support – I2 overlay routing to improve performance

Depot Status Number of resources and their resource ID’s Amount of free space Software version

Allocate Reserve space for a limited time Can create allocation Aliases to control access Enable block level disk checksums to detect silent read errors

Manage allocations INC/DEC allocation reference count (when 0 allocation is removed) Extend duration Change size Query duration, size, and reference counts

Read/Write Can use either append mode or use random access Optionally use network checksums for end-to-end guarantees Depot-Depot Transfers Data can be either pushed or pulled to allow firewall traversal Supports both append and random offsets Supports network checksums for transfers Pheobus support – I2 overlay routing to improve performance

Depot Status Number of resources and their resource ID’s Amount of free space Software version

Page 14: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

What is a “capability”?

Controls access to an IBP allocation 3 separate caps/allocation

Read Write Manage - delete or modify an allocation

Alias or Proxy allocations supported End user never has to see true capabilities Can be revoked at any time

Page 15: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Alias AllocationsAlias Allocations

A temporary set of capabilities providing limited access to an existing allocation

Byte-range or full access is allowed Multiple alias allocations can exist for the same

primary allocation Can be revoked at any time

A temporary set of capabilities providing limited access to an existing allocation

Byte-range or full access is allowed Multiple alias allocations can exist for the same

primary allocation Can be revoked at any time

Page 16: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Split/Merge AllocationsSplit/Merge Allocations

Split Split an existing allocations space into two unique allocations Data is not shared. The initial primary allocation is truncated.

Merge Merge two allocations together with only the “primary” allocation’s

capabilities surviving Data is not merged. The primaries size is increased and the

secondary allocation is removed.

Designed to prevent Denial of “Space” attacks Allows implementation of user and group quotas Split and Merge are atomic operations

Prevents another party from grabbing the space during the operation.

Split Split an existing allocations space into two unique allocations Data is not shared. The initial primary allocation is truncated.

Merge Merge two allocations together with only the “primary” allocation’s

capabilities surviving Data is not merged. The primaries size is increased and the

secondary allocation is removed.

Designed to prevent Denial of “Space” attacks Allows implementation of user and group quotas Split and Merge are atomic operations

Prevents another party from grabbing the space during the operation.

Page 17: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Depot-Depot TransfersCopy data from A to BDepot-Depot TransfersCopy data from A to B

Firewall Firewall

Private IP Private IP

Public IP

Depot A Depot B

Page 18: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Depot-Depot TransfersCopy data from A to BDepot-Depot TransfersCopy data from A to B

Firewall Firewall

Private IP Private IP

Public IP

Depot A Depot B

Direct transfer gets blocked at the

firewall!

Page 19: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Depot-Depot TransfersCopy data from A to BDepot-Depot TransfersCopy data from A to B

Firewall Firewall

Private IP Private IP

Public IP

Depot A

A issues a PUSH

B issues a PULL

Depot B

Page 20: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

exNodeXML file containing metadata

A B C

0

300

200

100

IBP Depots

Network

Analogous to a disk I-node and contains Allocations How to assemble file Fault tolerance encoding scheme Encryption keys

Normal file Replicated at different sites

Replicated and striped

Page 21: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

eXnode3eXnode3

Exnode Collection of layouts Different layouts can be used for versioning, replication, optimized

access (row vs. column), etc.

Layout Simple mappings for assembling the file Layout offset, segment offset, length, segment

Segment Collection of blocks with a predefined structure. Type: Linear, striped, shifted stripe, RAID5, Generalized Reed-Solomon

Block Simplest example is a full allocation Can also be an allocation fragment

Page 22: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Segment typesSegment types

1

1

1

1

2 3 4

2

2

2

3

3

3

4

4

4

1

1

1

1

1

1

1

1

2

2

2

2

3

3

3

3

4

4

4

4

Linear

Single logical device

Blocks can have unequal

size

Access is linear

Striped

Multiple devices

Shifted Stripe

Multiple devices using shifted stride between rows

RAID5 and Reed-Solomon segments are variations of the Shifted Stripe

Page 23: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Optimize data layout for WAN performance

Optimize data layout for WAN performance

1header 2 3 4 5

1header 2 3 4 5

a b c d e

a b c d e

Skip to a specific video frame

Have to read the header and each frame header

WAN latency and disk seeks kill performance

Frame header and data

Optimized layout

Separate header and data into separate segments

Layout contains mappings to present the traditional file view

All header data is contiguous minimizing disk seeks.

Easy to add additional frames

Header and Frame information

Frame data

Page 24: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Latency Impact on Performance

Latency Impact on Performance

Client Server

LatencyCommand

Synchronous

Client Server

Latency

Asynchronous

Time

Page 25: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Asynchronous Internal architecture

Asynchronous Internal architecture

Global Host Queue

In-flight Queue 1

Recv QueueSend Queue

Operation

Operation

completed

If socket error retry

Each host has a dedicated queue Multiple In-flight queues for each host

(depends on load) Send/Recv Queues are independent threads Socket library must be thread safe! Each operation has a weight An In-flight queue has a max weight If socket errors occur # of In-flight queues

are adjusted for stability

Page 26: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Asynchronous client interface

Asynchronous client interface

Similar to MPI_WaitAll() or MPI_WaitAny() Framework All commands return handles Execution can start immediately or be delayed Failed ops can be tracked

ibp_capset_t *create_allocs(int nallocs, int asize, ibp_depot_t *depot){

int i, err;ibp_attributes_t attr;oplist_t *oplist;ibp_op_t *op;

//**Create caps list which is returned ** ibp_capset_t *caps = (ibp_capset_t *)malloc(sizeof(ibp_capset_t)*nallocs);

//** Specify the allocations attributes ** set_ibp_attributes(&attr, time(NULL) + 900, IBP_HARD, IBP_BYTEARRAY);oplist = new_ibp_oplist(NULL); //**Create a new list of ops oplist_start_execution(oplist); //** Go on and start executing tasks. This could be done any time

//*** Main loop for creating the allocation ops *** for (i=0; i<nallocs; i++) {

op = new_ibp_alloc_op(&(caps[i]), asize, depot, &attr, ibp_timeout, NULL); //**This is the actual alloc op

add_ibp_oplist(oplist, op); //** Now add it to the list and start execution }

err = ibp_waitall(oplist); //** Now wait for them all to complete if (err != IBP_OK) {

printf(“create_allocs: At least 1 error occured! * ibp_errno=%d * nfailed=%d\n”, err, ibp_oplist_nfailed(iolist));

}free_oplist(oplist); //** Free all the ops and oplist info return(caps);

}

Page 27: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Distributed Hash TablesDistributed Hash Tables

Page 28: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Distributed Hash TablesDistributed Hash Tables

Provides a mechanism to distribute a single hash table across multiple nodes

Used by many Peer-to-peer systems Properties

Decentralized – No centralized control Scalable – Most can support 100s to 1000s of nodes

Lots of implementations – Chord, Pastry, Tapestry, Kademlia, Apache Cassandra, etc.

Most differ by lookup() implementation Key space is partitioned across all nodes Each node keeps a local routing table, typically routing is

O(log n).

Provides a mechanism to distribute a single hash table across multiple nodes

Used by many Peer-to-peer systems Properties

Decentralized – No centralized control Scalable – Most can support 100s to 1000s of nodes

Lots of implementations – Chord, Pastry, Tapestry, Kademlia, Apache Cassandra, etc.

Most differ by lookup() implementation Key space is partitioned across all nodes Each node keeps a local routing table, typically routing is

O(log n).

Page 29: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Chord Ring*Distributed Hash Table

Distributed Hash Table Maps a Key (K##) -

hash(name) Nodes (N##) are distributed

around the ring and are responsible for the keys “behind” them.

O(lg N) lookup

N1

N8

N14

N21

N32

N38

N42

N56

K10K10

K24K24

K30K30

K54K54

Valid key range for N32 is 22-32

*F. Dabek, et al., "Building Peer-to-Peer Systems With Chord, a Distributed Lookup Service." In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), May, 2001.

Page 30: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Locating keysLocating keys

Finger Table

N8+1 N14

N8+2 N14

N8+4 N14

N8+8 N21

N8+16 N32

N8+32 N42

N14

N1

N8

N21

N32

N38

N42

N48

N51

N56

+1

+2

+32

+4

+8

+16

N1

N8

N14

N21

N32

N38

N42

N48

N51

N56

K10K10

K24K24

K30K30

K54K54

lookup(54)

Page 31: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Apache Cassandrahttp://cassandra.apache.org/

Apache Cassandrahttp://cassandra.apache.org/

Open source, actively developed, used by major companies Created at and still used by Facebook, now an Apache project and also used by Digg,

Twitter, Reddit, Rackspace, Cloudkick, Cisco

Decentralized Every node is identical, no single points of failure

Scalable Designed to scale across many machines to meet throughput or size needs (100+

node production systems)

Elastic Infrastructure Nodes can be added and removed without downtime

Fault Tolerance through replication Flexible replication policy, support for location aware policies (rack and data center)

Uses Apache Thrift for communication between clients and server nodes

Thift bindings are available for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and Ocaml). Language specific client libraries also available.

Open source, actively developed, used by major companies Created at and still used by Facebook, now an Apache project and also used by Digg,

Twitter, Reddit, Rackspace, Cloudkick, Cisco

Decentralized Every node is identical, no single points of failure

Scalable Designed to scale across many machines to meet throughput or size needs (100+

node production systems)

Elastic Infrastructure Nodes can be added and removed without downtime

Fault Tolerance through replication Flexible replication policy, support for location aware policies (rack and data center)

Uses Apache Thrift for communication between clients and server nodes

Thift bindings are available for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and Ocaml). Language specific client libraries also available.

Page 32: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Apache Cassandra http://cassandra.apache.org/

Apache Cassandra http://cassandra.apache.org/

High write speed Use of RAM for write caching and sorting and append only disk writes (sequential

writes are fast, seeking is slow, especially for mechanical hard drives)

Durable Use of a commit log insures that data will not be lost if a node goes down (similar to

filesystem journaling).

Elastic Data Organization Schema-less, namespaces, tables, and columns can be added and removed on the fly

Operations support various consistency requirements Reads/Writes of some data may favor high availability and performance and accept

the risk of stale data. Reads/Writes of other data may favor stronger consistency guarantees and sacrifice

performance and availability when nodes fail

Does not support SQL and the wide range of query types of RDBMS Limited atomicity and support for transactions (no commit, rollback, etc)

High write speed Use of RAM for write caching and sorting and append only disk writes (sequential

writes are fast, seeking is slow, especially for mechanical hard drives)

Durable Use of a commit log insures that data will not be lost if a node goes down (similar to

filesystem journaling).

Elastic Data Organization Schema-less, namespaces, tables, and columns can be added and removed on the fly

Operations support various consistency requirements Reads/Writes of some data may favor high availability and performance and accept

the risk of stale data. Reads/Writes of other data may favor stronger consistency guarantees and sacrifice

performance and availability when nodes fail

Does not support SQL and the wide range of query types of RDBMS Limited atomicity and support for transactions (no commit, rollback, etc)

Page 33: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store – Putting it all togetherL-Store – Putting it all together

Page 34: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Storehttp://www.lstore.org

L-Storehttp://www.lstore.org

Generic object storage framework using tree based structure Data and metadata scale independently

Large objects, typically file data, is stored in IBP depots Metadata is stored in Apache Cassandra

Fault-tolerant in both data and metadata No downtime for adding or removing storage or metadata nodes. Heterogeneous disk sizes and type supported Lifecycle management – can migrate data off old servers Functionality can be easily extended via policies.

The L-Store core is just a routing framework. All commands, both system and user provided, are added in the same way.

Generic object storage framework using tree based structure Data and metadata scale independently

Large objects, typically file data, is stored in IBP depots Metadata is stored in Apache Cassandra

Fault-tolerant in both data and metadata No downtime for adding or removing storage or metadata nodes. Heterogeneous disk sizes and type supported Lifecycle management – can migrate data off old servers Functionality can be easily extended via policies.

The L-Store core is just a routing framework. All commands, both system and user provided, are added in the same way.

Page 35: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store PolicyL-Store Policy

A Policy is used to extend L-Store functionality and is comprised of 3 components – command, stored procedures, and services

Each can be broken into an ID, arguments, and execution handler

Arguments are stored in google protocol buffer messages

The ID is used for routing

A Policy is used to extend L-Store functionality and is comprised of 3 components – command, stored procedures, and services

Each can be broken into an ID, arguments, and execution handler

Arguments are stored in google protocol buffer messages

The ID is used for routing

Page 36: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Logistical Network Message Protocol

Logistical Network Message Protocol

Provides a common communication framework

Message Format

[4 byte command] [4 byte message size] [message] [data]

Command – Similar to an IPv4 octet (10.0.1.5) Message size – 32-bit unsigned big endian value Message – Google Protocol Buffer message

Designed to handle command options/variables Entire message is always unpacked using internal buffer

Data – Optional field for large data blocks

Provides a common communication framework

Message Format

[4 byte command] [4 byte message size] [message] [data]

Command – Similar to an IPv4 octet (10.0.1.5) Message size – 32-bit unsigned big endian value Message – Google Protocol Buffer message

Designed to handle command options/variables Entire message is always unpacked using internal buffer

Data – Optional field for large data blocks

Page 37: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Google Protocol Buffershttp://code.google.com/apis/protocolbuffers/

Google Protocol Buffershttp://code.google.com/apis/protocolbuffers/

Language-neutral mechanism for serializing data structures

Binary compressed format Special “compiler” to generate code

for packing/unpacking messages Messages support most primitive

data types, arrays, optional fields, and message nesting

Language-neutral mechanism for serializing data structures

Binary compressed format Special “compiler” to generate code

for packing/unpacking messages Messages support most primitive

data types, arrays, optional fields, and message nesting

package activityLogger;

message alog_file { required string name = 1; required uint32 startTime = 2; required uint32 stopTime = 3; required uint32 depotTime = 4; required sint64 messageSize = 5;}

Page 38: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store PoliciesL-Store Policies

Commands Processes heavyweight operations, including multi-object locking coordination via

locking service

Stored Procedures Lightweight atomic operations on a single object. Simple example is updating multiple key/value pairs on an object. More sophisticated examples would test/operate on multiple key/value pairs and

only perform the update on success

Services Triggered via metadata changes made through stored procedures Services register with a messaging queue or creates new queues listening for

events. Can trigger other L-Store commands Can’t access metadata directly

Commands Processes heavyweight operations, including multi-object locking coordination via

locking service

Stored Procedures Lightweight atomic operations on a single object. Simple example is updating multiple key/value pairs on an object. More sophisticated examples would test/operate on multiple key/value pairs and

only perform the update on success

Services Triggered via metadata changes made through stored procedures Services register with a messaging queue or creates new queues listening for

events. Can trigger other L-Store commands Can’t access metadata directly

Page 39: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureClient

L-Store ArchitectureClient

Client contacts L-Servers to issue commands

Data transfers are done directly via the depots with the L-Server providing the capabilities

Can also create locks if needed

Client contacts L-Servers to issue commands

Data transfers are done directly via the depots with the L-Server providing the capabilities

Can also create locks if needed

Arrow direction signifies which component initiates communication

Page 40: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureL-Server

L-Store ArchitectureL-Server

Essentially a routing layer for matching incoming commands to execution services

All persistent state information is stored in the metadata layer

Clients can contact multiple L-Servers to increase performance

Provides all Auth/AuthZ services

Controls access to metadata

Essentially a routing layer for matching incoming commands to execution services

All persistent state information is stored in the metadata layer

Clients can contact multiple L-Servers to increase performance

Provides all Auth/AuthZ services

Controls access to metadata

Arrow direction signifies which component initiates communication

Page 41: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureStorCore*

L-Store ArchitectureStorCore*

From Nevoa Networks Disk Resource Manager Allows grouping of

resources into Logical Volumes

A resource can be a member of multiple Dynamical Volumes

Tracks depot reliability and space

From Nevoa Networks Disk Resource Manager Allows grouping of

resources into Logical Volumes

A resource can be a member of multiple Dynamical Volumes

Tracks depot reliability and space Arrow direction signifies which component

initiates communication

*http://www.nevoanetworks.com/

Page 42: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureMetadata

L-Store ArchitectureMetadata

L-Store DB Tree-based structure to

organize objects Arbitrary number of

key/value pairs per object All MD operations are

atomic and only operate on a single object

Multiple key/value pairs can be updated simultaneously

Can be combined with testing (stored procedure)

L-Store DB Tree-based structure to

organize objects Arbitrary number of

key/value pairs per object All MD operations are

atomic and only operate on a single object

Multiple key/value pairs can be updated simultaneously

Can be combined with testing (stored procedure)

Arrow direction signifies which component initiates communication

Page 43: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureDepot

L-Store ArchitectureDepot

Used to store large objects. Traditionally file contents.

IBP storage Fault tolerance is done at the L-

Store level. Clients directly access the depots Depot-to-depot transfers are

supported. Multiple depot implementations ACCRE depot* is actively being

developed

Used to store large objects. Traditionally file contents.

IBP storage Fault tolerance is done at the L-

Store level. Clients directly access the depots Depot-to-depot transfers are

supported. Multiple depot implementations ACCRE depot* is actively being

developedArrow direction signifies which component

initiates communication

*http://www.lstore.org/pwiki/pmwiki.php?n=Docs.IBPServer

Page 44: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureMessaging Service

L-Store ArchitectureMessaging Service

Event notification system Uses the Advanced Messaging

Queuing Protocol L-Store provides a lightweight

layer to abstract all queue interaction

Event notification system Uses the Advanced Messaging

Queuing Protocol L-Store provides a lightweight

layer to abstract all queue interaction

Arrow direction signifies which component initiates communication

Page 45: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureServices

L-Store ArchitectureServices

Mainly handles events triggered by metadata changes

L-Server commands and other services can also trigger events.

Typically services are designed to handle operations that are not time sensitive.

Mainly handles events triggered by metadata changes

L-Server commands and other services can also trigger events.

Typically services are designed to handle operations that are not time sensitive.

Arrow direction signifies which component initiates communication

Page 46: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store ArchitectureLock Manager

L-Store ArchitectureLock Manager

Variation of the Cassandra metadata nodes but focusing on providing a simple locking mechanism.

Each lock is of limited duration, just a few seconds, and must be continually renewed.

Guarantees that a dead client doesn’t lock up the entire system

To prevent deadlock a client must present all objects to be locked in a single call.

Variation of the Cassandra metadata nodes but focusing on providing a simple locking mechanism.

Each lock is of limited duration, just a few seconds, and must be continually renewed.

Guarantees that a dead client doesn’t lock up the entire system

To prevent deadlock a client must present all objects to be locked in a single call. Arrow direction signifies which component

initiates communication

Page 47: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

Current functionalityCurrent functionality

Basic functionality Ls, cp, mkdir, rmdir, etc Augment – Create an additional copy using a different

LUN Trim – Remove a LUN copy Add/Remove arbitrary attributes RAID5 and Linear stripe supported

FUSE mount for Linux and OSX Used for CMS-HI

Basic messaging and services L-Sync, similar to DropBox Tape archive or HSM integration

Basic functionality Ls, cp, mkdir, rmdir, etc Augment – Create an additional copy using a different

LUN Trim – Remove a LUN copy Add/Remove arbitrary attributes RAID5 and Linear stripe supported

FUSE mount for Linux and OSX Used for CMS-HI

Basic messaging and services L-Sync, similar to DropBox Tape archive or HSM integration

Page 48: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

L-Store Performance circa 2004

L-Store Performance circa 2004

3 GB/s

30 Mins

Multiple simultaneous writes to 24 depots. Each depot is a 3 TB disk server in a 1U case. 30 clients on separate systems uploading files. Rate has scaled linearly as depots added. Latest generation depot is 72TB with 18Gb/s sustained transfer rates

Multiple simultaneous writes to 24 depots. Each depot is a 3 TB disk server in a 1U case. 30 clients on separate systems uploading files. Rate has scaled linearly as depots added. Latest generation depot is 72TB with 18Gb/s sustained transfer rates

Page 49: REDDnet: Enabling Data Intensive Science in the Wide Area Alan Tackett Mat Binkley, Bobby Brown, Chris Brumgard, Santiago de ledesma, and Matt Heller Vanderbilt

SummarySummary

REDDnet provides short term working storage Scalable in both metadata (op/s) and storage

(MB/s) Block level abstraction using IBP Apache Cassandra for metadata scalability Flexible framework supporting the addition of user

operations

REDDnet provides short term working storage Scalable in both metadata (op/s) and storage

(MB/s) Block level abstraction using IBP Apache Cassandra for metadata scalability Flexible framework supporting the addition of user

operations