pyramid: a large-scale array-oriented active storage system

25
Pyramid: A large-scale array-oriented active storage system Viet-Trung TRAN, Nicolae Bogdan, Gabriel Antoniu, Luc Bougé KerData Team Inria, Rennes, France 02 09 2011

Upload: viet-trung-tran

Post on 04-Jul-2015

1.511 views

Category:

Documents


0 download

DESCRIPTION

The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.

TRANSCRIPT

Page 1: Pyramid: A large-scale array-oriented active storage system

Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,

Gabriel Antoniu, Luc Bougé

KerData Team

Inria, Rennes, France 02 09 2011

Page 2: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 2

Outline

1. Motivation

2. Architecture

3. Preliminary evaluation

4. Conclusion

Page 3: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 00 MOIS 2011 - 3

MotivationWhyarray-orientedstorage?

1

Page 4: Pyramid: A large-scale array-oriented active storage system

Context: Data-intensive large-scale HPC

simulations

• The scalability of data management is becoming

a critical issue

• Mismatch between storage model and application

data model

• Application data model

- Multidimensional typed arrays, images, etc.

• Storage model

- Parallel file systems: Simple and flat I/O

model

- Relational model: ill-suited for Scientifics

• Need additional layers to map the application

model to the storage model

02 09 2011Viet-TrungTran - 4

•Sequence of bytes

Page 5: Pyramid: A large-scale array-oriented active storage system

[M. Stonebraker] The one-storage-fits-all-

needs has reached its limits

• Parallel I/O stack:

- Performance of non-contiguous I/O vs data

atomicity

• Relational data model:

- Simulating arrays on top of table is poor in

performance

- Scalability for join queries

• Need to specialize the I/O stack to match the

applications requirements

- Array-oriented storage for array data model

• Example: SciDB with ArrayStore.

02 09 2011Viet-TrungTran - 5

Application (Visit, Tornado

simulation)

Data model (HDF5, NetCDF)

MPI-IO middleware

Parallel file systems

Page 6: Pyramid: A large-scale array-oriented active storage system

Our approach

• Multi-dimensional aware chunking

• Lock-free, distributed chunk indexing

• Array versioning

• Active storage support

• Versioning array-oriented access interface

02 09 2011Viet-TrungTran - 6

Page 7: Pyramid: A large-scale array-oriented active storage system

Multi-dimensional aware chunking

• Split array into equal chunks and distributed over storage elements

- Simplify load balancing among storage elements

- Keep the neighbors of cells in the same chunk

• Shared nothing architecture

- Easier to handle data consistency

02 09 2011Viet-TrungTran - 7

Page 8: Pyramid: A large-scale array-oriented active storage system

Lock-free, distributed chunk indexing

• Indexing multi-dimensional information

- R-tree, XD-tree, Quad-tree, etc

- Designed and optimized centralized management

• Centralized metadata management scheme may not scale

- Bottleneck under highly concurrency

• Our approach:

- Porting quad-tree like structures to distributed environment

- Using shadowing technique on quad-tree to enable lock-free

concurrent update

02 09 2011Viet-TrungTran - 8

Page 9: Pyramid: A large-scale array-oriented active storage system

Array versioning

• Scientific applications need array versioning (VLDB 2009)

- Check pointing

- Cloning

- Provenance

• Keep data and metadata immutable

- Updating a chunk is handled at metadata level using shadowing

technique

02 09 2011Viet-TrungTran - 9

Page 10: Pyramid: A large-scale array-oriented active storage system

Active storage support

• Move data computation to storage elements

- Conserve bandwidth

- Better workload parallelization

• Allow user sending User defined handlers to storage servers

02 09 2011Viet-TrungTran - 10

Page 11: Pyramid: A large-scale array-oriented active storage system

Versioning array-oriented access interface

• Basic primitives

- id = CREATE(n, sizes[], defval)

- READ(id, v, offsets[], sizes[], buffer)

- w = WRITE(id, offsets[], sizes[], buffer)

- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)

• Other primitives like cloning, filtering mostly can be implemented based

on these above primitives

02 09 2011Viet-TrungTran - 11

Page 12: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 12

Pyramid: Architecture

2

Page 13: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 13

Architecture

• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]

• Version managers

- Ensure concurrency control

• Metadata managers

- Store index tree nodes

• Storage manager

- Monitor the storage servers

- Ensures a load balancing strategy of chunks among storage servers

• Active storage servers

- Store chunks and perform handlers on chunks

• Clients

- Perform I/O accesses

Page 14: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 14

Read

• I: optionally ask the version manager for

the latest published version

• II: fetch the corresponding metadata from

the metadata managers

• III: contact storage servers in parallel and

fetch the chunks in the local buffer

Client

Storage

servers

Metadata

managers

Version

managers

I

II

III

Page 15: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 15

Write

• I: get a list of storage servers that are

able to store the chunks, one for each

chunk

• II: contact storage servers in parallel and

write the chunks to the corresponding

providers

• III: get a version number for the update

• IV: add new metadata to consolidate the

new version

• V: report the new version is ready for

publication.

Client

Storage

servers

Metadata

managers

Version

manager

Storage

manager

II

I

III

IV

V

Page 16: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 16

Lock-free, distributed chunk indexing

• Organized as a Quad-tree to index 2D arrays

• Each tree node has at most 4 children, each covers one of the four quadrants

• Root tree covers the whole array

• Each leaf corresponds to a chunk and holds information about its location

• Tree nodes are immutable, uniquely identified by the version number and the

sub-domain they cover

• Using DHT to distribute tree nodes over metadata managers

Page 17: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 17

Tree shadowing to update

• Write newly created chunks to storage servers

• Build the quad-tree associated to the new snapshot in bottom-up fashion

- Writing the leaves to DHT

- Inner nodes may point to nodes of previous snapshots (imply a

synchronization of the quad-tree generation)

- Avoid synchronization by feeding additional information about the other

concurrent updaters (thank to computational ID of tree nodes)

Page 18: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 18

Efficient parallel updating

• Chunks are written concurrently

• Versions are assigned in the order the

clients finish writing

• Clients get additional information about

the other concurrent writers

• Tree nodes are written in lock-free manner

• Versions are published in the order they

were assigned

Client

#1

Client

#2Storage

servers

Metadata

managers

Version

manager

Publish

Publish

Page 19: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 19

Some more I/O primitives

• Easily implemented thanks to immutable data and metadata blocks

• Cheap I/O operators

• Clone a sub-domain

- Following the metadata tree of a specific snapshot

- Creating new metadata tree and publish as a newly created array

• Filtering, compression ca be done locally in parallel at active storage servers by

introducing user defined handlers

Page 20: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 20

Preliminary evaluationExperimented on G5K (www.grid5000.fr)

3

Page 21: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 21

Experimental setup

Simulate common access pattern exhibited by scientific applications: Array Dicing

• Using at most 130 nodes of Graphene cluster on G5K

- 1 Gbps Ethernet interconnected network

- 49 nodes deployed our Pyramid and the competitor system PVFS

• Array dicing

- Each client accesses a dedicated sub-array

- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)

- Concurrent Reading/Writing

• Measure the performance and scalability

Page 22: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 22

Aggregated throughput achieved under

concurrency

• PVFS suffers from non-

contiguous access pattern due

to serialization to flat file

• Pyramid

- Throughputincreased

steady

- Promising good scalability

on both data and metadata

organization

Page 23: Pyramid: A large-scale array-oriented active storage system

Viet-TrungTran 02 09 2011 - 23

Conclusion

4

Page 24: Pyramid: A large-scale array-oriented active storage system

02 09 2011Viet-TrungTran - 24

Conclusion

• Pyramid is an array-oriented active storage system

• Proposed a system offering support for

- Parallel array processing for both read and write workloads

- Versioning data

- Distributed metadata management, shadowing to reflect updates

• Preliminary evaluation shows promising scalability

• Future work

- Planed to integrate to HDF5

- Pyramid as a storage engine for SciDB?

- Investigate on keeping data at quad-tree nodes

Could be used for store array at different resolutions (map application)

Page 25: Pyramid: A large-scale array-oriented active storage system

Thankyou

INRIA – KerDataResearch Team

www.irisa.fr/kerdata