beolink.org mys3 fabrizio manfredi furuholmen federico mosca

68
Beolink.org myS3 Fabrizio Manfredi Furuholmen Federico Mosca

Upload: jeremy-stokes

Post on 03-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Beolink.org

myS3

Fabrizio Manfredi FuruholmenFederico Mosca

Beolink.org

FOSDEM 2014

2

Agenda

Introduction Goals Principals

myS3 Architecture Internals Sub project

Conclusion Developments

Beolink.org

3

Unsolved problem

Beolink.org

4

Web Interface

“Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web…”

Beolink.org S3

5

• Every file you upload to Amazon S3 is stored in a container called a bucket.

• Each bucket name should be unique. • Each bucket can contain an unlimited number of object (key/value). • Buckets cannot be nested, you can not create a bucket within a

bucket.• Object

– Id – Version– Metadata– Subresources– ACL

• Http Rest Call• Byte range transfer• Parallel transfer

Beolink.org myS3

6

Translate S3 Request to local Disk

Beolink.org Mapping

7

S3 Bucket is a directory in the AFS space

S3 Object is file or a directory, the directory

S3 ACLFake object

AFS ACL permission are returned as a S3 metadata

unix permission are returned as a S3 metadata

All other S3 features are not implemented

Beolink.org S3 Request

8

GET /mybucket/puppy.jpg HTTP/1.1User-Agent: dotnetHost: s3.amazonaws.comDate: Tue, 15 Jan 2008 21:20:27 +0000x-amz-date: Tue, 15 Jan 2008 21:20:27 +0000Authorization: AWS AKIAIOSFODNN7EXAMPLE:k3nL7gH3+PadhTEVn5EXAMPLE

Objects in the same bucket don’t have any relation !!!No Hierarchically

GET /mybucket/puppy.jpgGET /mybucket/yesterday/puppy.jp

“yesterday” doesn’t exist

Beolink.org S3 Request

9

For retrieving directory content :- Prefix for the parent directory - ‘/’ for end name Delimiter

For create a Directoy- Object name with ‘/’ at the end

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Name>ExampleBucket</Name> <Prefix>/mydir/</Prefix> <Marker></Marker> <MaxKeys>1000</MaxKeys> <Delimiter>/</Delimiter> <IsTruncated>false</IsTruncated> <Contents>

Beolink.org AWS Auth

10

Authorization = "AWS" + " " + AWSAccessKeyId + ":" + Signature;

Signature = Base64( HMAC-SHA1( YourSecretAccessKeyID, UTF-8-Encoding-Of( StringToSign ) ) );

StringToSign = HTTP-Verb + "\n" +Content-MD5 + "\n" +Content-Type + "\n" +Date + "\n" +CanonicalizedAmzHeaders +CanonicalizedResource;

CanonicalizedResource = [ "/" + Bucket ] +<HTTP-Request-URI, from the protocol name up to the query string> +[ subresource, if present. For example "?acl", "?location", "?logging", or "?torrent"];

CanonicalizedAmzHeaders = <described below>

Beolink.org Authentication

11

IP Base Computer Account, the authentication of the users is handle by internal db

Impersonate Forge the ticket for the users on the server side, the authentication is handle by internal db

Token Generation

Web interface authentication( kbr auth), one time AWS token generation

Beolink.org

12

Server Architecture

S3 Interface

StorageManager

Auth Manager

Bucket Manager

Storage Driver Cache

Inte

rfa

ce

Ma

na

ge

rsD

riv

ers

P

lug

in

/afs

Token Manager

Web Interface

Beolink.org InternalDB

13

Bucket DB - Contains the map btw the bucket name and the AFS Path

ex. Myhome -> /afs/beolink/home/manfred

Token DB - Contains the access key and secret key for Amazon Authentication, with web base authentication the db contains the kerberos token

Beolink.org Storage Manager

14

NFS style Most of the operation are made on temporary file (.NFSXXX)

Caching Save temporary file in non AFS space

NoWait

Return Ok as soon the file is on the S3 server

Mem

Keep file transferred in memory (max 100MB)

ACL

Enable write operation on AFS ACL

MD5

Enable or disable MD5

Beolink.org TODO

15

• Parallel Transfer• Locking• Kerberos Token base• Chunk transfer (http 100)/ byte range transfer• Create a interface for CloudStack• Automatic Volume release

Beolink.org

16

RestFS

Beolink.org

17

GOAL

Create a framework for testing a new technologies and paradigm

Beolink.org Principle 1/3

18

“Moving Computation is

Cheaper than Moving Data”

Beolink.org Principle 2/3

19

“There is always a failure waiting around the corner”

*Werner Vogel

Beolink.org Principle 3/3

20

“Decompose into small loosely coupled, stateless building

blocks”

*’ Leaving a Legacy System Revisited’ Chad Fowler

Beolink.org Five pylons

21

Ob

ject

s •Separation btw data and metadata

• Each element is marked with a revision

•Each element is marked with an hash.

Cac

he • Client side

• Callback/Notify

• Persistent

Tra

nsm

iss

ion • Parallel

operation

• Http like protocol

• Compression

• Transfer by difference

Dis

trib

uti

on •Resource

discovery by DNS

•Data spread on multi node cluster

•Decentralize

•Independents cluster

•Data Replication

Se

curi

ty •Secure connection

• Encryption client side,

• Extend ACL

• Delegation/Federation

•Admin Delegation

Beolink.org

22

RestFS Key Words

RestFS

Cellcollection of servers

Bucket virtual container, hosted by one or

more server

Object entity (file, dir, …)

contained in a Bucket

Beolink.orgObject

23

Data Metadata

Segments Ob

ject

Attributes set by user

Properties

ACL

Ext Properties

Block 1

Block 2

Block n

Block …

Ha

sh

Ha

sh

Ha

sh

Ha

sh

Se

ria

lS

eri

al

Se

ria

lS

eri

al

Se

ria

l

Beolink.orgBucket Discovery

24

Client

DNSLookup

Cell 1

Cell 2

N server

N server

Bucket name Cell RL IP list

Bucket name

Server list +Load info

Server Priority Type

IP 1

.. …

Server list priority List

Beolink.org

25

RestFS Cache client side

DNS

RestFS Metadata

RestFS Block

Federated Auth

Callbacks

Metadata cache

Block cache

RestFS BlockRestFS Block

Per

sist

ent

Cac

he

Resource Locator

ServerList

Tokens

Pub/SubList

Tem

po

rary

Locks

Beolink.org

26

Server Architecture

S3

Service

StorageMgr

Auth Manager

Meta Mgr

Storage Driver

Token Driver

RestFSRPC

Resource Manager

Distributed Cache

CallbacksManager

Meta Driver

Auth Driver

CallbacksDriver

Auth

Inte

rfa

ce

Ma

na

ge

rsD

riv

ers

P

lug

in

Resource Locator

Backends

Token Sub/Pub

Token Manager

Resource Driver

Met

a S

ervi

ce

RL

Ser

vice

Cal

lbac

k S

ervi

ce

Au

th S

ervi

ce

Toke

n S

ervi

ce

Blo

ck S

ervi

ce

Locks Mgr

Locks DriverL

ock

s S

ervi

ce

Beolink.org

27

Mounting

Cell

Bucket N

Objects

Cell

Bucket N

Objects

Beolink.org

28

Object Versioning

Cell

Bucket N

Objects

Objects

Objects

The segment contain the diff to upstream object

Each object knows the previous and the next. The current object knows the previous and the last

Beolink.org

29

Block Storage

Beolink.org

30

Backend: Consistent Hashing

Number of key to move for add/remove a node :

Keys/Node= keys to relocate

Blocks are collected in shards

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

Beolink.org Block Storage

31

AFS - Volume store a range of HASH - Chunk is write in 3 volume - Server

PISA- cluster of node - communication base on zmq- consensus base on raft

CEPH - Use CEPH node directly

Beolink.org

32

Backend: Storage

3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …

Monitor of neighbored small cluster of 3 nodes (GOSSIP)

Mini cluster electionkey space reclaim for replica coordination, leave join cluster

Beolink.org

33

Protocols

Europython 2013

Beolink.org

34

RestFS Protocol

{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00  \x06\x00\x00\x00world\x00\x00"

Europython 2013

--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}

GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com

WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.

Standard HTTP/HTTPS port

JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in

python dict

BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats

*Compression is a long story…

Beolink.org Protocols Metadata

35

Europython 2013

{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}

Collecting per segment

Parallel request per segment

{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}

<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}

Check cached Data

{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}

<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}

Block hash list for a specific segment

Beolink.org

36

NOSQL DB

Beolink.org

37

Redis performance

$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q

SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second

Beolink.org

38

Code

Beolink.org

39

Pluggable

Protocol

• Connection Handler• Data transcoding

Service

• High level Operations across multiple functions (like locking)

• Integrity operations/transaction

Manager

• Operations handler for specific area (ex. metadata)

• Split info in sub info

Driver

• Read and write operation to storage system, agnostic operation

Inte

rfac

e, d

ynam

ic lo

ad

Beolink.orgSupport

40

Beolink.org

Thank you

http://restfs.beolink.org

[email protected]@gmail.com

Beolink.org

42

Bucket

Europython 2013

Beolink.org

43

Bucket

Europython 2013

Bucket Namezebra

Propertysegment_size= 512block_size = 16kmax_read’=1000Bucket_size=0Bucket_quota=10000storage_class=STANDARDcompression= nonelogging=enablebucket_type=fs…

The bucket has many properties, the property element is a collection of object information, with this element you can retrieve the default value for the bucket (logging level, security level, ect).

Bucket Name

Properties objects:- Property- Property Ext- Property ACL- Property Stats

- Filesystm, The bucket is used as a filesystem- Logging, Logging operation done on the specific Bucket- Replica RO, Bucket shadow replication…Custom definition

Default parameters

Python Dict

Beolink.org

44

Objects

Europython 2013

Beolink.orgObject

45

Data Metadata

Segments Ob

ject

Attributes set by user

Europython 2013

Properties

ACL

Ext Properties

Block 1

Block 2

Block n

Block …

Ha

sh

Ha

sh

Ha

sh

Ha

sh

Se

ria

lS

eri

al

Se

ria

lS

eri

al

Se

ria

l

Beolink.org

46

MetaData Properties

Europython 2013

Object

zebra.c1d2197420bd41ef24fc665f228e2c76e98da247

PropertyObject_type=datasegment_size= 512block_size = 16kcontent_type = md5=ab86d732d11beb65ed0183d6a87b9b0max_read’=1000storage_class=STANDARDcompression= noneName=“my first object”Object_size=10000Object_prev=zebra.c1d2197420bd41ef24fc665f228e2c76e98dartg…vers:1335519328.091779

Object id (Special id is : bucket_name.ROOT is the starting point of the file system)

Object default

Object version

Object hash (replaced by merkel tree)

Pointer to the previous Object

Object type:- Data, Contains files- Folder, Special object that contain others objects- Mount point, Contains the name of the buckets- Link, Contains the name of the objects- Immutable, Gold imageCustom, Defined by the users

Bucket name

Beolink.org

47

Metadata Segment

Europython 2013

Segment Segment-1

Segment-id 1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c3:158aa47df63f79fd5bc227d32d52a97e1451828c4:1ee794c0785c7991f986afc199a6eee1fa45:c3c662928ac93e206e025a1b08b14ad02e77b29d …vers:1335519328.091779

Segment element

Block pos: integrity hash

Version base on timestamp +Incremental useful for vector clock conflict resolution

Data_size------------------------------------- = Total Segmentblock_size*segment_size

Python Dict

Beolink.org

48

Restfs ID

Europython 2013

Id Bucket

Id Object

Id segment and id block

Chunck data on the storage

Plain text DNS name

UUID random generation

Base on the position of the content

SHA-1 hash of the concatenation of Bucket.object.segment.block_id

Id Object is unique inside of the Bucket, with bucket name the id is a UUID

Beolink.org

49

Mounting

Europython 2013

Cell

Bucket N

Objects

Cell

Bucket N

Objects

Beolink.org

50

Object Versioning

Europython 2013

Cell

Bucket N

Objects

Objects

Objects

The segment contain the diff to upstream object

Each object knows the previous and the next. The current object knows the previous and the last

Beolink.org

51

Protocols

Europython 2013

Beolink.org

52

RestFS Protocol

{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00  \x06\x00\x00\x00world\x00\x00"

Europython 2013

--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}

GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com

WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.

Standard HTTP/HTTPS port

JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in

python dict

BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats

*Compression is a long story…

Beolink.org Protocols Metadata

53

Europython 2013

{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}

Collecting per segment

Parallel request per segment

{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}

<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}

Check cached Data

{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}

<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}

Block hash list for a specific segment

Beolink.org

54

Block Storage

Europython 2013

Beolink.org

55

Backend: Consistent Hashing

Europython 2013

Number of key to move for add/remove a node :

Keys/Node= keys to relocate

Blocks are collected in shards

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

Beolink.org

56

Backend: Storage

Europython 2013

3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …

Monitor of neighbored small cluster of 3 nodes (GOSSIP)

Mini cluster electionkey space reclaim for replica coordination, leave join cluster

Beolink.org

57

Cache

Europython 2013

Beolink.org

58

Cache

Europython 2013

Server Side

Client Side

Distribute Cache

Publish Subscribe

Pattern matching

Persistent cache

Beolink.org

59

Security

Europython 2013

Beolink.org

60

Security

Europython 2013

Protocol,• SSL Protocol

Authentication• Token for devices

(Enrollment)• Session Token for

User• External password

provider

Data Integrity• Encryption on block

level

Authorization• Extended ACL

based on NFS4 ACL• Admin delegation on

the Bucket level

Beolink.org

61

NOSQL DB

Europython 2013

Beolink.org

62

redis as much as Possible

Europython 2013

Main characteristics- Fast- Store Hash of HASH- Atomic operation- Sub/Pub primitives

zebra.c1d2197420bd41ef24fc665f228e2c76e98da247

object id

Dot format to simplify subscription operation (callback)

GLP

name of the properties

Primary key :

Subkey :

00101010101010Value :

Serialized Python Dict (bson in the future)

HASH of HASH

* Version and Hash of the objects has a dedicated subkey, no serialization

Beolink.org

63

Redis performance

Europython 2013

$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q

SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second

Beolink.org

64

Code

Europython 2013

Beolink.org

65

Pluggable

Europython 2013

Protocol •Connection Handler•Data transcoding

Service •High level Operations across multiple functions (like locking)•Integrity operations/transaction

Manager •Operations handler for specific area (ex. metadata)•Split info in sub info

Driver •Read and write operation to storage system, agnostic operationInte

rfac

e, d

ynam

ic lo

ad

Beolink.org

66

What we are using

Module Software

Storage Filesystem, DHT (kademlia, Pastry*)

Metadata SQL(mysql,sqlite), Nosql (Redis)

Auth Oauth(google, twitter, facebook), kerberos*, internal

Protocol Websocket

Message Format

JSON-RPC 2.0, Amazon S3

Encoding Plain, bson

CallBack Subscribe/Publish Websocket/Redis, Async I/O TornadoWeb, ZeroMQ*

HASH Sha-XXX, MD5-XXX, AES

Encryption

SSL, ciphers supported by crypto++

Discovery DNS, file base* are planned

Europython 2013

Beolink.orgWhat is it good for ?

67

User

• Home directory• Remote/Internet disks

Application

• Object storage• Shared space• Virtual Machine

Distribution

• CDN (Multimedia)• Data replication• Disaster Recovery

Europython 2013

Beolink.org

68

Backend: Storage

Transport Layer ZeroMQ

Storage Compressed DAta

Europython 2013