why gridftp? l performance u parallel tcp streams, optimal tcp buffer u non tcp protocol such as udt...

11
Why GridFTP? Performance Parallel TCP streams, optimal TCP buffer Non TCP protocol such as UDT Order of magnitude greater Cluster-to-cluster data movement Another order of magnitude Support for reliable and restartable transfers Multiple security options Anonymous, password, SSH, GSI

Upload: brittney-knight

Post on 12-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Why GridFTP?

Performance Parallel TCP streams, optimal TCP buffer Non TCP protocol such as UDT Order of magnitude greater

Cluster-to-cluster data movement Another order of magnitude

Support for reliable and restartable transfers

Multiple security options Anonymous, password, SSH, GSI

Page 2: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

GridFTP Data Transfers for the Advanced Photon Source

“One Australian user left nearly 1TB of data on our systems that we had been struggling to transfer via standard FTP for several weeks. The typical data rate using standard FTP was ~200 KB/s. Using GridFTP we are now moving data at 6 MB/s—quite a significant boost in performance!” Brian TiemanAdvanced Photon Source

30x speedup9688 miles

Page 3: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Cluster-to-Cluster transfers

Page 4: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Users

HEP community is basing its entire tiered data movement infrastructure for the LHC computing Grid on GridFTP

Southern California Earthquake Center (SCEC), European Space Agency, Disaster Recovery Center in Japan move large volumes of data using GridFTP

An average of more than 2 million data transfers happen with GridFTP every day

Page 5: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

A join activity

This is equivalent to running:

SELECT id, x, y FROM tableOne, tableTwo where table1.id = table2.myID;

Where tableOne and tableTwo are in two different databases

Tuple merge join

SELECT id, x FROM tableOne ORDER by id

Run SQL query

Run SQL query

SELECT myID, y FROM tableTwo ORDER by myID

joinColumn2: myIDjoinColumn1: id

Run SQL query

Run SQL query

Page 6: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

OGSA DAI SQL views Layer above the database to implement views Define views for databases to which you don’t have

write access Parses query Maps view to SQL query over actual database e.g if DrPatient was defined as

SELECT p.id, p.name, p.age, p.sex FROM Patient p, Doctor d WHERE p.DrID = d.ID AND d.dn = $DN$;

Can replace $DN$ by client’s DN from their certificate provided using GT4 security components

Doctors can only view their own patients Factor in the client’s security credentials

Page 7: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Objectives for Data Replication

AA

AA

AA

Improve DurabilitySafeguard against data loss due to disk failure

Improve AvailabilitySafeguard against data inaccessibility due to network partition

Improve PerformanceSafeguard against performance bottlenecks due to resource overload

Page 8: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Data Placement Services: Motivation

Scientific applications often perform complex computational analyses that consume and produce large data sets Computational and storage resources distributed in

the wide area The placement of data onto storage systems can

have a significant impact on performance of applications reliability and availability of data sets

We want to identify data placement policies that distribute data sets so that they can be staged into or out of computations efficiently replicated to improve performance and reliability

Page 9: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Replication occurs when…

Replica Placement I want replica X at sites A, B, and C I want N replicas of each file I want replicas near my compute clusters

Replica Repair Due to replica failure: lost or corrupted But it can be hard to tell the difference

between permanent and temporary failure!

Page 10: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Examples of Placement Policies

Page 11: Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster

Other Uses

GridFTP can be embedded in applications for high-performance data streaming

GridFTP can be used with SSH-style public keys instead of certificates

RFT can provide a Web services interface to GridFTP

RFT is used by GRAM for file staging OGSA DAI can be used to implement a

metadata service And many more…

OSGCC 2008 Globus Primer: An Introduction to Globus Software 11