![Page 2: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/2.jpg)
OutlineMotivationRequirements for Scientific Data TransferRelated WorksOur Proposal: GridTorrent FrameworkTest ResultsSummaryQuestions
![Page 3: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/3.jpg)
Motivation• Computational science is changing to be data
intensive• Scientists are faced with mountains of data
that stem from four sources[1]:1. New scientific instruments double their
output every year or so2. Simulations generates flood of data3. The Internet and computational Grid allow
the replication, creation, and recreation of more data[2]
![Page 4: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/4.jpg)
Motivation (cont.)
Scientific discovery increasingly driven by data collection[3] Computationally intensive analysesMassive data collectionsData distributed across networks of varying
capabilityInternationally distributed collaborations
Data Intensive Science: 2000-2015 Dominant factor: data growth (1 Petabyte = 1000 TB)
2000 ~0.5 Petabyte 2005 ~10 Petabytes 2010 ~100 Petabytes 2015 ~1000 Petabytes?
![Page 5: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/5.jpg)
Motivation (cont.)
Scientific applications generates petabytes of data are very diverse.
– Fusion power– Climate modeling – Earthquake engineering– Astronomy– Bioinformatics– High-energy physics
![Page 6: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/6.jpg)
Motivation (cont.)
Some examples[]Climate modeling
Community Climate System Model and other simulation applications generates 1.5 petabytes/year
Bioinformatics The Pacific Northwest National Laboratory is building
new Confocal microscopes which will be generating 5 petabytes/year
High-energy physics The large hadron collider (LHC) project at CERN will
create 15 petabytes/year
![Page 7: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/7.jpg)
Motivation ConclusionScientific community has large set of
distributed dataScientists want to analyze or work together
on the same data are geographically dispersed
![Page 8: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/8.jpg)
Requirements for Scientific Data TransferTransferring scientific
data over large-scale requiresefficient high-performancereliablesecurepolicy-aware
managementbalanced system
CPU farms storage network
![Page 9: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/9.jpg)
Is it a new problem?The answer is no.There are attempts to meet the above
requirements asGridFTPGridFTPXIOGridHTTPTeraGrid Copy (TGCP)The Replica Location Service (RLS)gLite
![Page 10: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/10.jpg)
GridFTPExtension of the standard FTP protocolReliable, secure high performance EfficientThe de facto standard for transferring data in
many Grid projectsHowever, GridFTP does not offer a web
service interface.
![Page 11: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/11.jpg)
GridFTP (cont.)Additional features supported by the
GridFTP protocolGrid Security Infrastructures (GSI) and Kerberos
supportSupport for reliable and restartable data
transfer: restart transfers from point of failure when failures occurred
Partial file transfer: regions of a file transfer.Parallel data transfer: multiple TCP streams
between two network endpoints to improve bandwidth.
Third-party control of data transfer: the ability to control transfers between storage servers from remote (third-party) server.
![Page 12: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/12.jpg)
GridHTTPAllow large (gigabyte) files to be transferred at
optimal speeds using HTTPDoes not deviate from existing HTTP
standards, But describes how to use existing headers and
methods to produce an encrypted data stream.Support bulk data transfers via unencrypted
HTTP, Support authentication and authorization with
the usual grid credentials over HTTP.
![Page 13: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/13.jpg)
GridFTPXIO The Globus eXtensible Input/Output (XIO)
System provides an abstraction layer to transport
protocols. enables different I/O problems to be presented
uniformly as a simple open/close/read/write (OCRW) interface.
a support framework for developing communication protocols.
an interface that enables an existing application written with XIO to access their hardware.
primary usage scenarios Independence from the Transport Control
Protocol Ease of Adding GridFTP Support to Third-Party
Applications Ease of Providing GridFTP Access to Data
Storage
![Page 14: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/14.jpg)
TeraGrid Copy (TGCP)TeraGrid Copy (TGCP)
solution includes three main components: GridFTP Service RFT ServiceTGCP shell script
In the striped configuration,GridFTP service runs on
several nodes of a clusterthe data to be transferred is
partitioned among the nodeseach node may use several
parallel streams to attain the maximum performance
![Page 15: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/15.jpg)
TGCP (cont.)The tgcp script can
use the globus-url-copy tool(A) in either third-
party transfer mode (B) in conventional
GridFTP client mode
![Page 16: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/16.jpg)
TGCP (cont.)RFT Service will be used
to manage the transfer. adds additional
reliability to the transfer request
transfer will be completed, if failure occurred during the transfer.
![Page 17: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/17.jpg)
The Replica Location Service (RLS)provides a framework for tracking the physical
locations of data that has been replicated. maps logical names to physical names. Replication of data items can reduce access latency, improve data locality, increase robustness, scalability and performance
for distributed applications. does not operate in isolation, used with other components like the Reliable File
Transfer service, GridFTP, the Metadata Catalog Service, and etc.
![Page 18: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/18.jpg)
RLS (cont.)The current RLS implementation has the
following features. Local Replica Catalogs (LRCs) Replica Location Indices (RLIs) LRCs send information about their state to
RLIs using soft state protocols. Optional "Bloom Filter" compression can be
used to summarize the contents of the LRC. The current RLS implementation maintains
static information about the LRCs and RLIs participating in the distributed system.
![Page 19: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/19.jpg)
So, if there are solutions….There is no pure P2P data transfer
mechanism used in this area.There are several different protocols
Each one has advantages and disadvantages over others
![Page 20: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/20.jpg)
Our proposal: GridTorrent FrameworkWe are proposing a new distributed file peer-
to-peer protocol in scientific data in an acceptable speed
Similar to (GridFTP) redefining of Bittorrent protocol to adjust it using in scientific data transfer
There are many studies show that Bittorrent can be used for scientific applications
![Page 21: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/21.jpg)
Why we need GridTorrent Framework?Requirements and characteristics of scientific
data transfer1. Large and voluminous data set2. Security3. Reliability4. Efficiency5. Scalability6. User-friendly environment7. Balanced8. Collaboration
![Page 22: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/22.jpg)
Why we need GridTorrent Framework? (cont.)GridTorrent has faster download speed
1. Large and voluminous data set7. Balanced
GridTorrent allows to share bandwidths between peers
4. EfficiencyGridTorrent is based on Bittorrent
3. Reliability5. Scalability
![Page 23: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/23.jpg)
Why we need GridTorrent Framework? (cont.)GridTorrent has security manager
2. SecurityGridTorrent has content management
framework6. User-friendly environment8. Collaboration
![Page 24: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/24.jpg)
Why Bittorrent?Alternative Peer to Peer Protocols
FastTrackGnutellaeDonkeyDirect ConnectAres
Why BitTorrent?Better bandwidth utilizationNever before speeds.Limit free riding – tit-for-tatLimit leech attack – coupling upload & downloadSpurious files not propagatedAbility to resume a download
![Page 25: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/25.jpg)
Why Bittorrent? (cont.)Bittorrent proved that it is suitable for
distributing very large files.There are many companies using Bittorrent
as distributing protocolAmazon S3Microsoft’s Avalanche (inspired by Bittorrent)Blizzard (Game production company)Movie studios
![Page 26: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/26.jpg)
Advantages of GridTorrent FrameworkSaves resources by taking advantage of the
unused upload capacity of downloaders.CPUNetwork BandwidthDisk
ReliableJobs can be started and stopped using web
interfaceCan be deployed under any systemSecure
![Page 27: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/27.jpg)
GridTorrent Framework Components
![Page 28: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/28.jpg)
GridTorrent Framework Components (cont.)GridTorrent Framework has three major
components:GridTorrent ClientGridTorrent Content ManagerGridTorrent WS-Tracker
![Page 29: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/29.jpg)
GridTorrent ClientIt has four components
Torrent Data Sharing Algorithm
Task ManagerWS-Tracker ClientData Transfer layerSecurity Manager
![Page 30: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/30.jpg)
GridTorrent Content ManagerFour main components:
Task ManagerACL ManagerContent ManagerCollaboration Manager
![Page 31: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/31.jpg)
GridTorrent WS-TrackerIt functions as regular Bittorrent Tracker
Send source and peer list to peersUpdate their status
It sends tasks list obtained from GridTorrent Content Manager
All communications are secure (SSL)It is a webservice
![Page 32: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/32.jpg)
GridTorrent Content ManagerIt allows content owner to publish content in
different access level.Public levelUser levelGroup level
It allows user to create a group and manage it and its member with upload, download access rights.
![Page 33: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/33.jpg)
Initial Test Results File size (MB) : 300 MB Number of Streams/Sources: 4 Source machines: gridfarm (Bloomington, IN)
LAN test: Iperf bandwith (Mbps): 857 Client machine: complexity (Indianapolis, IN)
WAN test: Iperf bandwith (Mbps): 30.2 Client machine: vlab2 (Tallahassee, FL)
![Page 34: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/34.jpg)
Initial Test Results (cont.)
Table 1: Download speed of PTCP vs. GridTorrent with 4 streams/sources
Table 2: GridTorrent bandwidth load balancing on downloaded file segment with 4 streams/sources
Download Speed (Mbps)
PTCP
GridTorrent
(1 stream)
GridTorrent (4 streams)
LAN Test 80 90 95
WAN Test 42 49 102
Bandwidth usage (Downloaded MB from each source)
Source1 Source2 Source3 Source4
LAN Test 44 53 47 42
WAN Test 52 45 43 48
![Page 35: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/35.jpg)
Initial Test Results (cont.)
![Page 36: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/36.jpg)
Research IssuesCurrent Bittorrent protocol is designed for actual
network environmentModifications needed to provide pure scientific
data transfermodification on message format and frequencyUDP GridFTP
Requirements needed to provide pure scientific data transferSecurityContent access managementSearching capability
![Page 37: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/37.jpg)
Questions?
![Page 38: Performance Oriented Data Transferring and Sharing Framework for Scientific Computing](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815b64550346895dc951c9/html5/thumbnails/38.jpg)
References1. Petascale computational systems, Bell, G.; Gray, J.;
Szalay, A. Computer Volume 39, Issue 1, Jan. 2006 Page(s): 110 – 112
2. Getting Up To Speed, The Future of Supercomputing, Graham, S.L. Snir, M., Patterson, C.A., (eds), NAE Press, 2004, ISBN 0-309-09502-6
3. Overview of Grid Computing, Ian Foster, http://www-fp.mcs.anl.gov/~foster/Talks/ResearchLibraryGroupGridsApril2002.ppt, last seen 2007
4. Science-Driven Network Requirements for Esnet, http:// www.es.net/ESnet4/Case-Study-Requirements-Update-With-Exec-Sum-v5.doc, last seen 2007