clustering the reliable file transfer service
DESCRIPTION
Clustering the Reliable File Transfer Service. Jim Basney and Patrick Duda NCSA, University of Illinois. This material is based upon work supported by the National Science Foundation under Grant No. 0426972. Goal. Provide a highly available Reliable File Transfer (RFT) Service - PowerPoint PPT PresentationTRANSCRIPT
June 6, 2007 TeraGrid '07 1
Clustering the Reliable File Transfer Service
Jim Basney and Patrick DudaNCSA, University of Illinois
This material is based upon work supported by the National Science Foundation under Grant No. 0426972.
June 6, 2007 TeraGrid '07 2
Goal
• Provide a highly availableReliable File Transfer (RFT) Service– Tolerate server failures
• Hardware/software faults and resource exhaustion
– Continue to handle incoming requests– Continue to make forward progress on file
transfers in the queue
June 6, 2007 TeraGrid '07 3
Globus ToolkitReliable File Transfer Service
RFTClient
GridFTP
GridFTP
June 6, 2007 TeraGrid '07 4
RFT and GridFTP Clustering
GridFTPcontrol
GridFTPcontrol
RFT
GridFTPdata
GridFTPdata
GridFTPdata
GridFTPdata
RFT
RFT
June 6, 2007 TeraGrid '07 5
Clustering Approach
RFT
RFT
RFT
LoadBalancer
HADBMS
June 6, 2007 TeraGrid '07 6
Web ServiceContainer
RFT State Management
RFT
DelegationService
Client
DBMS
June 6, 2007 TeraGrid '07 7
RFT DB Tables
Request Transfer RestartID
Termination Time
Started Flag
Max Attempts
Delegated EPR
Container ID
Start Time
ID
Request ID
Source URL
Destination URL
Status
Attempts
Retry Time
Transfer ID
Restart Marker
Last Update Time
Added Fields
June 6, 2007 TeraGrid '07 8
New Tables
Delegation Service Persistent SubscriptionResource ID
Caller DN
Local Name
Termination Time
Listener
Certificate
Container ID
Consumer
Producer
Policy
Precondition
Selector
Topic
Security Descriptor
…
June 6, 2007 TeraGrid '07 9
RFT Fail-Over
• Based on time-outs• Periodically query database for pending
requests with no recent activity– Stalled requests could be caused by RFT service
crash, hardware failure, RFT service overload, etc.– If found, obtain DB write lock, query again, claim
stalled requests, and release lock
• Configuration values:– Query interval (default: 30 seconds)– Recent interval (default: 60 seconds)
June 6, 2007 TeraGrid '07 10
Evaluation Environment
• Dedicated 12 node Linux cluster– Red Hat Enterprise Linux AS Release 3– Switched Gigabit Ethernet– 2 GB RAM– dual 2GHz Intel Xeon CPUs 512KB cache
• Globus Toolkit 4.0.3
• MySQL Standard 5.0.27
June 6, 2007 TeraGrid '07 11
Evaluation
• Correctness / Effectiveness– Submitted multiple RFT requests of
different sizes to 12 RFT instances– Verified fail-over and notification
functionality
• Performance– Evaluate overhead of shared DBMS– Stress test: transfer many small files
June 6, 2007 TeraGrid '07 12
0
2
4
6
8
10
12
14
0 510 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
100105110115120125130135140145150155160165seconds
files transferred per second
web servicescontainer stopped
fail-over
60 second fail-over interval
June 6, 2007 TeraGrid '07 13
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
number of nodes
total seconds
GT4 submit time cluster submit time
June 6, 2007 TeraGrid '07 14
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10
number of nodes
total seconds
GT4 transfer time cluster transfer time
4% 6%10%
14%
22%
43%
57%
82%
95%
June 6, 2007 TeraGrid '07 15
Related Work
• HAND: Highly Available Dynamic Deployment Infrastructure for GT4– Migrate services between containers to maintain availability
during planned outages– Does not address management of persistent service state or
fail-over for unplanned outages
• myGrid– DBMS persistence of WS-ResourceProperties in Apache
WSRF– Points to a general-purpose approach for DBMS-based
persistence of stateful WSRF services
June 6, 2007 TeraGrid '07 16
Conclusion
• Clustering RFT provides load-balancing and fail-over with acceptable performance for small clusters
• Clustering is a promising approach for application to other grid services
June 6, 2007 TeraGrid '07 17
Future Work
• Correctly handle replay of FTP deletes• Implement credentialRefreshListener• Evaluate use of different DBMS solutions• Investigate GT4 DBMS persistence in general• Investigate use of WS-Naming
June 6, 2007 TeraGrid '07 18
Thanks!
• Questions? Comments?
• This material is based upon work supported by the National Science Foundation under Grant No. 0426972.
• Performance experiments were conducted on computers at the Technology Research, Education, and Commercialization Center (TRECC), a program of the University of Illinois at Urbana-Champaign, funded by the Office of Naval Research and administered by the National Center for Supercomputing Applications. We thank Tom Roney for his assistance with the TRECC cluster.
• We also thank Ravi Madduri from the Globus project for answering our questions about RFT.