1 development of a high-throughput computing cluster at florida tech p. ford, r. pena, j. helsby, r....

1
1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept, Florida Institute of Technology, 150 W. University Blvd, Melbourne, FL 32901 Abstract Open Science Grid Hardware Conclusions and Future Integration onto the OSG requires the installation and careful configuration of many packages and services. Our group attended the Florida International Grid School before making our second installation attempt. The first installation took three months, the second took three weeks. With all software working as intended, future plans are to expand the cluster resources as much as possible, and to move to the OSG Production grid once we can ensure maximum uptime. A new frontend will be arriving soon along with nodes featuring 2GB memory for each CPU - which is the required hardware for data processing in the CMS experiment. For further information, contact [email protected] . Visit http://research.fit.edu/hep/ to follow this project. The OSG is a collaboration of many virtual organizations (VOs) ranging from biomedical research to particle physics to software development. Software ROCKS - The ROCKS operating system we are using is version 4.2.1. It is based on a community enterprise version of Red Hat Linux called CentOS. Since the first concept and implementation of the computing cluster at Florida Tech, we have increased its size and developed the cluster software significantly. We have implemented the Linux-based ROCKS OS as the central controller of all cluster resources. The cluster now uses the Condor high-throughput batch-job system and has been fully integrated into the Open Science Grid test-bed. In addition to contributing to the data-handling capabilities of worldwide scientific grids, the cluster is being used to process and model high-energy particle simulations such as in Muon radiography. The FLTECH cluster has 20 nodes (40 CPUs) operational with functioning simulation software packages (Geant4). We run all essential non-compute elements (Frontend, NAS, Switches) on a 4 kilowatt uninterruptible power supply that has been programmed to perform automatic shutdowns in the case of an extended power outage. In addition to this, a NAS featuring 10 terabytes of available storage is being installed. condor-compute compute-appliance compute Figure 6: The kickstart graph is the cornerstone of the ROCKS OS. It ensures that packages are installed on all cluster machines in the correct order. The bubble shows our addition of the condor-compute appliance (created with an XML file) to the graph, effectively interfacing condor with ROCKS. Figure 2: The current topology of the cluster (left), and hardware (below) Figure 4: Simulations running on Condor Figure 5: Machines available to Condor References and Acknowledgments Rocks Clusters User Guide: http://www.rocksclusters.org/rocks- documentation/4.2.1/ Accessed March 2008 Open Science Grid: http://www.opensciencegrid.org/ Accessed March 2008 Condor v6.6.11 Manual: http://www.cs.wisc.edu/condor/manual/v6.6/ref.html Accessed March 2008 Thanks to Dr J. Rodriguez (FIU) and Micha Niskin (FIU) for their guidance. All recent additions are installed in a second 50U rack from the same manufacturer as the one loaned to us from UF. Our goal is to make all current and future hardware rackmount. Figure 7: A map of Open Science Grid sites, provided by VORS and the grid operations center. Our site is located on the east coast of Florida. Figure 8: A map of OSG sites provided by the MonALISA grid monitoring system. Condor - Condor is the job manager that sends data to be computed to any available machines in the cluster. We use it to delegate our own resource intensive muon tomography simulations so that more work can be accomplished in the same amount of time. Condor now stands as our primary job manager serving the Open Science Grid. Figure 3: Ganglia Cluster Monitoring Cluster communication is in the form of a star topology using a high-end Cisco switch as the central manager, and several Linksys switches as node carriers. Figure 1: New High-end Cluster Hardware (NAS) - (above)

Upload: christal-sharp

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept,

1

Development of a High-Throughput Computing Cluster at Florida Tech

P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANNPhysics and Space Sciences Dept, Florida Institute of Technology, 150 W. University Blvd, Melbourne, FL 32901

Abstract Open Science Grid

Hardware

Conclusions and Future

Integration onto the OSG requires the installation and careful configuration of many packages and services. Our group attended the Florida International Grid School before making our second installation attempt. The first installation took three months, the second took three weeks.

With all software working as intended, future plans are to expand the cluster resources as much as possible, and to move to the OSG Production grid once we can ensure maximum uptime. A new frontend will be arriving soon along with nodes featuring 2GB memory for each CPU - which is the required hardware for data processing in the CMS experiment.For further information, contact [email protected]. Visit http://research.fit.edu/hep/ to follow this project.

The OSG is a collaboration of many virtual organizations (VOs) ranging from biomedical research to particle physics to software development.

Software

ROCKS- The ROCKS operating system we are using is version 4.2.1. It is based on

a community enterprise version of Red Hat Linux called CentOS.

Since the first concept and implementation of the computing cluster at Florida Tech, we have increased its size and developed the cluster software significantly. We have implemented the Linux-based ROCKS OS as the central controller of all cluster resources. The cluster now uses the Condor high-throughput batch-job system and has been fully integrated into the Open Science Grid test-bed. In addition to contributing to the data-handling capabilities of worldwide scientific grids, the cluster is being used to process and model high-energy particle simulations such as in Muon radiography.

The FLTECH cluster has 20 nodes (40 CPUs) operational with functioning simulation software packages (Geant4). We run all essential non-compute elements (Frontend, NAS, Switches) on a 4 kilowatt uninterruptible power supply that has been programmed to perform automatic shutdowns in the case of an extended power outage.In addition to this, a NAS featuring 10 terabytes of available storage is being installed.

condor-compute

compute-appliance

compute

Figure 6: The kickstart graph is the cornerstone of the ROCKS OS. It ensures that packages are installed on all cluster machines in the correct order. The bubble shows our addition of the condor-compute appliance (created with an XML file) to the graph, effectively interfacing condor with ROCKS.

Figure 2: The current topology of the cluster (left), and hardware (below) Figure 4: Simulations running on Condor

Figure 5: Machines available to Condor

References and Acknowledgments

Rocks Clusters User Guide: http://www.rocksclusters.org/rocks- documentation/4.2.1/ Accessed March 2008Open Science Grid: http://www.opensciencegrid.org/ Accessed March 2008Condor v6.6.11 Manual: http://www.cs.wisc.edu/condor/manual/v6.6/ref.html Accessed March 2008

Thanks to Dr J. Rodriguez (FIU) and Micha Niskin (FIU) for their guidance.

All recent additions are installed in a second 50U rack from the same manufacturer as the one loaned to us from UF. Our goal is to make all current and future hardware rackmount.

Figure 7: A map of Open Science Grid sites, provided by VORS and the grid operations center. Our site is located on the east coast of Florida.

Figure 8: A map of OSG sites provided by the MonALISA grid monitoring system.

Condor- Condor is the job manager that sends data to be computed to any

available machines in the cluster. We use it to delegate our own resource intensive muon tomography simulations so that more work can be accomplished in the same amount of time. Condor now stands as our primary job manager serving the Open Science Grid.

Figure 3: Ganglia Cluster Monitoring

Cluster communication is in the form of a star topology using a high-end Cisco switch as the central manager, and several Linksys switches as node carriers.

Figure 1: New High-end Cluster Hardware

(NAS) - (above)