wp2: infrastructure and service management

15
www.eu-etics.org INFSOM-RI-026753 WP2: Infrastructure and WP2: Infrastructure and Service Management Service Management Status Report Status Report ETICS All-Hands – 23 October 2006 ETICS All-Hands – 23 October 2006 CERN: Marian Zurek CERN: Marian Zurek INFN: Matteo Selmi INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo

Upload: asher

Post on 07-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

WP2: Infrastructure and Service Management. Status Report ETICS All-Hands – 23 October 2006 CERN: Marian Zurek INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo. Personnel News. Changes @ UW-Madison Tolya Karp replaced by Andy Pavlo and Becky Gietzel - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WP2: Infrastructure and Service Management

www.eu-etics.org

INFSOM-RI-026753

WP2: Infrastructure and WP2: Infrastructure and Service ManagementService ManagementStatus ReportStatus Report

ETICS All-Hands – 23 October 2006ETICS All-Hands – 23 October 2006

CERN: Marian ZurekCERN: Marian Zurek

INFN: Matteo SelmiINFN: Matteo Selmi

UW-Madison: Peter Couvares, Becky Gietzel, Andy PavloUW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo

Page 2: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Personnel News

• Changes @ UW-MadisonChanges @ UW-Madison– Tolya Karp replaced by Andy Pavlo and Becky GietzelTolya Karp replaced by Andy Pavlo and Becky Gietzel– Peter still here :)Peter still here :)

• Carlos to join WP2 @ CERN in NovemberCarlos to join WP2 @ CERN in November– Much needed sysadmin help for Marian!Much needed sysadmin help for Marian!

Page 3: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Deliverables

• D2.2 - Infrastructure installation and usage D2.2 - Infrastructure installation and usage documentation (PM06)documentation (PM06)– Delivered (a little late -- PM07)Delivered (a little late -- PM07)

• D2.3 - Status of certification, integration and validation D2.3 - Status of certification, integration and validation testbed setup (prototype) (PM12)testbed setup (prototype) (PM12)– Document not yet started -- but will contain positive news: Document not yet started -- but will contain positive news:

prototype testbeds are up and have been operational for >6 prototype testbeds are up and have been operational for >6 months.months.

Page 4: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Major Tasks Performed

• Certification, Integration and Validation Infrastructure Expansion: Certification, Integration and Validation Infrastructure Expansion: CERN FacilityCERN Facility– Due entirety to Marian’s ongoing hard work, WP2 has expanded the Due entirety to Marian’s ongoing hard work, WP2 has expanded the

NMI Build/Test Facility at CERN and improved its operation.NMI Build/Test Facility at CERN and improved its operation.– etics.cern.chetics.cern.ch: official ETICS WS/submission node, production host: official ETICS WS/submission node, production host

– 19 CPUs: ia32, x86_64, ia64, ppc19 CPUs: ia32, x86_64, ia64, ppc– SLC3, SLC4, RHES3, Deb3, FC3, FC4, FC5, WinXP, MacOSSLC3, SLC4, RHES3, Deb3, FC3, FC4, FC5, WinXP, MacOS– 2500+ jobs (as of 17 October 2006) 2500+ jobs (as of 17 October 2006) vs. 1300+ jobs (as of 22 May 2006)vs. 1300+ jobs (as of 22 May 2006)

– etics-test.cern.ch: etics-test.cern.ch: test submission nodetest submission node– a few machines with SLC3,SLC4 on ia32a few machines with SLC3,SLC4 on ia32– 2200+ jobs (as of 17 October 2006) vs. 450+ jobs (as of 22 May 2006)2200+ jobs (as of 17 October 2006) vs. 450+ jobs (as of 22 May 2006)

– etics-dev.cern.ch: etics-dev.cern.ch: development node, non-stabledevelopment node, non-stable– a few machines with SLC3, SLC4 on ia32a few machines with SLC3, SLC4 on ia32– 1650+ jobs (as of 17 October 2006)1650+ jobs (as of 17 October 2006)

– etics-hd.cern.ch: etics-hd.cern.ch: new host for SLC4 WS/submission node prototypenew host for SLC4 WS/submission node prototype – Operational setupOperational setup

– WNs status pageWNs status page: : http://etics.cern.ch/nmi/?page=pool/indexhttp://etics.cern.ch/nmi/?page=pool/index– Job status page: Job status page: http://etics.cern.ch/nmi/?page=results/overviewhttp://etics.cern.ch/nmi/?page=results/overview

Page 5: WP2: Infrastructure and Service Management

ETICS, 4th EGEE Conference, Pisa, Italy, November 2005 5INFSOM-RI-026753

Page 6: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Major Tasks Performed

• Certification, Integration and Validation Infrastructure Certification, Integration and Validation Infrastructure Expansion (Cont.)Expansion (Cont.)– INFN FacilityINFN Facility

– Thanks to Matteo, WP2 has also expanded NMI Build/Test Facility Thanks to Matteo, WP2 has also expanded NMI Build/Test Facility at INFNat INFN

– etics-01.cnaf.infn.itetics-01.cnaf.infn.it: : ETICS WS/submission nodeETICS WS/submission node– 5 CPUs: ia32, x86_64, ppc5 CPUs: ia32, x86_64, ppc

– SLC3/SLC4/CentOS4/MacOSXSLC3/SLC4/CentOS4/MacOSX

– 330+ jobs330+ jobs

– UW-Madison FacilityUW-Madison Facility– 100+ CPUs, 43+ platforms, and still growing…100+ CPUs, 43+ platforms, and still growing…– Thanks to Becky, local ETICS WS currently being deployedThanks to Becky, local ETICS WS currently being deployed

Page 7: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Major Tasks Performed

• Parallel Testing Feature DeliveredParallel Testing Feature Delivered– Allows co-scheduling of multiple heterogeneous resources, e.g. to Allows co-scheduling of multiple heterogeneous resources, e.g. to

dynamically deploy a custom tested for testing client/server or p2p s/w.dynamically deploy a custom tested for testing client/server or p2p s/w.– Originally an end of Q4 goal, delivered ~5 months early in response to Originally an end of Q4 goal, delivered ~5 months early in response to

to gLite demandsto gLite demands

• D2.2 D2.2 Infrastructure Installation and Usage Document CompletedInfrastructure Installation and Usage Document Completed– Thanks to all of WP2 for content & reviewers for helpful feedbackThanks to all of WP2 for content & reviewers for helpful feedback

• gLite System Testing PrototypegLite System Testing Prototype– To be described in detail by Marian tomorrow…To be described in detail by Marian tomorrow…

• Continued Improvements to NMI InfrastructureContinued Improvements to NMI Infrastructure– Many a result of Marian & Matteo’s feedback & experiences setting up Many a result of Marian & Matteo’s feedback & experiences setting up

facilities at CERN and INFN.facilities at CERN and INFN.– Additional NMI documentationAdditional NMI documentation– New NMI website (New NMI website (http://nmi.cs.wisc.eduhttp://nmi.cs.wisc.edu))– LISA ‘06 NMI paper, etc.LISA ‘06 NMI paper, etc.

Page 8: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Major Tasks Performed

• Implemented short-term solution for root-level testing Implemented short-term solution for root-level testing @ CERN@ CERN– Initial approach is only loosely integrated with NMIInitial approach is only loosely integrated with NMI– To be replaced by future NMI virtual machine capability?To be replaced by future NMI virtual machine capability?

• Participation in OMII-EuropeParticipation in OMII-Europe– Continued involvement to ensure infrastructure harmonyContinued involvement to ensure infrastructure harmony– Cross-site job migration is also a top OMII-Europe goalCross-site job migration is also a top OMII-Europe goal

• And last but not least: Boring system administration … every dayAnd last but not least: Boring system administration … every day– OS updates/upgrades, reboots, backups, disk space mgmt., OS updates/upgrades, reboots, backups, disk space mgmt.,

disappearing WNs, crashes, power outages, filesystem failures, etc.disappearing WNs, crashes, power outages, filesystem failures, etc.– As CERN is the facility with the most usage, most of this falls onto As CERN is the facility with the most usage, most of this falls onto

MarianMarian

– The etics.cern.ch service is highly available. No significant downtime was caused by the WP2 infrastructure

Page 9: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Issues

• Capacity Planning / ScalabilityCapacity Planning / Scalability– Marian: “Marian: “How many more needed?”How many more needed?”– Good question! I have no idea.Good question! I have no idea.– Major new users/projects may need to provide new resources.Major new users/projects may need to provide new resources.– We need to better understand how easy/quick it is to add resources to We need to better understand how easy/quick it is to add resources to

an existing facility, and how many can be added in the same manner an existing facility, and how many can be added in the same manner before new scalability issues arise.before new scalability issues arise.

– NMI has been demonstrated to scale to 100’s of nodes, and Condor to NMI has been demonstrated to scale to 100’s of nodes, and Condor to 1000’s… but ETICS + NMI + Condor? It also depends on specific 1000’s… but ETICS + NMI + Condor? It also depends on specific workload…workload…

• Additional ETICS Testbeds for DevelopmentAdditional ETICS Testbeds for Development– Marian: “Does every developer need their own ETICS installation?”Marian: “Does every developer need their own ETICS installation?”– Combined deployment of NMI submit node + ETICS WS is not trivial or Combined deployment of NMI submit node + ETICS WS is not trivial or

fully automated (no simple RPM or “plug’n’play”)fully automated (no simple RPM or “plug’n’play”)– WP2 needs help from other WPs to better automate their deploymentWP2 needs help from other WPs to better automate their deployment

Page 10: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Issues

• Uneven Facility UtilizationUneven Facility Utilization– Was an issue in May, still an issue todayWas an issue in May, still an issue today– 3/3 sites set up, 1/3 in use3/3 sites set up, 1/3 in use

– CERN facility set up, already in use, production-readyCERN facility set up, already in use, production-ready– INFN facility set up, butINFN facility set up, but lesser usedlesser used– UW facility set up, but not yet in regular use by ETICSUW facility set up, but not yet in regular use by ETICS

– Why? Two reasons:Why? Two reasons:– Minor: CERN facility known to work, other facilities less stress-Minor: CERN facility known to work, other facilities less stress-

tested.tested.– Major: inconvenience of submitting to multiple ETICS sites with Major: inconvenience of submitting to multiple ETICS sites with

multiple DBs & WS interfacesmultiple DBs & WS interfaces

– Upcoming cross-site job migration capabilities should largely Upcoming cross-site job migration capabilities should largely address both issues -- if jobs automatically migrate, users don’t address both issues -- if jobs automatically migrate, users don’t need to think about it, and all three pools will be exercisedneed to think about it, and all three pools will be exercised

– To be described in more detail by Andy tomorrow…To be described in more detail by Andy tomorrow…

Page 11: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Issues

• CommunicationCommunication– Evening in Europe == Morning in MadisonEvening in Europe == Morning in Madison– Bi-weekly calls stopped happening over summerBi-weekly calls stopped happening over summer– I’ve been slow to address the problemI’ve been slow to address the problem

– Matteo in May:Matteo in May:– ““I think we need more coordination among the three sites. It is quite difficult I think we need more coordination among the three sites. It is quite difficult

for us at INFN to understand what are the urgent operations to be done.”for us at INFN to understand what are the urgent operations to be done.”

– Marian in October: same complaint!Marian in October: same complaint!• Sysadmin Work

– Only one person @ CERN– Frequent OS updates/upgrades– Reboots

– because of the power-cut (too hot), kernel update/upgrade, HW failure– Marian: “I know it is not interesting for you, but this must work !! !! !!”

– Heterogeneous clusters inherently harder to manage than homogenous clusters of the same size

– Complex s/w stack: ETICS client -> ETICS WS -> NMI -> Condor -> OS

Page 12: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Workplan

• Q4 Top PrioritiesQ4 Top Priorities– Develop/deploy/test cross-facility job migration capability.Develop/deploy/test cross-facility job migration capability.

– ……and increase utilization of INFN and UW-Madison pools as a result.and increase utilization of INFN and UW-Madison pools as a result.

– Keep up with increasing sysadmin demands -- keep infrastructure Keep up with increasing sysadmin demands -- keep infrastructure running smoothly for ETICS users & developersrunning smoothly for ETICS users & developers

– Responding to Hardware/OS/Service issuesResponding to Hardware/OS/Service issues

– Automation of currently manual tasksAutomation of currently manual tasks

– Deployment of new systems & servicesDeployment of new systems & services

– Scalability workScalability work

– Prepare D2.3 report on infrastructure status.Prepare D2.3 report on infrastructure status.

Page 13: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Workplan

• Q4/Q5 Unprioritized (next steps and/or resources unclear):Q4/Q5 Unprioritized (next steps and/or resources unclear):– Hardware VirtualisationHardware Virtualisation

– WoD (WindowsOnDemand) service, VMWare and/or XenWoD (WindowsOnDemand) service, VMWare and/or Xen– Service Monitoring (Service Level Status)Service Monitoring (Service Level Status)

– see already http://sls.cern.ch/sls/service.php?id=ETICSsee already http://sls.cern.ch/sls/service.php?id=ETICS– Your feedback is neededYour feedback is needed

– Security issuesSecurity issues– Passwords present in the CVSPasswords present in the CVS

– Public / private resource allocationPublic / private resource allocation– A project wants to use ETICS and brings in its private nodes and wants its full power A project wants to use ETICS and brings in its private nodes and wants its full power

to be privateto be private– Steering the jobs to this node, preventing from others landing thereSteering the jobs to this node, preventing from others landing there– Already supported by NMI/Condor, needs to be documented/customized for ETICSAlready supported by NMI/Condor, needs to be documented/customized for ETICS

– Steering jobs to/identifying nodes with specific resourcesSteering jobs to/identifying nodes with specific resources– Already supported by NMI/Condor, needs to be documented/customized for ETICSAlready supported by NMI/Condor, needs to be documented/customized for ETICS

– DocumentationDocumentation– Needs to be updated & improvedNeeds to be updated & improved– ETICS-generic WS installation & configuration docsETICS-generic WS installation & configuration docs– CERN/INFN/UW facility-specific configuration & administration docsCERN/INFN/UW facility-specific configuration & administration docs– Extracting info from Savannah issue DBExtracting info from Savannah issue DB

Page 14: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Metrics

• Bugs, jobs, tasksBugs, jobs, tasks– 15 open NMI/Condor bugs/issues15 open NMI/Condor bugs/issues– 14 closed/addressed bugs/issues14 closed/addressed bugs/issues– Details available at: Details available at:

– bugs: bugs: https://savannah.https://savannah.cerncern..ch/bugs/ch/bugs/??group=eticsgroup=etics and select and select category=NMIcategory=NMI

– 5 open tasks, 1 closed5 open tasks, 1 closed– Details available at: Details available at: https://savannah.https://savannah.cerncern..ch/task/ch/task/??group=eticsgroup=etics

select category=NMIselect category=NMI

Page 15: WP2: Infrastructure and Service Management

INFSOM-RI-026753

Conclusion

• Discussion/Questions/Etc.Discussion/Questions/Etc.