teragrid operations overview mike pingleton ncsa teragrid operations december 2 nd, 2004
TRANSCRIPT
TeraGrid Operations TeraGrid Operations OverviewOverview
Mike PingletonMike Pingleton
NCSA TeraGrid Operations NCSA TeraGrid Operations December 2December 2ndnd, 2004, 2004
TeraGrid Operations CenterTeraGrid Operations CenterProvides continuous and coordinated operational support, user Provides continuous and coordinated operational support, user assistance, and incident response for the nation-wide TeraGrid assistance, and incident response for the nation-wide TeraGrid
TOC CapabilitiesTOC Capabilities
24/7 single source of assistance for TeraGrid 24/7 single source of assistance for TeraGrid users and staff, via email or telephoneusers and staff, via email or telephone
Dedicated TeraGrid trouble-ticket system (TTS) Dedicated TeraGrid trouble-ticket system (TTS) ensures timely resolution of problems and event ensures timely resolution of problems and event responseresponse
Leverages and pools vast experience of existing Leverages and pools vast experience of existing operations staff and system administratorsoperations staff and system administrators
Capable of monitoring systems/queues at Capable of monitoring systems/queues at multiple remote sitesmultiple remote sites
TOC Technical ApproachTOC Technical Approach
TG Operations Center staffed by NCSA TG Operations Center staffed by NCSA and SDSC Operations staff, 12 hour shift and SDSC Operations staff, 12 hour shift for each sitefor each site
TOC provides front-line evaluation, TOC provides front-line evaluation, resolution, and routing of problemsresolution, and routing of problems
TOC coordinates, participates in event TOC coordinates, participates in event response – security issues, down time, response – security issues, down time, etc.etc.
NCSA & SDSC Ops Centers:NCSA & SDSC Ops Centers:Expanded Scope, but Business as UsualExpanded Scope, but Business as Usual
MonitoringMonitoring
Currently ‘passively’ monitoring most Currently ‘passively’ monitoring most TeraGrid clusters using CluMonTeraGrid clusters using CluMon
Ramping up efforts to monitor the Ramping up efforts to monitor the TeraGrid networkTeraGrid network
Monitoring capacity untapped at this point Monitoring capacity untapped at this point (not yet monitoring grid fabric)(not yet monitoring grid fabric)
Technical Approach - TeraGrid Technical Approach - TeraGrid Ticketing SystemTicketing System
[email protected]@teragrid.org or toll-free number receive all or toll-free number receive all incoming requestsincoming requests
TTS is a browser-based, db-driven system TTS is a browser-based, db-driven system developed from NCSA’s in-house ticketing developed from NCSA’s in-house ticketing system (use existing infrastructure!)system (use existing infrastructure!)
Users are able to track the progress of their Users are able to track the progress of their ticketstickets
New TG sites are easily integrated into system New TG sites are easily integrated into system (all new ETF sites already integrated)(all new ETF sites already integrated)
Technical Approach – TeraGrid Technical Approach – TeraGrid Ticketing System Ticketing System (continued)(continued)
Problem Resolution – a tiered approachProblem Resolution – a tiered approach Front-line evaluation, routing or resolution by Front-line evaluation, routing or resolution by
TG Ops staffTG Ops staff Site-specific issues routed to site-leads for Site-specific issues routed to site-leads for
resolutionresolution TG-wide issues routed to user support team TG-wide issues routed to user support team
to coordinate resolution by technical leadsto coordinate resolution by technical leads Front-line Resolution an important factorFront-line Resolution an important factor
22% of all trouble tickets resolved by TOC 22% of all trouble tickets resolved by TOC staffstaff
Trouble Ticket ProcessingTrouble Ticket ProcessingFrom Open To CloseFrom Open To Close
When a ticket is created, user receives auto-When a ticket is created, user receives auto-notification with ticket numbernotification with ticket number
User receives personal reply within 30 minutesUser receives personal reply within 30 minutes Ticket is assigned to a project & to someoneTicket is assigned to a project & to someone User is kept updated on progress, resolutionUser is kept updated on progress, resolution Problem behind ticket is resolvedProblem behind ticket is resolved User is notifiedUser is notified User receives auto-notification of closure, with User receives auto-notification of closure, with
summarysummary
Problem Resolution WorkflowProblem Resolution Workflow
TeraGridUser
[email protected] TeraGrid
Operations
User SupportTeam
TeraGrid Sites
TeraGridTicket Breakdown
729 tickets, 22%
317 tickets, 10%
2249 tickets, 68%
Site Specific Tickets
TeraGrid-Wide Tickets
TG Ops Center
Pulling Ops Centers Together:Pulling Ops Centers Together:
A common set of web-based procedures A common set of web-based procedures documentation – documentation – Routing & Assignment GuidesRouting & Assignment Guides ’’20 Questions’ Guides for problem 20 Questions’ Guides for problem
determinationdetermination Basic operational policies and proceduresBasic operational policies and procedures
‘‘Shift Turnover’ phone callsShift Turnover’ phone calls Open communication & assistanceOpen communication & assistance
ChallengesChallenges
TeraGrid is a huge learning curve for Ops Staff TeraGrid is a huge learning curve for Ops Staff (must know at least a little bit about everything)(must know at least a little bit about everything)
Keeping abreast with a constant state of changeKeeping abreast with a constant state of change Working with people who are very far away (and Working with people who are very far away (and
sometimes on vacation)sometimes on vacation) Promoting the concept of Problem Resolution Promoting the concept of Problem Resolution
(new to some) and getting everyone to use the (new to some) and getting everyone to use the Ticketing SystemTicketing System
Inexperienced users on the horizonInexperienced users on the horizon
Lessons LearnedLessons Learned
More tickets than anyone expectedMore tickets than anyone expected Problem Resolution on a global scale is Problem Resolution on a global scale is
expensive wrt time and talent consumedexpensive wrt time and talent consumed TG Ops Center more than just a problem TG Ops Center more than just a problem
routing switchboardrouting switchboard Communication & coordination between Communication & coordination between
RPs, services and TOC vital to successRPs, services and TOC vital to success