Overview and Evaluation of Conceptual Strategies for Accessing
CPU-dependent Execution Resources in Grid Infrastructures
J Walsh, J Dukes, B Coghlan, G PierantoniSchool of Computer Science and Statistics
The University of Dublin, Trinity College
Background
• Original problem (Grid/GPGPU integration)– Publication, Discovery, Job Submission & LRMS– GPGPU only used for highly parallel compute tasks– CPU needed for data I/O to GPGPU
• All GPGPU jobs need CPU, but not vice-versa
• Is this problem similar for other H/W or S/W?– If so, can the problem be abstracted?– Generalised method to support many H/W types?
• Do not want to create new GLUE Schema definition for each type• Must accommodate the differences (i.e. must be extensible)
CPU-Dependent Execution Resource
• CPU-Dependant Execution Resource (CDER)• Physically associated with a host (node-bound)• CPU required to facilitate resource access• Job execution split between CPU and CDER• Job access to the CDER must be exclusive
(or appears to the job to be exclusive)• Finite number of batch system job-slots
(usually one)• e.g. GPGPU, FPGA, hardware media encoder
CDER GLUE Integration• Modify GLUE2 and UI/WMS/CE?• Slow adoption of GLUE changes (GLUE2 > 5 yrs)• Not practical for CDERs that have yet to be
envisaged• Need a flexible, dynamic approach
• CDER support layer using existing GLUE2 schema and middleware?• Not proposing changes to GLUE2• Use wrappers/plugins for existing middleware• Adapt quickly to describe new CDERs
Conceptual Strategies
• A Priori• Named-Queue• Tagged-Environment• Attribute-Extension• Class-Extension
Criteria for Evaluation• Discoverability• Semantic Resource Detail (intra)
– Level of Detail– Structured Information
• Semantic Structure (inter)– Associations between CDER and other Entities
• Dynamic Information• Time Efficiency (how efficiently you can (i) query and (ii)
update the CDER information)• Space Efficiency (what is the size of the additional published
CDER information)• Discovery / Matchmaking / Submission support
A Priori
• “Non”-strategy• Used by some Sites/VOs for GPGPU handling• Requires knowledge of how to access the resource (e.g
Queue and Software)
• Discoverability: None• Semantic Resource Detail: None• Semantic Structure: None• Dynamic Info: None
Named-Queue• Discoverability: Queue Name suffix
• https://ce1.example.com/cream-pbs-nvidia_gpgpu• Can match against Queue suffix in JDL
• Semantic Resource Detail: Minimal• Semantic Structure: Minimal• Limited to queue name suffix detail
• Dynamic Info: Limited• Do #CPUs = #CDERS?• Potential job requirements may never be satisfied• Batch system limitations• Under-utilisation of CPU resources
Tagged-Environment
• Discoverability: Published Tag • e.g GLUE 1.3 SoftwareEnvironment
• Semantic Resource Detail: coarse• Naming convention (e.g. MPI_FEATURE_X)
• Semantic Structure: minimal• must know relationship between published tags
• Dynamic Information: limited/difficult• Difficult to encode CDER capacity and utilisation• Very limited use with current M/W
Attribute-Extension (I)
• GLUE2 entities can define multiple OtherInfo string attributes containing arbitrary string values• Use to publish CDER specific K/V pairs• Extended attributes internal to entity representing CDER
• Discoverability: yes• via LDAP query, not WMS
• Semantic Resource Detail: fine• Can encode arbitrary attributes• Manufacturer, Model, Capacity, Utilisation, Memory, …
• Semantic Structure: yes
Attribute-Extension (II)
• Dynamic Information: yes• Time Efficiency: medium– to query key/values stored in OtherInfo, must
retrieve and parse ALL OtherInfo strings
• Space Efficiency: good, compact structure• 2-phase discovery and submission required
Example (App Environment)
objectClass: GLUE2ApplicationEnvironment GLUE2ApplicationEnvironmentMaxJobs: 32 GLUE2ApplicationEnvironmentAppName: CUDA GLUE2ApplicationEnvironmentFreeJobs: 30 GLUE2EntityOtherInfo: GPUCUDAComputeCapability=2.1 GLUE2EntityOtherInfo: GPUMainMemorySize=1024 GLUE2EntityOtherInfo: GPUCoresPerMP=48 GLUE2EntityOtherInfo: GPUCores=192 GLUE2EntityOtherInfo: GPUClockSpeed=1660 GLUE2EntityOtherInfo: GPUECCSupport=false GLUE2EntityOtherInfo: GPUVendor=Nvidia GLUE2EntityOtherInfo: GPUPerNode=2
Class-Extension (I)• GLUE2 entities can be associated with multiple
Extension class instances• Each Extension object contains a single key/value pair• Use to publish CDER specific K/V pairs
• Discoverability: yes• via LDAP query, not WMS
• Semantic Resource Details: fine– Entity can reference multiple Extension object
• Semantic Structure: fine– Inherent key/value pairs rather than strings in Attribute-
Extension
Class-Extension (II)• Dynamic Information: yes• Time Efficiency: high– LDAP query using desired key name– No need to extract key/value pairs from string
• Space Efficiency: low– Each key/value pair requires a complete Extension object– Less efficient than Attribute-Extension– Greater overhead in resolving all K/V pairs
• More complex to realise (e.g. in LDAP) than Attribute-Extension
• 2-phase discovery and submission required
Example (Extension Class)
dn:GLUE2ExtensionLocalID=GPU_NVIDIA_P_1,GLUE2ShareID=gpgpu_gputestvo_wn136.grid.cs.tcd.ie_ComputingElement,GLUE2ServiceID=wn136.grid.cs.tcd.ie_ComputingElement,GLUE2GroupID=resource,o=glue
GLUE2ExtensionLocalID: GPU_NVIDIA_P_1GLUE2ExtensionKey: GPUPerNodeobjectClass: GLUE2ExtensionGLUE2ExtensionValue: 2GLUE2ExtensionEntityForeignKey:
gpgpu_gputestvo_wn136.grid.cs.tcd.ie_ComputingElement
Two-phase GPGPU SubmissionRequirements = GPUVendor==“Nvidia” &&
(GPUMainMemorySize >= 512);
LDAP Query
Command
(1) Convert GPGPU Requirements to LDAP query
Global Resource Information
Service
(2) Query Global Information
Service
(3) Return LDAP matches
(4) Generate List of Matched Resource Centres
(5) Generate Job Description (restricted to matched RCs) and Submit to Grid in normal way
Orchestrate Grid Job
GPUPerNode=2;
Phase 1
Phase 2
Conclusion
• Five Conceptual Methods considered• Only two methods promising• Two-phase process required• 1st phase is a GLUE 2.0 GPGPU pre-filter on GPGPU
requirements• 2nd phase restricts jobs to set of matching Resource Centres
• Attribute vs Extension• Attribute more space efficient• Extension easier to find individual Key/Values (time)• Mixed Attribute & Extension ?
• Method applicable to many other new resources (Limited Software Licenses, …)