optimizing xml processing for grid applications using an emulation framework
Post on 18-Jan-2016
44 Views
Preview:
DESCRIPTION
TRANSCRIPT
Optimizing XML Processing for Grid Optimizing XML Processing for Grid Applications Using an Emulation Applications Using an Emulation
FrameworkFramework
Rajdeep BhowmikRajdeep Bhowmik11, Chaitali Gupta, Chaitali Gupta11, , Madhusudhan GovindarajuMadhusudhan Govindaraju11, ,
Aneesh AggarwalAneesh Aggarwal22
1. Grid Computing Research Laboratory (GCRL), 1. Grid Computing Research Laboratory (GCRL), Department of Computer ScienceDepartment of Computer Science
2. Electrical and Computer Engineering2. Electrical and Computer Engineering
State University of New York at BinghamtonState University of New York at Binghamton
IPDPS 2008, Miami, FloridaIPDPS 2008, Miami, Florida
MotivationMotivation
• Emergence of Chip Multiprocessors (CMPs)Emergence of Chip Multiprocessors (CMPs)• Need to study XML-based grid middleware and Need to study XML-based grid middleware and
applications forapplications for performance limitations, bottlenecks, and performance limitations, bottlenecks, and
optimization opportunitiesoptimization opportunities• How should grid middleware and applications be How should grid middleware and applications be
re-structured and re-tooled for multi-core re-structured and re-tooled for multi-core processors?processors?
• What designs will ensure that middleware and What designs will ensure that middleware and applications scale well with the increase in the applications scale well with the increase in the number of processing cores?number of processing cores?
McGridMcGrid
McGrid: Multi-core Grid EmulatorMcGrid: Multi-core Grid Emulator An emulation framework for Grid middlewareAn emulation framework for Grid middleware Built on top of SESC: a cycle accurate full-system Built on top of SESC: a cycle accurate full-system
multi-core simulator multi-core simulator Configurable for system and micro-architectural Configurable for system and micro-architectural
parametersparameters Current focus Current focus
Obtain performance results for XML-based grid Obtain performance results for XML-based grid middleware documents on multi-core systemsmiddleware documents on multi-core systems
Grid SimulatorsGrid Simulators
• Many grid emulators and simulators existMany grid emulators and simulators exist GridSim, Gangsim, SimGrid, MicroGridGridSim, Gangsim, SimGrid, MicroGrid
do not give feedback at micro-architecture do not give feedback at micro-architecture levels levels
memory access patterns, cache coherency memory access patterns, cache coherency overheads, synchronization between the threads overheads, synchronization between the threads of the applicationof the application
• Some fundamental challenges for code on Some fundamental challenges for code on CMPs CMPs fair and efficient allocation of shared fair and efficient allocation of shared
resources between concurrent threads resources between concurrent threads automatic detection of independent automatic detection of independent
modulesmodules modules that can be executed in parallelmodules that can be executed in parallel
McGrid Design GoalsMcGrid Design Goals
• Micro-architectural Simulator – Micro-architectural Simulator – • Designed on top of SESCDesigned on top of SESC• Allows pinning of threads to specific Allows pinning of threads to specific
processing coresprocessing cores• Provide Micro-architectural Feedback –Provide Micro-architectural Feedback –
• cache access patterns of multiple threadscache access patterns of multiple threads• cache misses for different cache sizescache misses for different cache sizes• invalidations due to cache coherency protocolinvalidations due to cache coherency protocol• conflicts in accesses to shared resources conflicts in accesses to shared resources • CPU cycles wasted due to synchronizationCPU cycles wasted due to synchronization
McGrid Design Goals (2)McGrid Design Goals (2)
• Configurable Design Configurable Design allow analysis of grid-middleware performance allow analysis of grid-middleware performance
for different processor types used in the for different processor types used in the heterogeneous grid environment. heterogeneous grid environment.
• Configuration options Configuration options • Cache and physical memory sizeCache and physical memory size• Processor and memory speedProcessor and memory speed• Number of on-chip coresNumber of on-chip cores• Pipeline and pre-fetch depth in each corePipeline and pre-fetch depth in each core• Execution width of each coreExecution width of each core
Porting to Multi-core systemsPorting to Multi-core systems
• Initial analysis focusInitial analysis focus• XML based documents for job submissionXML based documents for job submission• Event stream documentsEvent stream documents• Workflow specificationsWorkflow specifications• SOAP Messages with complex typesSOAP Messages with complex types• Serialized data formatsSerialized data formats
• DecompositionDecomposition• Parts that need to be thread-privateParts that need to be thread-private• Parts that can be shared among threadsParts that can be shared among threads
• SchedulingScheduling• Mix of threads executing in parallel on Mix of threads executing in parallel on
CMPsCMPs• Choice of core for a particular threadChoice of core for a particular thread
XML-based Grid Middleware XML-based Grid Middleware Design ConsiderationsDesign Considerations
Role of XML in Grid MiddlewareRole of XML in Grid Middleware NamespacesNamespaces XML Docs with Repetition of ElementsXML Docs with Repetition of Elements XML Docs without Repetition of XML Docs without Repetition of
ElementsElements BufferingBuffering Scanning and CachingScanning and Caching Co-Referenced Objects and GraphsCo-Referenced Objects and Graphs
Bio-Medical DocumentBio-Medical Document
The element atom appears repeatedlyEach atom element shares namespaces defined at the top
WS-Security Document WS-Security Document
Non sequence-basedNon sequence-basedSome elements are more expensive to processSome elements are more expensive to process
Research QuestionsResearch Questions
• How should namespaces be defined and used in How should namespaces be defined and used in XML processing to avoid triggering expensive XML processing to avoid triggering expensive synchronization algorithms between the cores?synchronization algorithms between the cores?
• What are the ways to cache frequently used What are the ways to cache frequently used namespaces that result in performance gains in a namespaces that result in performance gains in a multi-core processor?multi-core processor?
• For what class of grid applications will the use of For what class of grid applications will the use of multiple-threads in a multi-core processor multiple-threads in a multi-core processor provide significant speed-up compared to the provide significant speed-up compared to the serial processing model that is widely used for serial processing model that is widely used for XML processing documents on a single core XML processing documents on a single core processor? processor?
Research Questions (2)Research Questions (2)
• What optimizations can be enabled when the size What optimizations can be enabled when the size of sequence based XML documents is known in of sequence based XML documents is known in advance? advance?
• What are the algorithms that can detect the What are the algorithms that can detect the cache access pattern of the application and cache access pattern of the application and dynamically distribute the processing load evenly dynamically distribute the processing load evenly among the various cores?among the various cores?• This aspect of the research is part of future workThis aspect of the research is part of future work
Performance ResultsPerformance Results
Experimental Setup – Experimental Setup – SESC – a cycle accurate architectural simulatorSESC – a cycle accurate architectural simulator Each core hasEach core has
Private 32Kbyte 4-way set associative Level-1 data cachePrivate 32Kbyte 4-way set associative Level-1 data cache Private 32Kbyte 2-way set associative Level-1 instruction Private 32Kbyte 2-way set associative Level-1 instruction
cachecache Private 512K 8-way set associative Level-2 cachePrivate 512K 8-way set associative Level-2 cache
Cache Replacement PolicyCache Replacement Policy LRULRU
Cache Coherence ProtocolCache Coherence Protocol MESIMESI
Cache Line SizeCache Line Size 64-byte64-byte
For our performance tests For our performance tests MIPS cross-compiler built from the tool-chain gcc 3.4, MIPS cross-compiler built from the tool-chain gcc 3.4,
glibc-2.3.2, Linux kernel headers 2.4.15glibc-2.3.2, Linux kernel headers 2.4.15
3 Threading Approaches 3 Threading Approaches
• Single threadedSingle threaded A single thread is used on a single coreA single thread is used on a single core
• Scanned threadedScanned threaded First thread scans the documentFirst thread scans the document
determines points of parallelismdetermines points of parallelism New threads process in parallel after thatNew threads process in parallel after that
• Direct threadedDirect threaded Same as scanned threaded exceptSame as scanned threaded except
the scanning part is skippedthe scanning part is skipped assumed that parallel processing points are knownassumed that parallel processing points are known
based on processing in previous runsbased on processing in previous runs same document size and typesame document size and type
Threading Configuration Threading Configuration MeasurementsMeasurements
Direct threading over single-threading: 92% for all Direct threading over single-threading: 92% for all document sizes. document sizes.
Scanned-threading over single-threading: 20% for 500 Scanned-threading over single-threading: 20% for 500 element document and about 12% for 4000 element element document and about 12% for 4000 element document. document.
Direct-threading PerformanceDirect-threading Performance
Performance almost doubles with doubling of the Performance almost doubles with doubling of the number of cores. Speed-up of about 92% for 2000 number of cores. Speed-up of about 92% for 2000 and 4000 elementsand 4000 elements
Performance Impact of CachingPerformance Impact of Caching
Performance of direct-threading for varying Performance of direct-threading for varying number of elements per core. number of elements per core.
Processing is done by two threads running on Processing is done by two threads running on two different cores. two different cores.
Elements are evenly divided between the Elements are evenly divided between the threads. threads.
Results for 3 cases – Results for 3 cases – Case 1 – Document preparation and processing is done Case 1 – Document preparation and processing is done
on different cores. on different cores. Case 2 – Document is prepared in the core that Case 2 – Document is prepared in the core that
processes the bottom half of the elements.processes the bottom half of the elements. Case 3 – Document is prepared in the core that Case 3 – Document is prepared in the core that
processes the top half of the elements. processes the top half of the elements.
Performance Impact of CachingPerformance Impact of Caching
Performance of the two processing cores for the Performance of the two processing cores for the three cases of direct-threading for various three cases of direct-threading for various document sizes. document sizes.
Results for Even and Un-even Results for Even and Un-even Distribution of elements with Distribution of elements with
direct-threadingdirect-threading
With even distribution of elements – With even distribution of elements – Core 1 has the shortest running time among the coresCore 1 has the shortest running time among the cores Core 3 has the longest running time among the coresCore 3 has the longest running time among the cores
With uneven distribution of elements –With uneven distribution of elements – Best performance is obtained for the distribution when the running time Best performance is obtained for the distribution when the running time
of all cores are equalof all cores are equal
Performance Impact of Cache Performance Impact of Cache CoherencyCoherency
Configuration DetailsConfiguration Details Shared data structure for XML processingShared data structure for XML processing
Shared hash table to process a co-referenced objectShared hash table to process a co-referenced object Config 1 – Each write of an element is followed by a Config 1 – Each write of an element is followed by a
read of the elementread of the element Config 2 – Each write of an element is followed by three Config 2 – Each write of an element is followed by three
reads of the elementreads of the element
Performance Impact of Cache Performance Impact of Cache CoherencyCoherency
Performance for the two configurations of the Performance for the two configurations of the shared hash table for various application shared hash table for various application document sizes and number of cores. document sizes and number of cores.
Table-lookup and Shared Stack Table-lookup and Shared Stack based Namespace based Namespace ImplementationsImplementations
Performance of the two configurations of the Performance of the two configurations of the shared namespace stack for various document shared namespace stack for various document sizes and cores.sizes and cores.
ConclusionsConclusions
• XML docs should avoid redefinition of namespaces in XML docs should avoid redefinition of namespaces in inner elements inner elements prevent expensive synchronization algorithms prevent expensive synchronization algorithms
between the various cores.between the various cores.• The number of elements in XML doc may have to be un-The number of elements in XML doc may have to be un-
evenly divided among the multiple coresevenly divided among the multiple cores taking into account the cache access patterns of the taking into account the cache access patterns of the
threads.threads.• When size of the sequence-based document is known When size of the sequence-based document is known
and can be guessed accurately, a simple threading and can be guessed accurately, a simple threading approach of equal distribution of the elements between approach of equal distribution of the elements between the threads performs the bestthe threads performs the best because the processing of the document is equally because the processing of the document is equally
divided between the threads. divided between the threads. • Threads must be scheduled in cores that have already Threads must be scheduled in cores that have already
cached the whole or part of the data. cached the whole or part of the data. • Non-sequence based documents should be scanned first. Non-sequence based documents should be scanned first.
The processing loads should then balanced among The processing loads should then balanced among the different cores.the different cores.
Future WorkFuture Work
Future work includes –Future work includes –• Run the emulator for a larger number of Run the emulator for a larger number of
representative XML documents and grid representative XML documents and grid middleware services.middleware services.
• Run the emulator for representative grid Run the emulator for representative grid applications.applications.
• Study the effect of different thread Study the effect of different thread scheduling schemes on cache access scheduling schemes on cache access patterns for each core.patterns for each core.
• Quantify the benefits of parallel XML Quantify the benefits of parallel XML parsing techniques for different document parsing techniques for different document types and sizes.types and sizes.
• Use of the network simulator from the Use of the network simulator from the MicroGrid project to simulate the inter-MicroGrid project to simulate the inter-node communication between various grid node communication between various grid nodes. nodes.
THANK YOU !THANK YOU !
top related