optimizing xml processing for grid applications using an emulation framework

Optimizing XML Processing for Grid Optimizing XML Processing for Grid Applications Using an Emulation Applications Using an Emulation

FrameworkFramework

Rajdeep BhowmikRajdeep Bhowmik11, Chaitali Gupta, Chaitali Gupta11, , Madhusudhan GovindarajuMadhusudhan Govindaraju11, ,

Aneesh AggarwalAneesh Aggarwal22

1. Grid Computing Research Laboratory (GCRL), 1. Grid Computing Research Laboratory (GCRL), Department of Computer ScienceDepartment of Computer Science

2. Electrical and Computer Engineering2. Electrical and Computer Engineering

State University of New York at BinghamtonState University of New York at Binghamton

IPDPS 2008, Miami, FloridaIPDPS 2008, Miami, Florida

MotivationMotivation

• Emergence of Chip Multiprocessors (CMPs)Emergence of Chip Multiprocessors (CMPs)• Need to study XML-based grid middleware and Need to study XML-based grid middleware and

applications forapplications for performance limitations, bottlenecks, and performance limitations, bottlenecks, and

optimization opportunitiesoptimization opportunities• How should grid middleware and applications be How should grid middleware and applications be

re-structured and re-tooled for multi-core re-structured and re-tooled for multi-core processors?processors?

• What designs will ensure that middleware and What designs will ensure that middleware and applications scale well with the increase in the applications scale well with the increase in the number of processing cores?number of processing cores?

McGridMcGrid

McGrid: Multi-core Grid EmulatorMcGrid: Multi-core Grid Emulator An emulation framework for Grid middlewareAn emulation framework for Grid middleware Built on top of SESC: a cycle accurate full-system Built on top of SESC: a cycle accurate full-system

multi-core simulator multi-core simulator Configurable for system and micro-architectural Configurable for system and micro-architectural

parametersparameters Current focus Current focus

Obtain performance results for XML-based grid Obtain performance results for XML-based grid middleware documents on multi-core systemsmiddleware documents on multi-core systems

Grid SimulatorsGrid Simulators

• Many grid emulators and simulators existMany grid emulators and simulators exist GridSim, Gangsim, SimGrid, MicroGridGridSim, Gangsim, SimGrid, MicroGrid

do not give feedback at micro-architecture do not give feedback at micro-architecture levels levels

memory access patterns, cache coherency memory access patterns, cache coherency overheads, synchronization between the threads overheads, synchronization between the threads of the applicationof the application

• Some fundamental challenges for code on Some fundamental challenges for code on CMPs CMPs fair and efficient allocation of shared fair and efficient allocation of shared

resources between concurrent threads resources between concurrent threads automatic detection of independent automatic detection of independent

modulesmodules modules that can be executed in parallelmodules that can be executed in parallel

McGrid Design GoalsMcGrid Design Goals

• Micro-architectural Simulator – Micro-architectural Simulator – • Designed on top of SESCDesigned on top of SESC• Allows pinning of threads to specific Allows pinning of threads to specific

processing coresprocessing cores• Provide Micro-architectural Feedback –Provide Micro-architectural Feedback –

• cache access patterns of multiple threadscache access patterns of multiple threads• cache misses for different cache sizescache misses for different cache sizes• invalidations due to cache coherency protocolinvalidations due to cache coherency protocol• conflicts in accesses to shared resources conflicts in accesses to shared resources • CPU cycles wasted due to synchronizationCPU cycles wasted due to synchronization

McGrid Design Goals (2)McGrid Design Goals (2)

• Configurable Design Configurable Design allow analysis of grid-middleware performance allow analysis of grid-middleware performance

for different processor types used in the for different processor types used in the heterogeneous grid environment. heterogeneous grid environment.

• Configuration options Configuration options • Cache and physical memory sizeCache and physical memory size• Processor and memory speedProcessor and memory speed• Number of on-chip coresNumber of on-chip cores• Pipeline and pre-fetch depth in each corePipeline and pre-fetch depth in each core• Execution width of each coreExecution width of each core

Porting to Multi-core systemsPorting to Multi-core systems

• Initial analysis focusInitial analysis focus• XML based documents for job submissionXML based documents for job submission• Event stream documentsEvent stream documents• Workflow specificationsWorkflow specifications• SOAP Messages with complex typesSOAP Messages with complex types• Serialized data formatsSerialized data formats

• DecompositionDecomposition• Parts that need to be thread-privateParts that need to be thread-private• Parts that can be shared among threadsParts that can be shared among threads

• SchedulingScheduling• Mix of threads executing in parallel on Mix of threads executing in parallel on

CMPsCMPs• Choice of core for a particular threadChoice of core for a particular thread

XML-based Grid Middleware XML-based Grid Middleware Design ConsiderationsDesign Considerations

Role of XML in Grid MiddlewareRole of XML in Grid Middleware NamespacesNamespaces XML Docs with Repetition of ElementsXML Docs with Repetition of Elements XML Docs without Repetition of XML Docs without Repetition of

ElementsElements BufferingBuffering Scanning and CachingScanning and Caching Co-Referenced Objects and GraphsCo-Referenced Objects and Graphs

Bio-Medical DocumentBio-Medical Document

The element atom appears repeatedlyEach atom element shares namespaces defined at the top

WS-Security Document WS-Security Document

Non sequence-basedNon sequence-basedSome elements are more expensive to processSome elements are more expensive to process

Research QuestionsResearch Questions

• How should namespaces be defined and used in How should namespaces be defined and used in XML processing to avoid triggering expensive XML processing to avoid triggering expensive synchronization algorithms between the cores?synchronization algorithms between the cores?

• What are the ways to cache frequently used What are the ways to cache frequently used namespaces that result in performance gains in a namespaces that result in performance gains in a multi-core processor?multi-core processor?

• For what class of grid applications will the use of For what class of grid applications will the use of multiple-threads in a multi-core processor multiple-threads in a multi-core processor provide significant speed-up compared to the provide significant speed-up compared to the serial processing model that is widely used for serial processing model that is widely used for XML processing documents on a single core XML processing documents on a single core processor? processor?

Research Questions (2)Research Questions (2)

• What optimizations can be enabled when the size What optimizations can be enabled when the size of sequence based XML documents is known in of sequence based XML documents is known in advance? advance?

• What are the algorithms that can detect the What are the algorithms that can detect the cache access pattern of the application and cache access pattern of the application and dynamically distribute the processing load evenly dynamically distribute the processing load evenly among the various cores?among the various cores?• This aspect of the research is part of future workThis aspect of the research is part of future work

Performance ResultsPerformance Results

Experimental Setup – Experimental Setup – SESC – a cycle accurate architectural simulatorSESC – a cycle accurate architectural simulator Each core hasEach core has

Private 32Kbyte 4-way set associative Level-1 data cachePrivate 32Kbyte 4-way set associative Level-1 data cache Private 32Kbyte 2-way set associative Level-1 instruction Private 32Kbyte 2-way set associative Level-1 instruction

cachecache Private 512K 8-way set associative Level-2 cachePrivate 512K 8-way set associative Level-2 cache

Cache Replacement PolicyCache Replacement Policy LRULRU

Cache Coherence ProtocolCache Coherence Protocol MESIMESI

Cache Line SizeCache Line Size 64-byte64-byte

For our performance tests For our performance tests MIPS cross-compiler built from the tool-chain gcc 3.4, MIPS cross-compiler built from the tool-chain gcc 3.4,

glibc-2.3.2, Linux kernel headers 2.4.15glibc-2.3.2, Linux kernel headers 2.4.15

3 Threading Approaches 3 Threading Approaches

• Single threadedSingle threaded A single thread is used on a single coreA single thread is used on a single core

• Scanned threadedScanned threaded First thread scans the documentFirst thread scans the document

determines points of parallelismdetermines points of parallelism New threads process in parallel after thatNew threads process in parallel after that

• Direct threadedDirect threaded Same as scanned threaded exceptSame as scanned threaded except

the scanning part is skippedthe scanning part is skipped assumed that parallel processing points are knownassumed that parallel processing points are known

based on processing in previous runsbased on processing in previous runs same document size and typesame document size and type

Threading Configuration Threading Configuration MeasurementsMeasurements

Direct threading over single-threading: 92% for all Direct threading over single-threading: 92% for all document sizes. document sizes.

Scanned-threading over single-threading: 20% for 500 Scanned-threading over single-threading: 20% for 500 element document and about 12% for 4000 element element document and about 12% for 4000 element document. document.

Direct-threading PerformanceDirect-threading Performance

Performance almost doubles with doubling of the Performance almost doubles with doubling of the number of cores. Speed-up of about 92% for 2000 number of cores. Speed-up of about 92% for 2000 and 4000 elementsand 4000 elements

Performance Impact of CachingPerformance Impact of Caching

Performance of direct-threading for varying Performance of direct-threading for varying number of elements per core. number of elements per core.

Processing is done by two threads running on Processing is done by two threads running on two different cores. two different cores.

Elements are evenly divided between the Elements are evenly divided between the threads. threads.

Results for 3 cases – Results for 3 cases – Case 1 – Document preparation and processing is done Case 1 – Document preparation and processing is done

on different cores. on different cores. Case 2 – Document is prepared in the core that Case 2 – Document is prepared in the core that

processes the bottom half of the elements.processes the bottom half of the elements. Case 3 – Document is prepared in the core that Case 3 – Document is prepared in the core that

processes the top half of the elements. processes the top half of the elements.

Performance Impact of CachingPerformance Impact of Caching

Performance of the two processing cores for the Performance of the two processing cores for the three cases of direct-threading for various three cases of direct-threading for various document sizes. document sizes.

Results for Even and Un-even Results for Even and Un-even Distribution of elements with Distribution of elements with

direct-threadingdirect-threading

With even distribution of elements – With even distribution of elements – Core 1 has the shortest running time among the coresCore 1 has the shortest running time among the cores Core 3 has the longest running time among the coresCore 3 has the longest running time among the cores

With uneven distribution of elements –With uneven distribution of elements – Best performance is obtained for the distribution when the running time Best performance is obtained for the distribution when the running time

of all cores are equalof all cores are equal

Performance Impact of Cache Performance Impact of Cache CoherencyCoherency

Configuration DetailsConfiguration Details Shared data structure for XML processingShared data structure for XML processing

Shared hash table to process a co-referenced objectShared hash table to process a co-referenced object Config 1 – Each write of an element is followed by a Config 1 – Each write of an element is followed by a

read of the elementread of the element Config 2 – Each write of an element is followed by three Config 2 – Each write of an element is followed by three

reads of the elementreads of the element

Performance Impact of Cache Performance Impact of Cache CoherencyCoherency

Performance for the two configurations of the Performance for the two configurations of the shared hash table for various application shared hash table for various application document sizes and number of cores. document sizes and number of cores.

Table-lookup and Shared Stack Table-lookup and Shared Stack based Namespace based Namespace ImplementationsImplementations

Performance of the two configurations of the Performance of the two configurations of the shared namespace stack for various document shared namespace stack for various document sizes and cores.sizes and cores.

ConclusionsConclusions

• XML docs should avoid redefinition of namespaces in XML docs should avoid redefinition of namespaces in inner elements inner elements prevent expensive synchronization algorithms prevent expensive synchronization algorithms

between the various cores.between the various cores.• The number of elements in XML doc may have to be un-The number of elements in XML doc may have to be un-

evenly divided among the multiple coresevenly divided among the multiple cores taking into account the cache access patterns of the taking into account the cache access patterns of the

threads.threads.• When size of the sequence-based document is known When size of the sequence-based document is known

and can be guessed accurately, a simple threading and can be guessed accurately, a simple threading approach of equal distribution of the elements between approach of equal distribution of the elements between the threads performs the bestthe threads performs the best because the processing of the document is equally because the processing of the document is equally

divided between the threads. divided between the threads. • Threads must be scheduled in cores that have already Threads must be scheduled in cores that have already

cached the whole or part of the data. cached the whole or part of the data. • Non-sequence based documents should be scanned first. Non-sequence based documents should be scanned first.

The processing loads should then balanced among The processing loads should then balanced among the different cores.the different cores.

Future WorkFuture Work

Future work includes –Future work includes –• Run the emulator for a larger number of Run the emulator for a larger number of

representative XML documents and grid representative XML documents and grid middleware services.middleware services.

• Run the emulator for representative grid Run the emulator for representative grid applications.applications.

• Study the effect of different thread Study the effect of different thread scheduling schemes on cache access scheduling schemes on cache access patterns for each core.patterns for each core.

• Quantify the benefits of parallel XML Quantify the benefits of parallel XML parsing techniques for different document parsing techniques for different document types and sizes.types and sizes.

• Use of the network simulator from the Use of the network simulator from the MicroGrid project to simulate the inter-MicroGrid project to simulate the inter-node communication between various grid node communication between various grid nodes. nodes.

THANK YOU !THANK YOU !

optimizing xml processing for grid applications using an emulation framework

grid applications

analysis of grid

grid middlewarebuilt

heterogeneous grid environment

xml processing

multicore processors

middleware performance

cmpschoice of core

Documents

simulation emulation

lan emulation & mpoa

techniques for optimizing the query performance of...

nebulapresets emulation preset

lan emulation

emulation book

dcs emulation software - sim...

technology white paper · for developing sparc-based...

introduction to xml and relational databases introduction...

optimizing xml compression gregory leighton denilson barbosa...

dhcp emulation

ashima kalra introduction of xml introduction of xml xml...

tim o’brien emulation:

optimizing keyword queries in xml tree structure

technical track session optimizing xml on your campus and...

mr. howard, nmhs artist emulation inspirational mural...

upscale emulation

sedna xml database: query parser & optimizing rewriter

optimizing correlated path queries in xml languages ·...

introduction to emulation