[ieee 2011 international conference on complex, intelligent and software intensive systems (cisis) -...

Semantic XML Views Based OnGeographical Context

David J. Rogan and Wenny RahayuDepartment of Computer Science & Computer Engineering, La Trobe University

Email: [email protected], [email protected]

Abstract—XML has become a standard for the storage andexchange of data. The widespread use of XML has made queriesexecuted over related, but distinct, XML data sources increasinglyrelevant. With the growing popularity of XML a wide variety ofschemas may be applied to each document. In particular XMLdata that describes geographical/spatial information often needsto deal with a large number of complex elements and yet, duringaccess and retrieval only a particular set of relevant information– such as a local area – is required. For this reason, we mustsearch for ways to increase the performance of data processingand access.

Parallelism is an attractive way in which to achieve thisaim. With multi-processor or multi-core systems becoming thestandard, the idea of parallelism is emerging as a significantlyimportant concept. Methods for XML data processing have beendesigned and implemented with varying degrees of success. How-ever, these approaches deliver datasets that can contain irrelevantinformation for the purposes of the user as they do not take thecontext into account while parsing. This can result in a decreasein the efficiency of the traversal of the data sets when querying.

As such, this paper presents a framework the parallel processingof XML documents with a geographical context to filter and groupspatial information. In this paper we propose an algorithm and apossible system implementation and identify a potential scale-upmethodology.

Index Terms—XML, context, parallelism, multi-core.

I. INTRODUCTION

With the typical size of a database or a similar collectionof data now typically in the terabytes [1], [2], sorting throughthe extraneous data is costly and often inefficient. With largedatasets such as those containing geographical data, the subsetdata that is most important can be lost within the whole. Query-ing this data for the desired subset becomes more complex whenone must take into account a dynamic query definition, such asa changing environment or a geographical context. Along withthe possible dynamic nature of the query one must also considerthe applicability of the data in a temporal sense, that is, if thedata has been updated.

With the relevant geographical context being constructedfrom potentially many separate sources of differing spatialinformation and consisting of changing requirements, the querymust by definition be dynamic. This dynamic nature raises anumber of issues namely how to process the vast amount ofinformation, how to identify data relevant to the context, andhow to integrate multiple sources of different types of spatialdata into a single coherent context.

As the motivating problem is to do with aviation, informationsuch as the flight route and weather is stored. As a resultof the motivating problem the notion of the context is –for this paper – restricted to defining relevant data by itsassociated spatial/geographical and temporal information. Assuch, the context can be broken down into three separate, but

complementary, parts. Firstly the changing geographic locationof the user; secondly, the continuously changing environment;and lastly, the user’s planned locations.

To address these issues this paper will provide a novelapproach that potentially optimises query performance by im-plementing a dynamic process for querying data that is relatedto the currently defined context. This paper will also attempt toshow a feasible method to increase the scalability performanceof the experimental framework.

This paper is organised into the following sections. Section 2gives an overview of the related work, focusing on XML [3]processing. Section 3 will introduce a method that attempts toaddress the identified problem. Section 4 proposes an algorithmto implement the contextual data grouping. Section 5 providesan overview of an experimental system and section 6 givesan evaluation aimed at determining the performance of theproposed framework. Section 7 summarises and concludes thepaper.

II. LITERATURE REVIEW

In order to increase performance parallelism has become anattractive option because of its potential for processing. Dueto this it has become an important area to research whenattempting to increase the performance of systems that relyupon XML data sources. Although previous attempts have beenmade to reduce the time taken for queries to be executed, therehas been less focus upon the idea of reducing the data to bequeried so that the user only sees that data that is relevant tothem.

In the paper “Parallel XML processing by work stealing” [4],W. Lu and D. Gannon present some data partitioning methodsthat can be used with a work stealing processing scheme. Suchan approach can be used to traverse an XML tree in a depth-first, left-to-right order by visiting a parent node and thenpushing the children into the work queue as new tasks.

In the first example, traversing an XML tree structure, thedata is partitioned by the nodes. Each node, or task, in thedocument is a single task in the work queue. The secondexample involves increasing the granularity. By partitioning theXML document into ranges, each of these ranges becomes atask for a processing element to add to its queue. These tasks aredistributed across the processes and the results are reconstuctedinto the final solution tree.

The first partitioning approach has the drawback of con-suming a large amount of system resources which makes itan impractical partitioning scheme for large files. This kindof partitioning will also affect the processing method used, asthe number of tasks can be considerable. The second approachwidens the range of the partitions thereby making the parallel

2011 International Conference on Complex, Intelligent, and Software Intensive Systems

978-0-7695-4373-4/11 $26.00 © 2011 IEEE

DOI 10.1109/CISIS.2011.34

176

processing more effective. This approach is more realistic tothe real-life demands of XML processing.

Z. Fadika et al. in “Parallel and Distributed Approach for Pro-cessing Large-Scale XML Datasets” [5] present their toolkit forlarge-scale XML document processing called Piximal. Previoustoolkits for processing XML do not scale well for large-scaledata.

This paper focuses on both micro-parallelism and macro-parallelism. For the micro-parallelism a fine-grained imple-mentation of parallelism and symmetric multiprocessing pro-gramming techniques are used. For the macro-parallelism —implemented using MapReduce [6] — a distributed approachis taken to the processing of large-scale data, which is thenstored in a cluster.

Piximal uses a Non-deterministic Finite Automata as its mainparsing mechanism. The different states of the NFA, exceptingthe error state, are also used as starting states. In this way theNFA can be used for parallel processing. As the entry stateis unknown there is a fixing of the data when the concurrentstreams are merged.

The fix up of the data when merging the different processingstreams requires extra computation and communication timeto process the elements that may have been interrupted in thepartitioning. This toolkit seems to be a promising addition tothe body of work and addresses several issues that arise whenattemping to scale up a system.

Rajesh Bordawekar et al. in “Parallelization of XPath Queriesusing Multi-core Processors: Challenges and Experiences” [7]takes a pre-parsed XML document and on a shared addressspace multi-core system, processes the XPath queries in paral-lel.

This is implemented by using a multi-threaded pthread baseddriver. This driver then creates concurrent threads which eachinvoke the Apache Xalan [8] XPath processor. Each threadcontains a separate XPath query; these are executed over a pre-parsed XML document.

In this paper the parallelization was achieved in three ways:data partitioning; query partitioning; and a hybrid of the two.With data partitioning the query is executed on the differentdata sets, i.e., the same query is ran on the varying partitions.In query partitioning the sub-queries that are the componentsof the overall XPath query are executed over the entire dataset.

Fig. 1: Hybrid partitioning.With the hybrid partitioning scheme the query is segmented

into fragments. Then once this is achieved the dataset relatingto that query fragment is partitioned as per the data partitioningscheme. This process is illustrated above in Fig. 1. The inter-mediate results from the concurrent query threads are joinedusing a hash based merge-join method to create the final result.

Although the ability to run threads simultaneously generallyshows an increase in parallelism and the higher the number ofconcurrent threads the greater the speed up, there are a fewissues pertaining to this approach. By running the queries onthe same dataset, the memory usage of the XML processoris increased. Similarly, because the widespread access to thedataset, no updatable structures can be implemented. Also, asthe XML queries are achieved with XPath there is no automaticquery optimisation. The partitioning scheme at first glanceseems to be an effective method of patitioning XML datahowever it has yet to be seen how this proposal deals withmany inputs concurrently.

III. PROPOSED METHOD

The idea of retrieving the context based on a user’s locationis not a new one. Many mobile applications can perform taskssuch as identifying the closest restaurant. However collectingdata based around; one, a changing geographical context and;two, the data that is expected to be relevant to the shiftingcontext in the future, integrating data from multiple XMLsources can be a challenging hurdle to overcome.

As a result of collecting this data we can define a contextualdata grouping, the grouping contains only the collected datathat is relevant to the user’s situation. The collected data canbe broken into these three groups – firstly the user’s geographiclocation, secondly the surrounding environmental data andlastly the projected future locations.

With each possible XML document being a different docu-ment type, that is, containing data of different data types wemust identify a relationship between not only the multiple datasources but ensure that this relationship can be defined in acontextual way.

This integration is made harder by the hierarchical nature thatis built into the structure of the XML language. With all thedata that may be relevant being on a different level of the treethe increased processing power that parallel provides makes theinclusion of parallel techniques over a more traditional serialapproach an ideal way of implementing an on-demand andtimely delivery of the data.

A. Context ConstructionThe first and most fundamental challenge that is presented

in a problem such as contextual data grouping is how to definewhat data is relevant to the context. To do this we need someway to define the central relevant fact, that is, what links all thedata together. We there are required to store the current locationof the user.

Storing the user’s location has a few implications. First, wecan surmise that the user will move to different locations andtherefore the value of the user’s location will be changing insome way. Second, as the user’s location changes we requiresome way to collect newly relevant data for the grouping. Fromthis we can deduce that we also need some metric that candefine whether or a given data element is relevant.

Fig. 2: The route the the user takes.

177

When building the view or grouping of data for the contextwe are also interested in the planned route that the user shallbe taking. This will give us a series of projected locations ofwhere the user’s location will be at some time in the future. Assuch, an undesired circumstance at a projected location – suchas a storm – which would prevent travelling to that location canbe planned for, and the route altered. In Fig. 2 each node fromthe origin to the destination is a series of points that make upthe route that the user plans to take.

The metric must be able to define the degree of relevancymust be some measure of distance, i.e., the proximity of the areathe data is effects to the user’s location. There are a number offrequently used metrics that can define the distance between twopoints, an example being euclidean distance. However, we maywant to take into consideration where one point is in relationto another or if they can affect one another.

Hence, we must be able to define a metric that will expressthis need before we can begin to build a view based on thecontext. In Fig. 3 points a and b are close enough to effecteach other, b is therefore relevant to the context. The distancebetween points a and c, however, is too great for c to be relevantto a.

Fig. 3: Areas that are relevant to the context.

B. Update MechanismWith the constant shifting of the user’s location we can safely

assume that the context will change. With this change comesthe requirement for some form of update mechanism that willnot only meet the requisite condition of re-constructing thecontext grouping when new data becomes relevant but also howto update the currently grouped data if that data is updated.

When one updates the data in the view – the data deemedrelevant – there are some concerns that must first be addressed.We must first ask ourselves if the data is retrieved from multiplesources or documents, the answer to this question will havesignificant impact to the processing method invoked for thedata.

We then need to investigate conditions such as how longshould it be, does this fit every possible circumstanceand if ittake into account the velocity of the user. One must also takeinto account the nature of the data being queried, if this dataupdates and if each change in the data require an update of thecontext.

Along with this there is the consideration of the temporalaspect of the data and frequent data updates. If this is truethen up to date data must be made available to the user ina timely manner. This is similar for all such time sensitiveinformation, and this necessity implies that one possible wayfor the update process to be triggered is when the data is altered,that is, the new data is pushed to the user which then trigger’sthe mechanism. Another possibility is that the data is updated

on a set schedule, for example every half an hour, and then wecan match the update to occur after every update.

IV. CONTEXTUAL GROUPING ALGORITHM

In order to build the contextual view our algorithm must havea certain stages. Initially the algorithm must read in the XMLfrom the data source(s). There may be one to many sources thatthe data is gathered from.

Using a pre-defined function, we can use the way that thedata is connected to return a representation of how relevant thatdata is to the initial central fact. This function is crucial to thefunctioning of the entire algorithm and relies upon a number ofthings. The first thing we must know is how the context shallbe defined, i.e., the central context. Next we must have a metricthat will allow the relevancy of the data to be gauged. The lastbit of the function is to define some kind of cut off point asonce we have some value that represents the measurement ofthe relationship, there must be some way to indicate that thedata’s relationship to the current context is to abstract or distantto include in the contextual grouping.

In Fig. 4 we can see the a block diagram that depicts andabstracted view of the contextual algorithm and its stages orsub-algorithms. Here, we can easily see that any number of datasources are used as inputs into the first stage of the algorithmand that the rest of the algorithm relies on this input.

Fig. 4: Algorithm overview.

A. Data Gathering

To represent the XML, not only the data should be pre-served but also the hierarchy of elements with their associatedattributes. This adds up to the requirement of an object that ishierarchical in nature to store the data that has been gathered,with each data source being described by a unique object.The data may originally come in some other heterogeneousformat, but before applying the proposed algorithm the datamust be transformed into XML format following a definedXML schema. The most important thing to realise is that theinput data is partitioned and each partition is assigned to a singlethread. The granularity of the data partitioning can be definedby the user based upon their requirements, such as a smallerpartition size than the number of available worker threads. Forthe purposes of this paper we will assume that the partitioningscheme is based on a predefined geographical non-overlappingarea.

Once each thread has been created and has the partitionassigned to it, each thread reads only the data from its ownpartition – the local partition – and then builds the hierarchicaldocument for each input appropriately. Algorithm 1 outlines

178

the process undertaken to create the hierarchical documents.We can see in this algorithm that D is partitioned into distinct,non-overlapping segments, and these segments are used as theinput into a threads. Also, we see that before the threads areexecuted that the planned route is processed. This is becausethe planned route is part of what defines a user’s context.

Algorithm 1 Parallel Data Gathering

Require: Input data D = D1, D2, ...Dn partitioned into x distinctareas where x ≥ a

Require: Planned route r1: Read in input source r2: Build hierarchical document from r3: Divide D amongst a threads4: In each thread do5: for all p ∈ local partition of D do6: Read in from the input source p7: Build hierarchical document from p8: end for9: end

B. Pre-processing

We must now ensure that the new data that has been gatheredis different from the data that was previously stored in the view.We must first check if the context has changed, which meansthat there may be new data that is relevant. If the context hasnot changed, we must then test if the data has been in any wayaltered or updated. If it is the case that the dataset has beenupdated with new information the next step will be to identifywhich data elements within the contextual group to update.

The symbol φ represents the current contextual view of theuser. This view may have data that is no longer relevant in itor it may even be empty. There are three checks in this phasewhich occur for each distinct document.

The first check involves testing if the document type of thecurrently selected element in the document set is of the sametype of data as that which is contained within C, that is, canthe data in some way be related back to the context.

Each XML document should contain only one type of data;for example, drawing upon the area of aviation there shouldbe one file containing the weather data, one file for the route,ect. When comparing the data that has been read in to datastored within the system we must ensure that each object thatwe are testing is of the same type as that document. One way toensure this is to have a way of pre-defining what file or streamcontains a certain data type.

Next we check the context of the data, represented by e,within the document d, the current document being processed.In this check we make sure that there is data that is relevant tothe user’s defined context. We can use the pre-defined functionfor measuring the context which will determine whether thedata is relevant.

The last check within this phase is to check if the data thathas been deemed as relevant to the context is one, a new dataelement that was previously unrelated to the context or two,if the data element was previously related to the context butthe data has been changed i.e., updated. If the context or thatdata has changed then we have to invoke the next stage of thealgorithm, the context grouping stage.

Algorithm 2 Pre-processing

Require: The local set of documents D of each threadRequire: The current data grouping CRequire: The context representation φ

1: In each thread2: for all d ∈ D do3: obtain lock on C do4: if the document type of d is the same as the data from C

then5: Continue current iteration6: else7: Next Iteration8: end if9: release lock

10: for all e ∈ d do11: obtain lock on φ do12: if the e has the same context as φ then13: Continue current iteration14: else15: Next Iteration16: end if17: release lock18: if e is new data or e is updated data then19: Begin Algorithm 3: Context Grouping20: else21: Next Iteration22: end if23: end for24: end for25: end

C. Context GroupingIn Algorithm 3 we bring all the elements that have been pre-

viously described for constructing the contextual view togetherand in this way we are able to successfully collect distinct datainto a single grouping which is all closely related to the user’scontext at the time of the build request.

Before we actually add the data into the view there are a fewpre-conditions that must be met. The data that we are addingmust have already been deemed as relevant to the currentlydefined context. We must have the collection of data that isto represent the context that we are building defined, and if itexists we also might need to know the view generated by theprevious iteration of the entire algorithm. One last conditionthat must be met is that we need to know what type of data isrepresented by d, this type will be same for all the entries inthe source as was touched upon previously.

Once these conditions have been met we can then begin theprocess of adding d into the new view, which is representedby V . We create an object that can contain data of the typeT , which is the data type of d. Before we arbitrarily copy thevalues from d into the new object we must check to see if thatdata in d is just an update – i.e., the data is incomplete – from aprevious view. If this is the case then we must use this previousdata element as the base on which to update the values of thenew object. Otherwise we just set the values of the object thatwe have created to the values contained within d. Finally, nowthat we have the object completed we insert it into the growingview.

D. Update MechanismThe update function or mechanism though it is a crucial part

in this algorithm is not completely contained within the flow

179

Algorithm 3 Context Grouping

Require: The data d that has been defined as releventRequire: The new contextual view to be contructed VRequire: The data type TRequire: The previously defined contextual view C

1: Create an object O of the type T to represent d2: obtain lock on C do3: if d is an update of data contained in C then4: Initialise the values of O to the values within C5: end if6: release lock7: Alter values of O to new values obtained from d8: obtain lock on V do9: Add O into the list V

10: release lock11: return V

of the algorithm. Rather, this mechanism is what initiates theinitial stage of the algorithm. When the pre-defined condition isevaluated to be true, then the definition in Algorithm 4 initiates.

As is in the algorithm definition, we first back up thecurrent contextual view and place a timestamp upon it to showwhen this was created. This should occur whenever the updatemechanism is triggered whether it turns out if the context shouldbe updated or not. This allows us to keep a true historical recordand will show the view available to the user throughout anincremental time period.

When we say to retrieve the new context, we can take thisto mean that we retrieve the representation of the user’s currentcontext at the time that the mechanism is triggered. In the caseof the aviation case study this is the geographical location of theuser, which is represented by the plane’s longitude and latitudeco-ordinates. We then take this data and pass this into the firststage of the algorithm along with the XML inputs.

When the mechanism triggers the first stage, the algorithmcycles through the input data it generates the context as outlinedin the previous algorithms in this chapter. When the buildingof the view is completed the new contextual view is returnedto the update function, which in turn compares this view tothe backup. Should the new contextual view be different tothe latest backup, which indicates that either the context or theenvironmental data has changed, when then set this to be thecurrent contextually relevant dataset.

Algorithm 4 Update Mechanism

Require: The newly relevant context φ1: Backup current contextual view2: Place a timestamp in the backup3: Retrieve new context φ4: Trigger Algorithm 1: Data Gathering5: if the newly generated view 6= the backup then6: Update context7: end if

V. IMPLEMENTATION

When implementing a system there are other componentsthat the system should contain. Among these is the metric thatallows us to define in a definitive way if some piece of thedata read from the inputs is related to the current context ofthe user. Something else to consider is what is required of theXML input sources, they must be able to be connected in someway to the context representation.

The data used as data sources for this solution is formattedin the extensible mark-up language [3]. For this reason whenthe data is processed we need to take into account the way thatan XML document is structured.

A. System ComponentsWhen traversing through the document tree, the nodes within

the tree are on different levels of the tree. By utilising thisstructure we can easily discern the data for each separate entry.The downside of this, however, is that to find a specific piece ofdata within each entry we must traverse through the tree untilwe find the specific node. The problem with this is that eachnode does not know its parent, siblings or children – they arelinked in no real way.

The data within the XML documents must be in some way beable to be related with the central fact. Without this relationshipor some pre-defined way to connect the data we could not createthe view that we aim for. The reason for this is self-evident;the central fact – for example the user’s current geographicallocation – is what defines the very meaning of the context. Itis this that tells us exactly what information the view currentlypertains to.

When creating the metric we want to define as relevantnearby areas that are likely to exert an effect upon the localarea of the user. In this case, we can take that each point asdefined above is a geographical location that is represented bya pair of Cartesian co-ordinates.

The creation of a mechanism that describes if one geograph-ical area is likely to affect another can be achieved by usinga position vector. As we are using the user’s location as thecentral building point for the context we can take the origin ofthe vector to be the location of the plane, at the point in timethat the mechanism is executed. We can measure the distancebetween these two points by calculating the magnitude of thevector, which is represented by the expression |v|.

Now, we can formally define that in order to conclusively saythat the two areas affect each other – and therefore are relatedto each other – if these areas intersect. We can determine if thisintersection occurs by comparing the sum of the radii with thecalculated magnitude of the vector drawn between the locationof the plane and the terminal. If the magnitude of the vector isgreater, then the two areas do not intersect and are not relevantto each other.

Given two points Pi and Pj the vector v may be derived.The magnitude of v is shown in Equation 1. The equation thatdetermines the magnitude of the vector is the same as findingthe Euclidean distance between the points Pi and Pj .

Let Pi = (x1, y1) and Pj = (x2, y2)

Thus, |v| = 2√

(x2 − x1)2 + (y2 − y1)2 (1)

B. System ImplementationThe data for testing the system, the terminals and the terminal

aerodrome forecast (TAF), is read into the system as an objectthat represents the XML file including the hierarchical structureof the document. After the XML file has been parsed, we canuse the Document [9] object to extract the values containedwithin and create a series of objects that represent the data forthe context.

With the hierarchical documents built, we can now parsethrough these documents, and as we build the objects, test to

180

see if they are relevant to the current context. If they are thenwe add them to the contextual view we are attempting to build.To add the retrieved data we need to implement a comparisonfor both the route that is planned and the local environment ofthe user.

We have separate lists for the local user context and theroute context data as this allows us to group the data togetherso that the user could only display the data relevant to theirsurroundings, the data related to their future travel planes, orboth. It would also allow for the possibility of updating thelocal context at different times independently of the route if sodesired.

Listing 1: Route data file sample<r o u t e p l a n e i d =” p l a n e 0 1 ”>

141.070247222222−17.6835027777778

144.533333333333 −22.5833333333333148.595297222222

−28.0497555555556</ r o u t e>

The route is represented by an object that contains a list ofpoints, which are obtained from one of the XML inputs andthe user’s current location is represented by a set of Cartesianco-ordinates. Listing 1 shows an extract of a sample route fileused in this implementation.

Once the object has been built, we test to see whether itshould be added to the lists. As shown in Listings 2 and 3the terminal is compared with either the user’s current location,Listing 2 or the locations within the route as in Listing 3. Shouldit be evaluated that there is a match we then insert the Terminalinto the list.

Listing 2: User context checkingD i r e c t i o n V e c t o r v = new D i r e c t i o n V e c t o r ( c loc ,

t e r m i n a l . g e t C o o r d s ( ) ) ;i f ( v . i s I n t e r s e c t i o n (3 , 4 ) )

c o n t e x t v i e w . add ( t e r m i n a l ) ;

Listing 3: Route context checkingL i n k e d L i s t r o u t e L i s t = r o u t e . g e t P o i n t s ( ) ;i n t l e n g t h = r o u t e L i s t . s i z e ( ) ;f o r ( i n t i =0 ; i<l e n g t h ; ++ i ) {

i f ( r o u t e L i s t . g e t ( i ) . e q u a l s ( t e r m i n a l . g e t C o o r d s ( ) ) )r o u t e v i e w . add ( t e r m i n a l ) ;

}

Once all the terminals are found and added into the listswhich represent our context view, it is now time to add theTAF data to each terminal from the final input source. To do thisafter we have created the TAF object we check if the locationcode for the TAF matches that of the terminal we are testing.

The mutli-threading implementation uses lists local to thethread but it is but instead of the traditional implementationof Java threads we use a thread pool, in this case an instanceof the ExecutorService [10]. With this class one can pass in acollection or series of tasks which are then processed and exe-cuted in parallel by the ExecutorService. The ExecutorServicegive us an interface for controlling the thread operations.

VI. EVALUATION

When one is attempting to increase the performance of asystem, there are two different options that one can choosefrom. The first is speed-up, which is an increase in performancedemonstrated by quicker execution of the programs tasks. The

second form is known as scale-up and is the focus of theperformance increase for this paper. Scale-up is increasing thesize of the tasks that are performed by the system.

More specifically scale-up refers to a given systems abilityto process a larger amount of data, a bigger task, whilemaintaining a reasonable overall performance [1]. Due to thepossibility that a large amount of data may be used by thesystem we need to have a method of increasing the size of thedata that can be processed but we also have to aim to keepthe performance of the system, the execution time, as low aspossible. As the case study could have data that is extremelylarge, we will attempt to show the scale-up potential of theproposed algorithm.

A. Testing Environment

A virtual machine was chosen to be the environment forevaluating this system as such an environment can offer anumber of benefits. The most beneficial of these benefits isthe ability to easily alter the hardware that is virtualised forthe system. This allows us to compare the performance of thesystem under different circumstances, such as different numberof processors. Another advantage is that the system is entirelyself contained which helps us to ensure that the results we getare reliable and uninfluenced

The system was virtualised with the Oracle VM VirtualBox3.2.8 [11] software. VirtualBox is an Open Source programlicensed under the GNU. The base setup for the virtual machineis listed in Fig. 5. We can see here that the number of processorsthat the system may have depends upon the experiment, as men-tioned previously. Ubuntu Linux was chosen as the operatingsystem as it provides a stable environment with easy access tothe required resources, and by using the 64bit architecture wecan have a multi-processor system.

CPU(s) 1-4, depending on experimentRAM 2048MB ≡ 2GBO.S. Linux Ubuntu 10.04.1 LTS x64bit

Fig. 5: Virtual Machine SpecificationsAn implementation of the Java Development Kit was required

to build and execute the designed system. In these tests theOpenJDK 6 [12] package was used to compile and build therequired files. The version of the java virtual machine, asreturned by the command ’java -version’ is 1.6.0 18. The javacompiler version, as returned by the command ’javac -version’is also 1.6.0 18.

B. Case Study

The plane initially has its route, which is represented by aseries of co-ordinates. Each co-ordinate in the route has thepossibility of being a terminal. If a co-ordinate has a relatedterminal, it will have associated weather information. Both theorigin and the destination of the route – the start and end – ofthe route will be a terminal.

Each plane also has its current location represented by a tupleconsisting of x and y co-ordinates. The current location of theplane is a continuously changes as the plane travels along itsroute to its destination.

The plane contacts the central database which contains themost recent data and executes a predefined query that generatesthe view relevant for that plane’s context. The output of this

181

query is then sent to the plane for use on the client side, thatis, locally on the plane. For the following experiments, the datais partitioned into a number that is equal to the amount ofworker threads.

C. System Performance1) Experiment 1: In this experiment we compare the perfor-

mance of a parallel implementation of the system in a multi-processor environment, with a differing number of cores andworker threads using the thread pool implementation. In orderto achieve reliable results the times used as measurementsare the average time, in milliseconds, from 100 executions ofthe program. These measurements are the result of taking thedifference between the time before and after the parsing of thecontext in milliseconds.

The data used in this experiment consists of 1000 entriesfor both the Terminals and the TAF data. When this data isprocessed in multiple threads the data is evenly as possiblepartitioned between the threads with no data overlap; e.g, for1000 entries over two threads each thread has 500 entries.

In Fig. 6 we have a table of the threads and the number ofcores with the average elapsed processing time. In Fig. 7 wehave a line graph of these results, with the y axis representingthe elapsed time in milliseconds and the x axis representing thenumber of threads. Each line in this graph shows the trend aswe increase the number of cores.

2 cores 3 cores 4 coresSerial 1786.34 1501.34 1467.571 Thread 1742.7 1508.21 1462.482 Threads 2093.78 1438.51 1299.173 Threads 2316.41 1701.89 1416.534 Threads 2701.19 1882.95 1582.62

Fig. 6: Average elapsed time for data set of 1000 entries

Fig. 7: Elapsed time line graph

As we can see on two cores as we increase the number ofthreads the elapsed time trends upwards. With three and fourcores two threads perform better than a single worker threadsbut beyond that the time increases, though it is still better thanon the two core implementation.

Although at first this seems counter intuitive, there are someadditional facts that should be considered. The number ofthreads represented in the data is the number of worker threads,that is, the threads that process the data for the context. Butthere is another thread that one should keep in mind, themain thread. When these worker threads are invoked, the mainthreads pauses until the threads are finished.

Also, we should take into account the environment that thisprogram is run in. As the program has been created in Java, theJava Virtual Machine must execute the program. This runs inthe background and handles things like memory managementand signal handling for the running program.

From this we can conclude that for a smaller dataset such asthis, we need to have enough cores to run not only the multiplethreads for the program but also not consume resources thatare needed to run the system itself in order to achieve optimalperformance. This principle is most prevalent in the four coresystem running two worker threads.

2) Experiment 2: In this experiment we test to see how muchimpact the locking of the data as described in the proposedparallel algorithms has upon the system performance. We dothis by comparing the average elapsed time an implementationthat created locks around global data that it wishes to access tothe elapsed time of an identical implementation that accesseslocal data, and as such has no locking.

The data set used in this experiment has 2000 entries for boththe terminal and TAF data. This was tested with two, three andfour threads in a four as this has been shown to have greaterperformance. The average value was the average time from 100iterations or executions of the systems.

2 Threads 3 Threads 4 ThreadsLocking Implementation 2225.17 2238.02 2440.7Non-locking Implementation 2266.94 2194.79 2297.45

Fig. 8: Locking vs. Non-lockingAlthough as shown in Fig. 8, the four thread non-locking

implementation seems faster, when one takes into account theresults of the two and three thread executions, we can say thatoverall that locking and non-locking implementations achievefairly similar results.

3) Experiment 3: In order to show the performance impactof larger data, a data set of 3000 was chosen to illustrate thescale-up of the system when compared to a data set of 1000.In Fig. 10 a bar chart shows the average elapsed processingtime for data sets of the sized 1000 and 3000 on two to fourcores. We can easily see that across the board less than a linearincrease in the average time has occurred when increasing thesize of the data set. With the linear increase being equal to three,we can see in Fig. 9 that as the number of threads improves sodoes the scalability – with a lower number being more desirable.Based on this fact we can conclude that while the performanceof multi-threading may not have significant impact in terms ofspeeding up the performance for small data sets, it does allowfor a greater scalability performance with an increased numberof threads as per Fig. 9. We can also see that the differences

2 Cores 3 Cores 4 CoresSerial 2.021788 2.385356 2.4638961 Thread 2.194715 2.492398 2.5134022 Threads 1.920421 2.243634 2.3896723 Threads 1.877871 2.026541 2.1556554 Threads 1.66099 1.943201 2.082831

Fig. 9: System Scalabilitybetween the all of the tested cores are consistent for both datasets. As in Fig. 9, as the number of cores increase there isa noticeable improvement in the performance, most notablywhen transitioning from two to three cores. We can see withboth data sets that when we increase the number of cores theexecutions with more concurrent threads perform much betterthan otherwise expected. From this we can see how the numberof concurrent execution threads can be hindered if the processor

182

Fig. 10: Elapsed time for system scale-up

does not have the correct amount of cores to run the threads ina truly parallel manner.

4) Experiment 4: In this experiment we attempt to showthe feasibility of implementing a parallel system in order tothe scale of the tasks while keeping the speed of the systemwithin acceptable bounds. This experiment is run in a fourcore environment using both the serial implementation andconcurrent worker threads using a data set with 5000 entriesfor both the terminals and the TAF data.

When run initially, neither the serial nor the multi-threadedimplementations successfully completed the parsing of the data,running out of memory upon each attempt. There were someindications – such as a single thread completing its assignedparsing – that a parallel implementation would increase thescalability of the system. To test this, the heap size of the Javavirtual machine was increased.

To arrive at the appropriate size of the heap for testing, thetwo threaded, four core implementation was used as the base, asin smaller datasets it was shown to have the greatest processingcapability. By incrementally increasing the amount of memoryassigned to the heap, we can arrive at a number where thesystem begins to work successfully in some cases. From here,we can see any increase or decrease in the system performanceand reliability.

Fig. 11: Scale-up reliability increase

In Fig. 11 the graph shows the rate of successes and failuresfor 15 executions of each measured programs are depictedwith the heap increased to 768mb. As we can see with theserial implementation the heap size is still never sufficientto successfully complete the parsing of the XML data, andtherefore with 5000 entries the serial success rate is zero. Wecan also see in Fig. 11 that as the number of threads increase,so do the number of successful completions.

One can increase the amount of memory allocated to thesystem for better results, indeed assigning 1024 megabytes of

memory makes even the serial implementation work effectivelyon a data set of 5000 but memory is a finite resource. As such,we cannot depend on having access to increasing amounts ofmemory when larger and larger data sets must be processed.

This behaviour infers that increasing the number of threads,and consequentially partitioning the data across these threads,dramatically increases the potential performance of the systemregarding the scale of the data that it can effectively parse. As agreater number of threads has a lower rate of failure we can alsosay that the memory of the system is more effectively utilisedwith more concurrent threads. However, there is a limit of thisimprovement as one of the aims high performance scale-up is tokeep the processing time consistent with smaller datasets. More-over, the overhead of too many concurrent threads waiting forexecution can also have a negative impact upon performance.

VII. CONCLUSION

In this paper a solution was proposed in regards to collectinggeographical data from updateable sources relating to the con-text, the spatial location, at the time of the execution with theaim of constructing a view containing data relevant to the user.Once this view has been constructed then any future queriesfrom the user should be executed at a faster speed, as the totalamount of data is lessened for the purposes of the user. Thisidea has been expanded upon and discussed and an attempt hasbeen made to contribute an algorithm and method of addressingthis issue.

The parallel methodology has shown some indications ofbeing an effective method of achieving scale-up for the system,and as such it can be stated that parallel processing aids inincreasing the potential performance of the implemented systemin regards to completing larger tasks while keeping the speedof the execution within acceptable bounds.

REFERENCES

[1] D. Taniar, C. H. C. Leung, W. Rahayu, and S. Goel, High-PerformanceParallel Database Processing and Grid Databases, ser. Wiley Series onParallel and Distributed Computing. Wiley, 2008.

[2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.Addison Wesley, 2006.

[3] “Extensible Markup Language (XML) 1.0,” Electronic resource, Nov.2008, http://www.w3.org/TR/xml/.

[4] W. Lu and D. Gannon, “Parallel XML processing by work stealing,”in SOCP ’07: Proceedings of the 2007 workshop on Service-orientedcomputing performance: aspects, issues, and approaches. New York,NY, USA: ACM, 2007, pp. 31–38.

[5] Z. Fadika, M. Head, and M. Govindaraju, “Parallel and distributed ap-proach for processing large-scale xml datasets,” in GRID ’09: Proceedingsof the 10th IEEE/ACM International Conference on Grid Computing,2009, Oct. 2009, pp. 105–112.

[6] J. Dean and S. Ghemawat, “MapReduce: simplified data processing onlarge clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113,2008.

[7] R. Bordawekar, L. Lim, and O. Shmueli, “Parallelization of xpath queriesusing multi-core processors: challenges and experiences,” in EDBT ’09:Proceedings of the 12th International Conference on Extending DatabaseTechnology. New York, NY, USA: ACM, 2009, pp. 180–191.

[8] “Apache Xalan Project,” Electronic resource, retrieved on Apr. 2010, http://xalan.apache.org/.

[9] “Java SE 6 API: Class Document,” Electronic resource, retrievedon Oct. 2010, http://download.oracle.com/javase/6/docs/api/org/w3c/dom/Document.html.

[10] “Java SE 6 API: Interface ExecutorService,” Electronic resource, re-trieved on Oct. 2010, http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html.

[11] “VirtualBox Homepage,” Electronic resource, retrieved on Oct. 2010, http://www.virtualbox.org/wiki/VirtualBox.

[12] “OpenJDK Homepage,” Electronic resource, retrieved on Oct. 2010, http://openjdk.java.net/.

183

[ieee 2011 international conference on complex, intelligent and software intensive systems (cisis) -...

Documents