research article a group mining method for big...

10
Research Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories in WAN Jie Yang, 1,2 Xiaoping Li, 1 Dandan Wang, 1 and Jia Wang 1 1 School of Computer Science and Engineering, Southeast University, Nanjing 211189, China 2 Public Security Bureau of Jiangsu Province, Nanjing 210024, China Correspondence should be addressed to Xiaoping Li; [email protected] Received 11 August 2014; Revised 2 December 2014; Accepted 10 December 2014 Academic Editor: Xiaohong Jiang Copyright © 2015 Jie Yang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A distributed parallel clustering method MCR-ACA is proposed by integrating the ant colony algorithm with the computing framework Map-Combine-Reduce for mining groups with the same or similar features from big data on vehicle trajectories stored in Wide Area Network. e heaviest computing burden of clustering is conducted in parallel at local nodes, of which the results are merged to small size intermediates. e intermediates are sent to the central node and clusters are generated adaptively. e great overhead of transferring big volume data is avoided by MCR-ACA, which improves the computing efficiency and guarantees the correctness of clustering. MCR-ACA is compared with an existing parallel clustering algorithm on practical big data collected by the traffic monitoring system of Jiangsu province in China. Experimental results demonstrate that the proposed method is effective for group mining by clustering. 1. Introduction Recently, big data on vehicle trajectories collected by traffic monitoring systems are more and more important in practice, which are based on license plate identification and the RFID techniques. For example, there are more than 100 subsystems in a traffic monitoring system of Jiangsu province in China, which includes more than 2 × 10 4 data sensors (collecting devices) distributed in 13 cities, covering 3 million acres in size. Around 5 × 10 4 million data records have been collected so far with 70 million being increased every day. Usually, traveling features of human beings imply some patterns. For example, because of people’s behavior habits or their fixed working locations, they always go out at the same time with the same trajectory. erefore, the data collected by traffic monitoring systems imply features of vehicle trajectories, which illustrate characteristics of human behaviours. It is reported that 93% of human behaviors can be foreseen [1] and four spatiotemporal points are enough to uniquely identify 95% of the individuals [2]. Likewise, it is possible to identify a driver with high probability according to several trajectory records, and a driver group with the same or similar behaviour features can be found by mining conducted on big data of vehicle trajectories. It is quite important and applicable. For example, a band of criminal suspects would be found if they use cars for transportation. Big data on vehicle trajectories is distributively stored in WAN, which is huge in amount and hard to be physically centralized by extraction. e biggest challenge in big data clustering is designing effective algorithms for clustering and distributed parallel computation [3]. For these issues, some distributed parallel computing frameworks based on Cloud Computing [4] and MapReduce [5] have been proposed in recent years, such as the batch computing framework [6, 7], the stream parallel computing framework [8], the customized parallel comput- ing framework [9], and the mixed parallel computing frame- work [10]. Based on such computing frameworks, some dis- tributed parallel clustering algorithms have been proposed. A new density-based clustering algorithm DBCURE-MR [11] was introduced, which is robust to find clusters with various densities and suitable for parallelizing the algorithm with MapReduce. A nonparametric accuracy estimation method and system [12] were proposed for speeding up big data analysis. Sampling with replacement was adopted to obtain Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 756107, 9 pages http://dx.doi.org/10.1155/2015/756107

Upload: others

Post on 30-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

Research ArticleA Group Mining Method for Big Data on Distributed VehicleTrajectories in WAN

Jie Yang12 Xiaoping Li1 Dandan Wang1 and Jia Wang1

1School of Computer Science and Engineering Southeast University Nanjing 211189 China2Public Security Bureau of Jiangsu Province Nanjing 210024 China

Correspondence should be addressed to Xiaoping Li xpliseueducn

Received 11 August 2014 Revised 2 December 2014 Accepted 10 December 2014

Academic Editor Xiaohong Jiang

Copyright copy 2015 Jie Yang et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A distributed parallel clustering method MCR-ACA is proposed by integrating the ant colony algorithm with the computingframework Map-Combine-Reduce for mining groups with the same or similar features from big data on vehicle trajectories storedinWide Area NetworkThe heaviest computing burden of clustering is conducted in parallel at local nodes of which the results aremerged to small size intermediates The intermediates are sent to the central node and clusters are generated adaptively The greatoverhead of transferring big volume data is avoided by MCR-ACA which improves the computing efficiency and guarantees thecorrectness of clustering MCR-ACA is compared with an existing parallel clustering algorithm on practical big data collected bythe trafficmonitoring system of Jiangsu province in China Experimental results demonstrate that the proposed method is effectivefor group mining by clustering

1 Introduction

Recently big data on vehicle trajectories collected by trafficmonitoring systems aremore andmore important in practicewhich are based on license plate identification and the RFIDtechniques For example there are more than 100 subsystemsin a traffic monitoring system of Jiangsu province in Chinawhich includes more than 2 times 10

4 data sensors (collectingdevices) distributed in 13 cities covering 3 million acres insize Around 5times 104 million data records have been collectedso far with 70 million being increased every day Usuallytraveling features of human beings imply some patterns Forexample because of peoplersquos behavior habits or their fixedworking locations they always go out at the same time withthe same trajectory Therefore the data collected by trafficmonitoring systems imply features of vehicle trajectorieswhich illustrate characteristics of human behaviours It isreported that 93 of human behaviors can be foreseen[1] and four spatiotemporal points are enough to uniquelyidentify 95 of the individuals [2] Likewise it is possible toidentify a driver with high probability according to severaltrajectory records and a driver groupwith the same or similar

behaviour features can be found by mining conducted onbig data of vehicle trajectories It is quite important andapplicable For example a band of criminal suspects would befound if they use cars for transportation Big data on vehicletrajectories is distributively stored in WAN which is huge inamount and hard to be physically centralized by extraction

The biggest challenge in big data clustering is designingeffective algorithms for clustering and distributed parallelcomputation [3] For these issues some distributed parallelcomputing frameworks based on Cloud Computing [4] andMapReduce [5] have been proposed in recent years such asthe batch computing framework [6 7] the stream parallelcomputing framework [8] the customized parallel comput-ing framework [9] and the mixed parallel computing frame-work [10] Based on such computing frameworks some dis-tributed parallel clustering algorithms have been proposedA new density-based clustering algorithmDBCURE-MR [11]was introduced which is robust to find clusters with variousdensities and suitable for parallelizing the algorithm withMapReduce A nonparametric accuracy estimation methodand system [12] were proposed for speeding up big dataanalysis Sampling with replacement was adopted to obtain

Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2015 Article ID 756107 9 pageshttpdxdoiorg1011552015756107

2 International Journal of Distributed Sensor Networks

the sampling points according to the sampling distributionThe amount of data input to MapReduce can be decreasedconsiderably Taking into account the distributed natureof portioned data and model three clustering algorithms[13] 119896-mean canopy and Fuzzy 119896-mean were implementedin parallel on MapReduce The efficiency of distributedclustering was improved significantly For detecting in largecommunity networks a parallel structural clustering algo-rithm was introduced [14] which is based on the simi-larity of edge structures and MapReduce The interfacesand implementations for user-defined aggregation in severalstates of the distributed computing systems were evaluatedin [15] where communication overhead of data-intensiveapplications could be decreased largely by local clusteringwhich clusters the intermediate results generated by Maptasks and then transmits the clustered results to Reduce tasks

The existing methods improve clustering efficiency eitherby parallel computing on physically centralized big dataor by reducing data scale using sampling However thecommunication overhead of data centralization and theimpact on sparse data for clustering accuracy have not beenconsidered yet In this paper by integrating the ACA (antcolony algorithm) with the computing framework Map-Combine-Reduce (MCR) a MapReduce based distributedparallel clustering method MCR-ACA is proposed for groupmining on big data of vehicle trajectories in WAN Someparallel ACA methods based on MapReduce have beenproposed [16 17] (defined as MR-ACA) However sincethese methods work on physically centralized big data inLAN the communication overhead of data centralizationis ignored The MCR-ACA method contains three stagesMap operation Combine operation and Reduce operationBoth the computation tasks with the heaviest burden areconducted and their results are combined in parallel on datasource nodes The combined result is transmitted to thecentral node and new cluster centers are generated adaptivelyThe presented method avoids the communication overheadof big data migration improves the clustering efficiencyand guarantees the accuracy of the global cluster amongdistributed nodes

The rest of this paper is organized as followsThe problemof group mining for vehicle trajectories is described inSection 2 A distributed parallel clustering method MCR-ACA is proposed in Section 3 Section 4 shows the compu-tational experiments followed by the conclusion and futurework in Section 5

2 Group Mining for Vehicle Trajectories

Distributed frameworks inWANare always adopted by trafficmonitoring systems Hierarchical ones are even utilized bysome complex systems For the traffic monitoring systemof Jiangsu examined in this paper a 3-layer framework isapplied There are 13 independent branch centers in 13 citiesrespectively which are responsible for the integration ofindependent data branches within each city A head centeris in charge of all the branch ones Therefore the maincharacteristics of big data of vehicle trajectories in WAN aremultiple data sources and hard to be physically centralized

In this paper the data branches are called source nodes citybranch centers are city nodes and the head center is thecentral node The topological network is shown in Figure 1

Traffic monitoring systems which contain multiple inde-pendent subsystems are being developed in the cities Theyform distributed data sources which increase rapidly Thereare more than one hundred subsystems in the system of theJiangsu province The amount of data in the subsystems isquite huge and growing rapidly (data increment in one cityof Jiangsu is over 12 million records every day) Multimediadata in various formats (such as photos and videos) increaseseveral TBs each day which is very difficult to be physicallycentralized in the central nodeThe trajectory data of the carscollected by a subsystem is listed in Table 1

Group mining for vehicle trajectories (GMVT for short)is critical for data clustering on big data of distributed trafficmonitoring systems in WAN The main idea of mininggroups on big data of vehicle trajectories is to implementthe automatic partition of vehicle trajectories with the sameor similar features by clustering on attributes which consistof the metadata (eg time and location) of the vehicletrajectories The information (the number of license plates)of the vehicle trajectories in the same clusters can be drawnfrom the partition result A complete vehicle trajectory recordincludes those metadata the number of the license platespassing time location direction speed and car color Theseattributes are set as metadata Records collected by differentsensors in various subsystems are normalized as a 6-tuple (thenumber of license plates passing time location directionspeed and car color)The first record in Table 1 is normalizedas S032V0 20130521073907 checkpoint at the crossroad ofSuyuan road and west Qingshuiting road 1 41 and A Everyelement is assigned to aweight Features of vehicle trajectorieswith the same or similar elements are clustered on the datarecords

Suppose there are Z distributed data sources (subsys-tems) 119878

1 1198782 119878Z Each data source 119878119894 with 6 attributes has

119879119894tables 119871

1198941 1198711198942 119871

119894119879119894 A table 119871

119894119895has 119876

119894119895records In

total there are sumZ119894=1

sum119879119894

119895=1119876119894119895records The objective of group

mining is to partition set119860withsumZ119894=1

sum119879119894

119895=1119876119894119895records into119873

subsets1198601 1198602 119860

119873 In each subset records have the same

or similar attributes and ⋃119873

119894=1119860119894= 119860 119860

119894cap 119860119895= 0 forall119894 = 119895

In an arbitrary subset 119860119894 all records are grouped according

to the number of license plates The clustering accuracy rateis the ratio of the clustered records of one car to the totalrecords which this car is involved in (eg if the total recordsof one car are 119876

119888 119876 records of the car are clustered into one

class the corresponding accuracy rate is (119876119876119888)100) If the

accuracy rate is greater than a given threshold 120578 the car ismerged into this class All cars in this class have the same orsimilar trajectory features

3 Group Mining Methods for VehicleTrajectory in WAN

In WAN GMVT is critical for clustering data of vehicletrajectories by attribute features such as time or location toconstruct various classes with the same or similar attributes

International Journal of Distributed Sensor Networks 3

Central node

City node

Source nodeSource nodeSource nodeSource nodeSource nodeSource node

RFID Camera RFID Camera RFID Camera RFID Camera

WAN

RFID Camera RFID Camera

City node City nodemiddot middot middot

middot middot middotmiddot middot middotmiddot middot middot

middot middot middot

Figure 1 Topological network of traffic monitoring system

Table 1 Data collected by a subsystem

HPHM JGSJ JGDD XSFX XSSD CSYSS032V0 20130521073907 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 1 41 AS032V0 20130522191001 Southwest corner of the crossroad of Jinxianghe road and east Beijing road 0 25 JS032V0 20130524213427 Northwest corner of the crossroad of Xuanwu road and Huayuan road 2 85 BS470A5 20130523071732 Northeast corner of the crossroad of south Fengtai road and Hexi street 6 31 JS470A5 20130524101258 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 2 42 J

according to which drivers with the same or similar featuresare identified as a group Because of the two characteristicsmentioned in Section 2 traditional parallel clustering meth-ods are no longer efficient which motivates us to present thefollowing method

31 Clustering Framework for Distributed Big Data in WANData clustering is very difficult for big data that is storeddistributively in WAN The reason lies in two aspects (i)huge amount of data makes clustering computing moretime-consuming which leads to existing methods beinginfeasible (ii) Communication overhead on data migrationis generally more than the computing cost It is better tomigrate computation rather than migrate data Thereforea distributed parallel computing framework (MCR) whichis based on MapReduce is proposed in this paper Theframework of MCR is depicted in Figure 2

Based on MCR the traditional ACA is adapted to MCR-ACA for group mining for big dataThe procedure of MCR isdescribed as follows

(i) Divide the data source 119878119894(119894 = 1 2 Z) into119867

119894data

chunks 1198611198941 1198611198942 119861

119894119867119894

(ii) Map operations are carried out on each data chunk119861119894119895

by a clustering strategy All records in 119861119894119895

areclustered by the given strategy

(iii) The clustered results are merged into intermediateones by Combine For example 119861

119894119895is clustered and

combined into a set of intermediate results with 119898119894119895

elements (119898119894119895is usually small)

(iv) The intermediate results are sent to the central nodewhere Reduce is conducted for global clustering

(v) The method terminates if the global clustering con-verges to or reaches the maximal iterations 119892maxOtherwise the comparison parameter is sent to eachdata chunk by Reduce The next iteration starts fromstep (ii)

Computing operations with the heaviest burden are con-ducted in parallel at source nodes Data in each source node isdivided into data chunks All chunks are clustered in parallelwhich leads to good efficiency Communication overheadis significantly reduced by transmitting intermediate resultscombined at local source nodes rather than the source dataThe global clustering is conducted on the intermediate resultsat the central node

4 International Journal of Distributed Sensor Networks

Clustering resultcombination

Clusteringtask 1

Clusteringtask 2

Clustering

Data 1 Data 2 Data 3

Source node 1

Clustering resultcombination

Clustering Clustering Clustering

Source node S

Global clustering result

Intermediate resultfrom source node 1

Intermediate resultfrom source node S

Central node

WAN WAN

WANWAN

task 3 task 1 task 2 task 3

Data 1 Data 2 Data 3

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 MCR distributed parallel computing framework

32 MCR-ACAMethod for Group Mining ACA was inspiredfrom the phenomenon that ant individuals gather at alocation with food by pheromone interaction among them[18 19] By integrating ACA with the computing frameworkMCR the groupminingmethodMCR-ACA is proposedThenumber of classes and the trajectory records in a class aredetermined adaptively and clustering centers are generatedin iterations without predefinition which is desirable for theconsidered problem

321 Map Function of MCR-ACA A vector of 119898 elements(attributes) is developed for each record of vehicle trajecto-ries that is a data record 119877

119896to be clustered is denoted as

R119896

= (1199031

119896 1199032

119896 119903

119898

119896) 119886119894119895

ants are assigned to data chunk119861119894119895 each of which randomly serves for a record 119877

119896in the

initial step A neighborhood is established with the center119877119896and radius R (from experience) denoted as 119873(119877

119896R)

The comprehensive similarity between the center 119877119896and all

records within its neighborhood119873(119877119896R) is defined by

119891 (119877119896) = sum

119877119897isin119873(119877119896R)

[1 minus119889119896119897

120582] (1)

where 120582 is the similarity factor representing the range ofdimensions (the difference between the maximum and min-imum of dimension) Let 119889

119896119897be the space distance between

two records 119877119896and 119877

119897 which is calculated by the weighted

distance according to

119889119896119897=1003817100381710038171003817119875 (119877119896 minus 119877

119897)1003817100381710038171003817 =

radic

119898

sum

ℎ=1

119875ℎ(119903ℎ

119896minus 119903ℎ

119897)2

(2)

where 119875ℎis the weight based on the experience and data

collecting accuracy Therefore the probability of clusteringrecord 119877

119896into class119873(119877

119896R) is computed by

119901119896119897(119905) =

120591120572

119896119897(119905) 119891120573

119905(119877119896)

sumisin119873(119877119897R)

120591120572

119911119897(119905) 119891120573

119905(119877119911)

(3)

where120572120573 are control parameters and 120591119896119897(119905) is the pheromone

amount on the path from 119877119896to 119877119897at time 119905 (120591

119896119897(0) = 1)

The decision of putting down or moving 119877119896is made in

terms of the clustering probability 119901119896119897(119905) (i) If 119901

119896119897(119905) is greater

than or equal to the given threshold 1199010 the ant puts down 119877

119896

and clusters it into class 119873(119877119896R) The traveled path length

of the ant is saved and the location where 119877119896was put down

is set as the start of a new traversal path then another recordis randomly assigned to the ant (ii) If 119901

119896119897(119905) is less than 119901

0

the ant carrying 119877119896keeps moving to the next point 119877

119897with

the largest 119901119896119897(119905) 119877119896is dropped when the path length reaches

the maximum or the ant has not found the proper clustersuntil the travel ends (it can be regarded as abandoned) Theant gets a new record After all records in 119861

119894119895are travelled by

ants that is |119861119894119895| records are clustered or abandoned local

clustering stopsThe Map function takes (⟨key 119877

119896⟩ ⟨119901 119889 119904⟩) as the input

keyvalue pair where key is the key value of 119877119896 119901 is the

clustering probability and 119889 is the path length where the antcarries 119877

119896 119901 and 119889 are initialized as 0 119904 is the coordinate of

nodes along the path with length 119889 which is initialized as 0 119892is the index of the current iteration 120591

119896119897(119892) is the pheromone

value on path 119877119896to 119877119897after 119892th iteration with the initial value

1 119901119896is the clustering probability for 119877

119896 119889119896is the path length

while 119877119896is clustered or abandoned and 119904

119896is the node where

119877119896is clustered or abandoned 119889

119892is the minimum 119889

119896after the

International Journal of Distributed Sensor Networks 5

Input Data chunk 119861119894119895

(1) 119901 larr 0 119889 larr 0119873119906119898 larr 0 119904 larr 0 120591119896119897(0) larr 1 119901

0is given

(2) while (119873119906119898 ⩽ |119861119894119895|) do

(3) Calculate the weight distance 119889119896119897between 119877

119896with all records in119873(119877

119896R) by (2)

(4) Calculate the comprehensive similarity 119891(119877119896) between 119877

119896with all records in119873(119877

119896R) by (1)

(5) Read the pheromone value 120591119896119897(119892 minus 1) calculate the clustering probability 119901

119896119897(119905) by (3)

(6) if 119901119896119897(119905) ge 119901

0then

(7) Cluster 119877119896into119873(119877

119896R) save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(8) Go to Step 17(9) if 119889 ge 119889

119892minus1then

(10) Abandon 119877119896 save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(11) Go to Step 17(12) Select the node with largest 119901

119896119897(119905) into 119904

(13) if all records are examined by the ant then(14) Abandon 119877

119896 save 119901

119896 119889119896 119904119896

(15) Go to Step 17(16) 119889 larr 119889 + 119889

119896119897 go to Step 2

(17) Randomly assign the ant a new record which has not been clustered or abandoned(18) Output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(19) return

Algorithm 1 Map function of MCR-ACA

Input Probability threshold 1199010

(1) if 119901119896lt 1199010then

(2) 119889119894119895larr min119889

119896

(3) Output 119889119894119895and go to Step 6

(4) Combine records with the same 119904119896and generate N

119904119896

(5) Calculate 119862119904119896 output (119904

119896 ⟨N119904119896 119862119904119896⟩)

(6) Update pheromone 120591119896119897(119892)

(7) return

Algorithm 2 Combine function of MCR-ACA

119892th iteration which is the comparing parameter for the nextiteration Map function on 119861

119894119895is described in Algorithm 1

322 Combine Function of MCR-ACA The results obtainedby the Map function (⟨key 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩) are combined

into intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) by the Combine

function at local nodes The minimum 119889119894119895along the path of

this iteration is found Pheromone is updated by the Combinefunction which is increased when ants pass by and decreasedin timeThe pheromone along the path from119877

119896to119877119897after the

119892th Map function is updated by

120591119896119897(119892) = (1 minus 120588) sdot 120591

119896119897(119892 minus 1) + Δ119890 (4)

where 120588 isin (0 1] denotes the evaporating rate of pheromoneΔ119890 is the pheromone left by passing of ant Δ119890 is set as 1 if anant passes by otherwise it is set to 0

For the records with the clustering probability less thanthe given threshold 119901

0 the minimal 119889

119896is set as the min-

imum path length 119889119894119895

in the data chunk 119861119894119895 For all the

records with clustering probability not less than 1199010 they

are combined according to the clustered nodes 119904119896 Records

with the same 119904119896are merged into the same class 119904

119896 The

number is denoted as N119904119896 For the records 119877

119904119896 119896in class

119904119896 the sum of the attribute vector is 119862

119904119896= sum

N119904119896

119896=1119877119904119896 119896

=

(sumN119904119896

119896=11199031

119904119896 119896 sum

N119904119896

119896=11199032

119904119896 119896 sum

N119904119896

119896=1119903119898

119904119896 119896)

Multiple Combine functions can be conducted in parallelfor one data source 119878

119894 each of which works on one or several

data chunks The Combine function for 119861119894119895

is described inAlgorithm 2

There are only two possible outputs from the Combinefunction either (119904

119896 ⟨N119904119896 119862119904119896⟩) or the minimal path length

119889119894119895 Therefore the data sent to the central node in WAN

can be greatly reduced Data chunk 119861119894119895

is combined into119898119894119895classes the communication overhead is119898

119894119895intermediate

results (119904119896 ⟨N119904119896 119862119904119896⟩) and a 119889

119894119895 By testing on practical data

the data chunk with the amount of 18 GB only needs totransmit 30KB after combination which is only (16) times 10minus4of the original data

323 Reduce Function of MCR-ACA At the 119892th iterationtwo parts obtained from the Combine phase on data chunksthat is intermediate results (119904

119896119892 ⟨N119904119896119892 119862119904119896119892⟩) and 119889

119894119895119892 are

recombined by the Reduce function New clustering centersare generated The weighted distances among clusteringcenters in different data chunks are calculated by (2) If it isless than or equal to R the parts are merged into one class119873(119904119896119892R) The global cluster center 119862

119904119896119892at the 119892th iteration

is computed by sum119873(119904119896119892 R)

119862119904119896119892sum119873(119904119896119892 R)

N119904119896119892

119862119904119896119892

convergesto and outputs the global clustering result if |119862

119904119896119892minus 119862119904119896119892minus1

| le

|119862119904119896119892minus1

minus 119862119904119896119892minus2

| Otherwise the minimal 119889119894119895119892

is output as 119889119892

and sent to each source node for the next comparison TheReduce function is described in Algorithm 3

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

2 International Journal of Distributed Sensor Networks

the sampling points according to the sampling distributionThe amount of data input to MapReduce can be decreasedconsiderably Taking into account the distributed natureof portioned data and model three clustering algorithms[13] 119896-mean canopy and Fuzzy 119896-mean were implementedin parallel on MapReduce The efficiency of distributedclustering was improved significantly For detecting in largecommunity networks a parallel structural clustering algo-rithm was introduced [14] which is based on the simi-larity of edge structures and MapReduce The interfacesand implementations for user-defined aggregation in severalstates of the distributed computing systems were evaluatedin [15] where communication overhead of data-intensiveapplications could be decreased largely by local clusteringwhich clusters the intermediate results generated by Maptasks and then transmits the clustered results to Reduce tasks

The existing methods improve clustering efficiency eitherby parallel computing on physically centralized big dataor by reducing data scale using sampling However thecommunication overhead of data centralization and theimpact on sparse data for clustering accuracy have not beenconsidered yet In this paper by integrating the ACA (antcolony algorithm) with the computing framework Map-Combine-Reduce (MCR) a MapReduce based distributedparallel clustering method MCR-ACA is proposed for groupmining on big data of vehicle trajectories in WAN Someparallel ACA methods based on MapReduce have beenproposed [16 17] (defined as MR-ACA) However sincethese methods work on physically centralized big data inLAN the communication overhead of data centralizationis ignored The MCR-ACA method contains three stagesMap operation Combine operation and Reduce operationBoth the computation tasks with the heaviest burden areconducted and their results are combined in parallel on datasource nodes The combined result is transmitted to thecentral node and new cluster centers are generated adaptivelyThe presented method avoids the communication overheadof big data migration improves the clustering efficiencyand guarantees the accuracy of the global cluster amongdistributed nodes

The rest of this paper is organized as followsThe problemof group mining for vehicle trajectories is described inSection 2 A distributed parallel clustering method MCR-ACA is proposed in Section 3 Section 4 shows the compu-tational experiments followed by the conclusion and futurework in Section 5

2 Group Mining for Vehicle Trajectories

Distributed frameworks inWANare always adopted by trafficmonitoring systems Hierarchical ones are even utilized bysome complex systems For the traffic monitoring systemof Jiangsu examined in this paper a 3-layer framework isapplied There are 13 independent branch centers in 13 citiesrespectively which are responsible for the integration ofindependent data branches within each city A head centeris in charge of all the branch ones Therefore the maincharacteristics of big data of vehicle trajectories in WAN aremultiple data sources and hard to be physically centralized

In this paper the data branches are called source nodes citybranch centers are city nodes and the head center is thecentral node The topological network is shown in Figure 1

Traffic monitoring systems which contain multiple inde-pendent subsystems are being developed in the cities Theyform distributed data sources which increase rapidly Thereare more than one hundred subsystems in the system of theJiangsu province The amount of data in the subsystems isquite huge and growing rapidly (data increment in one cityof Jiangsu is over 12 million records every day) Multimediadata in various formats (such as photos and videos) increaseseveral TBs each day which is very difficult to be physicallycentralized in the central nodeThe trajectory data of the carscollected by a subsystem is listed in Table 1

Group mining for vehicle trajectories (GMVT for short)is critical for data clustering on big data of distributed trafficmonitoring systems in WAN The main idea of mininggroups on big data of vehicle trajectories is to implementthe automatic partition of vehicle trajectories with the sameor similar features by clustering on attributes which consistof the metadata (eg time and location) of the vehicletrajectories The information (the number of license plates)of the vehicle trajectories in the same clusters can be drawnfrom the partition result A complete vehicle trajectory recordincludes those metadata the number of the license platespassing time location direction speed and car color Theseattributes are set as metadata Records collected by differentsensors in various subsystems are normalized as a 6-tuple (thenumber of license plates passing time location directionspeed and car color)The first record in Table 1 is normalizedas S032V0 20130521073907 checkpoint at the crossroad ofSuyuan road and west Qingshuiting road 1 41 and A Everyelement is assigned to aweight Features of vehicle trajectorieswith the same or similar elements are clustered on the datarecords

Suppose there are Z distributed data sources (subsys-tems) 119878

1 1198782 119878Z Each data source 119878119894 with 6 attributes has

119879119894tables 119871

1198941 1198711198942 119871

119894119879119894 A table 119871

119894119895has 119876

119894119895records In

total there are sumZ119894=1

sum119879119894

119895=1119876119894119895records The objective of group

mining is to partition set119860withsumZ119894=1

sum119879119894

119895=1119876119894119895records into119873

subsets1198601 1198602 119860

119873 In each subset records have the same

or similar attributes and ⋃119873

119894=1119860119894= 119860 119860

119894cap 119860119895= 0 forall119894 = 119895

In an arbitrary subset 119860119894 all records are grouped according

to the number of license plates The clustering accuracy rateis the ratio of the clustered records of one car to the totalrecords which this car is involved in (eg if the total recordsof one car are 119876

119888 119876 records of the car are clustered into one

class the corresponding accuracy rate is (119876119876119888)100) If the

accuracy rate is greater than a given threshold 120578 the car ismerged into this class All cars in this class have the same orsimilar trajectory features

3 Group Mining Methods for VehicleTrajectory in WAN

In WAN GMVT is critical for clustering data of vehicletrajectories by attribute features such as time or location toconstruct various classes with the same or similar attributes

International Journal of Distributed Sensor Networks 3

Central node

City node

Source nodeSource nodeSource nodeSource nodeSource nodeSource node

RFID Camera RFID Camera RFID Camera RFID Camera

WAN

RFID Camera RFID Camera

City node City nodemiddot middot middot

middot middot middotmiddot middot middotmiddot middot middot

middot middot middot

Figure 1 Topological network of traffic monitoring system

Table 1 Data collected by a subsystem

HPHM JGSJ JGDD XSFX XSSD CSYSS032V0 20130521073907 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 1 41 AS032V0 20130522191001 Southwest corner of the crossroad of Jinxianghe road and east Beijing road 0 25 JS032V0 20130524213427 Northwest corner of the crossroad of Xuanwu road and Huayuan road 2 85 BS470A5 20130523071732 Northeast corner of the crossroad of south Fengtai road and Hexi street 6 31 JS470A5 20130524101258 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 2 42 J

according to which drivers with the same or similar featuresare identified as a group Because of the two characteristicsmentioned in Section 2 traditional parallel clustering meth-ods are no longer efficient which motivates us to present thefollowing method

31 Clustering Framework for Distributed Big Data in WANData clustering is very difficult for big data that is storeddistributively in WAN The reason lies in two aspects (i)huge amount of data makes clustering computing moretime-consuming which leads to existing methods beinginfeasible (ii) Communication overhead on data migrationis generally more than the computing cost It is better tomigrate computation rather than migrate data Thereforea distributed parallel computing framework (MCR) whichis based on MapReduce is proposed in this paper Theframework of MCR is depicted in Figure 2

Based on MCR the traditional ACA is adapted to MCR-ACA for group mining for big dataThe procedure of MCR isdescribed as follows

(i) Divide the data source 119878119894(119894 = 1 2 Z) into119867

119894data

chunks 1198611198941 1198611198942 119861

119894119867119894

(ii) Map operations are carried out on each data chunk119861119894119895

by a clustering strategy All records in 119861119894119895

areclustered by the given strategy

(iii) The clustered results are merged into intermediateones by Combine For example 119861

119894119895is clustered and

combined into a set of intermediate results with 119898119894119895

elements (119898119894119895is usually small)

(iv) The intermediate results are sent to the central nodewhere Reduce is conducted for global clustering

(v) The method terminates if the global clustering con-verges to or reaches the maximal iterations 119892maxOtherwise the comparison parameter is sent to eachdata chunk by Reduce The next iteration starts fromstep (ii)

Computing operations with the heaviest burden are con-ducted in parallel at source nodes Data in each source node isdivided into data chunks All chunks are clustered in parallelwhich leads to good efficiency Communication overheadis significantly reduced by transmitting intermediate resultscombined at local source nodes rather than the source dataThe global clustering is conducted on the intermediate resultsat the central node

4 International Journal of Distributed Sensor Networks

Clustering resultcombination

Clusteringtask 1

Clusteringtask 2

Clustering

Data 1 Data 2 Data 3

Source node 1

Clustering resultcombination

Clustering Clustering Clustering

Source node S

Global clustering result

Intermediate resultfrom source node 1

Intermediate resultfrom source node S

Central node

WAN WAN

WANWAN

task 3 task 1 task 2 task 3

Data 1 Data 2 Data 3

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 MCR distributed parallel computing framework

32 MCR-ACAMethod for Group Mining ACA was inspiredfrom the phenomenon that ant individuals gather at alocation with food by pheromone interaction among them[18 19] By integrating ACA with the computing frameworkMCR the groupminingmethodMCR-ACA is proposedThenumber of classes and the trajectory records in a class aredetermined adaptively and clustering centers are generatedin iterations without predefinition which is desirable for theconsidered problem

321 Map Function of MCR-ACA A vector of 119898 elements(attributes) is developed for each record of vehicle trajecto-ries that is a data record 119877

119896to be clustered is denoted as

R119896

= (1199031

119896 1199032

119896 119903

119898

119896) 119886119894119895

ants are assigned to data chunk119861119894119895 each of which randomly serves for a record 119877

119896in the

initial step A neighborhood is established with the center119877119896and radius R (from experience) denoted as 119873(119877

119896R)

The comprehensive similarity between the center 119877119896and all

records within its neighborhood119873(119877119896R) is defined by

119891 (119877119896) = sum

119877119897isin119873(119877119896R)

[1 minus119889119896119897

120582] (1)

where 120582 is the similarity factor representing the range ofdimensions (the difference between the maximum and min-imum of dimension) Let 119889

119896119897be the space distance between

two records 119877119896and 119877

119897 which is calculated by the weighted

distance according to

119889119896119897=1003817100381710038171003817119875 (119877119896 minus 119877

119897)1003817100381710038171003817 =

radic

119898

sum

ℎ=1

119875ℎ(119903ℎ

119896minus 119903ℎ

119897)2

(2)

where 119875ℎis the weight based on the experience and data

collecting accuracy Therefore the probability of clusteringrecord 119877

119896into class119873(119877

119896R) is computed by

119901119896119897(119905) =

120591120572

119896119897(119905) 119891120573

119905(119877119896)

sumisin119873(119877119897R)

120591120572

119911119897(119905) 119891120573

119905(119877119911)

(3)

where120572120573 are control parameters and 120591119896119897(119905) is the pheromone

amount on the path from 119877119896to 119877119897at time 119905 (120591

119896119897(0) = 1)

The decision of putting down or moving 119877119896is made in

terms of the clustering probability 119901119896119897(119905) (i) If 119901

119896119897(119905) is greater

than or equal to the given threshold 1199010 the ant puts down 119877

119896

and clusters it into class 119873(119877119896R) The traveled path length

of the ant is saved and the location where 119877119896was put down

is set as the start of a new traversal path then another recordis randomly assigned to the ant (ii) If 119901

119896119897(119905) is less than 119901

0

the ant carrying 119877119896keeps moving to the next point 119877

119897with

the largest 119901119896119897(119905) 119877119896is dropped when the path length reaches

the maximum or the ant has not found the proper clustersuntil the travel ends (it can be regarded as abandoned) Theant gets a new record After all records in 119861

119894119895are travelled by

ants that is |119861119894119895| records are clustered or abandoned local

clustering stopsThe Map function takes (⟨key 119877

119896⟩ ⟨119901 119889 119904⟩) as the input

keyvalue pair where key is the key value of 119877119896 119901 is the

clustering probability and 119889 is the path length where the antcarries 119877

119896 119901 and 119889 are initialized as 0 119904 is the coordinate of

nodes along the path with length 119889 which is initialized as 0 119892is the index of the current iteration 120591

119896119897(119892) is the pheromone

value on path 119877119896to 119877119897after 119892th iteration with the initial value

1 119901119896is the clustering probability for 119877

119896 119889119896is the path length

while 119877119896is clustered or abandoned and 119904

119896is the node where

119877119896is clustered or abandoned 119889

119892is the minimum 119889

119896after the

International Journal of Distributed Sensor Networks 5

Input Data chunk 119861119894119895

(1) 119901 larr 0 119889 larr 0119873119906119898 larr 0 119904 larr 0 120591119896119897(0) larr 1 119901

0is given

(2) while (119873119906119898 ⩽ |119861119894119895|) do

(3) Calculate the weight distance 119889119896119897between 119877

119896with all records in119873(119877

119896R) by (2)

(4) Calculate the comprehensive similarity 119891(119877119896) between 119877

119896with all records in119873(119877

119896R) by (1)

(5) Read the pheromone value 120591119896119897(119892 minus 1) calculate the clustering probability 119901

119896119897(119905) by (3)

(6) if 119901119896119897(119905) ge 119901

0then

(7) Cluster 119877119896into119873(119877

119896R) save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(8) Go to Step 17(9) if 119889 ge 119889

119892minus1then

(10) Abandon 119877119896 save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(11) Go to Step 17(12) Select the node with largest 119901

119896119897(119905) into 119904

(13) if all records are examined by the ant then(14) Abandon 119877

119896 save 119901

119896 119889119896 119904119896

(15) Go to Step 17(16) 119889 larr 119889 + 119889

119896119897 go to Step 2

(17) Randomly assign the ant a new record which has not been clustered or abandoned(18) Output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(19) return

Algorithm 1 Map function of MCR-ACA

Input Probability threshold 1199010

(1) if 119901119896lt 1199010then

(2) 119889119894119895larr min119889

119896

(3) Output 119889119894119895and go to Step 6

(4) Combine records with the same 119904119896and generate N

119904119896

(5) Calculate 119862119904119896 output (119904

119896 ⟨N119904119896 119862119904119896⟩)

(6) Update pheromone 120591119896119897(119892)

(7) return

Algorithm 2 Combine function of MCR-ACA

119892th iteration which is the comparing parameter for the nextiteration Map function on 119861

119894119895is described in Algorithm 1

322 Combine Function of MCR-ACA The results obtainedby the Map function (⟨key 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩) are combined

into intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) by the Combine

function at local nodes The minimum 119889119894119895along the path of

this iteration is found Pheromone is updated by the Combinefunction which is increased when ants pass by and decreasedin timeThe pheromone along the path from119877

119896to119877119897after the

119892th Map function is updated by

120591119896119897(119892) = (1 minus 120588) sdot 120591

119896119897(119892 minus 1) + Δ119890 (4)

where 120588 isin (0 1] denotes the evaporating rate of pheromoneΔ119890 is the pheromone left by passing of ant Δ119890 is set as 1 if anant passes by otherwise it is set to 0

For the records with the clustering probability less thanthe given threshold 119901

0 the minimal 119889

119896is set as the min-

imum path length 119889119894119895

in the data chunk 119861119894119895 For all the

records with clustering probability not less than 1199010 they

are combined according to the clustered nodes 119904119896 Records

with the same 119904119896are merged into the same class 119904

119896 The

number is denoted as N119904119896 For the records 119877

119904119896 119896in class

119904119896 the sum of the attribute vector is 119862

119904119896= sum

N119904119896

119896=1119877119904119896 119896

=

(sumN119904119896

119896=11199031

119904119896 119896 sum

N119904119896

119896=11199032

119904119896 119896 sum

N119904119896

119896=1119903119898

119904119896 119896)

Multiple Combine functions can be conducted in parallelfor one data source 119878

119894 each of which works on one or several

data chunks The Combine function for 119861119894119895

is described inAlgorithm 2

There are only two possible outputs from the Combinefunction either (119904

119896 ⟨N119904119896 119862119904119896⟩) or the minimal path length

119889119894119895 Therefore the data sent to the central node in WAN

can be greatly reduced Data chunk 119861119894119895

is combined into119898119894119895classes the communication overhead is119898

119894119895intermediate

results (119904119896 ⟨N119904119896 119862119904119896⟩) and a 119889

119894119895 By testing on practical data

the data chunk with the amount of 18 GB only needs totransmit 30KB after combination which is only (16) times 10minus4of the original data

323 Reduce Function of MCR-ACA At the 119892th iterationtwo parts obtained from the Combine phase on data chunksthat is intermediate results (119904

119896119892 ⟨N119904119896119892 119862119904119896119892⟩) and 119889

119894119895119892 are

recombined by the Reduce function New clustering centersare generated The weighted distances among clusteringcenters in different data chunks are calculated by (2) If it isless than or equal to R the parts are merged into one class119873(119904119896119892R) The global cluster center 119862

119904119896119892at the 119892th iteration

is computed by sum119873(119904119896119892 R)

119862119904119896119892sum119873(119904119896119892 R)

N119904119896119892

119862119904119896119892

convergesto and outputs the global clustering result if |119862

119904119896119892minus 119862119904119896119892minus1

| le

|119862119904119896119892minus1

minus 119862119904119896119892minus2

| Otherwise the minimal 119889119894119895119892

is output as 119889119892

and sent to each source node for the next comparison TheReduce function is described in Algorithm 3

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

International Journal of Distributed Sensor Networks 3

Central node

City node

Source nodeSource nodeSource nodeSource nodeSource nodeSource node

RFID Camera RFID Camera RFID Camera RFID Camera

WAN

RFID Camera RFID Camera

City node City nodemiddot middot middot

middot middot middotmiddot middot middotmiddot middot middot

middot middot middot

Figure 1 Topological network of traffic monitoring system

Table 1 Data collected by a subsystem

HPHM JGSJ JGDD XSFX XSSD CSYSS032V0 20130521073907 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 1 41 AS032V0 20130522191001 Southwest corner of the crossroad of Jinxianghe road and east Beijing road 0 25 JS032V0 20130524213427 Northwest corner of the crossroad of Xuanwu road and Huayuan road 2 85 BS470A5 20130523071732 Northeast corner of the crossroad of south Fengtai road and Hexi street 6 31 JS470A5 20130524101258 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 2 42 J

according to which drivers with the same or similar featuresare identified as a group Because of the two characteristicsmentioned in Section 2 traditional parallel clustering meth-ods are no longer efficient which motivates us to present thefollowing method

31 Clustering Framework for Distributed Big Data in WANData clustering is very difficult for big data that is storeddistributively in WAN The reason lies in two aspects (i)huge amount of data makes clustering computing moretime-consuming which leads to existing methods beinginfeasible (ii) Communication overhead on data migrationis generally more than the computing cost It is better tomigrate computation rather than migrate data Thereforea distributed parallel computing framework (MCR) whichis based on MapReduce is proposed in this paper Theframework of MCR is depicted in Figure 2

Based on MCR the traditional ACA is adapted to MCR-ACA for group mining for big dataThe procedure of MCR isdescribed as follows

(i) Divide the data source 119878119894(119894 = 1 2 Z) into119867

119894data

chunks 1198611198941 1198611198942 119861

119894119867119894

(ii) Map operations are carried out on each data chunk119861119894119895

by a clustering strategy All records in 119861119894119895

areclustered by the given strategy

(iii) The clustered results are merged into intermediateones by Combine For example 119861

119894119895is clustered and

combined into a set of intermediate results with 119898119894119895

elements (119898119894119895is usually small)

(iv) The intermediate results are sent to the central nodewhere Reduce is conducted for global clustering

(v) The method terminates if the global clustering con-verges to or reaches the maximal iterations 119892maxOtherwise the comparison parameter is sent to eachdata chunk by Reduce The next iteration starts fromstep (ii)

Computing operations with the heaviest burden are con-ducted in parallel at source nodes Data in each source node isdivided into data chunks All chunks are clustered in parallelwhich leads to good efficiency Communication overheadis significantly reduced by transmitting intermediate resultscombined at local source nodes rather than the source dataThe global clustering is conducted on the intermediate resultsat the central node

4 International Journal of Distributed Sensor Networks

Clustering resultcombination

Clusteringtask 1

Clusteringtask 2

Clustering

Data 1 Data 2 Data 3

Source node 1

Clustering resultcombination

Clustering Clustering Clustering

Source node S

Global clustering result

Intermediate resultfrom source node 1

Intermediate resultfrom source node S

Central node

WAN WAN

WANWAN

task 3 task 1 task 2 task 3

Data 1 Data 2 Data 3

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 MCR distributed parallel computing framework

32 MCR-ACAMethod for Group Mining ACA was inspiredfrom the phenomenon that ant individuals gather at alocation with food by pheromone interaction among them[18 19] By integrating ACA with the computing frameworkMCR the groupminingmethodMCR-ACA is proposedThenumber of classes and the trajectory records in a class aredetermined adaptively and clustering centers are generatedin iterations without predefinition which is desirable for theconsidered problem

321 Map Function of MCR-ACA A vector of 119898 elements(attributes) is developed for each record of vehicle trajecto-ries that is a data record 119877

119896to be clustered is denoted as

R119896

= (1199031

119896 1199032

119896 119903

119898

119896) 119886119894119895

ants are assigned to data chunk119861119894119895 each of which randomly serves for a record 119877

119896in the

initial step A neighborhood is established with the center119877119896and radius R (from experience) denoted as 119873(119877

119896R)

The comprehensive similarity between the center 119877119896and all

records within its neighborhood119873(119877119896R) is defined by

119891 (119877119896) = sum

119877119897isin119873(119877119896R)

[1 minus119889119896119897

120582] (1)

where 120582 is the similarity factor representing the range ofdimensions (the difference between the maximum and min-imum of dimension) Let 119889

119896119897be the space distance between

two records 119877119896and 119877

119897 which is calculated by the weighted

distance according to

119889119896119897=1003817100381710038171003817119875 (119877119896 minus 119877

119897)1003817100381710038171003817 =

radic

119898

sum

ℎ=1

119875ℎ(119903ℎ

119896minus 119903ℎ

119897)2

(2)

where 119875ℎis the weight based on the experience and data

collecting accuracy Therefore the probability of clusteringrecord 119877

119896into class119873(119877

119896R) is computed by

119901119896119897(119905) =

120591120572

119896119897(119905) 119891120573

119905(119877119896)

sumisin119873(119877119897R)

120591120572

119911119897(119905) 119891120573

119905(119877119911)

(3)

where120572120573 are control parameters and 120591119896119897(119905) is the pheromone

amount on the path from 119877119896to 119877119897at time 119905 (120591

119896119897(0) = 1)

The decision of putting down or moving 119877119896is made in

terms of the clustering probability 119901119896119897(119905) (i) If 119901

119896119897(119905) is greater

than or equal to the given threshold 1199010 the ant puts down 119877

119896

and clusters it into class 119873(119877119896R) The traveled path length

of the ant is saved and the location where 119877119896was put down

is set as the start of a new traversal path then another recordis randomly assigned to the ant (ii) If 119901

119896119897(119905) is less than 119901

0

the ant carrying 119877119896keeps moving to the next point 119877

119897with

the largest 119901119896119897(119905) 119877119896is dropped when the path length reaches

the maximum or the ant has not found the proper clustersuntil the travel ends (it can be regarded as abandoned) Theant gets a new record After all records in 119861

119894119895are travelled by

ants that is |119861119894119895| records are clustered or abandoned local

clustering stopsThe Map function takes (⟨key 119877

119896⟩ ⟨119901 119889 119904⟩) as the input

keyvalue pair where key is the key value of 119877119896 119901 is the

clustering probability and 119889 is the path length where the antcarries 119877

119896 119901 and 119889 are initialized as 0 119904 is the coordinate of

nodes along the path with length 119889 which is initialized as 0 119892is the index of the current iteration 120591

119896119897(119892) is the pheromone

value on path 119877119896to 119877119897after 119892th iteration with the initial value

1 119901119896is the clustering probability for 119877

119896 119889119896is the path length

while 119877119896is clustered or abandoned and 119904

119896is the node where

119877119896is clustered or abandoned 119889

119892is the minimum 119889

119896after the

International Journal of Distributed Sensor Networks 5

Input Data chunk 119861119894119895

(1) 119901 larr 0 119889 larr 0119873119906119898 larr 0 119904 larr 0 120591119896119897(0) larr 1 119901

0is given

(2) while (119873119906119898 ⩽ |119861119894119895|) do

(3) Calculate the weight distance 119889119896119897between 119877

119896with all records in119873(119877

119896R) by (2)

(4) Calculate the comprehensive similarity 119891(119877119896) between 119877

119896with all records in119873(119877

119896R) by (1)

(5) Read the pheromone value 120591119896119897(119892 minus 1) calculate the clustering probability 119901

119896119897(119905) by (3)

(6) if 119901119896119897(119905) ge 119901

0then

(7) Cluster 119877119896into119873(119877

119896R) save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(8) Go to Step 17(9) if 119889 ge 119889

119892minus1then

(10) Abandon 119877119896 save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(11) Go to Step 17(12) Select the node with largest 119901

119896119897(119905) into 119904

(13) if all records are examined by the ant then(14) Abandon 119877

119896 save 119901

119896 119889119896 119904119896

(15) Go to Step 17(16) 119889 larr 119889 + 119889

119896119897 go to Step 2

(17) Randomly assign the ant a new record which has not been clustered or abandoned(18) Output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(19) return

Algorithm 1 Map function of MCR-ACA

Input Probability threshold 1199010

(1) if 119901119896lt 1199010then

(2) 119889119894119895larr min119889

119896

(3) Output 119889119894119895and go to Step 6

(4) Combine records with the same 119904119896and generate N

119904119896

(5) Calculate 119862119904119896 output (119904

119896 ⟨N119904119896 119862119904119896⟩)

(6) Update pheromone 120591119896119897(119892)

(7) return

Algorithm 2 Combine function of MCR-ACA

119892th iteration which is the comparing parameter for the nextiteration Map function on 119861

119894119895is described in Algorithm 1

322 Combine Function of MCR-ACA The results obtainedby the Map function (⟨key 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩) are combined

into intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) by the Combine

function at local nodes The minimum 119889119894119895along the path of

this iteration is found Pheromone is updated by the Combinefunction which is increased when ants pass by and decreasedin timeThe pheromone along the path from119877

119896to119877119897after the

119892th Map function is updated by

120591119896119897(119892) = (1 minus 120588) sdot 120591

119896119897(119892 minus 1) + Δ119890 (4)

where 120588 isin (0 1] denotes the evaporating rate of pheromoneΔ119890 is the pheromone left by passing of ant Δ119890 is set as 1 if anant passes by otherwise it is set to 0

For the records with the clustering probability less thanthe given threshold 119901

0 the minimal 119889

119896is set as the min-

imum path length 119889119894119895

in the data chunk 119861119894119895 For all the

records with clustering probability not less than 1199010 they

are combined according to the clustered nodes 119904119896 Records

with the same 119904119896are merged into the same class 119904

119896 The

number is denoted as N119904119896 For the records 119877

119904119896 119896in class

119904119896 the sum of the attribute vector is 119862

119904119896= sum

N119904119896

119896=1119877119904119896 119896

=

(sumN119904119896

119896=11199031

119904119896 119896 sum

N119904119896

119896=11199032

119904119896 119896 sum

N119904119896

119896=1119903119898

119904119896 119896)

Multiple Combine functions can be conducted in parallelfor one data source 119878

119894 each of which works on one or several

data chunks The Combine function for 119861119894119895

is described inAlgorithm 2

There are only two possible outputs from the Combinefunction either (119904

119896 ⟨N119904119896 119862119904119896⟩) or the minimal path length

119889119894119895 Therefore the data sent to the central node in WAN

can be greatly reduced Data chunk 119861119894119895

is combined into119898119894119895classes the communication overhead is119898

119894119895intermediate

results (119904119896 ⟨N119904119896 119862119904119896⟩) and a 119889

119894119895 By testing on practical data

the data chunk with the amount of 18 GB only needs totransmit 30KB after combination which is only (16) times 10minus4of the original data

323 Reduce Function of MCR-ACA At the 119892th iterationtwo parts obtained from the Combine phase on data chunksthat is intermediate results (119904

119896119892 ⟨N119904119896119892 119862119904119896119892⟩) and 119889

119894119895119892 are

recombined by the Reduce function New clustering centersare generated The weighted distances among clusteringcenters in different data chunks are calculated by (2) If it isless than or equal to R the parts are merged into one class119873(119904119896119892R) The global cluster center 119862

119904119896119892at the 119892th iteration

is computed by sum119873(119904119896119892 R)

119862119904119896119892sum119873(119904119896119892 R)

N119904119896119892

119862119904119896119892

convergesto and outputs the global clustering result if |119862

119904119896119892minus 119862119904119896119892minus1

| le

|119862119904119896119892minus1

minus 119862119904119896119892minus2

| Otherwise the minimal 119889119894119895119892

is output as 119889119892

and sent to each source node for the next comparison TheReduce function is described in Algorithm 3

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

4 International Journal of Distributed Sensor Networks

Clustering resultcombination

Clusteringtask 1

Clusteringtask 2

Clustering

Data 1 Data 2 Data 3

Source node 1

Clustering resultcombination

Clustering Clustering Clustering

Source node S

Global clustering result

Intermediate resultfrom source node 1

Intermediate resultfrom source node S

Central node

WAN WAN

WANWAN

task 3 task 1 task 2 task 3

Data 1 Data 2 Data 3

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 2 MCR distributed parallel computing framework

32 MCR-ACAMethod for Group Mining ACA was inspiredfrom the phenomenon that ant individuals gather at alocation with food by pheromone interaction among them[18 19] By integrating ACA with the computing frameworkMCR the groupminingmethodMCR-ACA is proposedThenumber of classes and the trajectory records in a class aredetermined adaptively and clustering centers are generatedin iterations without predefinition which is desirable for theconsidered problem

321 Map Function of MCR-ACA A vector of 119898 elements(attributes) is developed for each record of vehicle trajecto-ries that is a data record 119877

119896to be clustered is denoted as

R119896

= (1199031

119896 1199032

119896 119903

119898

119896) 119886119894119895

ants are assigned to data chunk119861119894119895 each of which randomly serves for a record 119877

119896in the

initial step A neighborhood is established with the center119877119896and radius R (from experience) denoted as 119873(119877

119896R)

The comprehensive similarity between the center 119877119896and all

records within its neighborhood119873(119877119896R) is defined by

119891 (119877119896) = sum

119877119897isin119873(119877119896R)

[1 minus119889119896119897

120582] (1)

where 120582 is the similarity factor representing the range ofdimensions (the difference between the maximum and min-imum of dimension) Let 119889

119896119897be the space distance between

two records 119877119896and 119877

119897 which is calculated by the weighted

distance according to

119889119896119897=1003817100381710038171003817119875 (119877119896 minus 119877

119897)1003817100381710038171003817 =

radic

119898

sum

ℎ=1

119875ℎ(119903ℎ

119896minus 119903ℎ

119897)2

(2)

where 119875ℎis the weight based on the experience and data

collecting accuracy Therefore the probability of clusteringrecord 119877

119896into class119873(119877

119896R) is computed by

119901119896119897(119905) =

120591120572

119896119897(119905) 119891120573

119905(119877119896)

sumisin119873(119877119897R)

120591120572

119911119897(119905) 119891120573

119905(119877119911)

(3)

where120572120573 are control parameters and 120591119896119897(119905) is the pheromone

amount on the path from 119877119896to 119877119897at time 119905 (120591

119896119897(0) = 1)

The decision of putting down or moving 119877119896is made in

terms of the clustering probability 119901119896119897(119905) (i) If 119901

119896119897(119905) is greater

than or equal to the given threshold 1199010 the ant puts down 119877

119896

and clusters it into class 119873(119877119896R) The traveled path length

of the ant is saved and the location where 119877119896was put down

is set as the start of a new traversal path then another recordis randomly assigned to the ant (ii) If 119901

119896119897(119905) is less than 119901

0

the ant carrying 119877119896keeps moving to the next point 119877

119897with

the largest 119901119896119897(119905) 119877119896is dropped when the path length reaches

the maximum or the ant has not found the proper clustersuntil the travel ends (it can be regarded as abandoned) Theant gets a new record After all records in 119861

119894119895are travelled by

ants that is |119861119894119895| records are clustered or abandoned local

clustering stopsThe Map function takes (⟨key 119877

119896⟩ ⟨119901 119889 119904⟩) as the input

keyvalue pair where key is the key value of 119877119896 119901 is the

clustering probability and 119889 is the path length where the antcarries 119877

119896 119901 and 119889 are initialized as 0 119904 is the coordinate of

nodes along the path with length 119889 which is initialized as 0 119892is the index of the current iteration 120591

119896119897(119892) is the pheromone

value on path 119877119896to 119877119897after 119892th iteration with the initial value

1 119901119896is the clustering probability for 119877

119896 119889119896is the path length

while 119877119896is clustered or abandoned and 119904

119896is the node where

119877119896is clustered or abandoned 119889

119892is the minimum 119889

119896after the

International Journal of Distributed Sensor Networks 5

Input Data chunk 119861119894119895

(1) 119901 larr 0 119889 larr 0119873119906119898 larr 0 119904 larr 0 120591119896119897(0) larr 1 119901

0is given

(2) while (119873119906119898 ⩽ |119861119894119895|) do

(3) Calculate the weight distance 119889119896119897between 119877

119896with all records in119873(119877

119896R) by (2)

(4) Calculate the comprehensive similarity 119891(119877119896) between 119877

119896with all records in119873(119877

119896R) by (1)

(5) Read the pheromone value 120591119896119897(119892 minus 1) calculate the clustering probability 119901

119896119897(119905) by (3)

(6) if 119901119896119897(119905) ge 119901

0then

(7) Cluster 119877119896into119873(119877

119896R) save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(8) Go to Step 17(9) if 119889 ge 119889

119892minus1then

(10) Abandon 119877119896 save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(11) Go to Step 17(12) Select the node with largest 119901

119896119897(119905) into 119904

(13) if all records are examined by the ant then(14) Abandon 119877

119896 save 119901

119896 119889119896 119904119896

(15) Go to Step 17(16) 119889 larr 119889 + 119889

119896119897 go to Step 2

(17) Randomly assign the ant a new record which has not been clustered or abandoned(18) Output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(19) return

Algorithm 1 Map function of MCR-ACA

Input Probability threshold 1199010

(1) if 119901119896lt 1199010then

(2) 119889119894119895larr min119889

119896

(3) Output 119889119894119895and go to Step 6

(4) Combine records with the same 119904119896and generate N

119904119896

(5) Calculate 119862119904119896 output (119904

119896 ⟨N119904119896 119862119904119896⟩)

(6) Update pheromone 120591119896119897(119892)

(7) return

Algorithm 2 Combine function of MCR-ACA

119892th iteration which is the comparing parameter for the nextiteration Map function on 119861

119894119895is described in Algorithm 1

322 Combine Function of MCR-ACA The results obtainedby the Map function (⟨key 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩) are combined

into intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) by the Combine

function at local nodes The minimum 119889119894119895along the path of

this iteration is found Pheromone is updated by the Combinefunction which is increased when ants pass by and decreasedin timeThe pheromone along the path from119877

119896to119877119897after the

119892th Map function is updated by

120591119896119897(119892) = (1 minus 120588) sdot 120591

119896119897(119892 minus 1) + Δ119890 (4)

where 120588 isin (0 1] denotes the evaporating rate of pheromoneΔ119890 is the pheromone left by passing of ant Δ119890 is set as 1 if anant passes by otherwise it is set to 0

For the records with the clustering probability less thanthe given threshold 119901

0 the minimal 119889

119896is set as the min-

imum path length 119889119894119895

in the data chunk 119861119894119895 For all the

records with clustering probability not less than 1199010 they

are combined according to the clustered nodes 119904119896 Records

with the same 119904119896are merged into the same class 119904

119896 The

number is denoted as N119904119896 For the records 119877

119904119896 119896in class

119904119896 the sum of the attribute vector is 119862

119904119896= sum

N119904119896

119896=1119877119904119896 119896

=

(sumN119904119896

119896=11199031

119904119896 119896 sum

N119904119896

119896=11199032

119904119896 119896 sum

N119904119896

119896=1119903119898

119904119896 119896)

Multiple Combine functions can be conducted in parallelfor one data source 119878

119894 each of which works on one or several

data chunks The Combine function for 119861119894119895

is described inAlgorithm 2

There are only two possible outputs from the Combinefunction either (119904

119896 ⟨N119904119896 119862119904119896⟩) or the minimal path length

119889119894119895 Therefore the data sent to the central node in WAN

can be greatly reduced Data chunk 119861119894119895

is combined into119898119894119895classes the communication overhead is119898

119894119895intermediate

results (119904119896 ⟨N119904119896 119862119904119896⟩) and a 119889

119894119895 By testing on practical data

the data chunk with the amount of 18 GB only needs totransmit 30KB after combination which is only (16) times 10minus4of the original data

323 Reduce Function of MCR-ACA At the 119892th iterationtwo parts obtained from the Combine phase on data chunksthat is intermediate results (119904

119896119892 ⟨N119904119896119892 119862119904119896119892⟩) and 119889

119894119895119892 are

recombined by the Reduce function New clustering centersare generated The weighted distances among clusteringcenters in different data chunks are calculated by (2) If it isless than or equal to R the parts are merged into one class119873(119904119896119892R) The global cluster center 119862

119904119896119892at the 119892th iteration

is computed by sum119873(119904119896119892 R)

119862119904119896119892sum119873(119904119896119892 R)

N119904119896119892

119862119904119896119892

convergesto and outputs the global clustering result if |119862

119904119896119892minus 119862119904119896119892minus1

| le

|119862119904119896119892minus1

minus 119862119904119896119892minus2

| Otherwise the minimal 119889119894119895119892

is output as 119889119892

and sent to each source node for the next comparison TheReduce function is described in Algorithm 3

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

International Journal of Distributed Sensor Networks 5

Input Data chunk 119861119894119895

(1) 119901 larr 0 119889 larr 0119873119906119898 larr 0 119904 larr 0 120591119896119897(0) larr 1 119901

0is given

(2) while (119873119906119898 ⩽ |119861119894119895|) do

(3) Calculate the weight distance 119889119896119897between 119877

119896with all records in119873(119877

119896R) by (2)

(4) Calculate the comprehensive similarity 119891(119877119896) between 119877

119896with all records in119873(119877

119896R) by (1)

(5) Read the pheromone value 120591119896119897(119892 minus 1) calculate the clustering probability 119901

119896119897(119905) by (3)

(6) if 119901119896119897(119905) ge 119901

0then

(7) Cluster 119877119896into119873(119877

119896R) save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(8) Go to Step 17(9) if 119889 ge 119889

119892minus1then

(10) Abandon 119877119896 save 119901

119896 119889119896 119904119896119873119906119898 larr 119873119906119898 + 1

(11) Go to Step 17(12) Select the node with largest 119901

119896119897(119905) into 119904

(13) if all records are examined by the ant then(14) Abandon 119877

119896 save 119901

119896 119889119896 119904119896

(15) Go to Step 17(16) 119889 larr 119889 + 119889

119896119897 go to Step 2

(17) Randomly assign the ant a new record which has not been clustered or abandoned(18) Output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(19) return

Algorithm 1 Map function of MCR-ACA

Input Probability threshold 1199010

(1) if 119901119896lt 1199010then

(2) 119889119894119895larr min119889

119896

(3) Output 119889119894119895and go to Step 6

(4) Combine records with the same 119904119896and generate N

119904119896

(5) Calculate 119862119904119896 output (119904

119896 ⟨N119904119896 119862119904119896⟩)

(6) Update pheromone 120591119896119897(119892)

(7) return

Algorithm 2 Combine function of MCR-ACA

119892th iteration which is the comparing parameter for the nextiteration Map function on 119861

119894119895is described in Algorithm 1

322 Combine Function of MCR-ACA The results obtainedby the Map function (⟨key 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩) are combined

into intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) by the Combine

function at local nodes The minimum 119889119894119895along the path of

this iteration is found Pheromone is updated by the Combinefunction which is increased when ants pass by and decreasedin timeThe pheromone along the path from119877

119896to119877119897after the

119892th Map function is updated by

120591119896119897(119892) = (1 minus 120588) sdot 120591

119896119897(119892 minus 1) + Δ119890 (4)

where 120588 isin (0 1] denotes the evaporating rate of pheromoneΔ119890 is the pheromone left by passing of ant Δ119890 is set as 1 if anant passes by otherwise it is set to 0

For the records with the clustering probability less thanthe given threshold 119901

0 the minimal 119889

119896is set as the min-

imum path length 119889119894119895

in the data chunk 119861119894119895 For all the

records with clustering probability not less than 1199010 they

are combined according to the clustered nodes 119904119896 Records

with the same 119904119896are merged into the same class 119904

119896 The

number is denoted as N119904119896 For the records 119877

119904119896 119896in class

119904119896 the sum of the attribute vector is 119862

119904119896= sum

N119904119896

119896=1119877119904119896 119896

=

(sumN119904119896

119896=11199031

119904119896 119896 sum

N119904119896

119896=11199032

119904119896 119896 sum

N119904119896

119896=1119903119898

119904119896 119896)

Multiple Combine functions can be conducted in parallelfor one data source 119878

119894 each of which works on one or several

data chunks The Combine function for 119861119894119895

is described inAlgorithm 2

There are only two possible outputs from the Combinefunction either (119904

119896 ⟨N119904119896 119862119904119896⟩) or the minimal path length

119889119894119895 Therefore the data sent to the central node in WAN

can be greatly reduced Data chunk 119861119894119895

is combined into119898119894119895classes the communication overhead is119898

119894119895intermediate

results (119904119896 ⟨N119904119896 119862119904119896⟩) and a 119889

119894119895 By testing on practical data

the data chunk with the amount of 18 GB only needs totransmit 30KB after combination which is only (16) times 10minus4of the original data

323 Reduce Function of MCR-ACA At the 119892th iterationtwo parts obtained from the Combine phase on data chunksthat is intermediate results (119904

119896119892 ⟨N119904119896119892 119862119904119896119892⟩) and 119889

119894119895119892 are

recombined by the Reduce function New clustering centersare generated The weighted distances among clusteringcenters in different data chunks are calculated by (2) If it isless than or equal to R the parts are merged into one class119873(119904119896119892R) The global cluster center 119862

119904119896119892at the 119892th iteration

is computed by sum119873(119904119896119892 R)

119862119904119896119892sum119873(119904119896119892 R)

N119904119896119892

119862119904119896119892

convergesto and outputs the global clustering result if |119862

119904119896119892minus 119862119904119896119892minus1

| le

|119862119904119896119892minus1

minus 119862119904119896119892minus2

| Otherwise the minimal 119889119894119895119892

is output as 119889119892

and sent to each source node for the next comparison TheReduce function is described in Algorithm 3

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

6 International Journal of Distributed Sensor Networks

(1) Calculate the weighted distance 119889119904119896119892 1199041198961015840119892

between differentcluster centers 119904

119896119892and 1199041198961015840119892respectively located in different data chunks by (2)

(2) if 119889119904119896119892 1199041198961015840119892

le R then(3) Combine 119904

119896119892and 1199041198961015840119892into the same class119873(119904

119896119892R)

(4) Calculate the global cluster center 119862119904119896119892

= (sum119873(119904119896119892 R)

119862119904119896119892

sum119873(119904119896119892 R)

N119904119896119892

) of the 119892th iteration(5) if 119862

119904119896119892converges then

(6) output the global clustering result(7) Go to Step 9(8) Output the minimal 119889

119894119895119892as 119889119892

(9) return

Algorithm 3 Reduce function of MCR-ACA

(1) 119901 larr 0 119889 larr 0 1198890larr infin119873119906119898 larr 0 119904 larr 0 120591

119896119897(0) larr 1 119901

0is given

(2) while (119892 ⩽ 119892max) do(3) Conduct Map function in parallel output the clustering result (⟨119896119890119910 119877

119896⟩ ⟨119901119896 119889119896 119904119896⟩)

(4) Perform Combine functions in parallel output intermediate results (119904119896 ⟨N119904119896 119862119904119896⟩) and 119889

119894119895 for each data chunk 119861

119894119895

(5) Reduce functions are carried out in parallel to develop the global cluster 119862119904119896119892

(6) 119892 larr 119892 + 1(7) if (119862

119904119896119892does not converge) then

(8) Output 119889119892as the minimal 119889

119894119895119892to each source node

(9) Output the global classes(10) return

Algorithm 4 MCR-ACA

324 MCR-ACA Method Description MCR-ACA is con-structed by integrating MCR with the Map function con-ducted on various data chunks in different source nodes theCombine one on the local clustering results and the Reduceone on global cluster centers Assume themaximum iterationis 119892max The MCR-ACA method is described in Algorithm 4

4 Experimental Results on Practical Big Data

In the experiment the MCR-ACA method is compared withthe existing MR-ACA method on the traffic monitoringsystem of Jiangsu province in China The two cities Nantongand Changzhou are selected two subsystems are chosenfrom each of them respectively Nanjing is the central citySubsystems are linked by fiber with 1000Mbps within eachcityThe distance between Nantong and the central city is 270kilometers There are 140 kilometers from Changzhou andthe center Nanjing The cities are connected by the Internetwith networkwidth of 200MbpsWe adoptHadoopMahoutand IK as software tools Two PCs are used in the two citiesrespectively while four PCs work in the central node All ofthem are configured with Intel 5620CPU 24GHZ 6-core4G memory and 300G disk In the MCR-ACA experimentMap and Combine operations are conducted parallel in fourPCs in two cities and the Reduce function is conducted inthe central node In the MR-ACA experiment all data istransmitted to the central node and processed by the four

PCs in the central node where Map Combine and Reducefunctions are performed

The records on vehicle trajectories are represented by aset of 6 elements (HPHM JGSJ JGDD XSFX XSSD andCSYS) with the weight 119875

ℎbeing 005 03 03 015 015

and 005 Since both MCR-ACA and MR-ACA are basedon MapReduce the experiments focus on the scale growthof vehicle trajectories According to the experiments theneighborhood radius R and the threshold probability 119901

0of

the two methods are found to be key parameters of Mapfunctions The parameters are mainly determined by thecombination of requirements like accuracy and efficiencyMeanwhile the two parameters affect each other and thusare usually given in pairs As a result the comparisonexperiments of the parameters are conducted in advancebased on the MCR-ACA method since the Map function inbothmethods are roughly the same In the experiments thereare 56Map functions and 3 ants for eachMap function with 4Combine functions and 4 Reduce functions whose maximaliteration is 5 The experiment data consists of 4 times 10

7 vehicletrajectory recordsThe experiment results are shown inTables2 and 3

It can be seen from Table 2 that as R increases the totaltime of clustering computation increases and the accuracy ofclustering is robust Table 3 illustrates that the total time ofclustering computation tends to decrease and then increasewhile the accuracy of clustering fluctuates as119901

0increases For

further comparison of the effect of different parameter pairs

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

International Journal of Distributed Sensor Networks 7

Table 2 The clustering time (119867)the accuracy () of MCR-ACAwith different neighborhood radiusR

R 1199010

Total time Accuracy Total timeaccuracy0004 0441 5260 8645 1640005 0441 5358 8756 1630006 0441 5388 8872 1650007 0441 5456 8863 1620008 0441 5479 8725 159

Table 3 The clustering time (119867)the accuracy () of MCR-ACAwith different threshold probability 119901

0

R 1199010

Total time Accuracy Total timeaccuracy0006 0431 5453 8749 1600006 0436 5401 8792 1630006 0441 5388 8872 1650006 0446 5421 8873 1640006 0451 5523 8852 160

ofR and 1199010on clustering time and accuracy the comparison

is based on the accuracy in unit time (ie accuracytotaltime) R = 0006 and 119901

0= 0441 are the ideal parameter

pair according to Tables 2 and 3Therefore the experimental parameters are set as below

56Map functions 4 Combine functions 4 Reduce functionsthe maximal iteration being 5 3 ants for each Map functionthe neighbourhood radius R taking 0006 119901

0= 0441 and

the similarity factor set by the range of dimensions In theexperiments the records on vehicle trajectories are dividedinto 56 chunks on average with one Map function workingon one data chunk The communication overhead of dataextraction is the time cost of extracting the records fromNantong and Changzhou to the central node The results aregiven in Table 4Themetrics for Map Combine Reduce andtotal time are hours and the unit for accuracy is percentage

Table 4 implies that the accuracy of the two methods issimilar and rising as the data amount becomes largerThedataamount is key to the clustering accuracy Reducing data scaleby sampling also reduces the accuracy The increasing rate ofclustering accuracy for MCR-ACA is greater than MR-ACAas depicted in Figure 3

Furthermore the computing time for Map function ofMR-ACA is longer than MCR-ACA and the differencebecomes bigger as data amount increases The reason lies inthat all records are mixed up in hard disk after extractionwhich makes the Map function more complicated in thecentral node than that in the data source nodes The Mapfunction of MR-ACA works on the data chunks that aredivided from the data mixed stored on the central nodeand these data chunks are more complex in elements as thedata amount growsTherefore theMap function ofMR-ACAcosts more computing time The comparison is indicated inFigure 4

For Reduce function the time consumed in MR-ACAis about twice as much as that in MCR-ACA but less thanthe sum of computing time in both the Reduce function and

86

87

88

89

90

91

92

2 4 6 8 10

Acc

urac

y (

)

MCR-ACAMR-ACA

Amount (times107)

Figure 3 Comparison on group mining accuracy

0

20

40

60

80

100

120

140

160

2 4 6 8 10

MCR-ACAMR-ACA

Tim

e con

sum

ed (h

)

Amount (times107)

Figure 4 Comparison on Map function time consumed

the Combine function inMCR-ACA Actually the processingtime in theCombine function inMCR-ACA is included in theReduce function in MR-ACA

As shown in Table 4 the total time in MR-ACA is 50longer than that inMCR-ACA due to the data extracting timeon average which indicates that the data extraction is themost essential influential factor for big data cluster

Table 4 also demonstrates that as the data amountincreases the computation time of both of the two methodsincreases rapidly while the accuracy improvement is quitelimited The reason is that as the number of Map functionsCombine functions and Reduce functions keeps the samethe amount of data in the data blocks among Map functionsCombine functions and Reduce functions grows propor-tionally which finally causes the computation time to growrapidly The clustering accuracy is a relative value mainlydetermined by data amount (the ratio of the clustered recordsof one car to the total records which this car is involved in)As the data amount grows the clustering accuracy increasesgradually However there is no relationship between theclustering accuracy and the computation time which leads

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

8 International Journal of Distributed Sensor Networks

Table 4 Comparison experiments of MCR-ACA and MR-ACA

Amount (times107) MCR-ACA MR-ACAMap Combine Reduce Total time Accuracy Extract Map Combine Reduce Total time Accuracy

2 2036 151 136 2323 8789 1174 2192 0 271 3637 87924 4592 544 252 5388 8872 2531 4735 0 613 7879 89046 6789 780 510 8078 8943 3976 8317 0 1148 13441 89868 9657 1183 740 11581 9055 5568 11222 0 1628 18418 904310 12977 1521 1042 15541 9161 6841 14552 0 2357 23751 9156

Table 5 The number of classes

Amount (times107) The number of classes2 64 96 118 1810 23

Table 6 Trajectories of the cars in a group

HPHM JGSJ JGDD XSFX XSSD CSYSS032W0 20131227200308 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 62 CS470A5 20131227200636 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 43 JS560V8 20131227200638 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 47 AS032V0 20131227202633 Northwest corner of the crossroad of Zhujiang road and north Taiping road 2 41 CS560V8 20131227202957 Southeast corner of the crossroad of Hanzhongmen street and middle Jiangdong road 6 29 ASFM979 20131227203021 Checkpoint at the crossroad of Suyuan road and west Qingshuiting road 3 39 AS470A5 20131227203103 Southeast corner of the crossroad of Yangzijiang road and Qingliangmen street 4 40 A

S032V0

S470A5

S560V8

SFM979

1

2

3

4

Figure 5 Trajectories of the cars in a group

to the inconsistency of the computation time increasing andthe accuracy increasing According to the experiment thenumber of obtained classes is listed in Table 5

The results show that 4 cars in a research group illustratethe obvious similar trajectories which is listed in Table 6

Table 6 indicates that the 4 cars were caught by thesame camera in half an hour The car number ldquoS032V0rdquo wasmisidentified as ldquoS032W0rdquo by the camera capture Throughclustering the trajectories of these 4 cars have plenty of traceswith the same or similar features which construct a clusterThe time feature is on every Friday evening the locationfeature is overlapped along theway to the university as shownin Figure 5

5 Conclusion and Future Work

Critical issues for group mining on big data of vehicletrajectories are centralization and source distribution In thispaper a distributed parallel clustering method MCR-ACAis proposed for group mining on distributed vehicle trajec-tories Parallel clustering is realized while communicationoverhead of big data is avoided The method is tested ontrafficmonitoring systems of three cities (including the centercity Nanjing) of Jiangsu province in China Experimentalresults demonstrate that the proposedmethod achieves betterperformance on group mining

Group mining can be used in many scenarios Accordingto the experiment results in this paper two aspects are

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

International Journal of Distributed Sensor Networks 9

promising for further work (i) the forecast of group behaviorbased on specific features for example if the time feature ofa group is in midnight and the location feature is somewherewith high crime incidence the group can be regarded as apossible crime groupwith high possibility (ii) outlier analysisfor vehicle trajectory Some vehicle trajectory outliers areformed in the clustering process (eg the abandoned vehicletrajectories defined in the paper) the reason that these vehicletrajectories are abandoned as outliers is useful for behaviorforecast

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Thiswork is supported by theNational Natural Science Foun-dation of China (Grant 61272377) and the Key TechnologyRampD Program of Jiangsu Province (BE2014733)

References

[1] C Song Z Qu N Blumm and A-L Barabasi ldquoLimits ofpredictability in humanmobilityrdquo Science vol 327 no 5968 pp1018ndash1021 2010

[2] Y-A de Montjoye C A Hidalgo M Verleysen and V DBlondel ldquoUnique in the crowd the privacy bounds of humanmobilityrdquo Scientific Reports vol 3 article 1376 2013

[3] S Wang H Wang X Qin et al ldquoArc hitecting big datachallenges studies and forecastsrdquoChinese Journal of Computersvol 34 no 10 pp 1741ndash1752 2011

[4] M Armbrust A Fox R Griffith et al ldquoA view of cloudcomputingrdquo Communications of the ACM vol 53 no 4 pp 50ndash58 2010

[5] J Dean and S Ghemawat ldquoMapReduce a flexible data process-ing toolrdquo Communications of the ACM vol 53 no 1 pp 72ndash772010

[6] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[7] C Jayalath J Stephen and P Eugster ldquoFrom the cloud tothe atmosphere running mapreduce across data centersrdquo IEEETransactions on Computers vol 63 no 1 pp 74ndash87 2014

[8] B Chandramouli J Goldstein and S Duan ldquoTemporal analyt-ics on big data for web advertisingrdquo in Proceedings of the IEEE28th International Conference on Data Engineering (ICDE rsquo12)pp 90ndash101 April 2012

[9] A Mukherjee J Datta R Jorapur et al ldquoShared disk bigdata analytics with apache hadooprdquo in Proceedings of the19th International Conference on High Performance Computing(HiPC rsquo12) pp 1ndash6 IEEE 2012

[10] S Fiore A D Anca C Palazzo et al ldquoOphidia toward big dataanalytics for esciencerdquo Procedia Computer Science vol 1 pp2376ndash2385 2013

[11] Y Kim K Shim M-S Kim and J Sup Lee ldquoDBCURE-MR anefficient density-based clustering algorithm for large data usingMapReducerdquo Information Systems vol 42 pp 15ndash35 2014

[12] N Laptev K Zeng and C Zaniolo ldquoVery fast estimation forresult and accuracy of big data analytics the EARL systemrdquoin Proceedings of the 29th International Conference on DataEngineering (CDE rsquo13) pp 1296ndash1299 Brisbane Australia April2013

[13] D Garg K Trivedi and B Panchal ldquoA comparative study ofclustering algorithms using mapreduce in hadooprdquo Interna-tional Journal of Engineering vol 2 no 10 2013

[14] W Zhao V Martha and X Xu ldquoPSCAN a parallel Structuralclustering algorithm for big networks in MapReducerdquo in Pro-ceedings of the 27th IEEE International Conference on AdvancedInformation Networking and Applications (AINA rsquo13) pp 862ndash869 Barcelona Spain March 2013

[15] Y Yu P K Gunda and M Isard ldquoDistributed aggregation fordata-parallel computing interfaces and implementationsrdquo inProceedings of the 22nd ACM SIGOPS Symposium on OperatingSystems Principles (SOSP rsquo09) pp 247ndash260 October 2009

[16] X Cheng and N Xiao ldquoParallel implementa tion of dynamicpositive and negative feedback aco with iterat ive mapreducemodelrdquo Journal of Information and Computational Science vol10 no 8 pp 2359ndash2370 2013

[17] Y Yang X Ni H Wang et al ldquoParallel implementation ofantbased clustering algorithm based on hadooprdquo in Proceedingsof the 3rd International Conference on Swarm Intelligence (ICSIrsquo12) pp 190ndash197 2012

[18] E Bonabeau M Dorigo and G Theraulaz ldquoInspiration foroptimization from social insect behaviourrdquoNature vol 406 no6791 pp 39ndash42 2000

[19] M Dorigo E Bonabeau and GTheraulaz ldquoAnt algorithms andstigmergyrdquo Future Generation Computer Systems vol 16 no 8pp 851ndash871 2000

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article A Group Mining Method for Big …downloads.hindawi.com/journals/ijdsn/2015/756107.pdfResearch Article A Group Mining Method for Big Data on Distributed Vehicle Trajectories

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of