query processing of massive trajectory data based on mapreduce qiang ma, bin yang (fudan university)...

Post on 17-Dec-2015

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Query Processing of Massive Trajectory Data based on MapReduce

Qiang Ma, Bin Yang (Fudan University)Weining Qian, Aoying Zhou (ECNU)

Presented By: Xin Cao (Aalborg University)

Outline

• Introduction • Preliminary• Trajectory Processing– Execution Overview– Storage– Indexing Methods– Query Processing

• Experimental Study• Future Works

Introduction

• Location-based services are playing important roles.

• Large volumes of diverse formats of trajectory data have been accumulated.

• Traditional centralized technologies may not deal with the large amount of trajectories.

• Cloud computing, such as GFS and MapReduce, provides a promising paradigm to conquer the explosion of trajectory data.

Challenge

• Huge volume, updates frequently, rapidly increasing.

• Trajectory data is “continuous”, i.e. ordered sequentially.

• Highly skewed.• MapReduce is good at offline data

analysis, but not efficient for online query.

Our Contributions

• Extend the MapReduce framework to manage massive sequential data, such as trajectories of moving objects.

• Study what kind of query processing methods are appropriate for large clusters.

• Provide two scalable indexing methods to facilitate query processing efficiently.

Preliminary

• Data Model - line segments model–A polyline in three-dimensional space.

• Query Types– Spatio-temporal Range Query:

– Q(Es, Et) → {Sk}– Trajectory-based Query:

– Q(O, Et) → {Sk}

Trajectory Processing

• Execution Overviews

Storage• Data are grouped with key and organized in data chunks in GFS-style

storage.• The whole data set is divided into several parts, and each part is called a

partition and assigned to one data chunk to store.• Each trajectory data is assigned to at least one partition according to

spatio-temporal information

Storage

• A good spatio-temporal partitioning makes the size of data per chunk is fairly uniform.

• Static partitioning strategies are easy to control and suitable for distributed scheduling, but may lead to load imbalance.

• Dynamic strategies can resolve load imbalance, but re-split data can cause distantly migration of large volume of data in clusters.

• Appropriate strategies should be trained

PMI (Partition based Multilevel Index)

• Aim to speed up spatio-temporal range queries.• Generate all candidate partitions by invoking space

partition strategy.• Store together as key/value.– <PartitionID, Sk>

• Each data chunk only contains trajectory segments that belong to the same partition.

• Multilevel index for each node can be built local. (using traditional centralized methods)

OII (Object Inverted Index)

• Aim to speed up trajectory based queries.• Collect each object's all historical trajectories.• Store together as key/value.–<OID, { PartitionID, T}>–Access according to key(object identifier).

Data Insertion

Query Processing• Query Processing

• Trajectory based Queries– Given any object ID, the system can locate the object's trajectory

according to OII.• Range Queries

Experimental Study

• Settings–Hadoop version 0.19.0–8 PC nodes• Ubuntu Linux version 8.04• Pentium IV 1.7GHz CPU• 512M memory

– Java SDK 1.42– Experiment data: Network-based Generator

Experiments – Load Balance

Standard Deviation of Partitioning Load Balance of PRADASE

Experiments – Data Importing and Index Creating

Data Importing with PMI Data Importing with OII

Experiments – Query Processing

Spatio-temporal Range Query Processing with PMI Trajectory Base Query Processing with OII

Future Works

• More heuristic partitioning methods.• Reducing data migration between nodes.• Efficient real-time query processing on

Cloud infrastructure.

Thanks!

top related