query processing of massive trajectory data based on mapreduce qiang ma, bin yang (fudan university)...
Post on 17-Dec-2015
219 Views
Preview:
TRANSCRIPT
Query Processing of Massive Trajectory Data based on MapReduce
Qiang Ma, Bin Yang (Fudan University)Weining Qian, Aoying Zhou (ECNU)
Presented By: Xin Cao (Aalborg University)
Outline
• Introduction • Preliminary• Trajectory Processing– Execution Overview– Storage– Indexing Methods– Query Processing
• Experimental Study• Future Works
Introduction
• Location-based services are playing important roles.
• Large volumes of diverse formats of trajectory data have been accumulated.
• Traditional centralized technologies may not deal with the large amount of trajectories.
• Cloud computing, such as GFS and MapReduce, provides a promising paradigm to conquer the explosion of trajectory data.
Challenge
• Huge volume, updates frequently, rapidly increasing.
• Trajectory data is “continuous”, i.e. ordered sequentially.
• Highly skewed.• MapReduce is good at offline data
analysis, but not efficient for online query.
Our Contributions
• Extend the MapReduce framework to manage massive sequential data, such as trajectories of moving objects.
• Study what kind of query processing methods are appropriate for large clusters.
• Provide two scalable indexing methods to facilitate query processing efficiently.
Preliminary
• Data Model - line segments model–A polyline in three-dimensional space.
• Query Types– Spatio-temporal Range Query:
– Q(Es, Et) → {Sk}– Trajectory-based Query:
– Q(O, Et) → {Sk}
Trajectory Processing
• Execution Overviews
Storage• Data are grouped with key and organized in data chunks in GFS-style
storage.• The whole data set is divided into several parts, and each part is called a
partition and assigned to one data chunk to store.• Each trajectory data is assigned to at least one partition according to
spatio-temporal information
Storage
• A good spatio-temporal partitioning makes the size of data per chunk is fairly uniform.
• Static partitioning strategies are easy to control and suitable for distributed scheduling, but may lead to load imbalance.
• Dynamic strategies can resolve load imbalance, but re-split data can cause distantly migration of large volume of data in clusters.
• Appropriate strategies should be trained
PMI (Partition based Multilevel Index)
• Aim to speed up spatio-temporal range queries.• Generate all candidate partitions by invoking space
partition strategy.• Store together as key/value.– <PartitionID, Sk>
• Each data chunk only contains trajectory segments that belong to the same partition.
• Multilevel index for each node can be built local. (using traditional centralized methods)
OII (Object Inverted Index)
• Aim to speed up trajectory based queries.• Collect each object's all historical trajectories.• Store together as key/value.–<OID, { PartitionID, T}>–Access according to key(object identifier).
Data Insertion
Query Processing• Query Processing
• Trajectory based Queries– Given any object ID, the system can locate the object's trajectory
according to OII.• Range Queries
Experimental Study
• Settings–Hadoop version 0.19.0–8 PC nodes• Ubuntu Linux version 8.04• Pentium IV 1.7GHz CPU• 512M memory
– Java SDK 1.42– Experiment data: Network-based Generator
Experiments – Load Balance
Standard Deviation of Partitioning Load Balance of PRADASE
Experiments – Data Importing and Index Creating
Data Importing with PMI Data Importing with OII
Experiments – Query Processing
Spatio-temporal Range Query Processing with PMI Trajectory Base Query Processing with OII
Future Works
• More heuristic partitioning methods.• Reducing data migration between nodes.• Efficient real-time query processing on
Cloud infrastructure.
Thanks!
top related