indexing and parallel query processing support for visualizing climate datasets
Post on 24-Feb-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
ICPP 2012
Indexing and Parallel Query Processing Support for Visualizing
Climate Datasets
Yu Su*, Gagan Agrawal*, Jonathan Woodring†
*The Ohio State University†Los Alamos National Laboratory
ICPP 2012
Outline• Motivation and Introduction• Background• System Overview and Optimization• Experiment• Conclusion
ICPP 2012
Motivation
• Science becomes increasingly data driven;• Strong desire for efficient data visualization;• Challenges:
– Fast data generation speed– Slow disk IO and network speed – Worse performance during visualization– Different kinds of subsetting requests
• Difficult and Unnecessary to visualize all the data
ICPP 2012
Data Subsetting in Paraview• A widely used data analysis and visualization
application• Problems: Load + Filter mode
– Load the entire data set– Data filtering in visualization level
• Threshold Filter: based on values• Extract Subset Filter: based on dimension info
– Grid transformation needed during filtering• Regular Structured Grid -> Unstructured Grid
ICPP 2012
A Faster Solution• Subset at the I/O level
– User specifies the subset in one query for both dimension and value ranges
– Reduced I/O time and memory footprint• SQL queries in ParaView
– Query over Dimensions – API support– Query over Values - Indexing
• Bitmap Indices and Parallel Bitmap Indices– Efficient subsetting over values
ICPP 2012
Background: Bitmap Indexing• Fastbit: widely used in Scientific Data Management
• Suitable for float value for binning small ranges• Run Length Compression(WAH, BBC)
– Compress bitvector based on continuous 0s or 1s
ICPP 2012
Bitmap Index and Dim Subset• Run-length Compression(WAH, BBC)
– Good: compression rate, fast bitwise operation;– Bad: ability to locate dim subset is lost;
• Two traditional methods: – With bitmap indices: post-filter on dim info;– Without bitmap indices: post-filter on values;
• Two-phase optimization: – Index Generate: Distributed Indices over sub-
blocks;– Index Retrieval: Transform dim subsetting info into
bitvectors, and support fast bitwise operation;
ICPP 2012
System Overview
Parse the SQL expression
Parse the metadata file
Generate Query Request
Index Generation if not generated; Index Retrieving after that.
ICPP 2012
Optimization 1: Distributed Index Generation
Study relationship betweenQueries and Partitions.
Partition the data based onQuery Preference
ICPP 2012
Index Partition Strategy• α rate: Participation rate of data elements
– Number of elements in indexing / Total data size– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset
• Partition Strategies: – Strategy 1: α is proportional to dim subsetting percentage and inversely
proportional to number of partitions.
– Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim.
– Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.
ICPP 2012
Optimization 2: Index Retrieval
Post-filter?
ICPP 2012
Parallel Index Architecture
L3: data block
L1: data file
L2: variable
ICPP 2012
Experiment Setup• Goals:
– SQL subsetting vs. Load + Filter in Paraview– Scalability of parallel indexing method– Indexing and Partition Strategy vs. FastQuery
• Dataset: – Parallel Ocean Program– Data size: 33.6 GB– Data format: NetCDF(array based)
• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory
ICPP 2012
Efficiency Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method is
better than filtering when data subset < 60%
• Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method
Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter
ICPP 2012
Memory Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method has
much smaller memory cost than filtering method
• Two phase optimization only has small extra memory cost
Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter
ICPP 2012
Scalability with Different Proc#
• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of
one sub-block• Good scalability as
number of processes increases
ICPP 2012
Alpha Rate with Different Proc#
• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: Alpha Rate• More number of processes
means more index partitions
• Good participation rate when selecting a smaller percentage data subset
ICPP 2012
Alpha Rate and IO Access Times Comparison with FastQuery
• FastQuery: • Build relational table view over scientific dataset• Difference: doesn’t consider multi-dimension data features
• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type
ICPP 2012
Efficiency Comparison with FastQuery
• Data size: 8.4 GB• Proc#: 48• Input: 100 queries for each
query type• Achieved a 1.41 to 2.12
speedup compared with FastQuery
ICPP 2012
ICPP 2012
Conclusion
• Big data issue in data analysis and visualization• Find exact data subset in IO level with SQL
interface and bitmap indexing• A good speedup compared with filtering method• Data partition strategy and parallel indexing• A good speedup compared with FastQuery
ICPP 2012 22
Thanks
top related