supporting correlation analysis on scientific datasets in parallel and distributed settings

HPDC 2014

Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings

Yu Su*, Gagan Agrawal*, Jonathan Woodring#

Ayan Biswas*, Han-Wei Shen*

*The Ohio State University#Los Alamos National Laboratory

HPDC 2014

Motivation: Big Data

• Gaps between data generation and storage

2

Molecular Simulation:Molecular Data

Life Science:DNA Sequencing Data (Microarray)

Earth Science:Ocean and Climate Data

Space Science:Astronomy Data

HPDC 2014

Big Data (Volume/Velocity) Challenge

• Data Movement is the Bottleneck – Memory to CPU – Disk to Memory – Wide Area

• Memory availability is another challenge• Can we work with a summary of data?

– Compression approaches already shown applicable

HPDC 2014

Context: Correlation Data Analysis

• Scientific Analysis Type:– Individual Variable Analysis

• Data Subsetting, Aggregation, Mining, Visualization

– Correlation Analysis• Study relationship among multiple variables• Make interesting scientific discoveries• “Big Data” problem becomes more severe:

– Huge data loading cost (multiple variables) – Additional filtering cost for subset-based correlation analysis– Huge correlation calculation cost

• Correlation analysis is useful but extremely time consuming and resource costly

HPDC 2014

Our Solution and Contributions(1)

• Identify bitvectors as a summary structure – Space efficient – Data movement efficient – Assume constructed offline

• Correlation computation using bitmaps – Better efficiency – Smaller memory cost– Parallelization – Across data stored in distributed repositories

HPDC 2014

Our Solution and Contributions (2)

• An interactive framework to support both individual and correlation analysis based on bitmaps– Correlations and other operations using high-level

operators – Individual Analysis: flexible data subsetting– Correlation Analysis: interactive correlation queries

among multi-variables – Correlation over flexible data subsets – Combine with index-based sampling

HPDC 2014

Background: Bitmap Indexing

• Widely used in scientific data management

• Suitable for floating value by binning small ranges• Run Length Compression(WAH, BBC)• Bitmap Indices can be treated as a small profile of

the data

HPDC 2014

Bitmaps and Summarization

• Preserves spatial Distribution of data • Accurate within the limits of binning • Storage requirement within 15-25% after

compression • Entropy-preserving sampling (HPDC 13) • May already be built to support query processing• How do we calculate correlation metric?

– Accurately and Efficiently

HPDC 2014

Metrics of Correlation Analysis

• 2-D Histogram:– Indicate value distribution relationship– Value distribution of one variable regarding to change of another

• Shannon’s Entropy:– A metric to show the variability of the dataset– Low entropy => more constant, predictable data– High entropy => more random distributed data

• Mutual Information:– A metric for computing the dependence between two variables– Low M => two variables are relatively independent– High M => one variable provides information about another

HPDC 2014

Bitmap-based Correlations

• No Indexing Support: – Load all data of variable A and B– Filter A and B and generate subset (for value-based subsetting)– Generate joint bins: divide A and B into bins, generate (A1, B1)-

>count11, … (Am, Bm)->countmm by scanning each data element

– Calculate correlation metrics based on joint bins• Dynamic Indexing (build Index for each variable):

– Query bitvectors for variable A and B (much smaller index loading cost, very small filtering cost)

– Generate joint bins: generate (A1, B1)->count11, … (Am, Bm)->countmm based on fast bitwise operations between A and B (bitvectors# are much smaller than elements#)

– Calculate correlation metrics based on joint bins

HPDC 2014

Calculation Steps

Memory

HPDC 2014

Static Indexing• Dynamic Indexing:

– build one index for each variable– Still need to perform bitwise operations to generate joint bins

• Static Indexing: – build one index over multi-variables – Only need to perform bitvectors loading and calculation

HPDC 2014

Parallel Indexing: Dim-based Partitioning

• Pros: efficiency parallel index generation

• Cons: slave node cannot directly calculate the results. Big reduction overhead

HPDC 2014

Parallel Indexing: Value-based Partitioning

• Pros: partition for parallel index generation is more time-consuming

• Cons: slave node can directly calculate partial results. Very small reduction overhead

HPDC 2014

Correlation Analysis in Distributed Environment

Without Indexing Support

Read Data Subset

Using Bitmap Indexing

Read IndexSubset

Computing Node

HPDC 2014

Correlation Analysis over Samples

Select bitvectors of variable A

Select bitvectors of variable B

Perform Index-based sampling on Variable A

Logic operations between sample of A and bitvectors of B

HPDC 2014

System Architecture

Parse the SQL expression

Parse the metadata file

Generate query request

Decide query types

Perform index-based data query and samling

Read bitvectors and generate joint bins

Read Joint Bitvectors

Calculate Correlation Metrics based on joint bitvectors

Give up current corrlation result or not?

Continue Iteractive Query

Read the data value after finding satisfying result

HPDC 2014

User Interface• Please enter variable names which you want to perform correlation queries: • TEMP SALT UVEL• Please enter your query:• SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50;• Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05;• Mutual Information: TEMPSALT: 0.15, TEMP->UVEL: 0.036;• Please enter your query: • SELECT SALT FROM POP WHERE SALT<0.0346;• Entropy: TEMP: 2.28, SALT: 2.53, UVEL: 3.06;• Mutual Information: TEMPUVEL 0.039, SALT->UVEL->0.33;• Please enter your query:• UNDO• Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05;• Mutual Information: TEMPSALT: 0.15, TEMP->UVEL: 0.036;• Please enter your query: • SELECT SALT FROM POP WHERE SALT<0.0346;• Entropy: TEMP: 2.22, SALT: 1.58, UVEL: 2.64；• Mutual Information: TEMPUVEL 0.31, SALT->UVEL->0.21;• ……

HPDC 2014

User Case Results

• Histogram of SALT based on TEMP

• Cold Water(TEMP<5): High SALT

• Hot Water(TEMP>=15): High SALT

• Entropy• TEMP: similar

entropy • SALT: Diversity of

SALT becomes bigger as TEMP increases

• Mutual Information• Correlation

between TEMP and SALT is high when TEMP is cold or hot

HPDC 2014

Experiment Results• Goals:

– Speedup of correlation analysis using bitmap indexing– Scalability of parallel correlation analysis– Efficiency improvement in distributed environment– Efficiency and accuracy comparison with sampling

• Datasets: – Parallel Ocean Program – Multi-dimensional Arrays– 26 Variables: TEMP (depth, lat, lon), SALT, UVEL ……

• Environment: – OSC Glenn Cluster: each node has 8 cores, 2.6 GHz

AMD Opteron, 64 GB memory, 1.9 TB disk

HPDC 2014

Correlation Efficiency Comparisonbased on Different Subsets

• No Indexing (original): • Data Loading + Filtering• Joint Bins Generation (scan

each data element)• Correlation Calculation

• Dynamic Indexing: • Index Subset Loading• Joint Bins Generation (bitwise

operations)• Correlation Calculation• 1.78x to 3.61x speedup • Speedup becomes bigger as

data subset size decreases• Static Indexing:

• Joint Index Subset Loading• Correlation Calculation• 11.4x to 15.35x speedup

• Variables: TEMP SALT, 5.6 GB each• Metrics: Entropy, Histogram, Mutual Info• Input: 1000 queries divided into 5

categories based on subsetting percentage

HPDC 2014

Parallel Correlation Analysisbased on Different Nodes#

• Variables: TEMP SALT, 28 GB each• Metrics: Entropy, Histogram, Mutual Info• Nodes#: 1 – 32, one core per node• Calculate correlations based on entire data• Speedup as more number of nodes used

• Dim-based Partition: • The speedup is limited• 1.73x to 5.96x speedup• Every node can only generate

joint bins• Joint bins from different nodes

need to be transferred for a global reduction (big cost)

• More nodes used means bigger network transfer and calculation cost

• Value-based Partition: • Much better speedup• 1.87x to 11.79x speedup• Every node can directly

calculate partial correlation metrics

• Very small reduction cost

HPDC 2014

Efficiency Improvement in Distributed Environment

• Data Size: 7Gb – 28 GB• Indexing Method:

• Smaller data transfer time (index size is only 12.1% to 26.8% of the dataset)• Faster correlation analysis time (smaller data loading, faster joint bin calculation)

• Speedup of using local data server (1 Gb/s): 1.87x – 1.91x• Speedup of using remote data server (200 Mb/s): 2.78x – 2.96x

Local Data Server (1Gb/s) Remote Data Server (200Mb/s)

HPDC 2014

Efficiency and Accuracy Comparison with Sampling

• Select 10 Variables (1.4 GB each) and calculate mutual information between each pair (45 pairs)

• Calculate correlation based on samples:• Joint bins generation time is great

reduced• Extra cost: sampling time• Speedup: 1.34x – 6.84x

• Use CFP to present relative mutual information differences (45 pairs)

• More accuracy lost as smaller sample used, average accuracy lost : • 50% - 1.53%, 25% - 3.42%• 10% - 7.91%, 5% - 12.57%• 1% - 18.32%

HPDC 2014

Conclusion

• ‘Big Data’ issue brings challenges for scientific data management

• Correlation analysis is useful but time-consuming• Improve the efficiency of correlation analysis using

bitmap indexing• Develop a tool to support interactive correlation

analysis over flexible subsets of the data• Support correlation analysis in parallel and

distributed environments• Combine data sampling with correlation analysis

HPDC 2014 26

Thanks

HPDC 2014 27

Backup Slides

HPDC 2014

Correlation Efficiency Comparisonbased on Different Data Sizes

• Variables: TEMP SALT• Metrics: Entropy, Histogram, Mutual Info• Input: Data with different sizes

• No Indexing (original): • Data Loading• Joint Bins Generation• Correlation Calculation

• Dynamic Indexing: • Index Loading• Joint Bins Generation• Correlation Calculation• Still achieve a good

speedup because of the faster data loading speed and joint bins calculation speed

• Static Indexing: • Joint Index Loading• Correlation Calculation

HPDC 2014

Parallel Correlation Analysisbased on Different Subsets

• Variables: TEMP SALT, 28 GB each• Metrics: Entropy, Histogram, Mutual Info• Nodes#: 16• Input: 1000 queries divided into 5

categories based on subset sizes

• Dim-based Partition: • The speedup is limited• Bigger subsets will generate

bigger number of joint bins• More data transfer and

reduction cost as subset percentage increases

• Value-based Partition: • Much better scalability• 1.17x to 1.58x speedup

compared to dim-based partition

• The speedup is not affected by data subset percentage

supporting correlation analysis on scientific datasets in parallel and distributed settings

Documents

data hpdc

data of variable

summary of data

data generation

array data

correlation metric

flexible data subsets

multivariables correlation