Big Data Analysis Technology
University of Paderborn
L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)
Summer semester 2013
June 12, 2013
Tobias Hardes (6687549) – [email protected]
2June 12, 2013
Table of content
Introduction Definitions
Background Example
Related Work Research
Main Approaches Association Rule Mining MapReduce Framework
Conclusion
3June 12, 2013
4 Big keywords
4June 12, 2013
Big Data vs. Business Intelligence
How can we predict cancer early enough to treat it successfully?
How Can I make significant profit on the stock market next month?
Which is the most profitable branch of our supermarket? In a specific country? During a specific period of time
Docs.oralcle.com
5June 12, 2013
Background
home.web.cern.ch
6June 12, 2013
Big Science – The LHC
600 million times per second, particles collide within the Large Hadron Collider (LHC)
Each collision generate new particles Particles decay in complex way Each collision is detected The CERN Data Center reconstruct this collision event
15 petabytes of data stored every year Worldwide LHC Computing Grid (WLCG) is
used to crunch all of the data
home.web.cern.ch
7June 12, 2013
Data Stream Analysis
- Just in time analysis of data.- Sensor networks
- Analysis for a certain time (last 30 seconds)
http://venturebeat.com
8June 12, 2013
Complex event processing (CEP)
- Provides queries for streams- Usage of „Event Processing Languages“ (EPL)
- select avg(price) from StockTickEvent.win:time(30 sec)
https://forge.fi-ware.eu
Tumbling Window(Slide = WindowSize) Sliding Window
(Slide < WindowSize)
Window Slide
9June 12, 2013
Complex Event Processing - Areas of application
- Just in time analysis Complexity of algorithms- CEP is used with Twitter:
- Identify emotional states of users
- Sarcasm?
10June 12, 2013
Related Work
11June 12, 2013
Big Data in companies
12June 12, 2013
Principles
- Statistics- Probability theory- Machine learning
Data Mining- Association rule learning- Cluster analysis- Classificiation
13June 12, 2013
Association Rule Mining – Cluster analysis
Association Rule Mining-Relationships between items-Find associations, correlations or causal structures
-Apriori algorithm
-Frequent Pattern (FP)-Growth algorithm
Is soda purchased with bananas?
14June 12, 2013
Cluster analysis – Classification
Cluster Analysis-Classification of similar objects into classes-Classes are defined during the clustering
-k-Means-K-Means++
16June 12, 2013
Research and future work
- Performance, performance, performance…- Passes of the data source- Parallelization- NP-hard problems- ….
- Accuracy- Optimized solutions
17June 12, 2013
Example
- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans
18June 12, 2013
Distributed computing – Motivation
- Complex computational tasks- Serveral terabytes of data- Limited hardware resources
Google‘s MapReduce framework
Prof. Dr. Erich Ehses (FH Köln)
20June 12, 2013
Main approaches
http://ultraoilforpets.com
21June 12, 2013
Structure
- Association rule mining- Apriori algorithm- FP-Growth algorithm
- Googles MapReduce
22June 12, 2013
Association rule mining
- Identify items that are related to other items- Example: Analysis of baskets in an online shop
or in a supermarket
http://img.deusm.com/
23June 12, 2013
Terminology
- A stream or a database with n elements: S - Item set: - Frequency of occurrence of an item set: Φ(A)
- Association rule B :
- Support: - Confidence:
24June 12, 2013
Example
- Rule: „If a basket contains cheese and chocolate, then it also contains bread“
- 6 of 60 transactions contains cheese and chocolate
- 3 of the 6 transactions contains bread
25June 12, 2013
Common approach
- Disjoin the problem into two tasks:
1. Generation of frequent item sets• Find item sets that satisfy a minimum support value
2. Generation of rules• Find Confidence rules using the item sets
𝐦𝐢𝐧𝐬𝐮𝐩≤𝐬𝐮𝐩 ( 𝑨 )= 𝜱 (𝑨)¿𝑺∨¿ ¿
𝒎𝒊𝒏𝒄𝒐𝒏𝒇 ≤𝒄𝒐𝒏𝒇 ¿
26June 12, 2013
Aprio algorithm – Frequent item set
Input:Minimum support: min_supDatasource: S
27June 12, 2013
Apriori – Frequent item sets (I)
Generation of frequent item sets : min_sup = 2TID Transaction
1 (B,C)
2 (B,C)
3 (A,C,D)
4 (A,B,C,D)
5 (B,D)
{}
A B C D2 341 12 21 3 122 3 4 24
https://www.mev.de/
28June 12, 2013
Apriori – Frequent item sets (II)
Generation of frequent item sets : min_sup = 2TID Transaction
1 (B,C)
2 (B,C)
3 (A,C,D)
4 (A,B,C,D)
5 (B,D)
{}
A B C D
AB AC AD BC BD CD
4 342
1 2 2 3 2 2
ACD BCD
Candidates
Candidates 2 1
https://www.mev.de/
L1
L2
L3
29June 12, 2013
Apriori Algorithm – Rule generation
- Uses frequent item sets to extract high-confidence rules- Based on the same principle as the item set generation- Done for all
frequent item set Lk
30June 12, 2013
Example: Rule generation
TID Items
T1 {Coffee; Pasta; Milk}
T2 {Pasta; Milk}
T3 {Bread; Butter}
T4 {Coffee; Milk; Butter}
T5 {Milk; Bread; Butter}𝐵𝑢𝑡𝑡𝑒𝑟❑
⇒
𝑀𝑖𝑙𝑘
¿ (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒
𝑀𝑖𝑙𝑘)=Φ(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)
¿𝑆∨¿=25=40%¿
conf (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒
𝑀𝑖𝑙𝑘)=𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)¿ (𝐵𝑢𝑡𝑡𝑒𝑟 )
=40%60%
=66%
31June 12, 2013
Summary Apriori algorithm
- n+1 scans of the database- Expensive generation of the candidate item set- Implements level-wise search using frequent
item property.
- Easy to implement- Some opportunities for specialized optimizations
32June 12, 2013
FP-Growth algorithm
- Used for databases- Features:
- Requires 2 scans of the database- Uses a special data structure – The FP-Tree
1. Build the FP-Tree
2. Extract frequent item sets
- Compression of the database- Devide this database and apply data mining
33June 12, 2013
Construct FP-Tree
TID Items
1 {a,b}
2 {b,c,d}
3 {a,c,d,e}
4 {a,d,e}
5 {a,b,c}
6 {a,b,c,d}
7 {a}
8 {a,b,c}
9 {a,b,d}
10 {b,c,e}
d:1
34June 12, 2013
Extract frequent itemsets (I)
- Bottom-up strategy
- Start with node „e“- Then look for „de“- Each path is processed
recursively- Solutions are merged
35June 12, 2013
Extract frequent itemsets (II)
Φ(e) = 3 – Assume the minimum support was set to 2
- Is e frequent?- Is de frequent?
- …- Is ce frequent?
- ….- Is be frequent?
- ….- Is ae frequent?
- …..Using subproblems to identify frequent itemsets
36June 12, 2013
Extract frequent itemsets (III)
1. Update the support count along the prefix path
2. Remove Node e3. Check the frequency of the paths
Find item sets withde, ce, ae or be
37June 12, 2013
Apriori vs. FP-Growth
- FP-Growth has some advantages- Two scans of the database- No expensive computation of candidates- Compressed datastructure- Easier to parallelize
W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithmabout association rule mining
45June 12, 2013
MapReduce
- Map and Reduce functions are expressed by a developer
- map(key, val)- Emits new key-values p
- reduce(key, values) - Emits an arbitrary output- Usually a key with one value
46June 12, 2013
MapReduce – Word count
User Programm
Master
worker
worker
worker
worker
worker
worker
worker
worker
Input filesMap
phaseIntermediate
filesShuffle
Reducephase
Output files
Worker for red keys
Worker for blue keys
Worker for yellow keys
(1)fork (1)fork(1)fork
(2) assign (2) assign
(3) read (4) local write (5) RPC
(6) write
(7) return
48June 12, 2013
Conclusion: MapReduce (I)
- MapReduce is design as a batch processing framework
- No usage for ad-hoc analysis- Used for very large data sets- Used for time intensive computations
- OpenSource implementation: Apache Hadoop
http://hadoop.apache.org/
49June 12, 2013
Conclusion
50June 12, 2013
Conclusion (I)
- Big Data is important for research and in daily business
- Different approaches- Data Stream analysis
- Complex event processing
- Rule Mining- Apriori algorithm- FP-Growth algorithm
51June 12, 2013
Conclusion (II)
- Clustering- K-Means- K-Means++
- Distributed computing- MapReduce
- Performance / Runtime- Multiple minutes- Hours- Days…- Online analytical processing for Big Data?
Thank you for your attention
Appendix
54June 12, 2013
Big Data definitions
Big data is high-volume, high-velocity
and high-variety information assets
that demand cost-effective, innovative
forms of information processing for
enhanced insight and decision making.(Gartner Inc.)
Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is
beyond the ability of typical database software
tools to capture, store, manage, and analyze.(McKinsey & Company)
55June 12, 2013
Big Data definitions
Big data is high-volume, high-velocity
and high-variety information assets
that demand cost-effective, innovative
forms of information processing for
enhanced insight and decision making.(Gartner Inc.)
Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is
beyond the ability of typical database software
tools to capture, store, manage, and analyze.(McKinsey & Company)
56June 12, 2013
Complex Event Processing – Windows
Tumbling Window-Moves as much as the window size
Sliding Window-Slides in time-Buffers the last x elements
Tumbling Window(Slide = WindowSize) Sliding Window
(Slide < WindowSize)
Window Slide
57June 12, 2013
MapReduce vs. BigQuery
58June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do-
- for each do
- end for
- end for
- if then
- end if
- end for
- return
59June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do-
- for each do
- end for
- end for
- if then
- end if
- end for
- return
60June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do-
- for each do
- end for
- end for
- if then
- end if
- end for
- return
61June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do-
- for each do
- end for
- end for
- if then
- end if
- end for
- return
62June 12, 2013
63June 12, 2013
Distributed computing of Big Data
CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002
Stores, distributes and analyse the 15 petabytes of data 140 centres across 35 countries
64June 12, 2013
Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 Join
- Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.
- Assume that the items are ordered (alphabetical)
- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.
65June 12, 2013
Big Data vs. Business Intelligence
Big Data Large and complex data sets Temporal, historical, … Difficult to process and to
analyse Used for deep analysis and
reporting: How can we predict cancer
early enough to treat it successfully?
How Can I make significant profit on the stock market next month?
Business Intelligence Transformed Data Historical view Easy to process and to
analyse Used for reporting:
Which is the most profitable branch of our supermarket?
Which postcodes suffered the most dropped calls in July?
66June 12, 2013
Improvement approaches
- Selection of startup parameters for algorithms
- Reducing the number of passes over the database
- Sampling the database
- Adding extra constraints for patterns
- Parallelization
67June 12, 2013
Improvement approaches – Examples
68June 12, 2013
Example: FA-DMFI
- Algorithm for Discovering frequent item sets- Read the database once
- Compress into a matrix- Frequent item sets are generated by cover relations Further costly computations are avoided
69June 12, 2013
K-Means algorithm
1. Select k entities as the initial centroids.
2. (Re)Assign all entities to their closest centroids.
3. Recompute the centroid of each newly assembled cluster.
4. Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached
70June 12, 2013
Solving approaches
- K-Means cluster is NP-hard- Optimization methods to handle NP-hard
problems (K-Means clustering)
71June 12, 2013
Examples
- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans
- K-Means: Exponential runtime- K-Means++: Improve startup parameters
72June 12, 2013
Google‘s BigQuery
Upload
Upload the data set to the Google Storage
http://glenn-packer.net/
Analyse
Import data to tablesProcess
Run queries
73June 12, 2013
The Apriori algorithm
- Most known algorithm for rule mining- Based on a simple principle:
- „If an item set is frequent, then all subsets of this item are also frequent“
- Input:- Minimum confidence: min_conf- Minimum support: min_sup- Data source: S
74June 12, 2013
Apriori Algorithm – aprioriGen
- Generates a candidate item set that might by larger
- Join: Generation of the item set- Prune: Elimination of item sets with
75June 12, 2013
Apriori Algorithm – Rule generation -- Example
- {Butter, milk, bread} {cheese}- {Butter, meat, bread} {cola}
{Butter, bread} {cheese, cola}
76June 12, 2013
How to improve the Apriori algorithm
- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.
- Sampling: mining on a subset of given data- Dynamic itemset counting:
77June 12, 2013
Construction of FP-Tree
- Compressed representation of the database- First scan
- Get the support of every item and sort them by the support count
- Second scan- Each transaction is mapped to a path- Compression is done if overlapping path are
detected- Generate links between same nodes
- Each node has a counter Number of mapped transactions
78June 12, 2013
FP-Growth algorithm
Calculate the support count of
each item in S
Sort items in decreasing support
counts
Read transaction t
Create new nodes labeled with the
items in t
Set the frequency count to 1
No overlappedprefix found
Increment the frequency count for
each overlapped item
Overlapped prefix found
Create new nodes for none overlapped
items
Create additional path to common
items
hasNext
return