big data analysis technology university of paderborn l.079.08013 seminar: cloud computing and big...

69
Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June 12, 2013 Tobias Hardes (6687549) – [email protected]

Upload: clarissa-hodges

Post on 26-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

Big Data Analysis Technology

University of Paderborn

L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)

Summer semester 2013

June 12, 2013

Tobias Hardes (6687549) – [email protected]

Page 2: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

2June 12, 2013

Table of content

Introduction Definitions

Background Example

Related Work Research

Main Approaches Association Rule Mining MapReduce Framework

Conclusion

Page 3: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

3June 12, 2013

4 Big keywords

Page 4: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

4June 12, 2013

Big Data vs. Business Intelligence

How can we predict cancer early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Which is the most profitable branch of our supermarket? In a specific country? During a specific period of time

Docs.oralcle.com

Page 5: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

5June 12, 2013

Background

home.web.cern.ch

Page 6: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

6June 12, 2013

Big Science – The LHC

600 million times per second, particles collide within the Large Hadron Collider (LHC)

Each collision generate new particles Particles decay in complex way Each collision is detected The CERN Data Center reconstruct this collision event

15 petabytes of data stored every year Worldwide LHC Computing Grid (WLCG) is

used to crunch all of the data

home.web.cern.ch

Page 7: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

7June 12, 2013

Data Stream Analysis

- Just in time analysis of data.- Sensor networks

- Analysis for a certain time (last 30 seconds)

http://venturebeat.com

Page 8: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

8June 12, 2013

Complex event processing (CEP)

- Provides queries for streams- Usage of „Event Processing Languages“ (EPL)

- select avg(price) from StockTickEvent.win:time(30 sec)

https://forge.fi-ware.eu

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

Page 9: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

9June 12, 2013

Complex Event Processing - Areas of application

- Just in time analysis Complexity of algorithms- CEP is used with Twitter:

- Identify emotional states of users

- Sarcasm?

Page 10: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

10June 12, 2013

Related Work

Page 11: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

11June 12, 2013

Big Data in companies

Page 12: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

12June 12, 2013

Principles

- Statistics- Probability theory- Machine learning

Data Mining- Association rule learning- Cluster analysis- Classificiation

Page 13: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

13June 12, 2013

Association Rule Mining – Cluster analysis

Association Rule Mining-Relationships between items-Find associations, correlations or causal structures

-Apriori algorithm

-Frequent Pattern (FP)-Growth algorithm

Is soda purchased with bananas?

Page 14: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

14June 12, 2013

Cluster analysis – Classification

Cluster Analysis-Classification of similar objects into classes-Classes are defined during the clustering

-k-Means-K-Means++

Page 15: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

16June 12, 2013

Research and future work

- Performance, performance, performance…- Passes of the data source- Parallelization- NP-hard problems- ….

- Accuracy- Optimized solutions

Page 16: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

17June 12, 2013

Example

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

Page 17: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

18June 12, 2013

Distributed computing – Motivation

- Complex computational tasks- Serveral terabytes of data- Limited hardware resources

Google‘s MapReduce framework

Prof. Dr. Erich Ehses (FH Köln)

Page 18: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

20June 12, 2013

Main approaches

http://ultraoilforpets.com

Page 19: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

21June 12, 2013

Structure

- Association rule mining- Apriori algorithm- FP-Growth algorithm

- Googles MapReduce

Page 20: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

22June 12, 2013

Association rule mining

- Identify items that are related to other items- Example: Analysis of baskets in an online shop

or in a supermarket

http://img.deusm.com/

Page 21: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

23June 12, 2013

Terminology

- A stream or a database with n elements: S - Item set: - Frequency of occurrence of an item set: Φ(A)

- Association rule B :

- Support: - Confidence:

Page 22: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

24June 12, 2013

Example

- Rule: „If a basket contains cheese and chocolate, then it also contains bread“

- 6 of 60 transactions contains cheese and chocolate

- 3 of the 6 transactions contains bread

Page 23: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

25June 12, 2013

Common approach

- Disjoin the problem into two tasks:

1. Generation of frequent item sets• Find item sets that satisfy a minimum support value

2. Generation of rules• Find Confidence rules using the item sets

𝐦𝐢𝐧𝐬𝐮𝐩≤𝐬𝐮𝐩 ( 𝑨 )= 𝜱 (𝑨)¿𝑺∨¿ ¿

𝒎𝒊𝒏𝒄𝒐𝒏𝒇 ≤𝒄𝒐𝒏𝒇 ¿

Page 24: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

26June 12, 2013

Aprio algorithm – Frequent item set

Input:Minimum support: min_supDatasource: S

Page 25: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

27June 12, 2013

Apriori – Frequent item sets (I)

Generation of frequent item sets : min_sup = 2TID Transaction

1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D2 341 12 21 3 122 3 4 24

https://www.mev.de/

Page 26: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

28June 12, 2013

Apriori – Frequent item sets (II)

Generation of frequent item sets : min_sup = 2TID Transaction

1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D

AB AC AD BC BD CD

4 342

1 2 2 3 2 2

ACD BCD

Candidates

Candidates 2 1

https://www.mev.de/

L1

L2

L3

Page 27: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

29June 12, 2013

Apriori Algorithm – Rule generation

- Uses frequent item sets to extract high-confidence rules- Based on the same principle as the item set generation- Done for all

frequent item set Lk

Page 28: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

30June 12, 2013

Example: Rule generation

TID Items

T1 {Coffee; Pasta; Milk}

T2 {Pasta; Milk}

T3 {Bread; Butter}

T4 {Coffee; Milk; Butter}

T5 {Milk; Bread; Butter}𝐵𝑢𝑡𝑡𝑒𝑟❑

𝑀𝑖𝑙𝑘

¿ (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒

𝑀𝑖𝑙𝑘)=Φ(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)

¿𝑆∨¿=25=40%¿

conf (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒

𝑀𝑖𝑙𝑘)=𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)¿ (𝐵𝑢𝑡𝑡𝑒𝑟 )

=40%60%

=66%

Page 29: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

31June 12, 2013

Summary Apriori algorithm

- n+1 scans of the database- Expensive generation of the candidate item set- Implements level-wise search using frequent

item property.

- Easy to implement- Some opportunities for specialized optimizations

Page 30: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

32June 12, 2013

FP-Growth algorithm

- Used for databases- Features:

- Requires 2 scans of the database- Uses a special data structure – The FP-Tree

1. Build the FP-Tree

2. Extract frequent item sets

- Compression of the database- Devide this database and apply data mining

Page 31: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

33June 12, 2013

Construct FP-Tree

TID Items

1 {a,b}

2 {b,c,d}

3 {a,c,d,e}

4 {a,d,e}

5 {a,b,c}

6 {a,b,c,d}

7 {a}

8 {a,b,c}

9 {a,b,d}

10 {b,c,e}

d:1

Page 32: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

34June 12, 2013

Extract frequent itemsets (I)

- Bottom-up strategy

- Start with node „e“- Then look for „de“- Each path is processed

recursively- Solutions are merged

Page 33: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

35June 12, 2013

Extract frequent itemsets (II)

Φ(e) = 3 – Assume the minimum support was set to 2

- Is e frequent?- Is de frequent?

- …- Is ce frequent?

- ….- Is be frequent?

- ….- Is ae frequent?

- …..Using subproblems to identify frequent itemsets

Page 34: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

36June 12, 2013

Extract frequent itemsets (III)

1. Update the support count along the prefix path

2. Remove Node e3. Check the frequency of the paths

Find item sets withde, ce, ae or be

Page 35: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

37June 12, 2013

Apriori vs. FP-Growth

- FP-Growth has some advantages- Two scans of the database- No expensive computation of candidates- Compressed datastructure- Easier to parallelize

W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithmabout association rule mining

Page 36: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

45June 12, 2013

MapReduce

- Map and Reduce functions are expressed by a developer

- map(key, val)- Emits new key-values p

- reduce(key, values) - Emits an arbitrary output- Usually a key with one value

Page 37: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

46June 12, 2013

MapReduce – Word count

Page 38: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

User Programm

Master

worker

worker

worker

worker

worker

worker

worker

worker

Input filesMap

phaseIntermediate

filesShuffle

Reducephase

Output files

Worker for red keys

Worker for blue keys

Worker for yellow keys

(1)fork (1)fork(1)fork

(2) assign (2) assign

(3) read (4) local write (5) RPC

(6) write

(7) return

Page 39: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

48June 12, 2013

Conclusion: MapReduce (I)

- MapReduce is design as a batch processing framework

- No usage for ad-hoc analysis- Used for very large data sets- Used for time intensive computations

- OpenSource implementation: Apache Hadoop

http://hadoop.apache.org/

Page 40: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

49June 12, 2013

Conclusion

Page 41: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

50June 12, 2013

Conclusion (I)

- Big Data is important for research and in daily business

- Different approaches- Data Stream analysis

- Complex event processing

- Rule Mining- Apriori algorithm- FP-Growth algorithm

Page 42: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

51June 12, 2013

Conclusion (II)

- Clustering- K-Means- K-Means++

- Distributed computing- MapReduce

- Performance / Runtime- Multiple minutes- Hours- Days…- Online analytical processing for Big Data?

Page 43: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

Thank you for your attention

Page 44: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

Appendix

Page 45: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

54June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.(McKinsey & Company)

Page 46: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

55June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.(McKinsey & Company)

Page 47: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

56June 12, 2013

Complex Event Processing – Windows

Tumbling Window-Moves as much as the window size

Sliding Window-Slides in time-Buffers the last x elements

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

Page 48: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

57June 12, 2013

MapReduce vs. BigQuery

Page 49: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

58June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

Page 50: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

59June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

Page 51: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

60June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

Page 52: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

61June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

Page 53: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

62June 12, 2013

Page 54: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

63June 12, 2013

Distributed computing of Big Data

CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002

Stores, distributes and analyse the 15 petabytes of data 140 centres across 35 countries

Page 55: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

64June 12, 2013

Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 Join

- Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.

- Assume that the items are ordered (alphabetical)

- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.

Page 56: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

65June 12, 2013

Big Data vs. Business Intelligence

Big Data Large and complex data sets Temporal, historical, … Difficult to process and to

analyse Used for deep analysis and

reporting: How can we predict cancer

early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Business Intelligence Transformed Data Historical view Easy to process and to

analyse Used for reporting:

Which is the most profitable branch of our supermarket?

Which postcodes suffered the most dropped calls in July?

Page 57: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

66June 12, 2013

Improvement approaches

- Selection of startup parameters for algorithms

- Reducing the number of passes over the database

- Sampling the database

- Adding extra constraints for patterns

- Parallelization

Page 58: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

67June 12, 2013

Improvement approaches – Examples

 

Page 59: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

68June 12, 2013

Example: FA-DMFI

- Algorithm for Discovering frequent item sets- Read the database once

- Compress into a matrix- Frequent item sets are generated by cover relations Further costly computations are avoided

Page 60: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

69June 12, 2013

K-Means algorithm

1. Select k entities as the initial centroids.

2. (Re)Assign all entities to their closest centroids.

3. Recompute the centroid of each newly assembled cluster.

4. Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached

Page 61: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

70June 12, 2013

Solving approaches

- K-Means cluster is NP-hard- Optimization methods to handle NP-hard

problems (K-Means clustering)

Page 62: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

71June 12, 2013

Examples

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

- K-Means: Exponential runtime- K-Means++: Improve startup parameters

Page 63: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

72June 12, 2013

Google‘s BigQuery

Upload

Upload the data set to the Google Storage

http://glenn-packer.net/

Analyse

Import data to tablesProcess

Run queries

Page 64: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

73June 12, 2013

The Apriori algorithm

- Most known algorithm for rule mining- Based on a simple principle:

- „If an item set is frequent, then all subsets of this item are also frequent“

- Input:- Minimum confidence: min_conf- Minimum support: min_sup- Data source: S

Page 65: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

74June 12, 2013

Apriori Algorithm – aprioriGen

- Generates a candidate item set that might by larger

- Join: Generation of the item set- Prune: Elimination of item sets with

Page 66: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

75June 12, 2013

Apriori Algorithm – Rule generation -- Example

- {Butter, milk, bread} {cheese}- {Butter, meat, bread} {cola}

{Butter, bread} {cheese, cola}

Page 67: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

76June 12, 2013

How to improve the Apriori algorithm

- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.

- Sampling: mining on a subset of given data- Dynamic itemset counting:

Page 68: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

77June 12, 2013

Construction of FP-Tree

- Compressed representation of the database- First scan

- Get the support of every item and sort them by the support count

- Second scan- Each transaction is mapped to a path- Compression is done if overlapping path are

detected- Generate links between same nodes

- Each node has a counter Number of mapped transactions

Page 69: Big Data Analysis Technology University of Paderborn L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English) Summer semester 2013 June

78June 12, 2013

FP-Growth algorithm

Calculate the support count of

each item in S

Sort items in decreasing support

counts

Read transaction t

Create new nodes labeled with the

items in t

Set the frequency count to 1

No overlappedprefix found

Increment the frequency count for

each overlapped item

Overlapped prefix found

Create new nodes for none overlapped

items

Create additional path to common

items

hasNext

return