big data analysis technology university of paderborn l.079.08013 seminar: cloud computing and big...

Big Data Analysis Technology

University of Paderborn

L.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)

Summer semester 2013

June 12, 2013

Tobias Hardes (6687549) – [email protected]

2June 12, 2013

Table of content

Introduction Definitions

Background Example

Related Work Research

Main Approaches Association Rule Mining MapReduce Framework

Conclusion

3June 12, 2013

4 Big keywords

4June 12, 2013

Big Data vs. Business Intelligence

How can we predict cancer early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Which is the most profitable branch of our supermarket? In a specific country? During a specific period of time

Docs.oralcle.com

5June 12, 2013

Background

home.web.cern.ch

6June 12, 2013

Big Science – The LHC

600 million times per second, particles collide within the Large Hadron Collider (LHC)

Each collision generate new particles Particles decay in complex way Each collision is detected The CERN Data Center reconstruct this collision event

15 petabytes of data stored every year Worldwide LHC Computing Grid (WLCG) is

used to crunch all of the data

home.web.cern.ch

7June 12, 2013

Data Stream Analysis

- Just in time analysis of data.- Sensor networks

- Analysis for a certain time (last 30 seconds)

http://venturebeat.com

8June 12, 2013

Complex event processing (CEP)

- Provides queries for streams- Usage of „Event Processing Languages“ (EPL)

- select avg(price) from StockTickEvent.win:time(30 sec)

https://forge.fi-ware.eu

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

9June 12, 2013

Complex Event Processing - Areas of application

- Just in time analysis Complexity of algorithms- CEP is used with Twitter:

- Identify emotional states of users

- Sarcasm?

10June 12, 2013

Related Work

11June 12, 2013

Big Data in companies

12June 12, 2013

Principles

- Statistics- Probability theory- Machine learning

Data Mining- Association rule learning- Cluster analysis- Classificiation

13June 12, 2013

Association Rule Mining – Cluster analysis

Association Rule Mining-Relationships between items-Find associations, correlations or causal structures

-Apriori algorithm

-Frequent Pattern (FP)-Growth algorithm

Is soda purchased with bananas?

14June 12, 2013

Cluster analysis – Classification

Cluster Analysis-Classification of similar objects into classes-Classes are defined during the clustering

-k-Means-K-Means++

16June 12, 2013

Research and future work

- Performance, performance, performance…- Passes of the data source- Parallelization- NP-hard problems- ….

- Accuracy- Optimized solutions

17June 12, 2013

Example

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

18June 12, 2013

Distributed computing – Motivation

- Complex computational tasks- Serveral terabytes of data- Limited hardware resources

Google‘s MapReduce framework

Prof. Dr. Erich Ehses (FH Köln)

20June 12, 2013

Main approaches

http://ultraoilforpets.com

21June 12, 2013

Structure

- Association rule mining- Apriori algorithm- FP-Growth algorithm

- Googles MapReduce

22June 12, 2013

Association rule mining

- Identify items that are related to other items- Example: Analysis of baskets in an online shop

or in a supermarket

http://img.deusm.com/

23June 12, 2013

Terminology

- A stream or a database with n elements: S - Item set: - Frequency of occurrence of an item set: Φ(A)

- Association rule B :

- Support: - Confidence:

24June 12, 2013

Example

- Rule: „If a basket contains cheese and chocolate, then it also contains bread“

- 6 of 60 transactions contains cheese and chocolate

- 3 of the 6 transactions contains bread

25June 12, 2013

Common approach

- Disjoin the problem into two tasks:

1. Generation of frequent item sets• Find item sets that satisfy a minimum support value

2. Generation of rules• Find Confidence rules using the item sets

𝐦𝐢𝐧𝐬𝐮𝐩≤𝐬𝐮𝐩 ( 𝑨 )= 𝜱 (𝑨)¿𝑺∨¿ ¿

𝒎𝒊𝒏𝒄𝒐𝒏𝒇 ≤𝒄𝒐𝒏𝒇 ¿

26June 12, 2013

Aprio algorithm – Frequent item set

Input:Minimum support: min_supDatasource: S

27June 12, 2013

Apriori – Frequent item sets (I)

Generation of frequent item sets : min_sup = 2TID Transaction

1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D2 341 12 21 3 122 3 4 24

https://www.mev.de/

28June 12, 2013

Apriori – Frequent item sets (II)

Generation of frequent item sets : min_sup = 2TID Transaction

1 (B,C)

2 (B,C)

3 (A,C,D)

4 (A,B,C,D)

5 (B,D)

{}

A B C D

AB AC AD BC BD CD

4 342

1 2 2 3 2 2

ACD BCD

Candidates

Candidates 2 1

https://www.mev.de/

L1

L2

L3

29June 12, 2013

Apriori Algorithm – Rule generation

- Uses frequent item sets to extract high-confidence rules- Based on the same principle as the item set generation- Done for all

frequent item set Lk

30June 12, 2013

Example: Rule generation

TID Items

T1 {Coffee; Pasta; Milk}

T2 {Pasta; Milk}

T3 {Bread; Butter}

T4 {Coffee; Milk; Butter}

T5 {Milk; Bread; Butter}𝐵𝑢𝑡𝑡𝑒𝑟❑

⇒

𝑀𝑖𝑙𝑘

¿ (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒

𝑀𝑖𝑙𝑘)=Φ(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)

¿𝑆∨¿=25=40%¿

conf (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒

𝑀𝑖𝑙𝑘)=𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)¿ (𝐵𝑢𝑡𝑡𝑒𝑟 )

=40%60%

=66%

31June 12, 2013

Summary Apriori algorithm

- n+1 scans of the database- Expensive generation of the candidate item set- Implements level-wise search using frequent

item property.

- Easy to implement- Some opportunities for specialized optimizations

32June 12, 2013

FP-Growth algorithm

- Used for databases- Features:

- Requires 2 scans of the database- Uses a special data structure – The FP-Tree

1. Build the FP-Tree

2. Extract frequent item sets

- Compression of the database- Devide this database and apply data mining

33June 12, 2013

Construct FP-Tree

TID Items

1 {a,b}

2 {b,c,d}

3 {a,c,d,e}

4 {a,d,e}

5 {a,b,c}

6 {a,b,c,d}

7 {a}

8 {a,b,c}

9 {a,b,d}

10 {b,c,e}

d:1

34June 12, 2013

Extract frequent itemsets (I)

- Bottom-up strategy

- Start with node „e“- Then look for „de“- Each path is processed

recursively- Solutions are merged

35June 12, 2013

Extract frequent itemsets (II)

Φ(e) = 3 – Assume the minimum support was set to 2

- Is e frequent?- Is de frequent?

- …- Is ce frequent?

- ….- Is be frequent?

- ….- Is ae frequent?

- …..Using subproblems to identify frequent itemsets

36June 12, 2013

Extract frequent itemsets (III)

1. Update the support count along the prefix path

2. Remove Node e3. Check the frequency of the paths

Find item sets withde, ce, ae or be

37June 12, 2013

Apriori vs. FP-Growth

- FP-Growth has some advantages- Two scans of the database- No expensive computation of candidates- Compressed datastructure- Easier to parallelize

W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithmabout association rule mining

45June 12, 2013

MapReduce

- Map and Reduce functions are expressed by a developer

- map(key, val)- Emits new key-values p

- reduce(key, values) - Emits an arbitrary output- Usually a key with one value

46June 12, 2013

MapReduce – Word count

User Programm

Master

worker

worker

worker

worker

worker

worker

worker

worker

Input filesMap

phaseIntermediate

filesShuffle

Reducephase

Output files

Worker for red keys

Worker for blue keys

Worker for yellow keys

(1)fork (1)fork(1)fork

(2) assign (2) assign

(3) read (4) local write (5) RPC

(6) write

(7) return

48June 12, 2013

Conclusion: MapReduce (I)

- MapReduce is design as a batch processing framework

- No usage for ad-hoc analysis- Used for very large data sets- Used for time intensive computations

- OpenSource implementation: Apache Hadoop

http://hadoop.apache.org/

49June 12, 2013

Conclusion

50June 12, 2013

Conclusion (I)

- Big Data is important for research and in daily business

- Different approaches- Data Stream analysis

- Complex event processing

- Rule Mining- Apriori algorithm- FP-Growth algorithm

51June 12, 2013

Conclusion (II)

- Clustering- K-Means- K-Means++

- Distributed computing- MapReduce

- Performance / Runtime- Multiple minutes- Hours- Days…- Online analytical processing for Big Data?

Thank you for your attention

Appendix

54June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.(McKinsey & Company)

55June 12, 2013

Big Data definitions

Big data is high-volume, high-velocity

and high-variety information assets

that demand cost-effective, innovative

forms of information processing for

enhanced insight and decision making.(Gartner Inc.)

Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is

beyond the ability of typical database software

tools to capture, store, manage, and analyze.(McKinsey & Company)

56June 12, 2013

Complex Event Processing – Windows

Tumbling Window-Moves as much as the window size

Sliding Window-Slides in time-Buffers the last x elements

Tumbling Window(Slide = WindowSize) Sliding Window

(Slide < WindowSize)

Window Slide

57June 12, 2013

MapReduce vs. BigQuery

58June 12, 2013

Apriori Algorithm (Pseudocode)

- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

59June 12, 2013


- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

60June 12, 2013


- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

61June 12, 2013


- for (

- for each do-

- for each do

- end for

- end for

- if then

- end if

- end for

- return

62June 12, 2013

63June 12, 2013

Distributed computing of Big Data

CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002

Stores, distributes and analyse the 15 petabytes of data 140 centres across 35 countries

64June 12, 2013

Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 Join

- Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.

- Assume that the items are ordered (alphabetical)

- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.

65June 12, 2013

Big Data vs. Business Intelligence

Big Data Large and complex data sets Temporal, historical, … Difficult to process and to

analyse Used for deep analysis and

reporting: How can we predict cancer

early enough to treat it successfully?

How Can I make significant profit on the stock market next month?

Business Intelligence Transformed Data Historical view Easy to process and to

analyse Used for reporting:

Which is the most profitable branch of our supermarket?

Which postcodes suffered the most dropped calls in July?

66June 12, 2013

Improvement approaches

- Selection of startup parameters for algorithms

- Reducing the number of passes over the database

- Sampling the database

- Adding extra constraints for patterns

- Parallelization

67June 12, 2013

Improvement approaches – Examples

68June 12, 2013

Example: FA-DMFI

- Algorithm for Discovering frequent item sets- Read the database once

- Compress into a matrix- Frequent item sets are generated by cover relations Further costly computations are avoided

69June 12, 2013

K-Means algorithm

1. Select k entities as the initial centroids.

2. (Re)Assign all entities to their closest centroids.

3. Recompute the centroid of each newly assembled cluster.

4. Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached

70June 12, 2013

Solving approaches

- K-Means cluster is NP-hard- Optimization methods to handle NP-hard

problems (K-Means clustering)

71June 12, 2013

Examples

- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans

- K-Means: Exponential runtime- K-Means++: Improve startup parameters

72June 12, 2013

Google‘s BigQuery

Upload

Upload the data set to the Google Storage

http://glenn-packer.net/

Analyse

Import data to tablesProcess

Run queries

73June 12, 2013

The Apriori algorithm

- Most known algorithm for rule mining- Based on a simple principle:

- „If an item set is frequent, then all subsets of this item are also frequent“

- Input:- Minimum confidence: min_conf- Minimum support: min_sup- Data source: S

74June 12, 2013

Apriori Algorithm – aprioriGen

- Generates a candidate item set that might by larger

- Join: Generation of the item set- Prune: Elimination of item sets with

75June 12, 2013

Apriori Algorithm – Rule generation -- Example

- {Butter, milk, bread} {cheese}- {Butter, meat, bread} {cola}

{Butter, bread} {cheese, cola}

76June 12, 2013

How to improve the Apriori algorithm

- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.

- Sampling: mining on a subset of given data- Dynamic itemset counting:

77June 12, 2013

Construction of FP-Tree

- Compressed representation of the database- First scan

- Get the support of every item and sort them by the support count

- Second scan- Each transaction is mapped to a path- Compression is done if overlapping path are

detected- Generate links between same nodes

- Each node has a counter Number of mapped transactions

78June 12, 2013

FP-Growth algorithm

Calculate the support count of

each item in S

Sort items in decreasing support

counts

Read transaction t

Create new nodes labeled with the

items in t

Set the frequency count to 1

No overlappedprefix found

Increment the frequency count for

each overlapped item

Overlapped prefix found

Create new nodes for none overlapped

items

Create additional path to common

items

hasNext

return

big data analysis technology university of paderborn l.079.08013 seminar: cloud computing and big...

Documents