crystalball - compute relative frequency in hadoop

Post on 15-Aug-2015

41 Views

Category:

Software

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data Project on

Crystal BallSubmitted By:

Sushil Sedai(984474)

Suvash Shah(984461)

Submitted to:Prof. Prem Nair

Pair approach (Mapper) – pseudo code

method map(docid id, doc d)

for each term w in doc d do

total = 0;for each neighbor u in Neighbor(w) do

Emit(Pair(w, u), 1);

total++;

Emit(Pair(w, *), total);

Pair approach (Mapper) – Java Code

Pair approach (Reducer) – pseudo code

method reduce(Pair p, Iterable<Int> values)

if p.secondValue == *

if p.firstValue is new

currentvalue = p.firstvalue;

marginal = sum(values)

else

marginal += sum(values)

else Emit(p, sum(values)/marginal);

Pair approach (Reducer) – Java Code

Pair approach - input

Mapper1 input

18 29 12 34 79 18 56 12 34 92

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Pair approach – Output (Reducer1)(10,12) 0.5

(10,34) 0.5

(12,10)0.09090909090909091

(12,18)0.09090909090909091

(12,34)0.36363636363636365

(12,56) 0.18181818181818182

(12,79)0.09090909090909091

(12,92)0.18181818181818182

(18,12) 0.25

(18,29) 0.125

(18,34) 0.25

(18,56) 0.125

(18,79) 0.125

(18,92) 0.125

(29,10)0.06666666666666667

(29,12)0.26666666666666666

(29,18)0.06666666666666667

(29,34)0.26666666666666666

(29,56)0.13333333333333333

(29,79)0.06666666666666667

(29,92)0.13333333333333333

(34,10)0.08333333333333333

(34,12) 0.25

(34,18)0.08333333333333333

(34,29)0.08333333333333333

(34,56) 0.25

(34,79)0.08333333333333333

(34,92)0.16666666666666666

(56,10) 0.1

(56,12) 0.3

(56,29) 0.1

(56,34) 0.3

(56,92) 0.2

(92,10)0.3333333333333333

(92,12)0.3333333333333333

(92,34)0.3333333333333333

Pair approach – Output (Reducer2)

(79,12) 0.2

(79,18) 0.2

(79,34) 0.2

(79,56) 0.2

(79,92) 0.2

Stripe approach (Mapper) – pseudo code

method map(docid id, doc d)

Stripe H;

for each term w in doc d do

clear(H);

for each neighbor u in Neighbor(w) do

if H.containsKey(u)

H{u} += 1;

else

H.add(u, 1);

Emit(w, H);

Stripe approach (Mapper) – Java Code

Stripe approach (Reducer) – pseudo code

total = 0;

method reduce(Text key, Stripe H [H1, H2, …])

total = sumValues(H);

for each Item h in H do

h.secondValue /= total;

Emit(key, H);

Stripe approach (Reducer) – Java Code

Stripe appoach (Reducer) – Java Code

Stripe approach – input

Mapper1 input

34 56 29 12 34 56 92 10 34 12

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Stripe approach – Output(Reducer1)

10 [ (34,0.5000) (12,0.5000) ]

12 [ (56,0.1818) (92,0.1818) (34,0.3636) (18,0.0909) (79,0.0909) (10,0.0909) ]

18 [ (56,0.1250) (92,0.1250) (34,0.2500) (79,0.1250) (29,0.1250) (12,0.2500) ]

29 [ (56,0.1333) (92,0.1333) (34,0.2667) (18,0.0667) (79,0.0667) (10,0.0667) (12,0.2667) ]

34 [ (56,0.2500) (92,0.1667) (18,0.0833) (79,0.0833) (29,0.0833) (10,0.0833) (12,0.2500) ]

56 [ (92,0.2000) (34,0.3000) (29,0.1000) (10,0.1000) (12,0.3000) ]

92 [ (34,0.3333) (10,0.3333) (12,0.3333) ]

Stripe approach – Output(Reducer2)

79 [ (56,0.2000) (92,0.2000) (34,0.2000) (18,0.2000) (12,0.2000) ]

Hybrid approach (Mapper) – pseudo code

method map(docid id, doc d)

HashMap H;

for each term w in doc d do

for each neighbor u in Neighbor(w) do

if H.contains(Pair(w, u))

H{Pair(w, u)} += 1;

else

H.add(Pair(w, u));

for each Pair p in H do

Emit(p, H(p));

Hybrid approach (Mapper) – Java Code

Hybrid approach (Reducer) – pseudo codeprev = null;

HashMap H;

Method reduce(Pair p, Iterable<Int> values)

if p.firstValue != prev and not first

total = sumValues(H);

for each item h in H

h(prev.secondValue) /= total;

Emit(p.firstValue, H);

clear(H);

End if

prev = p.firstValue;

H.add(p.secondValue, sum(values));

Method close

//for last pair

total = sumValues(H);

for each item h in H

h(prev.secondValue) /= total;

Emit(p.firstValue, H);

Hybrid approach (Reducer) – Java Code

Hybrid approach (Reducer) – Java Code

Hybrid approach - Input

Mapper1 input

34 56 29 12 34 56 92 10 34 12

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Hybrid approach – Output(Reducer1)

10(12,0.5) (34,0.5)

12(10,0.09090909) (18,0.09090909) (34,0.36363637) (56,0.18181819) (79,0.09090909) (92,0.18181819)

18(12,0.25) (29,0.125) (34,0.25) (56,0.125) (79,0.125) (92,0.125)

29(10,0.06666667) (12,0.26666668) (18,0.06666667) (34,0.26666668) (56,0.13333334) (79,0.06666667) (92,0.13333334)

34(10,0.083333336) (12,0.25) (18,0.083333336) (29,0.083333336) (56,0.25) (79,0.083333336) (92,0.16666667)

56(10,0.1) (12,0.3) (29,0.1) (34,0.3) (92,0.2)

92(10,0.33333334) (12,0.33333334) (34,0.33333334)

Hybrid approach – Output(Reducer2)

79 (12,0.2) (18,0.2) (34,0.2) (56,0.2) (92,0.2)

Comparison

Apache Spark

Write a java program on spark to calculate total number of students in MUM coming in different entries. This program should display total number student by country.

Spark - Java Code

Spark - input

2014 Feb Nepal 20

2014 Feb India 15

2014 Oct Italy 2

2014 July France 1

2015 Feb Nepal 10

2015 Feb India 25

2015 Oct Italy 7

Spark - Output

(France,1)

(Italy,9)

(Nepal,30)

(India,40)

Tools Used

• VMPlayer Pro 7

• cloudera-quickstart-vm-5.4.0-0-vmware

• Eclipse Version: Luna Service Release 2 (4.4.2)

• Windows 8.1

References

• http://glebche.appspot.com/static/hadoop-ecosystem/mapreduce-job-java.html

• https://hadoopi.wordpress.com/2013/06/05/hadoop-implementing-the-tool-interface-for-mapreduce-driver/

• http://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark.php

Thank You

top related