restore implementation as an extension to pig vijay s

17
RESTORE IMPLEMENTATION as an extension to pig Vijay S

Upload: clifton-wade

Post on 03-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RESTORE IMPLEMENTATION as an extension to pig Vijay S

RESTORE IMPLEMENTATION as an extension to pig

Vijay S

Page 2: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Overview of Pig Query Compiler

Implementation of �Restore

Experiments�

Outline

Page 3: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Overview of the Pig Query Compilera parser syntactically checks the

input query and transforms it into a logical plan, which is a directed acyclic graph (DAG) of logical operators(1)

logical optimizer applies optimization rules to this logical plan(2)

MapReduce compiler transforms the logical plan into a physical plan and then compiles it into a series of MapReduce jobs, which forms a workflow(3)

Page 4: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Overview of the Pig Query Compiler - ContinuedMapReduce optimizer applies

rules to reduce the number of MapReduce jobs in the work- flow(4)

Hadoop job manager submits the jobs in a workflow to Hadoop for execution taking into account the dependencies between them.(5)

Page 5: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Overview of the Pig Query Compiler - ContinuedJobControlCompiler

component of the Hadoop job manager of Pig

Input is Workflow of Mapreduce Jobs

After the completion of executing all the MapReduce jobs in the workflow, these intermediate outputs are deleted.

Page 6: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Implementation of RestoreThe input of ReStore is a

workflow of MapReduce jobs. Every physical plan of these jobs

passes though two stages: (1) matching with plans in the repository, and (2) generating candidate sub-jobs.

.Implement the repository as a table that con-tains in every record: (1) a physical plan of a MapReduce job, (2) the filename of the output of this job in HDFS, and (3) statistics about this job

Page 7: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

ExperimentsReusing the Output of Whole

Jobs(7.1)Reusing the Output of Sub•

Jobs(7.2)Comparing the Heuristics for

GeneratingCandidate Sub-Jobs(7.3)

Reusing Sub• Jobs vs. Whole Jobs((7.4)

Effect of Data Reduction((7.5)

Page 8: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Reusing the Output of Whole Jobs(7.1)

Job execution time for queries is much reduced by resusing jobs compared to no data reuse.(L3, L11 – PigMix)

Example:L2-L8 and L11 (Join, Group, Co-

Group,Filter Distinct and Union)L3, L11 - PigMix

Page 9: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Reusing the Output of sub Jobs(7.2)

Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs

Example:L2-L8 and L11 (Join, Group, Co-

Group,Filter Distinct and Union)L3, L11 - PigMix

Page 10: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Comparing Heuristics for Generating Candidate subjobs(7.3)

Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs

Example:L2-L8 and L11 (Join, Group, Co-

Group,Filter Distinct and Union)L3, L11 - PigMix

Page 11: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Comparing the Heuristics for generating candidate Sub-Jobs (7.3)shows total size of Input Data loaded by different queries

Q I/P(GB)

HC

(GB)HA

(GB)NH

(GB)O/P

L2 150.6 3.1 3.1 6.7 1.1 MB

L3 150.7 3.2 8.2 22.1 62.9 MB

L4 150.6 2 2.8 10.8 34.2 MB

L5 150.7 1.8 4.6 7.4 2 B

L6 150.6 3.7 10.1 24.3 92.7 MB

L7 150.6 2.2 5.4 5.4 1.5 MB

L8 150.6 3.3 3.3 11.4 27 B

L11 173.6 2.6 2.7 2.8 1.6 GB

Page 12: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Reusing subjobs Vs Whole Jobs(7.4)

Field name Cardinality % Selected Data

field6 200 0.5%

field7 100 1%

field8 20 5%

field9 10 10%

field10 5 20%

field11 2 50%

field12 1.6 60%

Page 13: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Reusing subjobs Vs Whole Jobs(7.4)

Overhead and Speed up of different jobs – Dark line is speedup

Page 14: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Effect of Data Reduction(7.5)

Overhead and Speed up of different jobs with filter operators

Page 15: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Effect of Data Reduction(7.5)ContinuedQuery Template QPA = load ’$synth_data’ as

(field1, ..., field12); B = foreach A generate field1, ...;

C = group B by (field1, ...);D = foreach C generate

COUNT($1);store D into ’$out’;

Page 16: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Effect of Data Reduction(7.5)ContinuedQuery Template QFA = load ’$synth_data’ as (field1, ..., field12); B = filter A by $fieldi = $val ;C = group B by field1;D = foreach C generate COUNT($1);store D into ’$out’; 

’;

Page 17: RESTORE IMPLEMENTATION as an extension to pig Vijay S

LOGO

www.nordridesign.com

Related WorkPaper addresses challenges by

Mapreduce like massive data sizes and procedural nature of query language

Otherwork – Materialized views and Mrshare