pig optimization and execution page 1 alan f. gates @alanfgates © hortonworks inc. 2011

Pig Optimization and Execution Page 1 Alan F. Gates @alanfgates Hortonworks Inc. 2011 Slide 2 Who Am I? Pig committer and PMC Member HCatalog committer and mentor Member of ASF and Incubator PMC Co-founder of Hortonworks Author of Programming Pig from OReilly Photo credit: Steven Guarnaccia, The Three Little Pigs Slide 3 Who Are You? 3 Slide 4 What Should We Optimize? Minimize scans Hadoop is still often I/O bound Minimize total number of MR jobs Minimize shuffle size and number of shuffles Avoid spills to disk Reduce or remove skew For small jobs, minimize start-up time 4 Slide 5 Pig Deployment User machine Hadoop Cluster Pig resides on user machine or gateway Job executes on cluster No server, all optimization and planning done on the launching machine Slide 6 Pig Guts (i.e. Pig Architecture), p. 1 6 A = LOAD myfile AS (x, y, z); B = GROUP A by x; C = FILTER B by group > 0; D = FOREACH C GENERATE group, COUNT(A); STORE D INTO output; Pig Latin Load Group Filter Foreach Store Logical Plan AST Semantic Checks Slide 7 Pig Guts, p. 2 7 Load Group Filter Foreach Store Logical Plan Load Filter Group Foreach Store Rule based optimizations Map Filter Rearrange Reduce Package Foreach MapReduce Plan Slide 8 Pig Guts, p. 3 8 Map Filter Rearrange Reduce Package Foreach MapReduce Plan Map Filter Rearrange Reduce Package Foreach Combine Foreach Physical optimizations Slide 9 It would be really cool if 9 Map Reduce Map Reduce Whats the right join algorithm here? Even with statistics it would be hard to know. Need on the fly execution plan rewrites. Slide 10 Memory Java + memory management = oil + water Java types inefficient memory users (~4x disk size) Very difficult to tell how much memory you are using Originally tried to monitor memory use via MXBeans: FAIL! Now estimate number of records we can hold in memory and spill when we exceed; allow user to tune guess 10 Slide 11 Reducing Spills to Disk Select Map size and io.sort.mb size such that 1 Map produces 1 Combiner Would be nice if Pig did this automatically Recent improvements: hash based aggregation in 0.10 11 Slide 12 Skew You are only as fast as your slowest reducer Data often power law distributed, means one reducer gets 10x+ the data of others Solution 1, use combiner whenever possible Solution 2, break rule that all records for a given key go to one reducer; works for order by and join 12 Slide 13 Reducing your Reducers Whenever possible use algorithms that can be done with no reduce Fragment-replicate join Merge join Collected group 13 Slide 14 (De)serialization Data moves between memory and disk often Need to highly optimize, more work to be done here Need to do lazy deserialization 14 Slide 15 Faster Job Startup Should be using the distributed cache for Pig jar and UDFs For small jobs could use LocalJobRunner Need to try Tenzing approach of having a few tasks spun up and waiting for small jobs 15 Slide 16 Improved Execution Models 16 Map Reduce Map Reduce This is unnecessary. Anything that can be done in this map can be pushed to the previous reduce. Need MR* Slide 17 Code Generation Currently Pig physical operators are packaged in jar and pieced together on the backend to construct the data pipeline Tenzing and others have tried generating code on the fly instead, have seen significant improvements Downside, need javac on client machine 17 Slide 18 Multi-store script A = load users as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into bydemo; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into bystate; load users filter nulls group by state group by age, gender apply UDFs store into bystate store into bydemo Slide 19 Multi-Store Map-Reduce Plan map filter rearrange split rearrange reduce multiplex package foreach Slide 20 Hash Join Pages Users Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane) Slide 21 Fragment Replicate Join Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using replicated; Pages Users Map 1 Map 2 Users Pages block 1 Pages block 1 Pages block 2 Pages block 2 Slide 22 Skew Join Pages Users Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using skewed; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP Slide 23 Merge Join Pages Users aaron. zach aaron. zach Users = load users as (name, age); Pages = load pages as (user, url); Jnd = join Pages by user, Users by name using merge; Map 1 Map 2 Users Pages aaron amr aaron amy barb amy Slide 24 Learn More Read the online documentation: http://pig.apache.org/ http://pig.apache.org/ Programming Pig from OReilly Press Join the mailing lists: [email protected] for user [email protected] [email protected] for developer [email protected] Follow me on Twitter, @alanfgates @alanfgates Slide 25 Questions? 25

pig optimization and execution page 1 alan f. gates @alanfgates © hortonworks inc. 2011

Documents