Download - Pivoting Data with SparkSQL by Andrew Ray
![Page 1: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/1.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 1
Pivoting Data with SparkSQL
Andrew RaySenior Data EngineerSilicon Valley Data Science
![Page 2: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/2.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 2
CODE
git.io/vgy34(github.com/silicon-valley-data-science/spark-pivot-examples)
![Page 3: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/3.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 3
OUTLINE
• What’s a Pivot?• Syntax• Real world examples• Tips and Tricks• Implementation• Future work
git.io/vgy34
![Page 4: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/4.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 4
WHAT’S A PIVOT?
![Page 5: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/5.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 5
WHAT’S A PIVOT?
![Page 6: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/6.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 6
WHAT’S A PIVOT?
Group by A, pivot on B, and sum C
A B CG X 1G Y 2G X 3H Y 4H Z 5
A X Y ZG 4 2H 4 5
![Page 7: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/7.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 7
WHAT’S A PIVOT?
Group by A and BPivot on BA B C
G X 1G Y 2G X 3H Y 4H Z 5
A B CG X 4G Y 2H Y 4H Z 5
A X Y ZG 4 2H 4 5
![Page 8: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/8.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 8
WHAT’S A PIVOT?
Pivot on B (w/o agg.)Group by AA B C
G X 1G Y 2G X 3H Y 4H Z 5
A X Y ZG 1G 2G 3H 4H 5
A X Y ZG 4 2H 4 5
![Page 9: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/9.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 9
SYNTAX
![Page 10: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/10.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 10
SYNTAX
• Dataframe/table with columns A, B, C, and D.• How to
– group by A and B– pivot on C (with distinct values “small” and “large”)– sum of D
![Page 11: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/11.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 11
SYNTAX: COMPETITION
• pandas (Python)– pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
• reshape2 (R)– dcast(df, A + B ~ C, sum)
• Oracle 11g– SELECT * FROM df PIVOT (sum(D) FOR C IN ('small', 'large')) p
![Page 12: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/12.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 12
SYNTAX: SPARKSQL
• Simple– df.groupBy("A", "B").pivot("C").sum("D")
• Explicit pivot values– df.groupBy("A", "B")
.pivot("C", Seq("small", "large"))
.sum("D")
![Page 13: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/13.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 13
PIVOT
• Added to DataFrame API in Spark 1.6– Scala– Java– Python– Not R L
![Page 14: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/14.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 14
REAL WORLD EXAMPLES
![Page 15: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/15.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 15
EXAMPLE 1: REPORTING
• Retail sales• TPC-DS dataset
– scale factor 1• Docker image:
docker run -it svds/spark-pivot-reporting
TPC Benchmark™ DS - Standard Specification, Version 2.1.0 Page 18 of 135
2.2.2.3 The implementation chosen by the test sponsor for a particular datatype definition shall be applied consistently to all the instances of that datatype definition in the schema, except for identifier columns, whose datatype may be selected to satisfy database scaling requirements.
2.2.3 NULLs
If a column definition includes an ‘N’ in the NULLs column this column is populated in every row of the table for all scale factors. If the field is blank this column may contain NULLs.
2.2.4 Foreign Key
If the values in this column join with another column, the foreign columns name is listed in the Foreign Key field of the column definition.
2.3 Fact Table Definitions
2.3.1 Store Sales (SS)
2.3.1.1 Store Sales ER-Diagram
2.3.1.2 Store Sales Column Definitions
Each row in this table represents a single lineitem for a sale made through the store channel and recorded in the store_sales fact table.
Table 2-1 Store_sales Column Definitions Column Datatype NULLs Primary Key Foreign Key ss_sold_date_sk identifier d_date_sk ss_sold_time_sk identifier t_time_sk ss_item_sk (1) identifier N Y i_item_sk ss_customer_sk identifier c_customer_sk ss_cdemo_sk identifier cd_demo_sk ss_hdemo_sk identifier hd_demo_sk ss_addr_sk identifier ca_address_sk ss_store_sk identifier s_store_sk ss_promo_sk identifier p_promo_sk
![Page 16: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/16.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 16
SALES BY CATEGORY AND QUARTER
sql("""select *, concat('Q', d_qoy) as qoyfrom store_salesjoin date_dim on ss_sold_date_sk = d_date_skjoin item on ss_item_sk = i_item_sk""").groupBy("i_category").pivot("qoy").agg(round(sum("ss_sales_price")/1000000,2)).show
![Page 17: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/17.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 17
SALES BY CATEGORY AND QUARTER
+-----------+----+----+----+----+| i_category| Q1| Q2| Q3| Q4|+-----------+----+----+----+----+| Books|1.58|1.50|2.84|4.66|| Women|1.41|1.36|2.54|4.16|| Music|1.50|1.44|2.66|4.36|| Children|1.54|1.46|2.74|4.51|| Sports|1.47|1.40|2.62|4.30|| Shoes|1.51|1.48|2.68|4.46|| Jewelry|1.45|1.39|2.59|4.25|| null|0.04|0.04|0.07|0.13||Electronics|1.56|1.49|2.77|4.57|| Home|1.57|1.51|2.79|4.60|| Men|1.60|1.54|2.86|4.71|+-----------+----+----+----+----+
![Page 18: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/18.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 18
EXAMPLE 2: FEATURE GENERATION
• MovieLens 1M Dataset– ~1M movie ratings– 6040 users– 3952 movies
• Predict gender based on ratings– Using 100 most popular movies
![Page 19: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/19.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 19
LOAD RATINGSval ratings_raw = sc.textFile("Downloads/ml-1m/ratings.dat")case class Rating(user: Int, movie: Int, rating: Int)val ratings = ratings_raw.map(_.split("::").map(_.toInt)).map(r => Rating(r(0),r(1),r(2))).toDF
ratings.show
+----+-----+------+|user|movie|rating|+----+-----+------+| 11| 1753| 4|| 11| 1682| 1|| 11| 216| 4|| 11| 2997| 4|| 11| 1259| 3|...
![Page 20: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/20.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 20
LOAD USERSval users_raw = sc.textFile("Downloads/ml-1m/users.dat")case class User(user: Int, gender: String, age: Int)val users = users_raw.map(_.split("::")).map(u => User(u(0).toInt, u(1), u(2).toInt)).toDF
val sample_users = users.where(expr("gender = 'F' or ( rand() * 5 < 2 )"))
sample_users.groupBy("gender").count().show
+------+-----+|gender|count|+------+-----+| F| 1709|| M| 1744|+------+-----+
![Page 21: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/21.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 21
PREP DATAval popular = ratings.groupBy("movie").count().orderBy($"count".desc).limit(100).map(_.get(0)).collect
val ratings_pivot = ratings.groupBy("user").pivot("movie", popular.toSeq).agg(expr("coalesce(first(rating),3)").cast("double"))
ratings_pivot.where($"user" === 11).show
+----+----+---+----+----+---+----+---+----+----+---+...|user|2858|260|1196|1210|480|2028|589|2571|1270|593|...+----+----+---+----+----+---+----+---+----+----+---+...| 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| 3.0| 3.0|5.0|...+----+----+---+----+----+---+----+---+----+----+---+...
![Page 22: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/22.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 22
BUILD MODEL
val data = ratings_pivot.join(sample_users, "user").withColumn("label", expr("if(gender = 'M', 1, 0)").cast("double"))
val assembler = new VectorAssembler().setInputCols(popular.map(_.toString)).setOutputCol("features")
val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(assembler, lr))
val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345)
val model = pipeline.fit(training)
![Page 23: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/23.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 23
TEST
val res = model.transform(test).select("label", "prediction")
res.groupBy("label").pivot("prediction", Seq(1.0, 0.0)).count().show
+-----+---+---+|label|1.0|0.0|+-----+---+---+| 1.0|114| 74|| 0.0| 46|146|+-----+---+---+
Accuracy 68%
![Page 24: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/24.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 24
TIPS AND TRICKS
![Page 25: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/25.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 25
USAGE NOTES
• Specify the distinct values of the pivot column– Otherwise it does this:val values = df.select(pivotColumn).distinct().sort(pivotColumn).map(_.get(0)).take(maxValues + 1).toSeq
![Page 26: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/26.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 26
MULTIPLE AGGREGATIONS
df.groupBy("A", "B").pivot("C").agg(sum("D"), avg("D")).show
+---+---+------------+------------+------------+------------+| A| B|small_sum(D)|small_avg(D)|large_sum(D)|large_avg(D)|+---+---+------------+------------+------------+------------+|foo|two| 6| 3.0| null| null||bar|two| 6| 6.0| 7| 7.0||foo|one| 1| 1.0| 4| 2.0||bar|one| 5| 5.0| 4| 4.0|+---+---+------------+------------+------------+------------+
![Page 27: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/27.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 27
PIVOT MULTIPLE COLUMNS
• Merge columns and pivot as usualdf.withColumn(“p”, concat($”p1”, $”p2”)).groupBy(“a”, “b”).pivot(“p”).agg(…)
![Page 28: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/28.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 28
MAX COLUMNS
• spark.sql.pivotMaxValues– Default: 10,000– When doing a pivot without specifying values for the pivot
column this is the maximum number of (distinct) values that will be collected without error.
![Page 29: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/29.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 29
IMPLEMENTATION
![Page 30: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/30.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 30
PIVOT IMPLEMENTATION
• pivot is a method of GroupedData and returns GroupedData with PivotType.
• New logical operator: o.a.s.sql.catalyst.plans.logical.Pivot
• Analyzer rule: o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot– Currently translates logical pivot into an aggregation with
lots of if statements.
![Page 31: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/31.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 31
FUTURE WORK
![Page 32: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/32.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 32
FUTURE WORK
• Add to R API• Add to SQL syntax• Add support for unpivot• Faster implementation
![Page 33: Pivoting Data with SparkSQL by Andrew Ray](https://reader031.vdocuments.us/reader031/viewer/2022021922/587155b11a28ab8e5b8b517f/html5/thumbnails/33.jpg)
© 2016 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience | 33
Pivoting Data with SparkSQLAndrew [email protected]
We’re hiring!svds.com/careers
THANK YOU.git.io/vgy34