python for practice nprg067...python for practice nprg067 tomas bures petr hnetynka ladislav pe ška...

18
CHARLES UNIVERSITY IN PRAGUE http://d3s.mff.cuni.cz faculty of mathematics and physics Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Peška {bures,hnetynka}@d3s.mff.cuni.cz [email protected]

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

CHARLES UNIVERSITY IN PRAGUE

http://d3s.mff.cuni.cz

faculty of mathematics and physics

Python for PracticeNPRG067

Tomas BuresPetr HnetynkaLadislav Peška

{bures,hnetynka}@[email protected]

Page 2: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

2

Spark

Open-source cluster computing frameworkPart of Apache’s family of big-data tools

Runs as standalone or on top of Hadoop Yarn and Apache MesosCan access data stored in HDFS, Cassandra, Hbase, …

Main API in ScalaBinding to other languages Java, Python, R

Page 3: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

3

Spark Architecture

Image from https://spark.apache.org/docs/latest/cluster-overview.html

Page 4: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

4

Types of API

RDD (Resilient distributed dataset)Collections of serializable black-box objectsMore low-level operations

DataFramesTables of rows with schemaQueries mimic SQLQuery optimization before executionTwo main variants

SQL-like statements (strings)Query builder API

Seamless interoperability between RDDs and DataFramesNote that there is no DataSet API (as in Scala)

Page 5: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

5

First Example

Exercise: e01

Uses the DataFrames API

Page 6: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

6

Different APIs

Exercise: e03

“Show max 10 reviewers which have the salary above average. Sort them by ascending age.”

Examples:DataFrame APISQL APIRDD API

Page 7: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

7

DataFrame APIQuery building functions

Projection: select, selectExpr, drop, withColumn, withColumnRenamed, agg, replaceSelection: filter, where Deduplication of rows: distinct, dropDuplicatesHandling null values: dropna, fillnaJoins: join, crossJoinSet operations: intersect, subtractGrouping: groupByOperations for online analytical processing (OLAP): cube, rollup, crosstabOrdering, limiting: limit, orderBy / sort, sortWithinPartitionsStatistics over the whole dataframe: approxQuantile, cov, corr, …

Actionscoalesce, repartitioncollect, take, first, headtoLocalIteratorcountforeach, foreachPartitionwrite.xxx

Page 8: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

8

DataFrame APIRDD

rdd, toJSONExploratory analysis

describesample, sampleByrandomSplit

Debuggingshow, printSchemaexplain

Miscellaneous functionscache, persist

Page 9: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

9

Aggregations

Exercise: e05

“Compute the sum of review lengths by every reviewer and make summary statistics of the lengths along with breakdown by gender and age”

Page 10: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

10

Cube operator

Used in online analytical processingReturns aggregated data for different combinationsof aggregationsRuns faster thandoing the aggregationsone-by-one

Picture from https://blog.xlcubed.com/wp-content/uploads/2010/06/cube.png

Page 11: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

11

Column Mapping

A number of functions pre-defined for operations over columns:

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

User defined mappingVia transformation to RDD and then using RDD.mapOr via user-defined functions (UDF)

Exercise: e09

Page 12: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

12

Windows

Per-row functions in the context of a windowAka partition by

Window functions:row_number, rank, dense_rank, lag, lead, ntile, percent_rank, rank, cume_dist

Exercise: e11Exercise: e13

Page 13: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

13

Determining Trends

“Is there a trend in time for the length of reviews per reviewer? E.g. does anyone tend to write longer and longer reviews?”

Exercise: e16

Math on the next two slides…

Page 14: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

14

Simple Linear Regression

Simple Linear regression

Page 15: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

15

Quality of Fit

Pearson’s correlation coefficientdetermines how dependent data are

Page 16: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

16

Statistical Testing

“Check whether there is a statistical difference between the length of reviews written by man and women.”Welch’s t-testwith 𝜈𝜈 degreesof freedom

Page 17: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

17

t-distribution

Page 18: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster

18

Structured columns

Columns may contain structured data or listsThey can be indexed with the usual Pythonic way

df.select(x.y.z)df.select(x[0][1])

Exercise: e19