python for practice nprg067...python for practice nprg067 tomas bures petr hnetynka ladislav pe ška...

CHARLES UNIVERSITY IN PRAGUE

http://d3s.mff.cuni.cz

faculty of mathematics and physics

Python for PracticeNPRG067

Tomas BuresPetr HnetynkaLadislav Peška

{bures,hnetynka}@[email protected]

2

Spark

Open-source cluster computing frameworkPart of Apache’s family of big-data tools

Runs as standalone or on top of Hadoop Yarn and Apache MesosCan access data stored in HDFS, Cassandra, Hbase, …

Main API in ScalaBinding to other languages Java, Python, R

3

Spark Architecture

Image from https://spark.apache.org/docs/latest/cluster-overview.html

4

Types of API

RDD (Resilient distributed dataset)Collections of serializable black-box objectsMore low-level operations

DataFramesTables of rows with schemaQueries mimic SQLQuery optimization before executionTwo main variants

SQL-like statements (strings)Query builder API

Seamless interoperability between RDDs and DataFramesNote that there is no DataSet API (as in Scala)

5

First Example

Exercise: e01

Uses the DataFrames API

6

Different APIs

Exercise: e03

“Show max 10 reviewers which have the salary above average. Sort them by ascending age.”

Examples:DataFrame APISQL APIRDD API

7

DataFrame APIQuery building functions

Projection: select, selectExpr, drop, withColumn, withColumnRenamed, agg, replaceSelection: filter, where Deduplication of rows: distinct, dropDuplicatesHandling null values: dropna, fillnaJoins: join, crossJoinSet operations: intersect, subtractGrouping: groupByOperations for online analytical processing (OLAP): cube, rollup, crosstabOrdering, limiting: limit, orderBy / sort, sortWithinPartitionsStatistics over the whole dataframe: approxQuantile, cov, corr, …

Actionscoalesce, repartitioncollect, take, first, headtoLocalIteratorcountforeach, foreachPartitionwrite.xxx

8

DataFrame APIRDD

rdd, toJSONExploratory analysis

describesample, sampleByrandomSplit

Debuggingshow, printSchemaexplain

Miscellaneous functionscache, persist

9

Aggregations

Exercise: e05

“Compute the sum of review lengths by every reviewer and make summary statistics of the lengths along with breakdown by gender and age”

10

Cube operator

Used in online analytical processingReturns aggregated data for different combinationsof aggregationsRuns faster thandoing the aggregationsone-by-one

Picture from https://blog.xlcubed.com/wp-content/uploads/2010/06/cube.png

11

Column Mapping

A number of functions pre-defined for operations over columns:

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

User defined mappingVia transformation to RDD and then using RDD.mapOr via user-defined functions (UDF)

Exercise: e09

http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

12

Windows

Per-row functions in the context of a windowAka partition by

Window functions:row_number, rank, dense_rank, lag, lead, ntile, percent_rank, rank, cume_dist

Exercise: e11Exercise: e13

13

Determining Trends

“Is there a trend in time for the length of reviews per reviewer? E.g. does anyone tend to write longer and longer reviews?”

Exercise: e16

Math on the next two slides…

14

Simple Linear Regression

Simple Linear regression

15

Quality of Fit

Pearson’s correlation coefficientdetermines how dependent data are

16

Statistical Testing

“Check whether there is a statistical difference between the length of reviews written by man and women.”Welch’s t-testwith 𝜈𝜈 degreesof freedom

17

t-distribution

18

Structured columns

Columns may contain structured data or listsThey can be indexed with the usual Pythonic way

df.select(x.y.z)df.select(x[0][1])

Exercise: e19

python for practice nprg067...python for practice nprg067 tomas bures petr hnetynka ladislav pe ška...

Documents