crash course python - 3142.nl · crash course python . vrije universiteit amsterdam • not a...

‹nr.› Het begint met een idee

CRASH COURSE PYTHON

Vrije Universiteit Amsterdam

• Not a programming course

• For data analysts, who want to learn Python

• For optimizers, who are fed up with Matlab

This talk

• Scripting language • expensive computations typically in compiled modules

• such as matrix multiplication, optimization, classification • Faster Python code: Numba’s @jit construct (or Cython)

• Support for functions and OOP (classes, abstract classes, polymorphism, inheritance; but no encapsulation)

• Direct competitors: R, Julia, Matlab

Python

Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced.

Zen of Python

• 1994: Python 1 • 2000: Python 2 (backward compatible) • 2008: Python 3

• Most pronounced difference:

• Python 2: print “hello world!” • Python 3: print(“hello world!”)

• Strength of Python: broad availability of modules

• Many modules have been updated for Python 3

• Some people still use Python 2

Python 2 or 3?

• Windows users: use winPython • Has MKL for fast linear algebra, and many preinstalled modules • Portable, so extract & go • Ships with the Spyder editor for coding and debugging • and a compiler for new modules • winPython 3.4 is currently recommended (3.5 does not (yet) ship with

a compiler)

• Mac users • OS X ships with Python 2.7 (and depends on it, do not “update” to 3) • Python 3 can be installed alongside

• Linux • Ubuntu ships with both Python 2.7 and Python 3.4

• Commands: python & python3

Installing Python

• Mac/Linux/POSIX-compatible systems: run pip from the terminal

• e.g.: pip install cylp

• WinPython: run “WinPython Command Prompt.exe”

and use pip • For dependencies that require a shell script (“./configure”):

• add the folder “winPython/share/mingwpy/bin” to the path • install msys from mingw.org • start msys (C:\MinGW\msys\1.0\msys.bat) • Configure&compile the dependency

Installing modules

• The editor probably has a hotkey (F5 in Spyder)

• Shell command: “python filename.py”

• Alternative: “python” (runs commands as they are entered)

Running Python

Crash course

Data type Initialize empty Initialize with data

List x = [] x = [1,2,5]

Tuple - x = (1,2,5)

Set x = set() x = {1,2,5}

Dict x = {} x = {"one": 1, "two": 2, "five": 5}

String x = "" x = "hello world"

• Integers have infinite precision • Floats have finite precision

• use decimal/float/mpmath modules for arbitrary precision

>> print(2**1000) 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376

Precision

• range(n) is “list-like”

• internally it is an object that can be converted to a list

• range(int(1e10)) requires a few bytes instead of 74.5 GB

Creating a list

Code Output

x = [0,1,2,3,4,5,6,7,8,9,10]

print(x)

x = range(11)

print(x)

print(list(x))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

range(0, 11)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

• No curly braces or “end for” • Structure is derived from level of indentation • One statement per line • No semicolons required

Code Output

for i in [1,2,3]:

print(i)

while i < 5:

i += 1

print(i)

• All arguments are named: fun(name=‘group’)

• Naming useful for optional arguments

• Return is optional

Functions

Code Output

def fun(name, greeting='Hi', me='evil caterpillar'):

print(greeting + ' ' + name + ', this is ' + me)

return 0

fun('group', me='Python')

Hi group, this is Python

• Behavior depends on whether type is mutable

• variables are pointers • memory gets overwritten for mutable types only

• String, int, double, tuple are immutable • List, set, dict are mutable • “y=list(x)” creates a shallow copy (y = copy.deepcopy(x) when x contains mutable data)

Functions

Code Output

def trick_me(a,b,c):

a.append('o')

b.append('o')

c += 1

x = ['m','n']

trick_me(x,y,z)

print(x,y,z)

['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1

Creating a list with squares: 0, 1, 4, …, 100 Creating a list of even numbers 6, 8, 10, 12, 14

List comprehensions

Naive code Idiomatic Python

x = []

for i in range(11):

x.append(i*i)

x = [i*i for i in range(11)]

Naive code Idiomatic Python

x = []

for i in range(6,15):

if i % 2 == 0:

x.append(i)

x = [i for i in range(6,15) if i%2==0]

x = [i for i in range(6,15,2)]

• Find the last ten digits of the series: 11 + 22 + 33 + ... + 10001000 (projecteuler.net)

• >> print(str(sum([k**k for k in range(1,1001)]))[-10:]) 9110846700

• [k**k for k in range(1,1001)] creates the terms • sum(.) takes the sum • str(.) converts the argument to a string • [-10:] takes a substring

One-liner example

• Matlab replacements • scipy (free, linear algebra) • matplotlib (free, graphing)

• Optimization

• cylp (free, linear and mixed integer optimization) • pyipopt (free, convex optimization) • gurobi / cplex (academic license)

• Data mining • pandas (free, importing and slicing data) • scikit-learn (free, machine learning) • xgboost (free, gradient boosting) • takes less than 20 lines to create a cross-validated ensemble of

classifiers

Modules

Example: function, for-loop, range, comment

def take_sum(S):

sum = 0

for i in S:

sum += i

return sum

print(take_sum(range(7)))

# outputs 21

Example: named arguments

def fun(name, greeting='Hi', me='evil caterpillar'):

print(greeting + ' ' + name + ', this is ' + me)

return 0

fun('group', me='Python')

• Reading data with pandas • Visualization with matplotlib • Machine learning with scikit-learn

Data mining

• Pandas offers read_csv, read_excel, read_sql, read_json, read_html, read_sas, etc

• read_* returns pandas data structure: DataFrame

• Having data in DataFrame is useful • filtering, combining, grouping, sorting • to_csv, to_excel, etc (for, e.g., converting csv to json)

Reading data

Example: reading csv file

import pandas

filename = 'train.csv'

X = pandas.read_csv(filename, sep=",")

y = X.target

X.drop(['target', 'id'], axis=1, inplace=True)

CSV file

id,feat_1,feat_2,feat_3,feat_4,feat_5,target

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Filtering data

data = pandas.read_csv(filename)

print(data[0:2])

output: id feat_1 feat_2 feat_3 feat_4 feat_5 target

1 2 0 0 0 0 0 1

2 3 0 0 0 0 0 1

CSV file

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Filtering data

data = pandas.read_csv(filename)

print(data[data.feat_1 == 1])

output: id feat_1 feat_2 feat_3 feat_4 feat_5 target

0 1 1 0 0 0 0 1

3 4 1 0 0 1 6 1

CSV file

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Visualization

data[data.feat_2<=5].feat_2.plot(kind='hist')

# since the data takes few distinct values:

data[data.feat_2<=5].feat_2.value_counts().sort_index().plot(kind='bar')

Grouping

import numpy as np

pandas.set_option('display.precision',2)

for feat_2_value,group in data.groupby('feat_2'):

# group is the DataFrame data[feat_2 == feat_2_value]

data.groupby('feat_2').aggregate(pandas.Series.nunique)

# other aggregation functions: np.min, np.max, np.sum, np.std

id feat_1 feat_3 feat_4 feat_5 target

feat_2

0 55018 37 39 48 15 9

1 4012 26 39 36 10 9

2 1215 14 31 39 7 9

3 549 9 24 27 7 7

4 310 13 21 27 4 5

5 170 5 10 13 3 6

Example: time series

import pandas

import numpy as np

ts = pandas.Series(np.random.randn(1000), \

index=pandas.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

print(ts.mean())

# output: 28.642802230898678

Suppose csv file is 100 GB and has thousands of columns Subset of three columns is manageable

Example: large data set

infile = 'train.csv'

outfile = ‘output.xlsx’

df = pandas.DataFrame()

# chunksize is the number of rows to read per iteration

for data in pandas.read_csv(infile, chunksize=100):

data = data[['feat_1', 'feat_2', 'target']]

df = pandas.concat([df,data])

writer = pandas.ExcelWriter(outfile)

df.to_excel(writer, 'Sheet1')

writer.save()

Logistic regression

from sklearn import cross_validation,linear_model

from sklearn.metrics import log_loss

X = pandas.read_csv(filename, sep=",")

y = X.target

X.drop(['target', 'id'], axis=1, inplace=True)

y[y==1] = 0

y[y>1] = 1

X,X_test,y,y_test = cross_validation.train_test_split(X, y, test_size=0.5)

clf = linear_model.LogisticRegression()

clf.fit(X,y)

prediction = clf.predict_proba(X_test)

print(log_loss(y_test,prediction))

# output: 0.00159227347414; log_loss is in in [0, 34.5]

# 0 for “perfect fit”, 0.7 for “constant p=0.5”, 34.5 for “all wrong”

crash course python - 3142.nl · crash course python . vrije universiteit amsterdam • not a...

Documents

python crash course containers 3 rd year bachelors v1.0 dd...

python mapreduce programming with pydoop · mapreduce and...

python crash course pyraf 3 rd year bachelors v1.0 dd...

python programming — course introduction · python...

python crash course programming bachelors v1.0 dd 13-01-2015...

math 158 python crash course - nathan pfluegermath 158...

python crash course + web forms...

python crash course aplpy 3 rd year bachelors v1.0 dd...

python crash-course€¦ · · 2007-12-21numpy: for...

python: part 3 - national institute for · pdf filepython:...

python crash course pyfits, astropy 3 rd year bachelors v1.0...

python crash course plotting 3 rd year bachelors v1.0 dd...

python crash course file i/o sterrenkundig practicum 2 v1.0...

crash course

python: part 4 - national institute for computational …...

python crash course file i/o

a caffeinated crash course in python · 2008-07-22 ·...

python: a crash course on dynamo's python node...untangling...

a crash course in python · a crash course in python...

python crash course by monica sweat. python perspective is...