crash course python - 3142.nl · crash course python . vrije universiteit amsterdam • not a...
Post on 06-Jul-2020
15 Views
Preview:
TRANSCRIPT
‹nr.› Het begint met een idee
CRASH COURSE PYTHON
Vrije Universiteit Amsterdam
• Not a programming course
• For data analysts, who want to learn Python
• For optimizers, who are fed up with Matlab
This talk
2
Vrije Universiteit Amsterdam
• Scripting language • expensive computations typically in compiled modules
• such as matrix multiplication, optimization, classification • Faster Python code: Numba’s @jit construct (or Cython)
• Support for functions and OOP (classes, abstract classes, polymorphism, inheritance; but no encapsulation)
• Direct competitors: R, Julia, Matlab
Python
3
Vrije Universiteit Amsterdam
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced.
Zen of Python
4
Vrije Universiteit Amsterdam
• 1994: Python 1 • 2000: Python 2 (backward compatible) • 2008: Python 3
• Most pronounced difference:
• Python 2: print “hello world!” • Python 3: print(“hello world!”)
• Strength of Python: broad availability of modules
• Many modules have been updated for Python 3
• Some people still use Python 2
Python 2 or 3?
5
Vrije Universiteit Amsterdam
• Windows users: use winPython • Has MKL for fast linear algebra, and many preinstalled modules • Portable, so extract & go • Ships with the Spyder editor for coding and debugging • and a compiler for new modules • winPython 3.4 is currently recommended (3.5 does not (yet) ship with
a compiler)
• Mac users • OS X ships with Python 2.7 (and depends on it, do not “update” to 3) • Python 3 can be installed alongside
• Linux • Ubuntu ships with both Python 2.7 and Python 3.4
• Commands: python & python3
Installing Python
6
Vrije Universiteit Amsterdam
• Mac/Linux/POSIX-compatible systems: run pip from the terminal
• e.g.: pip install cylp
• WinPython: run “WinPython Command Prompt.exe”
and use pip • For dependencies that require a shell script (“./configure”):
• add the folder “winPython/share/mingwpy/bin” to the path • install msys from mingw.org • start msys (C:\MinGW\msys\1.0\msys.bat) • Configure&compile the dependency
Installing modules
7
Vrije Universiteit Amsterdam
• The editor probably has a hotkey (F5 in Spyder)
• Shell command: “python filename.py”
• Alternative: “python” (runs commands as they are entered)
Running Python
8
Vrije Universiteit Amsterdam
Crash course
9
Data type Initialize empty Initialize with data
List x = [] x = [1,2,5]
Tuple - x = (1,2,5)
Set x = set() x = {1,2,5}
Dict x = {} x = {"one": 1, "two": 2, "five": 5}
String x = "" x = "hello world"
Vrije Universiteit Amsterdam
• Integers have infinite precision • Floats have finite precision
• use decimal/float/mpmath modules for arbitrary precision
>> print(2**1000) 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
Precision
10
Vrije Universiteit Amsterdam
• range(n) is “list-like”
• internally it is an object that can be converted to a list
• range(int(1e10)) requires a few bytes instead of 74.5 GB
Creating a list
11
Code Output
x = [0,1,2,3,4,5,6,7,8,9,10]
print(x)
x = range(11)
print(x)
print(list(x))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
range(0, 11)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Vrije Universiteit Amsterdam
• No curly braces or “end for” • Structure is derived from level of indentation • One statement per line • No semicolons required
Loops
12
Code Output
for i in [1,2,3]:
print(i)
while i < 5:
i += 1
print(i)
1
2
3
4
5
Vrije Universiteit Amsterdam
• All arguments are named: fun(name=‘group’)
• Naming useful for optional arguments
• Return is optional
Functions
13
Code Output
def fun(name, greeting='Hi', me='evil caterpillar'):
print(greeting + ' ' + name + ', this is ' + me)
return 0
fun('group', me='Python')
Hi group, this is Python
Vrije Universiteit Amsterdam
• Behavior depends on whether type is mutable
• variables are pointers • memory gets overwritten for mutable types only
• String, int, double, tuple are immutable • List, set, dict are mutable • “y=list(x)” creates a shallow copy (y = copy.deepcopy(x) when x contains mutable data)
Functions
14
Code Output
def trick_me(a,b,c):
a.append('o')
b.append('o')
c += 1
x = ['m','n']
y = x
z = 1
trick_me(x,y,z)
print(x,y,z)
['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1
Vrije Universiteit Amsterdam
Creating a list with squares: 0, 1, 4, …, 100 Creating a list of even numbers 6, 8, 10, 12, 14
List comprehensions
15
Naive code Idiomatic Python
x = []
for i in range(11):
x.append(i*i)
x = [i*i for i in range(11)]
Naive code Idiomatic Python
x = []
for i in range(6,15):
if i % 2 == 0:
x.append(i)
x = [i for i in range(6,15) if i%2==0]
# or
x = [i for i in range(6,15,2)]
Vrije Universiteit Amsterdam
• Find the last ten digits of the series: 11 + 22 + 33 + ... + 10001000 (projecteuler.net)
• >> print(str(sum([k**k for k in range(1,1001)]))[-10:]) 9110846700
• [k**k for k in range(1,1001)] creates the terms • sum(.) takes the sum • str(.) converts the argument to a string • [-10:] takes a substring
One-liner example
16
Vrije Universiteit Amsterdam
• Matlab replacements • scipy (free, linear algebra) • matplotlib (free, graphing)
• Optimization
• cylp (free, linear and mixed integer optimization) • pyipopt (free, convex optimization) • gurobi / cplex (academic license)
• Data mining • pandas (free, importing and slicing data) • scikit-learn (free, machine learning) • xgboost (free, gradient boosting) • takes less than 20 lines to create a cross-validated ensemble of
classifiers
Modules
17
Vrije Universiteit Amsterdam
Recap
18
Example: function, for-loop, range, comment
def take_sum(S):
sum = 0
for i in S:
sum += i
return sum
print(take_sum(range(7)))
# outputs 21
Example: named arguments
def fun(name, greeting='Hi', me='evil caterpillar'):
print(greeting + ' ' + name + ', this is ' + me)
return 0
fun('group', me='Python')
Vrije Universiteit Amsterdam
• Reading data with pandas • Visualization with matplotlib • Machine learning with scikit-learn
Data mining
19
Vrije Universiteit Amsterdam
• Pandas offers read_csv, read_excel, read_sql, read_json, read_html, read_sas, etc
• read_* returns pandas data structure: DataFrame
• Having data in DataFrame is useful • filtering, combining, grouping, sorting • to_csv, to_excel, etc (for, e.g., converting csv to json)
Reading data
20
Vrije Universiteit Amsterdam
Example: reading csv file
21
Code
import pandas
filename = 'train.csv'
X = pandas.read_csv(filename, sep=",")
y = X.target
X.drop(['target', 'id'], axis=1, inplace=True)
CSV file
id,feat_1,feat_2,feat_3,feat_4,feat_5,target
1,1,0,0,0,0,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,1,0,0,1,6,0
5,0,0,0,0,0,1
Vrije Universiteit Amsterdam
Filtering data
22
Code
filename = 'train.csv'
data = pandas.read_csv(filename)
print(data[0:2])
output: id feat_1 feat_2 feat_3 feat_4 feat_5 target
1 2 0 0 0 0 0 1
2 3 0 0 0 0 0 1
CSV file
id,feat_1,feat_2,feat_3,feat_4,feat_5,target
1,1,0,0,0,0,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,1,0,0,1,6,0
5,0,0,0,0,0,1
Vrije Universiteit Amsterdam
Filtering data
23
Code
filename = 'train.csv'
data = pandas.read_csv(filename)
print(data[data.feat_1 == 1])
output: id feat_1 feat_2 feat_3 feat_4 feat_5 target
0 1 1 0 0 0 0 1
3 4 1 0 0 1 6 1
CSV file
id,feat_1,feat_2,feat_3,feat_4,feat_5,target
1,1,0,0,0,0,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,1,0,0,1,6,0
5,0,0,0,0,0,1
Vrije Universiteit Amsterdam
Visualization
24
Code
data[data.feat_2<=5].feat_2.plot(kind='hist')
# since the data takes few distinct values:
data[data.feat_2<=5].feat_2.value_counts().sort_index().plot(kind='bar')
Vrije Universiteit Amsterdam
Grouping
25
Code
import numpy as np
pandas.set_option('display.precision',2)
for feat_2_value,group in data.groupby('feat_2'):
# group is the DataFrame data[feat_2 == feat_2_value]
data.groupby('feat_2').aggregate(pandas.Series.nunique)
# other aggregation functions: np.min, np.max, np.sum, np.std
id feat_1 feat_3 feat_4 feat_5 target
feat_2
0 55018 37 39 48 15 9
1 4012 26 39 36 10 9
2 1215 14 31 39 7 9
3 549 9 24 27 7 7
4 310 13 21 27 4 5
5 170 5 10 13 3 6
Vrije Universiteit Amsterdam
Example: time series
26
Code
import pandas
import numpy as np
ts = pandas.Series(np.random.randn(1000), \
index=pandas.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
print(ts.mean())
# output: 28.642802230898678
Vrije Universiteit Amsterdam
Suppose csv file is 100 GB and has thousands of columns Subset of three columns is manageable
Example: large data set
27
Code
infile = 'train.csv'
outfile = ‘output.xlsx’
df = pandas.DataFrame()
# chunksize is the number of rows to read per iteration
for data in pandas.read_csv(infile, chunksize=100):
data = data[['feat_1', 'feat_2', 'target']]
df = pandas.concat([df,data])
writer = pandas.ExcelWriter(outfile)
df.to_excel(writer, 'Sheet1')
writer.save()
Vrije Universiteit Amsterdam
Logistic regression
28
Code
from sklearn import cross_validation,linear_model
from sklearn.metrics import log_loss
filename = 'train.csv'
X = pandas.read_csv(filename, sep=",")
y = X.target
X.drop(['target', 'id'], axis=1, inplace=True)
y[y==1] = 0
y[y>1] = 1
X,X_test,y,y_test = cross_validation.train_test_split(X, y, test_size=0.5)
clf = linear_model.LogisticRegression()
clf.fit(X,y)
prediction = clf.predict_proba(X_test)
print(log_loss(y_test,prediction))
# output: 0.00159227347414; log_loss is in in [0, 34.5]
# 0 for “perfect fit”, 0.7 for “constant p=0.5”, 34.5 for “all wrong”
top related