learning from other's mistakes: data-driven code analysis

30
Data-driven code analysis: Learning from other's mistakes Andreas Dewes (@japh44) [email protected] 13.04.2015 PyCon 2015 – Montreal

Upload: andreas-dewes

Post on 18-Jul-2015

289 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Learning from other's mistakes: Data-driven code analysis

Data-driven code analysis: Learning from other's mistakes

Andreas Dewes (@japh44)

[email protected]

13.04.2015

PyCon 2015 – Montreal

Page 2: Learning from other's mistakes: Data-driven code analysis

About

Physicist and Python enthusiast

CTO of a spin-off of the

University of Munich (LMU):

We develop software for data-driven code analysis.

Page 3: Learning from other's mistakes: Data-driven code analysis

Our mission

Page 4: Learning from other's mistakes: Data-driven code analysis

Tools & Techniques for Ensuring Code Quality

static dynamic

automated

manual

Debugging

Profiling

...

Manual

code reviews

Static analysis /

automated

code reviews

Unit testing

System testing

Integration testing

Page 5: Learning from other's mistakes: Data-driven code analysis

Discovering problems in code

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

obj returns only thekeys of the dictionary.(obj.items() is needed)

value.imaginary does not exist. (value.imag would be correct)

Page 6: Learning from other's mistakes: Data-driven code analysis

Dynamic Analysis (e.g. unit testing)

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

def test_encode(): d = {'a' : 1j+4,

's' : {'d' : 4+5j}}

r = encode(d) #this will fail...

assert r['a'] == {'type' : 'complex', 'r' : 4,'i' : 1}

assert r['s']['d'] == {'type' : 'complex', 'r' : 4,'i' : 5}

Page 7: Learning from other's mistakes: Data-driven code analysis

Static Analysis (for humans)

encode is a function with 1 parameterwhich always returns a dict.

I: obj should be an iterator/list of tupleswith two elements.

encode gets called with adict, which does not satisfy (I).

a value of type complex does nothave an .imaginary attribute!

encode is called with a dict, whichagain does not satisfy (I).

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

Page 8: Learning from other's mistakes: Data-driven code analysis

How static analysis tools works (short version)

1. Compile the code into a data

structure, typically an abstract syntax

tree (AST)

2. (Optionally) annotate it with

additional information to make

analysis easier

3. Parse the (AST) data to find problems.

Page 9: Learning from other's mistakes: Data-driven code analysis

Python Tools for Static Analysis

PyLint (most comprehensive tool)http://www.pylint.org/

PyFlakes (smaller, less verbose)https://pypi.python.org/pypi/pyflakes

Pep8 (style and some structural checks)https://pypi.python.org/pypi/pep8

(... and many others)

Page 10: Learning from other's mistakes: Data-driven code analysis

Limitations of current tools & technologies

Page 11: Learning from other's mistakes: Data-driven code analysis

Checks are hard to create / modify...(example: PyLint code for analyzing 'try/except' statements)

Page 12: Learning from other's mistakes: Data-driven code analysis

Long feedback cycles

Page 13: Learning from other's mistakes: Data-driven code analysis

Rethinking code analysis for Python

Page 14: Learning from other's mistakes: Data-driven code analysis

Our approach

1. Code is data! Let's not keep it in text

files but store it in a useful form that we

can work with easily (e.g. a graph).

2. Make it super-easy to specify errors

and bad code patterns.

3. Make it possible to learn from user

feedback and publicly available code.

Page 15: Learning from other's mistakes: Data-driven code analysis

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

Page 16: Learning from other's mistakes: Data-driven code analysis

dict

name

nameassign

functiondef

body

body

targets

for

body iterator

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

value

Page 17: Learning from other's mistakes: Data-driven code analysis

{i : 1}

{id : 'e'}

{name: 'encode',args : [...]}

{i:0}

Building the Code Graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

e4fa76b...

a76fbc41...

c51fa291...

74af219...

name

nameassign

body

body

targets

for

body iterator

value

dict

functiondef

$type: dict

Page 18: Learning from other's mistakes: Data-driven code analysis

Example: Tornado Project

10 modules from the tornado project

Modules

Classes

Functions

Page 19: Learning from other's mistakes: Data-driven code analysis

Advantages

- Simple detection of (exact) duplicates

- Semantic diffing of modules, classes, functions, ...

- Semantic code search on the whole tree

Page 20: Learning from other's mistakes: Data-driven code analysis

Describing Code Errors / Anti-Patterns

Page 21: Learning from other's mistakes: Data-driven code analysis

Code issues = patterns on the graph

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

name

attribute

value

attr

{id : imaginary}

name

$type {id : value}

complex

Page 22: Learning from other's mistakes: Data-driven code analysis

Using YAML to describe graph patterns

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

node_type: attribute

value:

$type: complex

attr: imaginary

Page 23: Learning from other's mistakes: Data-driven code analysis

Generalizing patterns

def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)

elif isinstance(value,complex): e[key] = {'type' : 'complex',

'r' : value.real, 'i' : value.imaginary}

return e

d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)

node_type: attribute

value:

$type: complex

attr:

$not:

$or: [real, imagin]

Page 24: Learning from other's mistakes: Data-driven code analysis

Learning from feedback / false positives

Page 25: Learning from other's mistakes: Data-driven code analysis

"else" in for loop without break statement

node_type: for

body:

$not:

$anywhere:

node_type: break

orelse:

$anything: {}

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"

else: print "didn't find 'baz'!"

Page 26: Learning from other's mistakes: Data-driven code analysis

Learning from false positives (I)

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"return value

else: print "didn't find 'baz'!"

node_type: for

body:

$not:

$or:

- $anywhere:

node_type: break

- $anywhere:

node_type: return

orelse:

$anything: {}

Page 27: Learning from other's mistakes: Data-driven code analysis

Learning from false positives (II)

node_type: for

body:

$not:

$or:

- $anywhere:

node_type: break

exclude:

node_type:

$or: [while,for]

- $anywhere:

node_type: return

orelse:

$anything: {}

values = ["foo", "bar", ... ]

for i,value in enumerate(values): if value == 'baz': print "Found it!"for j in ...:

#...break

else: print "didn't find 'baz'!"

Page 28: Learning from other's mistakes: Data-driven code analysis

patterns vs. code

handlers:node_type: excepthandlertype: null

node_type: tryexcept

handlers:- body:

- node_type: passnode_type: excepthandler

node_type: tryexcept

(no exception type specified)

(empty exception handler)

Page 29: Learning from other's mistakes: Data-driven code analysis

Summary & Feedback

1. Storing code as a graph opens up many

interesting possibilities. Let's stop thinking of

code as text!

2. We can learn from user feedback or even

use machine learning to create and adapt

code patterns!

3. Everyone can write code checkers!

=> crowd-source code quality!

Page 30: Learning from other's mistakes: Data-driven code analysis

Thanks!

www.quantifiedcode.comhttps://github.com/quantifiedcode

@quantifiedcode

Andreas Dewes (@japh44)

[email protected]

Visit us at booth 629!