commit2015 kharchenko - python generators - ext

26
Maxym Kharchenko & m@ team Writing efficient Python code with pipelines and generators

Upload: maxym-kharchenko

Post on 13-Apr-2017

400 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Commit2015   kharchenko - python generators - ext

Maxym Kharchenko & m@ team

Writing efficient Python code with pipelines and generators

Page 2: Commit2015   kharchenko - python generators - ext

Agenda

Style

Efficiency

Simplicity

Pipelines

Page 3: Commit2015   kharchenko - python generators - ext

Python is all about streaming (a.k.a. iteration)

Page 4: Commit2015   kharchenko - python generators - ext

Streaming in Python# Listsdb_list = ['db1', 'db2', 'db3']for db in db_list: print db

# Dictionarieshost_cpu = {'avg': 2.34, 'p99': 98.78, 'min': 0.01}for stat in host_cpu: print "%s = %s" % (stat, host_cpu[stat])

# Files, strings file = open("/etc/oratab")for line in file: for word in line.split(" "): print word

# Whatever is coming out of get_things()for thing in get_things(): print thing

Page 5: Commit2015   kharchenko - python generators - ext

Quick example: Reading records from a file

def print_databases(): """ Read /etc/oratab and print database names """

file = open("/etc/oratab", 'r')

while True: line = file.readline() # Get next line

# Check for empty lines if len(line) == 0 and not line.endswith('\n'): break

# Parsing oratab line into components db_line = line.strip() db_info_array = db_line.split(':') db_name = db_info_array[0] print db_name

file.close()

Page 6: Commit2015   kharchenko - python generators - ext

Reading records from a file: with “streaming”

def print_databases(): """ Read /etc/oratab and print database names """ with open("/etc/oratab") as file: for line in file: print line.strip().split(':')[0]

Page 7: Commit2015   kharchenko - python generators - ext

Style matters!

Page 8: Commit2015   kharchenko - python generators - ext

Ok, let’s do something useful with streaming We have a bunch of ORACLE listener logs

Let’s parse them for “client IPs”

21-AUG-2015 21:29:56 * (CONNECT_DATA=(SID=orcl)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.107.137.91)(PORT=43105)) * establish * orcl * 0

And find where the clients are coming from

Page 9: Commit2015   kharchenko - python generators - ext

First attempt at listener log parserdef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []

with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)

return client_hosts

Page 10: Commit2015   kharchenko - python generators - ext

First attempt at listener log parserdef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []

with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)

return client_hosts

MEMORY WASTE!

Stores all results until

return

BLOCKING! Does NOT

return untilthe entire log is processed

Page 11: Commit2015   kharchenko - python generators - ext

Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []

with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)

return client_hosts

Page 12: Commit2015   kharchenko - python generators - ext

Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = []

with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host)

return client_hosts

Page 13: Commit2015   kharchenko - python generators - ext

Generators for efficiencydef parse_listener_log(log_name): """ Parse listener log and return clients """

with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic>

yield hostAdd this !

Page 14: Commit2015   kharchenko - python generators - ext

Generators in a nutshelldef test_generator(): """ Test generator """

print "ENTER()"

for i in range(5): print "yield i=%d" % i yield i

print "EXIT()"

# MAINfor i in test_generator(): print "RET=%d" % i

ENTER()yield i=0RET=0yield i=1RET=1yield i=2RET=2yield i=3RET=3yield i=4RET=4EXIT()

Page 15: Commit2015   kharchenko - python generators - ext

Nongenerators in a nutshelldef test_nongenerator(): """ Test no generator """ result = []

print "ENTER()"

for i in range(5): print "add i=%d" % i result.append(i)

print "EXIT()"

return result

# MAINfor i in test_nongenerator(): print "RET=%d" % i

ENTER()add i=0add i=1add i=2add i=3add i=4EXIT()RET=0RET=1RET=2RET=3RET=4

Page 16: Commit2015   kharchenko - python generators - ext

Generators to Pipelines

Generator(extractor)

1 secondper record

100,0001st: 1 second

100,000

Generator(filter: 1/2)

2 secondsper record

Generator(mapper)

5 secondsper record

50,0001st:5 seconds

50,0001st:10 seconds

Page 17: Commit2015   kharchenko - python generators - ext

Generator pipelining in Pythonfile_handles = open_files(LISTENER_LOGS)log_lines = extract_lines(file_handles)client_hosts = extract_client_ips(log_lines)

for host in client_hosts: print host

Open files

Extract lines

ExtractIPs

Filenames

Filehandles

Filelines

ClientIPs

Page 18: Commit2015   kharchenko - python generators - ext

Generators for simplicitydef open_files(file_names): """ GENERATOR: file name -> file handle """

for file in file_names: yield open(file)

Page 19: Commit2015   kharchenko - python generators - ext

Generators for simplicitydef extract_lines(file_handles): """ GENERATOR: File handles -> file lines Similar to UNIX: cat file1, file2, … """

for file in file_handles: for line in file: yield line

Page 20: Commit2015   kharchenko - python generators - ext

Generators for simplicitydef extract_client_ips(lines): """ GENERATOR: Extract client host """

host_regex = re.compile('\(HOST=(\S+)\)\(PORT=')

for line in lines: line_match = host_regex.search(line) if line_match: yield line_match.groups(0)[0]

Page 21: Commit2015   kharchenko - python generators - ext

Developer’s bliss:simple input, simple output, trivial

function body

Page 22: Commit2015   kharchenko - python generators - ext

Then, pipeline the results

Page 23: Commit2015   kharchenko - python generators - ext

But, really …

Open files

Extract lines

IP -> host

name

Filenames

Filehandles

Filelines

Clienthosts

Locate files

Filter db=orcl

Filter proto=

TCP

db=orcllines

db=orcllines

db=orcl& prot=TCP

Extractclients

ClientIPs

Clienthosts

Db writer

Clienthosts

Textwriter

Page 24: Commit2015   kharchenko - python generators - ext

Why generators ?

Simple functions that are easy to write and understand

Non blocking operations: TOTAL execution time: faster FIRST RESULTS: much faster

Efficient use of memory

Potential for parallelization and ASYNC processing

Page 25: Commit2015   kharchenko - python generators - ext

Special thanks to David Beazley …

For this: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf

Page 26: Commit2015   kharchenko - python generators - ext

Thank you!