python for high throughput science by mark basham

46
Python for High Throughput Science Mark Basham Scientific Software Group Diamond Light Source Ltd UK.

Upload: pydata

Post on 27-Jan-2015

113 views

Category:

Technology


2 download

DESCRIPTION

Python for High Throughput Science by Mark Basham

TRANSCRIPT

Page 1: Python for High Throughput Science by Mark Basham

Python for High Throughput Science

Mark Basham

Scientific Software Group

Diamond Light Source Ltd UK.

Page 2: Python for High Throughput Science by Mark Basham

Overview

• What is Diamond Light Source• Big Data?

• Python for scientists• Python for developers

Page 3: Python for High Throughput Science by Mark Basham
Page 4: Python for High Throughput Science by Mark Basham

Diamond Light source

Page 5: Python for High Throughput Science by Mark Basham

What do I do?

• Provide data analysis for use during and after beamtime for users

–Users may or may not have any prior experience.

–~30 beamlines with over 100 techniques used.

• With 12 other Full time developers

Page 6: Python for High Throughput Science by Mark Basham

Where it all started

Client server technology

Communication withEPICS and hardware

Scan mechanism

www.opengda.org

Jythonand Python

Visualisation

Communicationwith external

analysis

Analysistools

All core technologies open source

Acquisition

• 1.0 release 2002• 3.0 release 2004

– Jython introduced as scripting language

Beamline setup and data collection speed

increased.

Page 7: Python for High Throughput Science by Mark Basham

Universal Data Problem

Page 8: Python for High Throughput Science by Mark Basham

Detector History at DLS

• Early 2007:– Diamond first user.– No detector faster than ~10 MB/sec.

• Early 2009:– first Lustre system (DDN S2A9900)– first Pilatus 6M system @ 60 MB/s.

• Early 2011:– second Lustre system (DDN SFA10K)– first 25Hz Pilatus 6M system @150 MB/s.

• Early 2013:– first GPFS system (DDN SFA12K)– First 100 Hz Pilatus 6M system @ 600 MB/sec– ~10 beamlines with 10 GbE detectors (mainly Pilatus and PCO Edge).

• Early 2015:– delivery of Percival detector (6000 MB/sec).

2007 2009 2011 2013 201510

100

1000

10000

Peak Detector Per-formance (MB/s)

Doubling time = 7.5 months

Page 9: Python for High Throughput Science by Mark Basham

< 100GB/day< 1TB/day> 1TB/day

Per Beamline Data Rates

Page 10: Python for High Throughput Science by Mark Basham

Data Storage

● ~1PB of Lustre

● ~1PB of GPFS

● ~0.5PB of on-line archive

● ~1PB near-line archive– >200M files

High performance parallel file systems HATE lots of

small files.

Page 11: Python for High Throughput Science by Mark Basham
Page 12: Python for High Throughput Science by Mark Basham

< 100GB/day< 1TB/day> 1TB/day

Small Data Rate Beamlines (Variety)

Page 13: Python for High Throughput Science by Mark Basham

“I have all the data I have ever collected on a floppy disk and

process it by hand…”

Principal beam-line scientist when asked about data volumes in 2005

Page 14: Python for High Throughput Science by Mark Basham

“I have all the data I have ever collected on a floppy disk and

process it by hand…”

~1 TB so far this year

Page 15: Python for High Throughput Science by Mark Basham

Processing Data (Variety)

• Experimental work requires exploring– Matlab– IDL– IgorPro– Excel– Origin– Mathmatica

Page 16: Python for High Throughput Science by Mark Basham

Processing Playing with Data (Variety)

• Experimental work requires exploring– Matlab– IDL– IgorPro– Excel– Origin– Mathmatica

• Issue is scalability at all and at a reasonable price

Page 17: Python for High Throughput Science by Mark Basham

Clusters (Velocity)

●132 Intel based nodes, 1280 Intel cores in service.

●80 NVIDIA GPGPU’s, 23328 GPU cores in service.

●Split across 6 clusters, with a range of capabilities.

●Mostly used by MX and tomography beamlines.

●All accessed via Sun Grid Engine interface.

Page 18: Python for High Throughput Science by Mark Basham

Python is the Obvious answer

• Users have used it during their beam times.• Free and easily distributable.• ...

• BUT – how to give it to them in a way they understand.

Page 19: Python for High Throughput Science by Mark Basham

Extending the Acquisition tools

Client server technology

Communication withEPICS and hardware

Scan mechanism

www.opengda.org

Jythonand Python

Visualisation

Communicationwith external

analysis

Analysistools

Data read, write,convert

Metadata structure

Workflows

All core technologies open source

www.dawnsci.org

DAWN is a collection of generic and

bespoke ‘views’ collated into

‘perspectives’.

The perspectives and views can be

used in part or whole in either

the GDA or DAWN.

Acquisition Analysis

Page 20: Python for High Throughput Science by Mark Basham

Main Dawn Elements for Python

Python/Jython

DataExploring

Workflow PyDev Scripting

IPython Console

Python Actor scisoftpy module

HDF5Visualisation

www.dawnsci.org

Page 21: Python for High Throughput Science by Mark Basham

Scisoftpy plotting

Page 22: Python for High Throughput Science by Mark Basham

Interactive console

Run on CMDReal-time variable

view

IPython interface

IntegratedDebugging

Page 23: Python for High Throughput Science by Mark Basham

Scripting tools

Breakpoints andStep by step debugging

Interact with the interpreter while paused

Page 24: Python for High Throughput Science by Mark Basham

Python @ Diamond

• Anaconda–Numpy–Scipy–H5py–Mpi4py–Webservices

• Astra (Tomography)• FFTW

(Ptychography)• CCTBX

(Crystallography)

Page 25: Python for High Throughput Science by Mark Basham

Processing Playing with Data (Variety)

• Experimental work requires exploring– Python

• Scientific Software team– Modules for easy access and common tasks– Repositories and Training

Page 26: Python for High Throughput Science by Mark Basham

Aside – Python for Optimization

• We produce a very fast beam of electrons (99.999999% the speed of light)

• We oscillate this beam between magnet arrays called Insertion Devices (ID’s) to make lots of light

Page 27: Python for High Throughput Science by Mark Basham

Insertion Devices (ID’s ~600 magnets)

Page 28: Python for High Throughput Science by Mark Basham

Individual Magnet (~800)

Unique MagnetMagnet Holder

x

yz

X Y Z

Perfect 1.0 0.0 0.0

Real 1.08 0.02 -0.04

Page 29: Python for High Throughput Science by Mark Basham

Simple Optimisation Problem

• From 800 magnets, pick 600 of them in the right order so that they appear to be a perfect array.

• But we already have code in Fortran–Bit hard to use–Not that extensible to new systems

Page 30: Python for High Throughput Science by Mark Basham

Objective Functions

• Slower in Python than Fortran–Original code ~ 1,000 times slower–Numpy array optimised ~ 10 times

slower

• Python improvements,–Caching ~ matched the speed–Clever updating ~ 100 times faster.

Page 31: Python for High Throughput Science by Mark Basham

OptID

• Artificial Immune systems– Global optimiser– Need more evaluations

• Parallelization– Threading with np to use processors– Mpi4py for data transfer and making use of the cluster

• Running on 25 machines, 200 cpu’s

• First sort with the new code has been built.

Page 32: Python for High Throughput Science by Mark Basham

< 100GB/day< 1TB/day> 1TB/day

High Data Rate Beamlines

Page 33: Python for High Throughput Science by Mark Basham

Archiving (Veracity)

• Simple task of registering files and metadata with a remote service.

– Xml parsing– Contact web services– File system interaction

• Nearly 1PB of data and 200 Million files archives through this system.

• Extended onto the cluster to deal with the additional load.

Page 34: Python for High Throughput Science by Mark Basham

< 100GB/day< 1TB/day> 1TB/day

MX Data Processing(Volume and Velocity)

Page 35: Python for High Throughput Science by Mark Basham

MX Data Reduction (Volume)Fast DP - fast

Index

Integrate

PointlessScale, refine in P1

Scale, postrefine, merge in point group

Choose best point group

Integrate Integrate Integrate Integrate

Output MTZ File

xia2 – thorough

downstream processing...

Page 36: Python for High Throughput Science by Mark Basham

Experimental Phasing (Velocity)Fast EP

Prepare for Shelx - ShelxC

Phase - ShelxE

Solvent fraction

OriginalInverted

Find substructure - ShelxD

# si

tes

Spacegroups

0.25 0.75

Experimentally phased map

Fast DP MTZ file

Results location: (visitpath)/processed/(folder)/(prefix)

Page 37: Python for High Throughput Science by Mark Basham

DIALS

• Full application being built in Python– 4 full time developers

• CCTBX– Extending and working with this open source project

• Boost– Optimization when required using Boost

Page 38: Python for High Throughput Science by Mark Basham

< 100GB/day< 1TB/day> 1TB/day

Tomography Data Reconstruction(Volume and Velocity)

Page 39: Python for High Throughput Science by Mark Basham

Tomography Current Implemetation

• Existing codes for reconstruction in c with CUDA– Only runs on Tiffs– Minimal data correction for experimental artefacts– Only uses 1GPU

• Python– Splits data and manages cluster usage (2 GPU’s per

Node)– Extracts corrected data from HDF– Builds input files from metadata

Page 40: Python for High Throughput Science by Mark Basham

Tomography Next Gen

• Mpi4py– Cluster organisation,– Parallelism– Queues using send buffers

• Transfer of data using ZeroMQ– Using blosc for compression

• Processing in python where possible– But calls to external code will be used initially

Page 41: Python for High Throughput Science by Mark Basham

Multiprocessor + MPI “profiling”

Page 42: Python for High Throughput Science by Mark Basham

MPI “profiling”

Page 43: Python for High Throughput Science by Mark Basham

Multiprocessor/MPI “profiling”

• Javascript

var dataTable = new google.visualization.DataTable()

• Pythonimport logginglogging.basicConfig(level=0,format='L %(asctime)s.%

(msecs)03d M' + machine_number_string + ' ' + rank_names[machine_rank] + ' %(levelname)-6s %(message)s', datefmt='%H:%M:%S')

• Jinja2 templating to tie the 2 together

Page 44: Python for High Throughput Science by Mark Basham

Where are we going?

• Scientists are having to become developers– We try to steer them in the right direction– Python is a very good, if not the best tool to do this

• Developers are having to work faster and be more reactive to new detectors, clusters, software, methods,....– Python allows this, and is being adopted almost as

standard by new computational projects at Diamond

Page 45: Python for High Throughput Science by Mark Basham

Acknowledgements

– Alun Ashton– Graeme Winter– Greg Matthews– Tina Friedrich– Frederik Ferner– Jonah Graham

(Kichwa)– Matthew Gerring– Peter Chang– Baha El Kassaby

– Jacob Filik– Karl Levik– Irakli Sikharulidze– Olof Svensson– Andy Gotz– Gábor Náray– Ed Rial– Robert Oates

Page 46: Python for High Throughput Science by Mark Basham

Thanks for Listening...

@basham_mark

www.dawnsci.org

www.diamond.ac.uk