open–source python tools for environmental data processing
TRANSCRIPT
Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/2019
Michael J. MurphyEnvironmental Data Analyst/ Data Scientist and Senior Staff Geologist
Terraphase Engineering Inc.
OVERVIEW
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20192
• The Python programming language
• Why Python?
• Python data analysis basics
• Case studies
• Review
PYTHON* IS A MODERN PROGRAMMING LANGUAGE THAT IS ESPECIALLY USEFUL FOR QUANTITATIVEDATA ANALYSIS AND SCIENTIFIC PROGRAMMING
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20193
*Python is named after Monty Python, not Pythonidae.
!=+
WHY PYTHON INSTEAD OF EXCEL?
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20194
vs.
BASIC PYTHON DATA TOOLS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20195
• DataFrame table structures • Access, process, analyze, export and visualize data
• Vast collection of functions for array operations• Quantitative analysis
• High-level plotting functions• Can produce publication-quality data visualizations• Works seamlessly with Python data tools such as NumPy, pandas
EXAMPLE: PANDAS DATAFRAMES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20196
Key Values
Index
EXAMPLE: PANDAS DATAFRAMES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20197
Original wide-format table:
Melted long-format table:
EXAMPLE: NUMPY OPERATIONS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20198
Vectorize function:
Draw random values:
Define function to calculate ion balance:
EXAMPLE: PLOTTING WITH MATPLOTLIB
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20199
A very basic example of a Matplotlib plot:
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201910
CASE STUDIES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201911
CASE STUDY: MACHINE LEARNING ANALYSIS OF VOC DISTRIBUTION INFRACTURED AQUIFER
+
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201912
ML STUDY PT. 1 – PRINCIPLE COMPONENT ANALYSIS
DATA CONSISTS OF DEPTH, DIP AND STRIKE, VOCS, AND DESCRIPTION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201913
DATA IS CONVERTED TO A NUMERICAL ARRAY OR MATRIX
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201914
Original data table:
Numerical data in array:
Categorical data as target values:
DATA IS SCALED, THEN DECOMPOSED INTO TWO PRINCIPLE COMPONENTS(EIGENVECTORS)
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201915
Scale data:
Transform data with PCA:
COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201916
COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201917
Target categories show clustering:
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201918
ML STUDY PT. 2 – K-MEANS CLUSTERING
DIP AND VOCS ARE SELECTED FROM SAME DATA AS AN ARRAY
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201919
Same dataset as PCA example:
Select dip and VOC columns as array:
DATA IS SCALED AND A NEW VECTOR CREATED WITH FOUR K-MEANS CLUSTERS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201920
Create K-means pipeline:
Fit and predict four clusters:
LABELS ARE ADDED, AND EACH CLUSTER LABEL COUNTED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201921
Original data:
Two new columns with cluster labels:
Counts of cluster labels:
CLUSTERS ARE PLOTTED WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201922
Clusters plotted with depth:
THE CLUSTER COUNTS INDICATE THAT A HIGHER PERCENTAGE OF LOW-ANGLE FEATURES HAVE HIGHVOCS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201923
Count = 71
Count = 30Count = 29
Count = 48
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201924
BOTH UNSUPERVISED METHODS INDICATE CLUSTERING BETWEEN VARIABLES
Count = 71
Count = 30
Count = 29
Count = 48
CASE STUDY: WELL TRANSDUCER DATA PROCESSING AND VISUALIZATION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201925
DATA FROM A SINGLE TRANSDUCER CONTAINS NEARLY 70,000 DATA POINTS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201926
Data collected every 1m; ~70,000 records:
SEVERAL LARGE OUTLIERS ARE PRESENT WHERE TRANSDUCER WAS MOVED, ETC.
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201927
Large artificial outliers:
OUTLIERS ARE ITERATIVELY REPLACED WITH INTERPOLATED VALUES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201928
Interpolate and impute outliers:
THE RESULTING HYDROGRAPH IS MUCH MORE READABLE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201929
Hydrograph with outliers replaced:
THE PROCESSED DATASET IS THEN RESAMPLED TO THE MEAN HOURLY VALUE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201930
Resample dataset to the hourly mean:
Data is reduced by ~98% to ~1,200 records:
THE PROCESSED AND REDUCED DATASET CONTAINS THE ESSENTIAL INFORMATION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201931
Hydrograph of reduced dataset:
BY ITERATIVELY REPLACING ARTIFICIAL OUTLIERS AND REDUCING THE SIZE OF THE DATASET, WHATWOULD TAKE SEVERAL HOURS IN EXCEL IS ACCOMPLISHED IN LESS THAN 1 MINUTE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201932
REVIEW AND FURTHER TOPICS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201933
• Python presents a comprehensive and free open-source toolkit for data processing, analysis and visualization– Python libraries, such as NumPy, pandas, Matplotlib, Seaborn, and Jupyter Notebook
– Developed by universities, scientists, independent developers
– Libraries can be used together to process, analyze, and visualize groundwater and environmental data
– More accurate, powerful, and repeatable than Excel, etc.
– Range of applications from simple EDA to complex ML studies
• Python can also be used to interact with other software and codes– FloPy- MODFLOW library
– PHREEQPY- PHREEQC library
– ArcPy, Python API- Interact and script functions within ESRI’s ArcGIS suite
THANK YOU! QUESTIONS?
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201934
Please feel free to contact me at [email protected] if you have
any questions we cannot get to, or find me around the conference.
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201936
• Metals data from soil samples– Potentially contaminated site
– Arsenic is primary COC
– Examine distribution of As with depth to determine possible outliers
• EDA with ‘Seaborn’ Python library, Jupyter Notebooks
– Seaborn: High-level statistical visualization tools built on Matplotlib
– Works seamlessly with pandas DataFrames
– Allows rapid EDA
– Produce publication-quality visualizations
– Jupyter Notebooks: Edit and compile code, view inline plots
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201937
Wide-format tableof metals data:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201938
Plot all metals against depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201939
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201940
Highlight arsenic results:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201941
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201942
Box and scatter plot of arsenic vs. depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201943
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201944
Violin and scatter plot of arsenic vs. depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201945