using data mining to investigate interaction between channel characteristics and hydraulic geometry...

Download Using Data Mining to Investigate Interaction between Channel Characteristics and Hydraulic Geometry Channel Types Leong Lee, Ph.D. Associate Professor,

If you can't read please download the document

Upload: oscar-gilmore

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Using Data Mining to Investigate Interaction between Channel Characteristics and Hydraulic Geometry Channel Types Leong Lee, Ph.D. Associate Professor, Dept. of Computer Science Austin Peay State University, Tennessee, USA Gregory S. Ridenour, Ph.D. Professor, Dept. of Geosciences Austin Peay State University, Tennessee, USA
  • Slide 2
  • Introduction Hydraulic geometry is the study of variations in channel characteristics with respect to variations in channel discharge. The purpose of this paper is to illustrate the use of data mining in hydraulic geometry to establish a large database for five objectives: 1.empirical estimation of parameters in theoretical equations (power functions), 2.classification (based on a ternary diagram), 3.production of maps with a GIS, 4.assessment of data quality by computing Euclidean distance to multivariate means from previous studies, generating scatterplot matrices for identifying outliers, and comparison utilizing Spearmans rank order correlation coefficient, and 5.pattern discovery via chi-square analysis for goodness of fit with expected distributions and tests of interaction of variables from pivot tables. 2
  • Slide 3
  • Hydraulic Geometry Leopold and Maddock coined the term hydraulic geometry to refer to the set of equations that describe the functional relationships (of a stream) between the width (w), (1) mean depth (d), (2) and mean velocity (v) (3) and its discharge, (Q) Q = ackQ b+f+m (4) (4) = (1) (2) (3) it is evident that ack = 1; b + f + m = 1 3
  • Slide 4
  • Ternary Diagram (b-f-m diagram) and Channel Types For graphical comparison, Park and Rhodes simultaneously and independently recognized that the appropriate diagram for the unit-sum constrained exponents is a triangular plot, or ternary diagram [11, 12, 13] A barycentric plot, 3 variables sum to 1 4 Fig. 1. Channel types within the b-f-m diagram [12].
  • Slide 5
  • Ternary Diagram (b-f-m diagram) and Channel Types The hydraulic geometry exponents or b, f, m values obtained at-a-station (each station along a river) can be plotted on a ternary diagram Five hydrologically significant lines differentiate the ternary diagram into 10 fields (channel types) 5 Channel Type Criteria Ib + f f IIb + f < m AND b < f IIIb + f > m AND b > f AND m > f IVb + f > m AND b f Vf > m AND b > f AND m/f > 2/3 VIf > m AND b 2/3 VIIm > f/2 AND b > f AND m/f < 2/3 VIIIm > f/2 AND b < f AND m/f < 2/3 IXm f Xm < f/2 AND b < f TABLE I. Rhodes Classification of Hydraulic Geometry [12]
  • Slide 6
  • Ternary Diagram (b-f-m diagram) and Channel Types 6 The first line is f = b. If a point plots to the left of the line (b > f), the width-depth ratio (w/d) increases with increasing discharge; to the right of the line (b < f) the ratio would decrease. The second line is m = f. If a point plots above the line (m > f), competence (the largest particle size a stream can transport) should increase with increasing discharge; below the line competence should decrease. The third line is m = f /2. If a point plots above the line (m > f/2), the Froude number (which differentiates supercritical and subcritical flow) increases with increasing discharge; below the line the Froude number decreases. The fourth line is m = b+f. If a point plots above the line (m > b+f), velocity increases more rapidly than cross-sectional area with increasing discharge; below the line it decreases. The fifth line is m/f = 2/3, which is related to the Manning equation. If a point plots above the line (m > (2/3) f), the ratio of the square root of slope (s) to the roughness coefficient (s /n) increases with increasing discharge; below the line it decreases.
  • Slide 7
  • Hydraulic Geometry & Applications Hydraulic geometry is applicable to prediction of channel deformation, layout of river training works, design of stable canals and intakes, river flow control works, irrigation schemes, and river improvement works [1]. Hydraulic geometry can discriminate between different types of river sections [3], which could be used by planners for resource and impact assessment [4]. 7
  • Slide 8
  • Hydraulic Geometry & Applications Additional applications include drought management, climate change vulnerability [5], and assessment of instream aquatic habitat [6]. U.S. EPAs current definition of a TMDL (Total Maximum Daily Load) incorporates both nonpoint sources and point sources of pollution, include all sources subject to regulation under the National Pollutant Discharge Elimination System (NPDES) program [7], Point source pollutant models such as QUAL2E utilize some of the parameters a, c, k, and b, f, m [8]. 8
  • Slide 9
  • United States (Rivers Runs Through) Software engineer Nelson Minar used data provided by the Environmental Protection Agency (Business Insider) http://www.businessinsider.com/map-of-americas-rivers-2013-6#ixzz2kNKSuqPM http://www.businessinsider.com/map-of-americas-rivers-2013-6#ixzz2kNKSuqPM 9
  • Slide 10
  • Data Mining Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically [10]. Data Mining is referred to as the entire knowledge discovery process, which may include: data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation [10]. 10
  • Slide 11
  • Summary of Data Mining Approach 11 Fig. 2. Flowchart of multi-stage data mining method
  • Slide 12
  • Module 1 (Data Extraction and Cleaning) The USGS started collecting stream information in 1889 by using a stream gage on the Rio Grande River in New Mexico [14]. A stream gage primarily measures a streams water level, but often also collects information about the water quality and amounts of sediment. Measurements are usually recorded every 15 minutes and uploaded to the USGS on average every four hours. 12
  • Slide 13
  • Module 1 (Data Extraction and Cleaning) The USGS stores data from stream gages and manual measurements in a large database that currently has information from about 1.5 million sites in the US and territories [16]. Historical measurements and nearly real time data are available to the public through online filtered searches [16]. Module 1 acquires preliminary data in the form of html files from the USGS site. 13
  • Slide 14
  • U.S. Geological Survey Water Resources 14
  • Slide 15
  • U.S. Geological Survey Water Resources 15
  • Slide 16
  • Module 1 (Data Extraction and Cleaning) We performed a filtered search constrained by a selected state, and a date range of January 1, 2009, to December 31, 2011. Five southeastern states (KY, TN, MS, AL, GA). A web scraping program performs data extraction : gaging station identification numbers (id), latitude (la) and longitude (lo), channel location distance (dist), field measurements: stream width (w), stream cross-sectional area (A) from which mean depth (d, or A/w) is computed, mean velocity (v), and discharge (Q), channel characteristic: stability (sta), material (mat), and evenness (eve). 16
  • Slide 17
  • Algorithm 1: Data Extraction and Cleaning begin use USGS website to perform search, acquire input file for each station in input file retrieve id, la, lo, dist, and add them to output file for all records in this station retrieve w, A, v, Q, sta, mat, eve check for missing data for all fields above if no missing data begin calculate d add w, A, v, Q, sta, mat, eve, d to output file calculate logarithms for Q, w, d, v add logarithms of Q, w, d, v to output file end end-for end-algorithm 17
  • Slide 18
  • Module 2 (Data Selection) It selects only stations with a minimum of twelve valid measurement records. Algorithm 2: Data Selection begin declare cutoff variable = 12 for each station in input file read station records into an array if the number of station records cutoff variable (12) begin add all records to output file end end-for end-algorithm 18
  • Slide 19
  • Module 3 (Data Transformation and Preliminary Data Pattern Extraction) It produces the final data set for data pattern extraction and post-processing analysis. It uses equations 5, 6, 7, 8, 9 (later slides). It transforms latitude (la) and longitude (lo) to values suitable for the equations. It finds the minimum value of channel location distance (dist) and uses it to select the suitable stability (sta), material (mat), and evenness (eve) values, and classifies each channel, 19
  • Slide 20
  • Algorithm 3: Data Transformation and Preliminary Data Pattern Extraction begin for each station in input file retrieve id, la, lo, dist add id to output file convert la, lo from degree/minute/second to decimal add la, lo to output file end-for for each station in input file for each record in this station retrieve log(w), log(d), log(v), log(Q) retrieve sta, mat, eve, dist use liner regression (equations 8, 9) to obtain b, f, m and a, c, k for this station normalize b, f, m increase values of a, c, k proportionally normalize a, c, k find minimum dist for this station select sta, mat, eve at the minimum dist classify channel type (I to X) based on Table I end-for add b, f, m and a, c, k to output file for this station add minimum dist to output file for this station add sta, mat, eve to output file for this station add channel type to output file for this station end-for end-algorithm 20
  • Slide 21
  • Module 4 (Data Pattern Extraction and Post-Processing) Uses MS-Excel, SigmaPlot, and ArcGIS to achieve pattern evaluation and knowledge presentation empirical estimation of parameters graphical displays production of maps assessment of data quality statistical analysis for goodness of fit and interaction of variables 21
  • Slide 22
  • Estimation of Parameters and Classification There are 3,467 sites in Tennessee with stream flow measurements. Only sites with channel information (stream flow, channel geometry, stability, evenness, and material) were selected. Data set was then cleaned, and filtered. 22
  • Slide 23
  • Estimation of Parameters and Classification 23
  • Slide 24
  • Calculate b, f, m (one site) 24
  • Slide 25
  • Calculate b, f, m (one site) 25
  • Slide 26
  • Calculate b, f, m (one site) 26
  • Slide 27
  • Ternary Diagram (b-f-m diagram) Five hydrologically significant lines differentiate the ternary diagram into 10 fields. 27
  • Slide 28
  • Channel Characteristics and Production of Maps Three non-metric channel characteristics were also downloaded for each station: Stability is a nominal variable which categorizes the bed as either firm or soft. Material is an ordinal variable which ranks the composition of bed material from finest to coarsest in the sequence silt (silt and/or mud), sand, gravel, cobbles, boulders (cobbles and/or boulders), or ledge (bedrock or an artificial material like concrete or metal). Evenness is a nominal variable which categorizes the bed as either even (channel has significant variation in its cross-section) or uneven. 28
  • Slide 29
  • Assessment of Data Quality 29
  • Slide 30
  • Assessment of Data Quality 30
  • Slide 31
  • Goodness of Fit and Interaction 31
  • Slide 32
  • Goodness of Fit and Interaction 32
  • Slide 33
  • Goodness of Fit and Interaction 33
  • Slide 34
  • Results - Empirical Estimation of Parameters, Production of Maps 225 gauging stations (sites) in Kentucky and Tennessee remained following our preliminary filtering criteria. The hydraulic geometry of 218 gaging stations (sites) that had positive b, f, and m values were plotted on the b-f-m diagram. 218 gaging stations were mapped by channel stability (Figure 3) and a scatter plot matrix of b, f, and m was generated in ArcGIS (Figure 4). 34
  • Slide 35
  • 35 Fig. 3. Stability of stream channel cross sections at or near gaging stations in Kentucky and Tennessee. Results - Empirical Estimation of Parameters, Production of Maps
  • Slide 36
  • 36 Fig. 4. Scatter plot matrix of hydraulic geometry exponents b, f, and m in Kentucky and Tennessee. The large graph at the upper right is an enlargement that is displayed by clicking one of the figures in the matrix of graphs.
  • Slide 37
  • Results - Empirical Estimation of Parameters, Production of Maps The histograms are similar to those produced by Rhodes and no obvious outliers were detected from the scatter plots The mean hydraulic geometry (0.219, 0.377, 0.404) for our data from five states (KY, TN, MS, AL, GA) was most similar to the mean at-a-station hydraulic geometry from a study by Stall and Yang [19], whose averages of b, f, and m (0.23, 0.41, 0.36), with a Euclidean distance of 0.0565 from our data, were from the Big Sandy River basin in Kentucky. Stall and Yang [19] report that there were 19 gages in this basin, described as a mature plateau of fine texture with moderate to strong relief. 37
  • Slide 38
  • Results - Classification The expected frequency count (EFC) within each channel type polygon in the ternary diagram was computed by multiplying its area percentage by the total count of observations in the usable data set (Table II) next slide. Cursory inspection of the table reveals that hydraulic exponents are concentrated in even-numbered channel types, which lie to the right of vertical line b = f in the b-f-m diagram. Note that all EFCs exceeded 5, meeting the requirements for a Chi square test. The Chi square statistic for a contingency table that included only even-numbered channel types was about 186.3 with four degrees of freedom and a pvalue of 3.25 10 -39, indicating that the points are not randomly distributed among even-numbered channel types on the b-f-m diagram. 38
  • Slide 39
  • Results - Classification - Table II. Chi-square from observed and expected counts in each channel type. 39 Chann el type Observed count (O) Ternary area [%] Expected count (E) [(O-E)] 2 /E I812.5027.3 13.6 II5412.5027.3 26.3 III2320.8345.4 11.1 IV384.179.1 91.9 V34.179.1 4.1 VI345.8312.7 35.7 VII32.505.5 1.1 VIII234.179.1 21.3 IX510.0021.8 12.9 X2723.3350.9 11.2 Total218100.00218.0 229.1
  • Slide 40
  • Ternary Diagram: b-f-m diagram Five hydrologically significant lines differentiate the ternary diagram into 10 fields. 40
  • Slide 41
  • Ternary Diagram: Channel Type I b + f < m AND b > f 41
  • Slide 42
  • Ternary Diagram: Channel Type II b + f < m AND b < f 42
  • Slide 43
  • Ternary Diagram: Channel Type III b + f > m AND b > f AND m > f 43
  • Slide 44
  • Ternary Diagram: Channel Type IV b + f > m AND b < f AND m > f 44
  • Slide 45
  • Ternary Diagram: Channel Type V f > m AND b > f AND m/f > 2/3 45
  • Slide 46
  • Ternary Diagram: Channel Type VI f > m AND b < f AND m/f > 2/3 46
  • Slide 47
  • Ternary Diagram: Channel Type VII m > f/2 AND b > f AND m/f < 2/3 47
  • Slide 48
  • Ternary Diagram: Channel Type VIII m > f/2 AND b < f AND m/f < 2/3 48
  • Slide 49
  • Ternary Diagram: Channel Type IX m < f/2 AND b > f 49
  • Slide 50
  • Ternary Diagram: Channel Type X m < f/2 AND b < f 50
  • Slide 51
  • Results Assessment of Data Quality The comparison of our frequency distribution of channels types with that of Rhodes is shown in Table III. The value of Spearmans rank order correlation coefficient for the comparison was 0.82. The critical values of Spearmans rank order correlation coefficient for one- and two-tailed tests at the 0.05 significance level for 10 pairs of observations were, respectively, 0.564 and 0.648, suggesting a high degree of similarity in the compared frequency distributions of channels types. In both distributions, even number channel types, which lie to the right of the vertical line b = f on the diagram, all have ranks from 1 to 5.5. 51
  • Slide 52
  • Results - Assessment of Data Quality - Table III. Frequency distributions for channels types. 52 Channel Type Rhodes [12] rankOur rank I97 II11 III65.5 IV52 V79.5 VI23 VII89.5 VIII45.5 IX108 X34
  • Slide 53
  • Results Stability and Material (significant interaction) The cross tabulation for each combination of stability and material are shown in Table IV. The pvalue of the Chi square statistic was 7.28 10 -4, indicating a significant interaction between channel stability and material. This corroborates the findings of Ridenour [20], who found an 86% success rate in predicting bank stability from hydraulic geometry exponents using compositional discriminant function analysis. 53
  • Slide 54
  • Results Stability and Material - Table IV. Cross tabulation of channel stability and material (KY-TN). 54 firmsoftTotal gravel40242 sand13720 silt44347 Total9712109
  • Slide 55
  • Results Evenness and Material (NO significant interaction) A similar cross tabulation of channel evenness and material (eliminating stations with either or both characteristics unspecified) is shown in Table V. The pvalue of the Chi square statistic was 0.56, indicating no significant difference between the observed frequencies and those expected by chance, thus, there is no interaction between channel evenness and material. 55
  • Slide 56
  • Results Evenness and Material - Table V. Cross tabulation of channel evenness and material (KY-TN). 56 evenunevenTotal ledge30838 boulders + cobbles29837 gravel 37643 sand 15520 silt 351449 Total14641187
  • Slide 57
  • Results Stability and Evenness (Interaction, NOT strong) Additional data was mined from the three states that share Tennessee's southern border: Mississippi, Alabama, and Georgia, which more than tripled the size of the database (to 599 stations). The cross tabulation is shown in Table VI. The pvalue of the Chi square statistic was 0.11, indicating interaction between channel stability and evenness at the 5% and 10% significance levels but not at the level of 1%. 57
  • Slide 58
  • Results Stability and Evenness - Table 6. Cross tabulation of channel stability and evenness (KY-TN-MS-AL-GA). 58 firmsoftTotal even317103420 uneven11762179 Total434165599
  • Slide 59
  • Results - Channel Characteristics and Graphical Classification The data for Kentucky and Tennessee were plotted on b-f-m diagrams and differentiated by stability, evenness, and material in Figures 5, 6, 7, and 8. To determine the efficacy of these lines with regard to separating channels by stability, evenness, or material, two channel types on either side of each line were created by consolidating all channel types on the same side. The pvalues are summarized in Table VII. 59
  • Slide 60
  • Results Table VII. Chi-square pvalues for tests of interaction of merged channel types with stability, evenness, and material (KY-TN-MS-AL-GA). (Small number means nonrandom.) 60 LineBounding polygon 1Bounding polygon 2stability evennes s material m = b + fI thru IIIII thru X.024.386