coordinator: aleksandar jovanovic eu‐vri bastien caillard eu‐vri … · 2019. 10. 8. ·...
TRANSCRIPT
PUBLIC DELIVERABLE
H2020 Project: Smart Resilience Indicators for Smart Critical Infrastructure
D3.9 ‐ Report on RapidMiner
Coordinator: Aleksandar Jovanovic EU‐VRi
Project Manager: Bastien Caillard EU‐VRi
European Virtual Institute for Integrated Risk Management
Haus der Wirtschaft, Willi‐Bleicher‐Straße 19, 70174 Stuttgart
Contact: smartResilience‐CORE@eu‐vri.eu
SMART RESILIENCE INDICATORS FOR SMART CRITICAL INFRASTRUCTURES
© 2016-2019 This document and its content are the property of the SmartResilience Consortium. All rights relevant to this document are determined by the applicable laws. Access to this document does not grant any right or license on the document or its contents. This document or its contents are not to be used or treated in any manner inconsistent with the rights or interests of the SmartResilience Consortium or the Partners detriment and are not to be disclosed externally without prior written consent from the SmartResilience Partners. Each SmartResilience Partner may use this document in conformity with the SmartResilience Consortium Grant Agreement provisions. The research leading to these results has received funding from the European Union’s Horizon 2020 Research and Innovation Programme, under the Grant Agreement No 700621.
The views and opinions in this document are solely those of the authors and contributors, not those of the European Commission.
Report on RapidMiner Data Science-Driven Resilience Analytics with RapidMiner
Report Title: Report on RapidMiner
Author(s): T. Knape, B. Allen
Responsible Project Partner:
AIA Contributing Project Partners:
n/a
Document data:
File name / Release: D3.9_Report on RapidMiner_v13sm25092019.docx Release No.: 4
Pages: 70 No. of annexes: 3
Status: Amended acc. to the EC review
Dissemination level: PU
Project title: SmartResilience: Smart Resilience Indicators for Smart Critical Infrastructures
Grant Agreement No.: 700621
Project No.: 12135
WP title:
The SmartResilience indicator-based methodology for assessing, predicting & monitoring the resilience of SCIs for optimized multi-criteria decision-making
Deliverable No: D3.9
Date: Due date: September 30, 2019
Submission date: September 30, 2019
Keywords: RapidMiner, Resilience Analytics
Reviewed by:
Knut Øien Review date: March 21, 2019
Peter Klimek Review date: March 21, 2019
Frank Fiedrich Review date: March 24, 2019
Approved by Coordinator:
A. Jovanović Approval date: September 30, 2019
Dublin, September 2019
SmartResilience: Indicators for Smart Critical Infrastructures
page i
Release History
Release No.
Date Description / Change
1 March 20, 2019 Draft version.
2 April 11, 2019 Updated version based on comments from reviewers.
3 April 23, 2019 Final version, based on comments during SC meeting on April 16, 2019.
4 September 24, 2019 Revised version prepared to address European Commission review comments.
SmartResilience: Indicators for Smart Critical Infrastructures
page ii
Project Contact
EU-VRi – European Virtual Institute for Integrated Risk Management Haus der Wirtschaft, Willi-Bleicher-Straße 19, 70174 Stuttgart, Germany Visiting/Mailing address: Lange Str. 54, 70174 Stuttgart, Germany Tel: +49 711 410041 27, Fax: +49 711 410041 24 – www.eu-vri.eu – [email protected] Registered in Stuttgart, Germany under HRA 720578
SmartResilience Project
Modern critical infrastructures are becoming increasingly smarter (e.g. the smart cities). Making the infrastructures smarter usually means making them smarter in the normal operation and use: more adaptive, more intelligent etc. But will these smart critical infrastructures (SCIs) behave smartly and be smartly resilient also when exposed to extreme threats, such as extreme weather disasters or terrorist attacks? If making existing infrastructure smarter is achieved by making it more complex, would it also make it more vulnerable? Would this affect resilience of an SCI as its ability to anticipate, prepare for, adapt and withstand, respond to, and recover? What are the resilience indicators (RIs) which one has to look at?
These are the main questions tackled by the SmartResilience project.
The project envisages answering the above questions in several steps: (#1) By identifying existing indicators suitable for assessing resilience of SCIs, (#2) By identifying new smart resilience indicators including those from Big Data, (#3) By developing, a new advanced resilience assessment methodology based on smart RIs and the resilience indicators cube, including the resilience matrix, (#4) By developing the interactive SCI Dashboard tool, and (#5) By applying the methodology/tools in 8 case studies, integrated under one virtual, smart-city-like, European case study. The SCIs considered (in 8 European countries!) deal with energy, transportation, health, and water.
This approach will allow benchmarking the best-practice solutions and identifying the early warnings, improving resilience of SCIs against new threats and cascading and ripple effects. The benefits/savings to be achieved by the project will be assessed by the reinsurance company participant. The consortium involves seven leading end-users/industries in the area, seven leading research organizations, supported by academia and lead by a dedicated European organization. External world leading resilience experts are included in the Advisory Board.
SmartResilience: Indicators for Smart Critical Infrastructures
page iii
Executive Summary
This D3.9 report describes our use of RapidMiner technology for resilience analytics applications, and it relates to work in D3.3, D3.4, D3.7 and also D4.6. The main objectives addressed are
Predictive resilience analytics Multi-Criteria Decision Making Enterprise Integration for resilience assessment applications New data-driven indicators update for the project database
We have collaborated with Cork City Council on the application of predictive resilience analytics and multi-criteria decision making for the use case scenario on urban flood resilience.
Our research has been supported by the following agencies in the Irish Government:
Office of Public Works Ordnance Survey Ireland, Ireland’s National Geographic Service National Transport Authority
We thank them for their support for the development of resilience analytics and data science applications which they provided through discussions, meetings and data provisions.
We have built predictive models for forecasting flood water levels using available datasets in the GOLF case study and evaluated their effectiveness.
We have further implemented multi-criteria decision making by the example of a flood-protection investment use case supported by the Office of Public Works.
Work carried out in D3.9 relates to the GOLF case study. Contributions under T3.3 concern a predictive model that with location height data and location statistics data can in real-time predict future functionality levels along the FL-t curve. However, more extended forecasting is at the expense of forecasting accuracy. Recovery is also linked to the severity of flood water level impact and likely structural flood damage. Contributions under T3.4 are in the use of RapidMiner and the application of the MCA approach which is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. It does not follow the MCDM approach described in D3.4.
Concerning T3.7, we discuss RapidMiner enterprise integration options and re-use of RapidMiner analytics processes which can be integrated with most database systems but likely require alterations to fit the particular business use case.
With regards to T4.6, we report on several new data-driven indicators. These indicators are based on the predictive water level model for the GOLF use case.
SmartResilience: Indicators for Smart Critical Infrastructures
page iv
Table of Contents
Purpose of Document ......................................................................................................................... 10 D3.9 in the SmartResilience project ................................................................................................... 10 Intended Audience 12 Impact on Stakeholders ...................................................................................................................... 12 Flood risk management methods ...................................................................................................... 12
Background Predictive Analytics ........................................................................................................ 14 CRISP-DM Industry-standard methodology for data analytics projects .......................................... 15 Conclusions 17
Introduction 18 Algorithmic approach ......................................................................................................................... 19 Data ingested by the model ............................................................................................................... 20 Data cleaning & pre-processing process............................................................................................ 21 Relevance of input data attributes - weighting by information gain ............................................... 23 Implementation of the forecasting model - RapidMiner Predictive Analytics ................................ 25
3.6.1 Option 1 Simple Learning with Naïve Bayes .......................................................................... 25 3.6.2 Option 2 ARIMA 27 3.6.3 Option 3 Deep Learning .......................................................................................................... 32 3.6.4 Model performance (unseen data) discussion & conclusions .............................................. 36
Predictive Functionality Levels & Charting ........................................................................................ 38 Conclusions 41
Introduction 42 Flood Protection Investments Options GOLF .................................................................................... 42 MCA benefit score approach .............................................................................................................. 44 Implementation in RapidMiner .......................................................................................................... 48 Conclusions 52
Conclusions 54
Conclusions 55
SmartResilience: Indicators for Smart Critical Infrastructures
page v
Annex 1 Summary of the input data ............................................................................................................... 60
Annex 2 Charts 62
Annex 3 Review process 64
SmartResilience: Indicators for Smart Critical Infrastructures
page vi
List of Figures
Figure 1: Complete structure for functionality assessment in a smart city ........................................................................... 11 Figure 2: Common steps in a predictive analytics project ...................................................................................................... 14 Figure 3: Phases of the CRISP-DM reference model [7] .......................................................................................................... 16 Figure 4: Predictive resilience analytics 18 Figure 5: Geo Locations Sensors - on OSI ITM Digital Globe Aerial Imagery ......................................................................... 21 Figure 6: Geo Locations Sensors - on OSI ITM basemap ......................................................................................................... 21 Figure 7: Data cleaning & pre-processing process .................................................................................................................. 22 Figure 8: Executive Process containing the data pre-processing ........................................................................................... 23 Figure 9: Naïve Bayes Model with integrated Weighting using Information Gain ................................................................ 26 Figure 10: Validation sub-process 27 Figure 11: Naïve Bayes confusion matrix on training & test data .......................................................................................... 27 Figure 12: Arima Top level Predictive Model View ................................................................................................................. 28 Figure 13: Arima Executive Process Sub-Process View ........................................................................................................... 29 Figure 14: Arima Model Copy-Time-column Sub-Process View ............................................................................................. 29 Figure 15: Arima Model Handle-missing Sub-Process View ................................................................................................... 30 Figure 16: Arima Model Find ExtremesInLabel Sub-Process View ......................................................................................... 30 Figure 17: Arima Model find-name-of-label Sub-Process View .............................................................................................. 30 Figure 18: Arima Model rename-label-to-a-standard-name Sub-Process View .................................................................... 31 Figure 19: Arima Model Optimize Parameters (Grid) ............................................................................................................. 31 Figure 20: ARIMA: the blue line is the prediction.................................................................................................................... 32 Figure 21: Holt-Winters: the red line is actual, the blue line is the prediction ...................................................................... 32 Figure 22: Deep Learning Toplevel Predictive Model View .................................................................................................... 33 Figure 23: Deep Learning Results 33 Figure 23: Performance Vector Deep Learning - Training Dataset ......................................................................................... 33 Figure 25: Deep Learning forecasting accuracy chart - actual value are in red, predicted values in blue ........................... 34 Figure 26: Shorter-term Deep Learning forecasting accuracy chart (1) - actual values are in red, predicted values in
blue 34 Figure 27: Shorter-term Deep Learning forecasting accuracy chart (2) - actual values are in red, predicted values in
blue – on unseen data 35 Figure 28: Deep Learning forecasting accuracy chart (window size of 48) - actual values are in red, predicted values in
blue 35 Figure 29: Shorter-term Deep Learning forecasting accuracy chart (1) (window size of 48) - actual values are in red,
predicted values in blue 36 Figure 30: Shorter-term Deep Learning forecasting accuracy chart (2) (window size of 48) - actual values are in red,
predicted values in blue 36 Figure 31: Naïve Bayes confusion matrix on unseen data ...................................................................................................... 37 Figure 32: Performance Vector Deep Learning – Unseen Dataset ......................................................................................... 37 Figure 32: Predicted number of jobs at business locations either affected and severely affected by flooding .................. 38 Figure 33: Ratio of jobs at business locations either affected and severely affected by flooding ....................................... 39 Figure 34: percentage of non-affected jobs & non-affected jobs severely ........................................................................... 39
SmartResilience: Indicators for Smart Critical Infrastructures
page vii
Figure 35: Predictive modelling of FL-t curve .......................................................................................................................... 40 Figure 36: RapidMiner process 48 Figure 37: Review of data in the RapidMiner data editor ....................................................................................................... 48 Figure 39: RapidMiner functions - MCDM formula for GOLF ................................................................................................. 50 Figure 39: Defining the sum over all weighted-scores calculated for the criteria for a specific option............................... 50 Figure 40: Defining the grouping by flood protection measure option ................................................................................. 51 Figure 41: Result set with values for each option ................................................................................................................... 51 Figure 42: Visual representation of MCDM result set ............................................................................................................. 51 Figure 43: Example operators supporting Integration of RapidMiner Resilience Analytics applications ............................ 53 Figure 44: Webservices integration for analytics processes running on RapidMiner Server in, e.g. a cloud deployment
scenario 54 Figure 45: Insured businesses either affected or severely affected by the predicted water level....................................... 62 Figure 46: Ratio of the number of jobs at business locations either affected or severely affected by the predicted
flood water level 62 Figure 47: Value of stock levels held at business locations either affected or severely affected by the predicted flood
water level 63
SmartResilience: Indicators for Smart Critical Infrastructures
page viii
List of Tables
Table 1: Functionality assessment levels GOLF test scenario ................................................................................................. 11 Table 2: Weight by information gain 24 Table 3: Global weighting for flood protection investment options ...................................................................................... 44 Table 4: Local weighting, importance scoring ......................................................................................................................... 45 Table 5: Scoring - General Approach 45 Table 6: Scoring - Technical & Economic Criteria .................................................................................................................... 46 Table 7: Scoring - Social & Environmental 46 Table 8: Scoring - Other Criteria 46 Table 9: MCA benefit calculation for a data-driven social indicator (partial, reviewing one data-driven indicator) .......... 47 Table 10: Example calculation of MCA value per option ........................................................................................................ 47 Table 11: Example MCDM calculation 49 Table 12: Data summary – Roches Point Weather Station – every hour ............................................................................... 60 Table 13: Data summary – Tidal Station NMCI Ringaskiddy Data - every 15mins ................................................................. 61 Table 14: Data summary – Water Level Station Lee Road - every 5mins .............................................................................. 61
SmartResilience: Indicators for Smart Critical Infrastructures
page ix
List of Acronyms
Acronym Definition AIA Applied Intelligence Analytics
APSR Areas for Further Assessment
ARIMA AutoRegressive Integrated Moving Average
AU Assessment Unit
CCC Cork City Council
CSV Comma-separated values (file format)
FRM Flood Risk Map
GW Global-weighting
ITM Irish Transverse Mercator
LW Local-weighting
MCA Multi-Criteria Analysis
MCDM Multi-Criteria Decision Making
NTA National Transport Authority
OPW Office of Public Work
OSI Ordnance Survey Ireland, Ireland’s National Geographic Service
SmartResilience: Indicators for Smart Critical Infrastructures
page 10
Introduction
Purpose of Document This report relates to several project tasks (T3.3, T3.4, T3.7 and T4.6):
1. Contributing to modelling the impact and recovery phase using predictive analytics capabilities of RapidMiner using available data in the context of the GOLF case study
2. Contributing to developing multi-criteria decision analysis tools based on, e.g. RapidMiner 3. Supporting the development of an integrated resilience assessment tool based on expertise with
RapidMiner technology 4. Contributing to updating the resilience indicators database through expertise on databases and
risk/resilience tools
The report addresses the above in the following sections:
All Chapter 2 RapidMiner Analytics
1 Chapter 3 Resilience Analytics with RapidMiner Predictive Analytics
2 Chapter 4 Multi-Criteria Decision Making
3 Chapter 5 Enterprise Integration with RapidMiner
4 Chapter 6 New data-driven indicators
D3.9 in the SmartResilience project D3.9 focuses on the technical aspects of predictive modelling and MCDM implementation using RapidMiner technology. We describe how we use predictive analytics and MCDM for building data-driven indicators that help assess the resilience by example using available datasets in the context of the GOLF case study.
GOLF – Urban Flooding case study
Cork City, located at the head of a tidal estuary and the downstream end of a large river catchment, is prone to both tidal and fluvial flooding. Cork City is the second-largest city in the Republic of Ireland with a population of 125,622 as per the 2016 census. Flooding is the main threat in the GOLF case study. For predictive impact modelling, we selected tidal flooding as the most frequent flooding and calculated the predicted impact/recovery for the city using available economic statistics data, location height data and environmental data.
D3.9 describes predictive modelling of the impact and recovery phase using RapidMiner technology and thereby relates to Deliverable D3.3: Report on the ‘SmartResilience Methodology for Assessing Resilience of SCIs Based on RIs (Resilience Indicators) [1].
D3.9 further describes the implementation of MCDM using RapidMiner by the example of a practical use case supported by the OPW and relates to Deliverable D3.4: Report on the SmartResilience MCDM Methodology Serving as the Basis for the ‘SCIs Dashboard [2].
SmartResilience: Indicators for Smart Critical Infrastructures
page 11
Also, D3.9 describes the reuse of RapidMiner processes and integration options with other systems such as the SmartResilience database and thereby relates to Deliverable D3.7: The “SCIs Dashboard containing the module on Dynamic Intelligent Checklists [3]
It further lists new data-driven indicators which relate to Deliverable D4.6: New Release of the RI-Database [4].
Below figure illustrates the complete structure for functionality assessment in a smart city according to the SmartResilience methodology [1]. It relates to chapters 3 and 4 (predictive resilience analytics for the impact and recovery phases).
Figure 1: Complete structure for functionality assessment in a smart city
Below table relates resilience analytics as described in chapter 3 & 4 with the SmartResilience methodology.
Table 1: Functionality assessment levels GOLF test scenario
SmartResilience methodology structure Test scenario Urban Flooding Resilience Predictive Analytics
Level 1. Functionality level of the city Cork City Building a threat impact prediction model such as the water level prediction model for Cork City discussed in chapter 4
Height dataset Cork City Environmental Sensor data
Level 2. Functionality level (FL) of the infrastructure, corresponding to the SCIs in the project
Economy Employment statistics
Level 3. Functionality elements (FEs) e.g. jobs, buildings, insurance location and height attributes in stakeholder databases
Level 4. Functionality indicators (FIs) e.g. % or number of jobs affected by flooding, affected by flooding severely
Impact and recovery indicator calculation based on prediction models and data assets at levels 1, 2, 3.
SmartResilience: Indicators for Smart Critical Infrastructures
page 12
Intended Audience The intended audience of this report, which mostly focuses on data science and the use of RapidMiner for data-driven indicator development, are the required actors involved in a data science-driven resilience assessment project:
End users (needs / requirements / validation / data) Business analysts (concept analysis) Data scientists (data prep/ modelling / operationalisation)
End users:
Readers who are interested in forecasting applications helping to assess the impact of a threat such as flooding water levels in their work environment, e.g. emergency coordination.
Business analysts:
Readers who are interested in the domain of predictive modelling, MCDM, delivering projects that create decision support tools using predictive analytics or MCDM applications.
Data scientists:
Readers who are interested in the technical aspects of data analytics.
Impact on Stakeholders As per the DRS-14-2015 call topic prospective project impact requirements, the funding agency requested the action “to proactively target the needs and requirements of public bodies.” The assessment of end-user needs, and various resilience analytics options have been carried out in close collaboration with Cork City Council and Cork City Fire Brigade and supported by the following stakeholders in the Irish Government:
Office of Public Work Ordnance Survey Ireland National Transport Authority
We have assessed various datasets from the above agencies. We argue that our focus on prototyping resilience analytics concepts where data assets are readily available in public bodies warrants the highest impact requested by the funding agency.
Flood risk management methods Different methods to assess risk and vulnerability of areas to flooding have been developed over the last few decades. Two of the more widely used methods are deterministic physically-based hydraulic modelling approaches to risk assessment and parametric approaches for assessing flood vulnerability [5]. Deterministic modelling approaches use physically-based hydraulic modelling approaches to estimate flood hazard/probability of particular events and rely on a significant amount of detailed topographic, hydrographic and economic information in the area studied. If the information is available, reasonably accurate estimates of the potential flood risk to an area can be achieved. Parametric approaches were introduced in the 1980s by Little and Rubin [6], and aim to use only a few readily available data of information to build a picture of the vulnerability of an area. Parametric approaches point on vulnerability assessments to minimise the impacts of flooding and also to increase the resilience of the affected system. This report presents a hybrid approach of deterministic flood prediction modelling using available environmental data sources and parameters such as jobs registered at businesses affected by flooding as determined by the water level prediction model.
SmartResilience: Indicators for Smart Critical Infrastructures
page 13
SmartResilience: Indicators for Smart Critical Infrastructures
page 14
RapidMiner Analytics
Background Predictive Analytics As one of the main parts of this report focuses on predictive modelling, this section gives a primer on predictive analytics.
Predictive analytics is about building predictive models that can provide accurate assessments of what will happen in the future. By using data, statistical algorithms & machine learning techniques, in predictive analytics, we identify the likelihood of future outcomes based on historical data.
Predictive analytics is a process of identifying patterns in historical data to estimate values for future data we do not have. An example which is further discussed in this report is the use of past location-based water level data, tidal data and weather data to build a model which will predict future water levels which allows to better prepare for a flooding disaster. With an accurate estimate of a future water level we can evaluate how many locations at certain heights will be affected, and with the use of location-based statistics can calculate the likely impact of a flood in terms of for example how many jobs at locations are endangered, how many locations do not have any insurance cover and so forth. Comparing the predicted water level to the location height data will also give us a clearer picture how many locations will be affected severely, which are for instance at least 0.5m below the predicted water level and thereby likely not having a speedy recovery due to required repair work. These locations and their location statistics likely will not show a recovery that is 100% soon after the disaster impact.
Predictive modelling
It is essential to understand that a predictive analytics problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. Model creation is greatly supported with a visual modelling tool, but for the model to be effective, we need to carry out an in-depth review of the problem to be solved and the available data. Often it turns out that available data is not enough to support a predictive model. Quite often in the pre-modelling phases, the strategy for predictive modelling changes its shape.
A predictive modelling project is often very detailed and complex; however, all have in common some high-level tasks. The following illustrates the main steps for building an effective predictive model. These steps are taken iteratively when aiming to build a highly accurate model with benefit to the participating project end user. Data preparation concerns data access, exploration, blending and cleansing. Data modelling concerns model building and validation. Operationalisation concerns deployment & maintenance as well as embedding. These steps are further explained in detail below as they are mentioned in the predictive model building discussion in chapter 3.
Figure 2: Common steps in a predictive analytics project
SmartResilience: Indicators for Smart Critical Infrastructures
page 15
Data preparation
In the first step, we have been identifying relevant data sources that allow us to build a predictive resilience analytics model in an urban flood resilience context. We have been discussing and evaluating data sources with
Cork City Council National Transport Authority Ordnance Survey Ireland – Ireland’s National Geographic Service Office of Public Works
We have been exploring the available datasets for use in predictive resilience analytics applications. For datasets supporting a predictive resilience model, we have been blending data using transformations, data parsing, type conversions, filtering, sorting, set operations such as joins or unions, aggregations, rotations, feature selection, feature creation, feature extraction, sampling and partitioning.
In the next step, we have been cleansing data using anomaly & outlier detection, duplicate detection, binning, dimensionality reduction, missing value handling and normalisation.
We have built a number of predictive analytics models and scored them to validate their performance.
Modelling
We have been using different modelling techniques to build predictive models. These include machine learning algorithms such as regression and classification. Further association mining, frequent item set, similarity computation. Also feature weighting, segmentation & clustering, ensemble and hierarchical models. We further used algorithms, loops & branches to find optimal actions.
We used cross-validation to validate the performance of the models and several interactive charts to get a visual insight into the model performance. We have used numerical/nominal and categorical model performance criteria and also, significance tests, optimal threshold cut-off for binomial classes, cost-sensitive learning and performance measures.
Operationalisation
We have used the scoring engine, run on server- and cloud-based infrastructures, trigger or schedule execution.
CRISP-DM Industry-standard methodology for data analytics projects Utilising a standard methodology can help ensure quality outcomes for predictive analytics. The Cross-Industry Standard Process for Data Mining (CRISP-DM) [7] is a widely followed standard process for analytics projects. It is composed of six steps we discuss in the following with comments on where each step sits in relation to SmartResilience data-driven indicators and predictive modelling.
SmartResilience: Indicators for Smart Critical Infrastructures
page 16
Figure 3: Phases of the CRISP-DM reference model [7]
1. Business understanding
In this step, we spend time understanding the reasons for predictive modelling from a business perspective. In SmartResilience this phase would concern level 1 to 4 shown in Figure 1.
2. Data understanding
In this step, we review data and its potential promises and shortcomings, and we begin to generate hypotheses. We then reassess the business understanding (step 1) if needed. In SmartResilience we have been identifying data source, reviewing their potential for predictive modelling and analysing legal data access modalities. This second step in CRISP-DM would relate to level 4 in the functionality level hierarchy in SmartResilience. It concerns the data of a data-driven indicator that informs about a functionality level.
3. Data preparation
In this step, we carry out data selection, integration, transformation, and pre-processing steps. CRISP-DM does not prescribe in what order these tasks will be done in.
4. Modelling
In this step, we apply the algorithms to the data to discover the patterns. We may have to reassess the data preparation step (step 3) if the modelling step requires it.
5. Evaluation
Here we evaluate the model and discovered patterns for their value in answering the business problem such as location-based water level prediction. We may have to revisit the business understanding (step 1) if necessary.
6. Deployment
We present the discovered knowledge and models and put them into production to solve the business problem. In SmartResilience this would be a new data-driven indicator.
The strength of CRISP-DM is in its built-in iteration. We are expected to check that the current step is still in agreement with certain previous steps. Another strength is that we are explicitly reminded to keep the business problem in the centre of all steps including the evaluation steps. The SmartResilience case studies give a broad spectrum for predictive modelling use cases. Data understanding and availability of relevant historical datasets is key to predictive modelling and data-driven indicators.
RapidMiner Studio [8]
SmartResilience: Indicators for Smart Critical Infrastructures
page 17
RapidMiner Studio is a visual design environment for rapidly building complete predictive analytic workflows. It provides an extensive library of machine learning algorithms, data preparation and exploration functions, and model validation tools to support data science projects and use cases.
Data science teams can easily re-use existing R and Python code, and new functionality via a vast marketplace of pre-built extensions.
Key features are:
Visual Programming Environment Guided Analytics Reusable Building Blocks & Processes 1500+ Machine Learning & Data Prep Functions Integration of R & Python Scripts Correct Model Validation Methods Access All Types of Data
RapidMiner Studio is a Java-based application that facilitates a GUI driven development of predictive and descriptive models. It is possible to directly run training of models as well as their application which facilitates the modelling and scoring of unseen data. In this project, we have been dealing with reasonably small datasets even when using larger window sizes. We have been using a reasonably powerful workstation with an i7-7700HQ quad-core CPU and 16GB RAM. Alternative implementation options include C, C++ or possibly python which can yield fast performance results but may not provide the advantage of rapid prototyping through a GUI-driven development environment.
RapidMiner Server [9] RapidMiner Server allows for fast and straightforward collaboration for large-scale enterprise data science projects. Users across an organisation can easily access, reuse and share models and processes in a version-controlled, secure and centrally managed environment. RapidMiner Server easily integrates analytic results into business processes and applications with its rich set of connectors, BI integration and web-service APIs. Key features are:
Optimised enterprise data science teamwork Seamlessly operationalise, leverage enterprise infrastructure Highly scalable, distributed architecture Cloud deployment
The next section discusses the use of RapidMiner for predictive modelling addressing flood water level prediction and location-based impact and recovery calculations.
Conclusions This chapter gave an overview of RapidMiner and the typical steps followed in a data analytics project. CRISP-DM is a widely used reference model for data analytics projects. Predictive modelling has been described as a process of identifying patterns in historical data to estimate values for future data we do not have. A predictive analytics business problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. It requires project-specific consultancy with the participation of end-users, business analysts and data scientists and developers as key stakeholders who in a collaborative way engage to refine the business understanding following the six phases of the CRISP-DM model in order to ensure to build decision support predictive indicators that reliably address the interests of the end-user stakeholder.
In the next chapter, we discuss the development of predictive data-driven indicators we carried out going through the various CRISP-DM steps iteratively.
SmartResilience: Indicators for Smart Critical Infrastructures
page 18
Resilience Analytics with RapidMiner Predictive Analytics
Introduction Predictive analytics empowers resilience assessment applications with data-science-driven accuracy for flood impact and recovery, and its results are straight-forward enough such that anyone can draft a meaningful understanding of the predicted flood impact and recovery by using it. Predictive analytics can be used to create accurate forecasting applications, and end-users can use them to substantiate decision making for mitigating negative flood impact and recovery in an urban environment for more effective disaster preparation, response actions and also for analysing future resilience improvement strategies in lessons learned post-disaster assessment. Predictive analytics creates data insight from relevant data sources for resilience assessment applications and supports officers in public authorities in understanding the impact of the predicted next disaster. With the understanding of the predicted next disaster extent, officers are empowered to plan for the disaster event with actions to reduce disaster impact on society. This can be, for example, how many temporary flood protection options such as flood bags are available to protect locations that are estimated to be flooded. Hence predictive resilience analytics is a powerful tool that can focus the thoughts and actions of disaster management staff towards reducing the disaster impact.
Below figure illustrates the approach for predicting the impact of a threat via a threat target parameter and utilising location intelligence via various databases and location height data.
With the predictive modelling, we obtain an indicator for predicting future water levels which are then in turn used as input into the resilience assessment.
Figure 4: Predictive resilience analytics
Figure 4 contains various variables. The following explains them in more detail:
Location IntelligencePredictive Resilience Analytics
Resilience Indicators
Predictive Threat Target Parameter
influencing attributes in time series
Resilience Assessment Apps
Threat target parameter context data such as
location height
Resilience indicator domain-specific data sources such
location employment statistics
Time series data from sensors etc.
Threat Target Parameter Prediction
SmartResilience: Indicators for Smart Critical Infrastructures
page 19
Time series data from sensors etc.: symbolic data from sensor sources such as described in Annex 1. Predictive Threat Target Parameter: the threat variable such as the water level for an urban area
prone to flooding hazards which we are aiming to forecast. Influencing attributes in time series data: for the threat of a high water-level, there can be several
influencing factors such as described in Table 12. Information gain analysis (see 3.5) then determines which of the attributes does have an influence.
Location Intelligence: data described here is used to calculate the impact of a predicted threat target parameter. For instance, a predicted water level x could mean that y locations at or below that level are impacted. Any statistical data associated with the impacted locations helps calculated the predicted threat impact such as the number of jobs at business locations (see section 3.7).
Resilience Indicators: these are calculated by associating the predicted threat with the impacted locations and location-specific statistics such as an employment database for indicators such as Predicted number of jobs at business locations either affected and severely affected by flooding (see section 3.7)
Algorithmic approach In this section, we present the predictive analytics process in pseudocode. A predictive model can be either a regression model with continuous output or a classification model. The model performance evaluation is different for each approach and discussed in section 3.6.4. A predictive classification model for flooding events outputs either 0 or 1 (no-flooding or flooding). A predictive regression model for flooding outputs a probability on the occurrence of an event such as flooding and can be implemented through various algorithmic approaches, for instance, ARIMA, Holt-Winters or Deep Learning.
Generic
For a given natural disaster environment do
identify the multivariate parameters of system inoperability collect relevant data evaluate candidate analytics approaches asses model accuracy and create performance plots, visualisations if accuracy >= acceptable threshold set by the stakeholder (e.g., errors < 10%) then
o identify the key predictors of the multivariate inoperability o asses the (non-linear) influence of the key predictors on the response
else o improve data collection, further model tuning
end if end for Test / validation (Naïve Bayes / Deep Learning)
for a given dataset do
Training phase: split training data (80,20) preprocess data
o cross-validation o feature weighting
SmartResilience: Indicators for Smart Critical Infrastructures
page 20
Testing phase: test algorithm (20% of unseen data) if (accuracy >= threshold) then
o save the model for application to unseen data. end if
end for
Model application (classification system, Naïve Bayes)
for every new data-point do
classify flow according to the saved model if (data-point is a flooding incident) then
o categorise as flooding else
o categorise as no flooding end if
end for Model application (deep learning model)
for every new data-point do
predict the next water levels as defined in the window size according to the saved model end for
Data ingested by the model Annex 1 presents a description of the data we used for predictive modelling. The data is from Lee Road Water Level Station, NMCI Ringaskiddy tidal station and Roches Point weather station. Training and test data used was 17 Sep 2015 to 31 Dec 2018 and unseen data 1 Jan 2019 to 28 Feb 2019.
The data allowed us to build a model that had an acceptable outcome. The location of Lee Road Station is not ideal, and the modelling would benefit from water level measurements closer to the city centre.
SmartResilience: Indicators for Smart Critical Infrastructures
page 21
Figure 5: Geo Locations Sensors - on OSI ITM Digital Globe Aerial Imagery
Figure 6: Geo Locations Sensors - on OSI ITM basemap
Data cleaning & pre-processing process Pre-processing in any data science project is responsible for the bulk of the effort. It is critical for any supervised learning algorithm to obtain the correct data structures. Not just in terms of the correct data types but also in relation to maximising the ‘value’ of the data.
For this project, as we are looking to predict water levels, it is vital to incorporate previous water level data points for each record. We implemented this by using a windowing approach that consists of setting the
SmartResilience: Indicators for Smart Critical Infrastructures
page 22
window size, the horizon and the offset. The pre-processing process then produces datasets, based on the input data, that is optimised for the following modelling processes:
Naïve Bayes Classification: This process also uses the windowing approach but we, in the absence of clear flooding threshold, created a class label that is ‘High’ for all flood levels that are higher than the average plus two times the standard deviation. The remaining records are classed as ‘OK’.
ARIMA and Holt-Winters: This process aims to predict the actual height. We are also using the windowing approach, tidy up dates and remove attributes that have no relevance.
Deep learning: This process aims to predict the actual height. We are also using the windowing approach, tidy up dates and remove attributes that have no relevance.
The pre-processing is mainly in relation to creating the window size and rearranging the data in such a way that it is useful for the algorithms.
Figure 7: Data cleaning & pre-processing process
The Execute Process operator has the ability to run other processes as well as have input parameters. The input parameters are the window size, horizon, and offset values. The pre-processing process creates data for all three model inputs.
Window Size: The number of values in one window, which is the number of values we are looking into the future.
Offset: we are starting with the following value Horizon: we are taking n number of values.
SmartResilience: Indicators for Smart Critical Infrastructures
page 23
Figure 8: Executive Process containing the data pre-processing
Discussion of data cleaning & pre-processing process
This process concerns the data cleaning and pre-processing ingesting the data from Table 12, Table 13 and Table 14. The following lists the most important parts of the pre-process part.
1. Retrieve Operator: “Retrieve river”: accessing stored information in the repository and loading them into the process. The river2 CSV file with the data outlined in Table 12, Table 13 and Table 14 is accessed.
2. Sub-process “Copy Time column”: a. Operator “generate attributes”: generate attribute MeasurementTime = [date-measurement] b. Operator “reorder attributes”: “Ensure order of columns matches the table view.”
3. Sub-process “Handle missing” a. Subprocess (2): “unify column types” b. Operator “select attributes”: Remove low-quality columns ind, ind 1, ind 2, ind 3, ind 4 c. Sub-process “replace missing values” d. Operator “reorder attributes”: Ensure order of columns matches the table view. e. Operator “filter examples”:
4. Sub-process “FindExtremesInLabel”: 5. Operator “Select attributes”: 6. Operator “Set role” 7. Operator “Fill data gaps”: Fill data gaps finds all the possible dates and times and allows gaps to be seen
and ensures that the window is based on a fixed time-based number of steps. This leads to some blank rows, but these get deleted later.
8. Operator “Join”: 9. Operator “Select attributes”: 10. Sub-process “Find name of label”: 11. Operator “Multiply” 12. Operator “Windowing”
Relevance of input data attributes - weighting by information gain We have calculated the relevance of the data input attributes for predicting the water level based on information gain and assigned weights to them accordingly. The attributes with the largest value are the values of the previous window. The temperature was also included in these predictors, but this is most likely linked to warmer weather being advantageous when dealing with flooding. It could also be linked to temperatures usually dropping with high winds. We assume this occurrence is more prevalent in spring, winter and autumn times.
Below table lists the weights. The higher the weights, the more important the attribute. Not surprisingly, the previous water levels are the most important decision makers/boundary. The tide does not seem to have much of an impact, and this likely is due to the fact that the only water level data available is from a station located outside the city away from the sea.
SmartResilience: Indicators for Smart Critical Infrastructures
page 24
Table 2: Weight by information gain
Attribute Weight Attribute Weight
Level Station 1 (Target) - 0 0.274635562 rain - 11 0.010423
Level Station 1 (Target) - 1 0.274151314 rain - 10 0.010393
Level Station 1 (Target) - 2 0.273220028 rain - 9 0.010325
Level Station 1 (Target) - 3 0.272367715 rain - 8 0.010271
Level Station 1 (Target) - 4 0.272121079 rain - 7 0.010223
Level Station 1 (Target) - 5 0.271289444 rain - 6 0.010191
Level Station 1 (Target) - 6 0.270810944 rain - 5 0.010158
Level Station 1 (Target) - 7 0.270113386 rain - 4 0.010085
Level Station 1 (Target) - 8 0.269516878 rain - 3 0.01005
Level Station 1 (Target) - 9 0.268648041 rain - 2 0.010027
Level Station 1 (Target) - 10 0.268085498 rain - 1 0.009987
Level Station 1 (Target) - 11 0.267156002 rain - 0 0.009961
date-measurement - 0 0.069757884 dewpt - 1 0.009692
Date tide - 0 0.069757884 dewpt - 6 0.009692
date weather - 0 0.069737766 dewpt - 4 0.009692
temp - 10 0.024932777 dewpt - 3 0.009692
temp - 11 0.024932348 dewpt - 2 0.009692
temp - 9 0.024931063 dewpt - 0 0.009692
temp - 8 0.024929777 dewpt - 5 0.009692
temp - 7 0.024927635 dewpt - 7 0.00969
temp - 6 0.024925064 dewpt - 8 0.009689
temp - 5 0.024922922 dewpt - 10 0.009687
temp - 4 0.024918209 dewpt - 9 0.009687
temp - 3 0.024913496 dewpt - 11 0.009686
temp - 2 0.02490964 vappr - 6 0.009619
temp - 1 0.0249045 vappr - 5 0.009619
temp - 0 0.024897647 vappr - 4 0.009619
msl - 11 0.021713452 vappr - 3 0.009618
msl - 10 0.021658996 vappr - 2 0.009618
msl - 9 0.021595849 vappr - 1 0.009618
msl - 8 0.02153266 vappr - 7 0.009618
msl - 7 0.021483744 vappr - 0 0.009617
msl - 6 0.021453685 vappr - 8 0.009617
msl - 5 0.021413337 vappr - 10 0.009615
msl - 4 0.021362273 vappr - 9 0.009615
msl - 3 0.021310359 vappr - 11 0.009614
msl - 2 0.021258522 wddir - 11 0.007876
msl - 1 0.021221679 wddir - 10 0.007859
msl - 0 0.021200811 wddir - 9 0.007842
wdsp - 11 0.018346989 wddir - 8 0.007823
wdsp - 10 0.018333758 wddir - 7 0.007805
wdsp - 3 0.018321383 wddir - 6 0.007786
wdsp - 4 0.018320681 wddir - 5 0.007769
SmartResilience: Indicators for Smart Critical Infrastructures
page 25
wdsp - 5 0.018304282 wddir - 4 0.00775
wdsp - 9 0.018299837 wddir - 3 0.007732
wdsp - 7 0.01829721 wddir - 2 0.007713
wdsp - 6 0.018296667 wddir - 1 0.007694
wdsp - 8 0.018292706 wddir - 0 0.007673
wdsp - 2 0.018291939 rhum - 10 0.006186
wdsp - 1 0.018260656 rhum - 11 0.006185
wdsp - 0 0.018219316 rhum - 9 0.006185
wetb - 0 0.016216825 rhum - 8 0.006171
wetb - 1 0.016200562 rhum - 7 0.006155
wetb - 2 0.016186195 rhum - 6 0.006114
wetb - 3 0.016170341 rhum - 5 0.006074
wetb - 4 0.016154881 rhum - 4 0.006022
wetb - 5 0.016140189 rhum - 3 0.005983
wetb - 6 0.016125888 rhum - 2 0.005956
wetb - 7 0.016111604 rhum - 1 0.005927
wetb - 8 0.016099211 rhum - 0 0.005884
wetb - 9 0.016086459 Tide - 11 0.001764
wetb - 11 0.016074096 Tide - 10 0.001762
wetb - 10 0.016072597 Tide - 9 0.00176
Tide - 8 0.001758
Tide - 7 0.001755
Tide - 6 0.001754
Tide - 5 0.001753
Tide - 4 0.001752
Tide - 3 0.001751
Tide - 2 0.00175
Tide - 1 0.001749
Tide - 0 0.001748
Implementation of the forecasting model - RapidMiner Predictive Analytics In this section, we discuss several predictive modelling approaches. We used 2015-2018 data for modelling/training and cross-validation and Jan/Feb 2019 data to test the cross-validated models. We developed separate analytics processes for building/testing and for applying the models to unseen data.
3.6.1 Option 1 Simple Learning with Naïve Bayes
The Naïve Bayes operator in RapidMiner Studio generates a Naive Bayes classification model.
RapidMiner Studio provides the following summary for Naïve Bayes [10]:
Naïve Bayes is a high-bias, low-variance classifier, and it can build a good model even with a small data set. It is simple to use and computationally inexpensive. Typical use cases involve text categorisation, including spam detection, sentiment analysis, and recommender systems. The fundamental assumption of Naïve Bayes is that, given the value of the label (the class), the value of any attribute is independent of the value of any other Attribute. Strictly speaking, this assumption is rarely true (it's "naive"!), but experience shows that the Naive Bayes classifier often works well. The independence assumption vastly simplifies the calculations needed to build the Naive Bayes probability model.
SmartResilience: Indicators for Smart Critical Infrastructures
page 26
To complete the probability model, it is necessary to make some assumption about the conditional probability distributions for the individual Attributes, given the class. This Operator uses Gaussian probability densities to model the Attribute data.
In our implementation, we are fitting a model that can create a decision boundary between the binary label (‘OK’ and ‘High’). ‘High’ are values that are equal or greater than the average plus two times the standard deviation. The process works as follows:
Execute the pre-processing process and output an example set of the windows data Apply a weight by information gain algorithm to calculates the relevance of the attributes based on
information gain and assigns weights to them accordingly. Keep the attributes with the 15 highest weights which is an optimised parameter value Feed the data stream into 10-fold cross-validation, where the data are split into ten parts using
stratified sampling. Iteratively, each of these ten parts is then once used for testing and the remainder for training the algorithm.
o Inside the cross-validation, we are applying a Naïve Bayes algorithm and measure the performance of each iteration
Figure 9: Naïve Bayes Model with integrated Weighting using Information Gain
SmartResilience: Indicators for Smart Critical Infrastructures
page 27
Figure 10: Validation sub-process
The confusion matrix gives close to 100% precision for the OK class (no flood) and a 75.42% precision for the ‘High’ class (flood level reached). There is a total of 4168 misclassifications with only 11 false positives leading to a class precision to almost 100%. The false negatives are considerably small too where we predict that a future river level is ‘High’ when in fact it was ‘OK’. This is acceptable as we rather have a false positive than a false negative. A false negative (a predicted flood that does not occur) is better than predicting no flood but actual flood occurring.
The misclassifications are almost always in the desired quadrant where we predict a high-water level, but in fact, the water level is still normal. Frequently these misclassifications are linked to border values where we are just at the edge of the water level between ‘OK’ and ‘High’. The recall dictates what percentage of total relevant results were correctly classified for each class and is with 98.24% and 99.91% for the two classes more than acceptable. As a 10-fold validation model was used a standard deviation of 0.07% could be calculated with an average accuracy of 98.33% overall. The standard deviation indicates that the model is likely to be robust and will generalise well to score unseen future data.
Figure 11: Naïve Bayes confusion matrix on training & test data
3.6.2 Option 2 ARIMA
The ARIMA operator in RapidMiner Studio trains an ARIMA model for a selected time series attribute.
RapidMiner Studio provides the following summary for ARIMA [10]:
ARIMA stands for Autoregressive Integrated Moving Average. Typically, an ARIMA model is used for forecasting time series. An ARIMA model is defined by its three order parameters, p, d, q. p specifies the number of Autoregressive terms in the
SmartResilience: Indicators for Smart Critical Infrastructures
page 28
model. d specifies the number of differentiations applied on the time series values. q specifies the number of Moving Average terms in the model. An ARIMA model is an integrated ARMA model. The ARMA model describes a time series by a weighted sum of lagged time series values (the Autoregressive terms) and a weighted sum of lagged residuals. These residuals originate from a normal distributed noise process. The "integrated" indicates that the values of the ARMA model are integrated, which is equal to that the original time series values which the ARMA model describes are differentiated. The ARIMA operator fits an ARIMA model with given p,d,q to a time series by finding the p+q coefficients (and if estimate constant is true, the constant) which maximize the conditional loglikelihood of the model describing the time series. For the optimization the LBFGS (Limited-memory Broyden-Fletcher-Foldfarb-Shanno) algorithm is used. When choosing values for p,d,q, it is essential that the conditional loglikelihood is only a good estimation for the exact loglikelihood if the number of parameters (sum of p,d,q) is not in the order of the length of the time series. Hence the number of parameters should be way smaller than the length of the time series. How well a trained ARIMA model describes a given time series is often calculated with the Akaikes Information Criterion (AIC), the Bayesian Information Criterion (BIC) or a corrected Akaikes Information Criterion (AICC). The ArimaTrainer operator calculates these performance measures and outputs a Performance Vector containing the calculated values. An ARIMA model which describes a time series well has small information criteria. This operator is similar to other modelling operators but is specifically designed to work on time series data. One of the implications of this is, that the forecast model should be applied to the same data it was trained on. The Apply Forecast operator receives a trained ARIMA model and creates the forecast for the time series it was trained on.
Figure 12: Arima Top level Predictive Model View
SmartResilience: Indicators for Smart Critical Infrastructures
page 29
Figure 13: Arima Executive Process Sub-Process View
Figure 14: Arima Model Copy-Time-column Sub-Process View
SmartResilience: Indicators for Smart Critical Infrastructures
page 30
Figure 15: Arima Model Handle-missing Sub-Process View
Figure 16: Arima Model Find ExtremesInLabel Sub-Process View
Figure 17: Arima Model find-name-of-label Sub-Process View
SmartResilience: Indicators for Smart Critical Infrastructures
page 31
Figure 18: Arima Model rename-label-to-a-standard-name Sub-Process View
Figure 19: Arima Model Optimize Parameters (Grid)
3.6.2.1 ARIMA results discussion
We reviewed ARIMA as one of the main methods for predicting time series. Flooding events are linked to wind, temperature, and tidal forecasts and are no isolated events. Both ARIMA and Holt-Winters algorithms did not perform very well and were considerably inferior to the deep learning algorithm.
ARIMA worked reasonably well for normal series but could not predict deviation from the norm.
SmartResilience: Indicators for Smart Critical Infrastructures
page 32
Figure 20: ARIMA: the blue line is the prediction
Holt-Winters
Figure 21: Holt-Winters: the red line is actual, the blue line is the prediction
3.6.3 Option 3 Deep Learning
The Deep Learning operator in RapidMiner Studio executes the Deep Learning algorithm using H2O 3.8.2.6.
RapidMiner Studio provides the following summary for Deep Learning [10]:
Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout and L1 or L2 regularization enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network.
SmartResilience: Indicators for Smart Critical Infrastructures
page 33
As the deep learning algorithm is computationally expensive, we opted for a 60:40 split between training and test data. The algorithm was set up with a rectifier activation function and ten epochs. The results, as outlined later, are promising.
Figure 22: Deep Learning Toplevel Predictive Model View
The detailed Deep Learning results can be seen below. A window size of 12 results in a Mean Square Error (MSE) of 0.0009293002 and a correlation value of R^2: 0.9896514, which both point to the fact that the model worked rather well.
Figure 23: Deep Learning Results
Below Figure 24 shows the Root Mean Squared Error (RMSE) of 0.031 and the Squared Error of 0.001 which we will use when discussing the performance of Deep Learning comparing the performance of the model on training and unseen data in section 3.6.4.
Figure 24: Performance Vector Deep Learning - Training Dataset
SmartResilience: Indicators for Smart Critical Infrastructures
page 34
Figure 25: Deep Learning forecasting accuracy chart - actual value are in red, predicted values in blue
Figure 26: Shorter-term Deep Learning forecasting accuracy chart (1) - actual values are in red, predicted values in blue
SmartResilience: Indicators for Smart Critical Infrastructures
page 35
Figure 27: Shorter-term Deep Learning forecasting accuracy chart (2) - actual values are in red, predicted values in blue – on unseen data
We ran both in and out of sample, and the model worked well on unseen data. Figure 27 shows the prediction performance on unseen data.
Window size of 48 is still very good with a Mean Square Error of 0.0066463803 and a correlation of R^2: 0.92412037.
Figure 28: Deep Learning forecasting accuracy chart (window size of 48) - actual values are in red, predicted values in blue
SmartResilience: Indicators for Smart Critical Infrastructures
page 36
Figure 29: Shorter-term Deep Learning forecasting accuracy chart (1) (window size of 48) - actual values are in red, predicted values in blue
Figure 30: Shorter-term Deep Learning forecasting accuracy chart (2) (window size of 48) - actual values are
in red, predicted values in blue
3.6.4 Model performance (unseen data) discussion & conclusions
As discussed in section 3.2, a predictive model can be either a regression model with continuous output or a classification model. We have developed separate analytics processes for training and applying a trained model to unseen data. In this section, we discuss the model performance for Naïve Bayes (a predictive classification model) and Deep Learning (a regression model with continuous output).
SmartResilience: Indicators for Smart Critical Infrastructures
page 37
As new weather data arrive continually, in production, we would need to make frequent model updates by retraining to ensure accurately predicted future values either predicted flooding events (Naïve Bayes) or predicted water levels.
Naïve Bayes
Below Figure 31 shows the confusion matrix which describes the performance of the Naïve Bayes model, how well it predicts whether the water level is high or ok on unseen data.
High was defined as any value which is greater than the mean + 2 * the standard deviation. No high-water classification event occurred in the timeframe of the unseen data and the model did not forecast any high-water level which is an excellent result as the model performance was 100% accurate. Figure 11 shows that the precision of the model using 10-fold cross-validation of predicting a flood is over 70% and when looking closer, the misclassified examples mainly consisted of values that were close to the threshold.
The sensitivity or recall for true-negative is 100% and for false-negative 0%. Class-recall refers to how many flood events are selected correctly as no flood (ok) or flood (high), and class-precision refers to how many.
Figure 31: Naïve Bayes confusion matrix on unseen data
Deep Learning also showed excellent results. Deep Learning model performance is measured with error loss which is the difference between the values predicted and the values actually observed. with a very low mean square error of 0.0009293002 when using a window size of 12 future measurements. The root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model and the values actually observed. The RMSE for training and test (unseen data) datasets should be very similar for a good model. If the RMSE for unseen data is much higher than that of the training data, this could be due to overfitting the data. The RMSE for unseen data is 0.040 and for the training data, it is 0.031 which is a good result.
Figure 32: Performance Vector Deep Learning – Unseen Dataset
The learning time for the Deep Learning model is considerably longer than for the Naïve Bayes model, and so is the scoring of unseen data. However, it is still at an acceptable level. We used different window sizes. As expected, the larger the window size, the less correct are the predictions.
SmartResilience: Indicators for Smart Critical Infrastructures
page 38
Predictive Functionality Levels & Charting In this section, we discuss the predicted functionality levels. We differentiate the severity of the flooding impact. Business locations that are affected (impacted) by a predicted flood water level at or up to 50 cm below are labelled as ‘affected’. Business locations that are predicted to be affected with more than 50cm predicted flood water level are labelled as ‘affected severely’ which suggest significant flood damage.
Business locations labelled as ‘affected severely’ likely will not recover back to ‘business as usual’ than business locations labelled as ‘affected’. We set the threshold of 50 cm as an example; it can be freely adapted based on the experience of subject matter experts.
Below chart shows the predicted functionality level for the number of jobs either affected or severely affected by a predicted water level of, e.g. 3.77m, which we chose as an example. It can be interpreted as how many jobs are very likely either affected or severely affected by the predicted water level.
Figure 33 shows the absolute number of jobs predicted to be affected or affected severely at their employment locations.
Figure 33: Predicted number of jobs at business locations either affected and severely affected by flooding
Figure 34 shows the ratios of jobs affected or affected severely by a predicted water level from the predictive model.
SmartResilience: Indicators for Smart Critical Infrastructures
page 39
Figure 34: Ratio of jobs at business locations either affected and severely affected by flooding
Figure 35 shows the predicted functionality level ‘employment’ with the percentage of jobs either not affected or not affected severely by the forecasted water level at their employment locations
Figure 35: percentage of non-affected jobs & non-affected jobs severely
Figure 36 illustrates the functionality level results in relation to the FL-t curve.
SmartResilience: Indicators for Smart Critical Infrastructures
page 40
Figure 36: Predictive modelling of FL-t curve
Impact & recovery phase discussion
Figure 34 shows the ratio of jobs at business locations, either affected and severely affected by a predicted flooding incident based on the total number of jobs in the location database. It shows that 5.4% of jobs will be affected, and 32.9% of jobs will be affected severely by the predicted water level of 3.77m. Figure 36 shows the resulting employment functionality level based on a value from the water level forecasting window. We define ‘severe’ flooding as an incident that causes severe structural damage to flooded locations. The percentage from the predictive indicator corresponds to the predicted water level value from the rolling forecasting window of the predictive model.
With the predictive model, we are predicting all values that are part of the window size (see section 3.4) which is our horizon. The larger the window size, the more likely we are incorrect in our prediction. That means we can increase the window size at the cost of accuracy for values far in the future to cover a full impact and recovery phase concerning the predicted water level.
Naïve Bayes forecasting only predicts the event of a flooding incident; however, deep learning predicts the future values of the water level. The deep learning model has been tested with a window size between 12 to 48. Given the model is built with 5mins intervals, with a window size of 48 the model forecasts 4 hours of future water level values which combined with the height data and location statistics give us the real-time FL-t curve.
The indicator illustrated in above Figure 33 estimates (with the input of a predicted water level value from the current model forecasting window) that in the event of the water level returning back to normal, 32.9% of jobs are still affected due to severe flood damage at the respective employment locations. As the current predicted water level gives insight into the severity of the flooding disaster and allows to estimate how many locations will suffer severe flood damage it, in turn, gives insight into the likelihood of bouncing back to 100%. Locations exposed to severe flooding may not bounce back to 100% which is returning to normalcy or resuming pre-flood challenge behaviours such as business-as-usual. In that context, there may be a need to think about regional adaptation, rethink resilience and see an individual flood-affected region as a complex adaptive system. An adaptive system is able to change or adapt to stresses rather than merely striving for a return to normalcy or a resumption of pre-challenge behaviours or outcomes [11]. In a complex adaptive system, resilience is not related to equilibrium, a return to ‘normal’, or even to resilient outcomes but instead it is a dynamic attribute associated with a process of continual development [12]. Severe flooding or a series of severe flooding incidents can contribute to a discussion of resettlement.
SmartResilience: Indicators for Smart Critical Infrastructures
page 41
There have been a number of severe flooding incidents in Europe in recent years. For example, regions in Saxony/Germany were flooded up to three times in 2002, 2006 or 2010, and 2013. According to a study on the impact of flooding to households in Saxony [13], households affected by flooding up to three times in recent years perceived the impacts of flooding incidents as more severe than households that had been affected by flooding in 2013 for the first time. Further, the study found that households that suffered flood damage several times thought considerably more often about resettlement. Hence flood-prone communities that do not get flood protection may face severe consequences and may not bounce back to normalcy 100% measured before the incident.
Annex II shows the charting for other predicted functionality levels.
Conclusions We have discussed several approaches for building a predictive model for flood water level forecasting which helps flood emergency coordinators see the impact of and recovery from a soon flooding event with the associated functionality level impact & recovery metrics. For the target parameter, the predicted water level, we have used a water level gauge not close to the city which only has historical recordings starting in 2015.
The deep learning approach yielded the best forecasting accuracy and can be adapted to other forecasting applications for a threat target parameter ‘x’ to be predicted with high accuracy pending available relevant longer-term time series data. The use of more extensive historical data for the threat target parameter recorded at a location close to the expected threat impact is essential for building a robust predictive model. By training a model with more extended time series data, it can tune in to a wide range of patterns that occurred over time and become more robust for forecasting applications. Shorter period historical time series data likely have fewer patterns that a deep learning approach can adapt to, making the predictive model less able to predict future events.
We have been using data available to us (weather, tidal and river levels) and we have shown that deep learning algorithms are quite accurate in forecasting future water levels. The model itself cannot be applied to other cities due to the variance in infrastructure (flood prevention, damns, walls, river system, etc.) and weather/tidal data. However, it would be reasonably straight forward to fit a model to data obtained from other cities. We have shown that for Cork, the deep learning approach is the most suited algorithm. Whether this holds for other cities would need to be further investigated for each case.
We have further discussed that with a severe impact from a natural flooding disaster there may not be a bounce-back recovery to pre-incident normalcy.
In addition, we discussed the limitations of our predictive model with the available input data we used to build the model. We tested the model (also on unseen data) with a window size of up to 4 hours of future water level values. This window is likely too short to carry out flood mitigation actions for the locations identified with severe flood impact such as local businesses that have a high number of employees registered at their location and are important drivers of the local economy.
SmartResilience: Indicators for Smart Critical Infrastructures
page 42
Multi-Criteria Decision Making
Introduction The previous chapters provided an overview of the capabilities of RapidMiner. In this chapter, we discuss the use of RapidMiner for multi-criteria decision making (MCDM). The test scenario is closely related to the appraisal process for investing in flood protection measures that the Office of Public Works (OPW) applies in Ireland. The OPW is the lead agency for the delivery of flood protection infrastructure in Ireland and as an external stakeholder kindly supported the SmartResilience project through the GOLF case study on urban flooding. The OPW provided relevant multi-criteria analysis (MCA) data and advice based on OPW MCA work on flood protection measures for Cork City.
RapidMiner is a data science software platform that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. For MCDM we are mostly using data access and blending as well as statistics visualisation features.
There are many MCA methods which each have useful features justifying their application. The OPW MCA approach chosen is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use.
The MCA approach discussed relies on objectives that clarify what is intended to be achieved regarding flood risk reduction and related benefits. In that, the objectives defined focus on the adverse consequences of flooding on human health, the environment, cultural heritage and economic activity.
Each flood protection investment option is calculated on its performance against each objective, in turn, looking at available indicators and the change the investment option brings to the indicator. This score is then multiplied by the global and local weightings. These weighted scores for each objective are then added up to give the overall MCA benefit score for the option which represents the overall benefits and impacts of the option across the full range of objectives.
Flood Protection Investments Options GOLF In this section, we describe the OPW MCA problem statement and options for flood protection investments in Cork City.
The core problem statement has been defined as the identification of the best investment option for improving flood protection in Cork City.
The investment options identified are:
Option 1: Develop a flood forecasting system combined with individual property protection and a targeted public awareness and education campaign
Option 2: Improved channel conveyance combined with permanent flood walls Option 3: Proactive maintenance of existing informal defences Option 4: Provision of demountable defences combined with some permanent defences with a
flood forecasting system Option 5: Provision of permanent flood walls/embankments
Discussion of investment options
Option 1: Develop a flood forecasting system combined with individual property protection and a targeted public awareness and education campaign.
SmartResilience: Indicators for Smart Critical Infrastructures
page 43
Baseline option:
No modelling and no change in flow regime. Consider individual property protection for all properties where flood depth <600mm
Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.
Option 2: Improvement in channel conveyance combined with the provision of flood walls/ embankments.
Baseline option:
This option is the same as the permanent defences option (option 5) except modification of footbridges has been considered in order to improve channel conveyance. Modelling showed that some of the footbridges on the south channel do have a small effect on water levels, reducing them by approximately 100mm in some areas. Modifying the footbridges reduces the height of the defences required.
Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.
Option 3: Proactive maintenance of existing informal defences.
Baseline option:
'with defences' and 'without defences' models of Cork demonstrate the impact of the existing informal defences. In general, the defences provide little protection against anything greater than minor floods. The reason for this is that many are not designed as flood defences and have openings for pedestrian access etc.
There are more than 40km of existing assets within Cork. Of these approximately 3% have good or very good condition, 54% have a fair condition, and 43% have a poor or very poor condition.
Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.
Option 4: Provision of demountable defences combined with some permanent defences with a flood forecasting system.
Baseline option:
This option is very similar to the previous option in terms of areas defended but wherever possible demountable defences have been used.
Demountable defences are appropriate wherever there is good access to the defence location for installation and removal. In many areas of the city centre, it would be possible to install demountable defences slightly set back from the river bank. This reduces the need to provide new walls along the bank and would make the demountables easier and safer to install. Where existing bridges are below the flood level, it has been assumed that the demountable defences will continue past the bridge. These sections of defence could be left open to allow traffic movement until floods reach a critical level.
Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.
Option 5: Provision of permanent flood walls/embankments. Baseline option: For this option assessment flood walls and embankments are considered for all the areas where
there are significant numbers of properties at risk within the APSR which can be protected
SmartResilience: Indicators for Smart Critical Infrastructures
page 44
without excessive cost. If this option is taken forward the location and type of defences would need optimising.
The defences option has been modelled, and it was found that the defences raised water levels in the North and South Channel by approximately 300-400mm at the upstream (west) part of the river but had little effect on levels in the downstream (east) end of the river.
Baseline option assumes the continuation of any existing maintenance regime in the assessment unit.
MCA benefit score approach The core of the approach features the following:
Global weightings (economic, environmental, social, technical) represent the four core objectives as the basis assessment criteria
Local weightings (from ‘international importance’ to ‘not relevant’) Scoring system developed in consultation with the Lee Catchment Flood Risk Assessment and
Management project steering group o OPW o Cork City Council o Cork County Council o Environmental Protection Agency
Global weightings
The global weightings have been developed by the OPW and are fixed nationally. They are unchanged for each assessment unit. This level of weighting recognises the key drivers behind flood risk management (FRM) options and gives higher weightings to risk to human health and life and economic return on options.
An assessment unit defines the spatial scale at which flood risk management options are assessed and are defined on four spatial scales ranging in size from largest to smallest as follows: catchment scale, Assessment Unit (AU) scale - a large sub-catchment - e.g. Lower Lee AU, Areas for Further Assessment (APSR) - e.g. Cork City.
Table 3: Global weighting for flood protection investment options
Criterion Objective Global weighting
Technical Operationally Robust 5
Technical Health & Safety Risk 5
Technical Adaptability 5
Economic Economic Return 25
Economic Transport Infrastructure 15
Economic Utility Infrastructure 15
Economic Agriculture 5
Social Risk to Human Health 30
Social Community Risk 10
Social Risk to Social Amenity 5
SmartResilience: Indicators for Smart Critical Infrastructures
page 45
Environmental Ecological Status 5
Environmental Pollution Sources 15
Environmental Habitats 10
Environmental Fisheries 5
Environmental Landscape Character 5
Environmental Cultural Heritage 5
Local weighting
The local weighting of each objective varies for each assessment unit depending on the level of applicability of that objective to that unit. For some objectives, the local weighting could be 0, since the objective does not apply to that part of the catchment.
Table 4: Local weighting, importance scoring
Importance Local Weight
Major / International importance 5
Significant / National importance 4
Medium / Regional importance 3
Minor / Local importance 2
Negligible importance 1
Not relevant 0
Scoring
The flood protection measures applicable to an analysis unit (large sub-catchment, e.g. Lower Lee AU) or areas of potential significant risk (APSR), e.g. Cork City, Ballincollig, Crookstown are scored based on the core criteria. The baseline indicator data relevant to each core criterion (e.g. the presence of a sensitive environmental designation) were used to inform this preliminary evaluation. The scoring system was developed in consultation with the project steering group.
Table 5 to Table 8 show the scorings.
Table 5: Scoring - General Approach
Impact Score
Achieving aspirational target 5.0
Partly achieving the aspirational target 3.0
Exceeding minimum target 1.0
Meeting minimum target 0.0
Just failing minimum target -1.0
Partly failing minimum target -3.0
SmartResilience: Indicators for Smart Critical Infrastructures
page 46
Fully failing minimum target -999.0
Table 6: Scoring - Technical & Economic Criteria
Core Criteria Basis for scoring Score
Technical Technically impossible or difficult -2, -1
Technically possible 0
Technically straightforward 1
Unacceptable -999
Economic Prohibitive / excessive cost; estimated BC ratio <<1 -2
Reasonable cost; estimated BC ratio 0.5 – 1
1
1-2
-1
0
1
Low cost or potential for income; estimated BC ratio > 2 2
Unacceptable -999
Table 7: Scoring - Social & Environmental
Core Criteria Basis for scoring Score
Social Significant negative impact on people -1
Neutral impact on people 0
Positive impact on people 1
Unacceptable -999
Environmental Overall negative environmental impact -1
Overall neutral environmental impact
0
Overall positive environmental impact 1
Unacceptable -999
Table 8: Scoring - Other Criteria
Core Criteria Basis for scoring Score
Other Significant negative issue -1
No other significant issues 0
Significant positive issue 1
Unacceptable -999
SmartResilience: Indicators for Smart Critical Infrastructures
page 47
The elements for the calculation of the MCA benefit score for a flood protection investment option, e.g. build a tidal barrier include:
Applicable (yes/no) (no if, for instance, the option is about the improvement of existing tidal flood defences, but none exist)
Technical (weighted score) Economic (weighted score) Social (weighted score) Environmental (weighted score) Other (weighted score) Score looking at data-driven indicators Overall score (weighted score - the MCA benefit score) Decision to carry forward to option development (yes/no), e.g. yes build a tidal barrier
MCA benefit calculation for a data-driven social indicator (partial reviewing one data-driven indicator from one objective)
We use the location-based employment statistics database discussed in chapter 3 and the indicator “Predicted number of jobs at business locations either affected and severely affected by flooding” as shown in Figure 33. This indicator relates to the social/political dimension listed in the Resilience matrix in the SmartResilience project [1] (table 1, page 14). In the MCA approach discussed it falls under the social objective, sub-objective minimise risk to community (employment).
Table 9: MCA benefit calculation for a data-driven social indicator (partial, reviewing one data-driven indicator)
minimize risk to community (employment)
global (10), local(5)
Baseline indicator total of 31123 jobs at risk in city locations severely affected by flood water levels of 3.77m.
Provision of permanent flood walls
score 3
Explanation of option assessment (simplified for this report)
Flood walls and embankments are considered for all the areas where there are significant numbers of properties at risk within the APSR which can be protected without excessive cost.
MCA benefit score (partial) 150
Table 10 shows an example of the approach for MCA where the score links to the review of the relevant data-driven indicators. It illustrates an example for the calculation of an MCA option ‘x’ for which we have selected three criteria which each have assigned the respective global and local weights and scores as discussed earlier in this section. The significance or individual benefit of each criterion for option ‘x’ is calculated in the last column with a weighted score, and the sum of that column gives the MCA benefit score for option ‘x’ which then can be compared with other options for decision making. With an MCA benefit score of -50, this option ‘x’ will not be carried forward to option development.
Table 10: Example calculation of MCA value per option
SmartResilience: Indicators for Smart Critical Infrastructures
page 48
Criteria Global-weighting (GW)
Local-weighting (LW)
Score (S) Weighted Score (WS)
GW * LW * S
Technical - Ensure Flood Risk Management options are operationally robust.
5 5 0 0
Technical - Minimise Health and Safety risk of flood risk management options.
5 5 1 25
Technical - Ensure flood risk managed effectively and sustainable into the future.
5 5 -3 -75
MCA benefit score -50
Implementation in RapidMiner We have first prepared the data in a CSV file storing the criteria, global weighting, local weighting and the score for each option. We loaded the CSV into a RapidMiner process using the retrieve operator.
Figure 37: RapidMiner process
We can use the data editor in RapidMiner to make changes in the underlying dataset to review the direct changes in the outcome of the MCDM calculation.
Figure 38: Review of data in the RapidMiner data editor
Following the approach of global & local weights and scores taken by the OPW, we calculate the weighted sum for each criterion and define the formula expression using the RapidMiner expressions editor.
SmartResilience: Indicators for Smart Critical Infrastructures
page 49
We first calculate the weighted score as (global-weighting x local-weighting x score) for each row in the database. We then calculate the sum of the weighted score for each option. Below table shows the calculation for option 5.
Table 11: Example MCDM calculation
Criteria Global-weighting
Local-weighting
Score Weighted Score
Technical - Ensure Flood Risk Management options are operationally robust.
5 5 3 75
Technical - Minimise Health and Safety risk of flood risk management options.
5 5 0 0
Technical - Ensure flood risk managed effectively and sustainable into the future.
5 5 0 0
Economic - Optimise economic return on flood risk management investment.
25 5 0.164489 20.5611015
Economic - Minimise risk to infrastructure. 15 4 3 180
Economic - Minimise risk to agricultural land. 5 2 0 0
Social - Minimise risk to human health and life. 30 5 3 450
Social - Minimise risk to the community. 10 5 3 150
Social - Minimise risk to, or enhance social amenity. 5 4 3 60
Environmental - Support the achievement of good ecological status/ potential (GES/GEP) under the WFD.
5 5 -1 -25
Environmental - Minimise risk to sites with pollution potential 15 0 0 0
Environmental - Avoid damage to and where possible enhance the flora and fauna of the catchment.
10 5 -1 -50
Environmental - Avoid damage to, and where possible, enhance fisheries within the catchment.
5 4 -1 -20
Environmental - Protect, and where possible enhance, landscape character and visual amenity within the catchment
5 4 -3 -60
Environmental - Avoid damage to or loss of features of cultural heritage importance, their setting and heritage value within the catchment
5 4 0 0
MCA benefit score 780.561102
SmartResilience: Indicators for Smart Critical Infrastructures
page 50
Figure 39: RapidMiner functions - MCDM formula for GOLF
To get the sum of all weighted sums for each flood protection option, we use the sum aggregate function with the weighted-score as attribute and group it by option.
Figure 40: Defining the sum over all weighted-scores calculated for the criteria for a specific option
SmartResilience: Indicators for Smart Critical Infrastructures
page 51
Figure 41: Defining the grouping by flood protection measure option
After running the process, RapidMiner shows the results in the ExampleSet view for each flood protection option.
Figure 42: Result set with values for each option
We use RapidMiner visualisation and configure it to present the MCDM results as a bar chart.
Figure 43: Visual representation of MCDM result set
SmartResilience: Indicators for Smart Critical Infrastructures
page 52
Based on the weights and scores, the provision of permanent flood walls/embarkments is the favourable investment option in this example, i.e. Option 5 (red bar in Figure 43). It received the highest score with 780.561102 according to the calculation described in
Table 11.
Conclusions This chapter has described the use of RapidMiner for an MCDM use case by the example of the OPW investment options MCA appraisal approach. It is used for analysing infrastructure investment options such as flood protection measures. We have discussed the MCA approach for flood protection investment options for Cork City with an example dataset.
The MCA approach discussed is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. It is an alternative approach to the SmartResilience MCDM approach described in the D3.4 report.
We have discussed the MCA application using RapidMiner following the OPW MCA appraisal process adapted in the context of the GOLF case study, which consists of global & local weights and a scoring. The global weights are defined by four core criteria technical, economic, social and environmental, each with several objectives such as economic return (25) or transport infrastructure (15) for the criterion economic with the global weights 25 & 15 respectively. Local weighting then puts a value on the importance of the area concerned by flooding such as major/international importance, national importance, local importance to no importance for the areas indicated by flooding. The scoring is a measure developed by an expert steering group for indicators for the flood-affected locations, for example, if a flood protection investment option partly achieves the aspirational target for reducing the risk of flooding to a number of business locations it will receive a score of 3. In other words, if a flood protection investment option partly reduces the employment-related indicator illustrated in Figure 33, then it will receive a score of 3.
RapidMiner has a high number of customisable operators that can be combined in a RapidMiner analytics process where process results can be explored through various charting features. The RapidMiner MCA implementation discussed can be used for other use cases that follow a similar process of global/local weights and scoring. We used the following data structure for data to feed into the MCA process:
Criteria (text) global-weighting (number) local-weighting (number) score (number) option (text)
In order to utilise the RapidMiner MCA process for other use cases a CSV data file with data and the same structure of the 5 data attributes above can be used.
SmartResilience: Indicators for Smart Critical Infrastructures
page 53
Enterprise Integration with RapidMiner
Reuse of RapidMiner Analytics Processes
Development efforts in RapidMiner focus on RapidMiner Analytics Processes that are stored in RMP XML files. These processes can be reused for solving analytics problems in similar environments. An example is the reuse of the MCDM process which can be reused with other datasets supplied in the same data format.
The predictive analytics processes can be reused to a certain degree. As discussed in chapter 2 predictive modelling is a problem-solution specific development effort. However, in the case of predictive modelling for forecasting flood water levels, the predictive model can be reused for other cities with similar environments to a certain degree. It is not possible to directly apply the Cork model to other cities without alterations. At a very minimum, the model would need to be retrained assuming that the data are identical in terms of available attributes.
Integration options
RapidMiner has several integration options. There is a wide range of data access and management features which can be used to access, load and analyse any type of data both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured.
There are a number of integration options via file, e.g. in Excel, CSV etc. format, connectivity with databases or publishing models through web services on Server in a cloud deployment scenario for instance.
Figure 44: Example operators supporting Integration of RapidMiner Resilience Analytics applications
SmartResilience: Indicators for Smart Critical Infrastructures
page 54
Figure 45: Webservices integration for analytics processes running on RapidMiner Server in, e.g. a cloud deployment scenario
Conclusions RapidMiner applications can be integrated into an operational environment using mainstream enterprise integration options. As discussed in section 4.5, RapidMiner analytics processes can be reused, provided the underlying business analytics approach, and data input structures are the same. For predictive modelling, as explained in section 2.1, a predictive model can be reused to a certain degree if the underlying business problem and data inputs are very similar. A predictive model in order to remain accurate needs to be retrained, so it meets the accuracy performance criteria defined by the end-user.
We have discussed the CRISP-DM process for data analytics projects in section 2.2 which describes the steps widely used in industry to implement a data analytics project. The first two steps, business understanding & data understanding, are crucial enablers for a successful analytics project. In these two steps end-users and business analysts need to clarify the interests of the end-user to formalise the business analytics problem. They also need to identify the data sources required to address the business problem with the more technical steps data preparation, modelling which again will be reviewed in a fifth step evaluation that checks if the technical implementation has met the business problem description articulated in step 1. In SmartResilience the business description has been carried out in T5.2 in which each case study leaders described their business problem with a scenario description and available datasets. T52 also concerned the data understanding step of CRISP-DM. In the GOLF case, we have explored business problem descriptions and reviewed a number of datasets, carried out data preps and data analysis tasks and evaluated the results in view of robust, accurate predictive indicators.
Many datasets have been provided in ESRI [14] shapefile format. We have used ESRI to explore
Street-level transport data Cork City Lidar height data Cork City
We have further explored various CCC datasets such as fire brigade call-outs and location-based employment statistics database.
The strength of CRISP-DM is in its built-in iteration. In that we went through GOLF business understanding, data review, data prep, analysis and evaluation phases in several iterations during the course of the project. We have also reviewed other SmartResilience case studies and access to historical timeseries datasets. The predictive model for GOLF water level forecasting and city location-based predictive impact and recovery assessment was identified as the business use case which was supported with available datasets. Datasets for resilience analytics in a multi-stakeholder environment such as flood disaster emergency coordination often involves time-consuming data scouting in various stakeholder organisations and negotiating terms for data access frequently in the form of an NDA.
SmartResilience: Indicators for Smart Critical Infrastructures
page 55
New data-driven indicators
The predictive resilience analytics prototyping has generated several new data-driven indicators. These indicators are based on the predictive model for the water level in Cork City which is discussed in chapter 3. Combined with existing databases available to Cork City Council and stakeholders by using RapidMiner advanced analytics tooling we derived several new data-driven indicators including:
Predicted impact on employment in the City based on the predicted flood water level: o Jobs affected by the predicted flood water level o Jobs affected severely by the predicted flood water level (structural flood damage to a business
location)
Predicted impact on businesses without flood damage insurance based on the predicted flood water level o Businesses without insurance affected by the predicted flood water level o Businesses without insurance affected severely by the predicted flood water level
Predicted damage to stock held in businesses based on the predicted flood water level o Euro amount of stock held at all businesses that will be affected by the predicted flood water
level o Euro amount of stock held at all businesses that will be affected severely by the predicted flood
water level
These indicators are available as an update to the SmartResilience database.
Conclusions We developed several new predictive data-driven indicators. These indicators are the result of many iterations through data analytics steps described in the CRISP-DM reference model for data analytics projects.
The indicators presented in the above are based on
A predictive flood water level model Lidar location height data Location statistics databases
A predictive indicator in order to be valuable for the end-user needs to be reliable. Reliability comes with predictive model accuracy and model accuracy depends on identifying the influencing historical data input streams which we can tune in with predictive modelling and discover patterns that help forecast future values.
The SmartResilience methodology builds on several macro indicators and depends on that these indicators provide a reliable indication of the underlying business understanding, such as referred to in CRISP-DM step 1. This D3.9 report provided a practical description of how to build reliable predictive data-driven indicators.
SmartResilience: Indicators for Smart Critical Infrastructures
page 56
It is important to note that the extension of any data-driven indicator database such as the SmartResilience resilience tool database requires hands-on consultancy work to ensure the end-user benefit from predictive indicators that provide reliable, validated insight in line with their business understanding.
The predictive modelling approach for resilience analytics discussed has several advantages over the existing risk analytics approaches, mainly in its combination of data science-driven predictive analytics with location-based intelligence. In that, it helps stakeholders understand the impact of a predicted threat by linking the threat prediction value with location-based metrics such as employment at business locations. Firstly this approach allows calculating the impact of a currently predicted threat (following a change process such as climate) using predictive models which the data scientist can train and test in RapidMiner - with, for example, sensor-sourced weather data like wind speed or wind direction and environmental data such as water levels from water level gauges in real-time. Secondly, it allows for exploring the threat impact by manually changing the value of the threat parameter and reviewing its impact using the location-specific metrics.
By exploring the impact of a threat with available location-associated metrics, we can assess the impact without historical data in a what-if style. Section 1.5 discussed deterministic vs parametric flood risk estimation approaches, the latter requiring only a few parameters. This links with the idea of the SmartResilience indicator database which provides a case study such as urban flooding or cyber-attacks with a large number of indicators to choose from and is similar to the concept of parametric vulnerability assessment described in Little and Rubin 1983 [6]. For example, the availability of flood bags increases flood resilience and reduces the impact of flood events. In the cyber domain, the term cyber resilience has recently been coined to identify specifically “the ability to continuously deliver the intended outcome despite adverse cyber events” [15]. Cyber resilience indicators can help describe the baseline distribution of a cyber-physical system. Such a baseline of a cyber resilience indicator can be learned from process control log data recorded during secure operation. Statistically significant deviations in resilience indexes for a wastewater facility can be produced by, for example, a faulty pump. Cyber-attacks do not exhibit a predictive pattern in contrast to naturally occurring fault. Therefore, insights about the cause of an anomaly could come from a comparison between several indicators, including those obtained simulating possible faults [16].
RapidMiner is aimed at the data science community who have undergone extensive training in data science. It is rich in predictive methods that rely on time series data, and it can also support what-if explorations which functionality is accessible through its data science audience-focused interface. Given the focus of a data science solutions development tool, it is not suitable for direct use by resilience-tasked end users but for technical staff that help build indicators for use by the end-users. Therefore, to benefit from its quantitative capabilities, it can be used to feed its results into other systems via various integration options. One is the storage of results from a RapidMiner analytics process in an excel file for use by another application, such as the SmartResilience indicator dashboard, like it was done through the excel import feature of the big data uploader described in D4.2. However, for RapidMiner, we did not reach this level of integration with the project integrated tool for several reasons. RapidMiner was installed in the SmartResilience web portal at an early stage, in order to explore the possibilities to perform (SmartResilience) resilience assessments using a COTS development tool for data science solutions as an alternative to the custom-made tools in SmartResilience. It turned out to be two main challenges with this. One was that the custom-made tools (and most of the methods) were at an early development stage, meaning that it was not exactly clear which assessments or parts of assessments should be performed using RapidMiner. This led e.g. to the development of a RapidMiner MCDM approach that differed from the one later developed and included in the integrated tool. The second challenge was the recognition that RapidMiner is aimed at the data science community who have undergone extensive training in data science. Attempts were made to explore the potential use of RapidMiner for the different use cases, mainly by data analytics experts analysing data sets received from the case studies, but it turned out to be difficult to obtain relevant (open and non-sensitive) data sets from the case studies. One main reason being that case studies except GOLF did not focus on the prediction of events and did not possess this type of (non-sensitive) data sets.
SmartResilience: Indicators for Smart Critical Infrastructures
page 57
Summary
We have discussed several approaches for building a predictive model for flood water level forecasting which helps flood emergency coordinators see the impact and recovery of a soon flooding event with the associated functionality levels (cf. the D3.3 report). We have been using available data (weather, tidal and river levels) and shown that a deep learning algorithm is quite accurate in forecasting future water levels.
The model itself cannot be applied to other cities due to the variance in infrastructure (flood prevention, damns, walls, river system, etc.) and weather/tidal data. However, it would be reasonably straight forward to fit a model to data obtained from other cities. We have shown that for Cork, the deep learning approach is the most suited algorithm. Whether this holds for other cities would need to be investigated for each case.
D3.9 has focused mostly on the GOLF case given the availability of datasets and dialogue with external stakeholders in Ireland. The contribution to D3.3 (modelling impact and recovery) is a predictive model for forecasting water levels in real-time which we combined with location height data and location statistics data to calculate new indicators that show respective functionality levels based on the water level values from the predictive model. The predictive time window can be configured with the window size. Modelling a whole FL-t is different for each incident as they have individual time lengths. The forecasting time length of the predictive model can be increased, however, at the cost of forecasting accuracy.
We have tested a deep learning predictive model with a window size of up to 4 hours into the future with acceptable results. We have connected the predictive model with location-centric statistics data using location height data. In that, we were able to identify all location statistics at or below a predicted water level and could calculate data-driven indicators such as jobs affected or affected severely by flooding. We have calculated the bounce-back functionality level considering the number of locations severely affected by a predicted flood water level which will likely suffer structural damage. We argue that there may not be a return back to pre-disaster normalcy in case of a severe flooding disaster. In that context, there is a need to think about regional adaptation, rethink resilience and see an individual flood-affected region as a complex adaptive system. An adaptive system is able to change or adapt to stresses rather than merely striving for a return to normalcy or a resumption of pre-challenge behaviours or outcomes [11].
It is essential to understand that a predictive analytics problem cannot be solved by loading data into a predictive modelling tool such as RapidMiner and hoping it will return the required results. Model creation is greatly supported with a visual modelling tool, but for the model to be effective, we need to carry out an in-depth review of the problem to be solved and the available data. Often it turns out that available data is not sufficient to support a predictive model, frequently in the pre-modelling phases, the strategy for predictive modelling changes its shape.
The MCA approach used in the GOLF case study is an alternative approach to the SmartResilience MCDM approach described in the D3.4 report. It is broadly applicable across a range of government decisions and fulfils important criteria for application in government such as the ability to provide an audit trail, transparency and ease of use. We have discussed the MCA application using RapidMiner following the OPW MCA appraisal process adapted in the context of the GOLF case study. The OPW MCA is based on global & local weights and a scoring. The global weights are defined by the four core criteria technical, economic, social and environmental, each with several objectives such as economic return (25) or transport infrastructure (15) for the criterion economic with the global weights 25 & 15 respectively. Local weighting then puts a value on the importance of the area concerned by flooding such as major/international importance, national importance, local importance to no importance. The scoring is a measure developed by an expert steering group for indicators for the flood-affected locations, for example, if a flood protection
SmartResilience: Indicators for Smart Critical Infrastructures
page 58
investment option partly achieves the aspirational target for reducing the risk of flooding to a number of business locations it will receive a score of 3. In other words, if a flood protection investment option partly reduces the employment-related indicator illustrated in Figure 33, then it will receive a score of 3. The RapidMiner MCA implementation can be directly adapted to other MCDM use cases if they follow the MCA approach we used.
The RapidMiner applications can be integrated in various ways, via file (Excel, CSV, etc.), database connectivity with most database systems, or exposed as web service in a cloud deployment scenario.
Concerning T3.7, we have discussed RapidMiner enterprise integration options and re-use of RapidMiner analytics processes which can be integrated with most database systems, including Microsoft SQL Server which is used for the SmartResilience tool. Also, RapidMiner Studio was installed and made accessible via RemoteDesktop in the hosting environment of the SmartResilience tool. T3.7 task members were engaged in discussions on tool integration.
D3.9 also created several new data-driven indicators as a contribution to T4.6. These indicators are based on the predictive water level model for the GOLF use case, which is discussed in chapter 3. Combined with OSI height data and location statistics data by using RapidMiner advanced analytics tooling, we derived several new data-driven indicators available for the SmartResilience database.
We have given an overview of CRISP-DM, which is a widely used reference model for data analytics projects. CRISP-DM gives guidance on the typical steps of a data analytics project leading to data-driven indicators and decision support tools. Datasets for resilience analytics in a multi-stakeholder environment concerned with flood disaster resilience often involves time-consuming data scouting across various stakeholder organisations and negotiating terms for data access frequently in the form of an NDA. Data understanding, data availability, data prep, modelling, evaluation and deployment are part of steps 2-6 in CRISP-DM and can lead to a review & change of the business analytics problem description in step 1. Such reviews & changes may be necessary to ensure to build decision support tools that address the interests of the end-user stakeholder.
SmartResilience: Indicators for Smart Critical Infrastructures
page 59
References
[1] SmartResilience Consortium, “Deliverable D3.3: Report on the ‘SmartResilience Methodology for Assessing Resilience of SCIs based on RIs (resilience indicators),’” 2018.
[2] SmartResilience Consortium, “Deliverable D3.4: Report on the SmartResilience MCDM Methodology serving as the basis for the ‘SCIs Dashboard,’” 2018.
[3] SmartResilience Consortium, “Deliverable D3.7: The “SCIs Dashboard containing the module on Dynamic Intelligent Checklists,” 2018.
[4] SmartResilience Consortium, “Deliverable D4.6: New release of the RI-database,” 2018. [5] S. F. Balica, I. Popescu, L. Beevers, and N. G. Wright, “Parametric and physically based modelling
techniques for flood risk and vulnerability assessment: A comparison,” Environ. Model. Softw., vol. 41, pp. 84–92, 2013.
[6] R. Little and D. Rubin, “On Jointly Estimating Parameters and Missing Data by Maximizing the Complete-Data Likelihood,” Am. Stat., vol. 37, p. 218, 1983.
[7] P. Chapman et al., “CRISP-DM 1.0 Step-by-step,” ASHA Present., p. 73, 2000. [8] RapidMiner, “RapidMiner Studio Datasheet with Feature List,” 2017. [9] RapidMiner, “RapidMiner Server,” 2017. [10] RapidMiner, “RapidMiner© Studio.” RapidMiner, Inc., 2019. [11] S. Carpenter, F. Westley, and M. Turner, Surrogates for Resilience of Social–Ecological Systems, vol. 8.
2005. [12] R. Pendall, K. A. Foster, and M. Cowell, Resilience and Regions: Building Understanding of the
Metaphor, vol. 3. 2009. [13] C. Kuhlicke, C. Begg, M. Beyer, I. Callsen, A. Kunath, and N. Löster, “Hochwasservorsorge und
Schutzgerechtigkeit - Erste Ergebnisse einer Haushaltsbefragung zur Hochwassersituation in Sachsen,” Helmholtz Centre for Environmental Research (UFZ), Leipzig, 15/2014, May 2014.
[14] Environmental Systems Research Institute Inc, “Esri: GIS Mapping Software, Spatial Data Analytics & Location Intelligence.” [Online]. Available: https://www.esri.com/en-us/home. [Accessed: 02-Feb-2017].
[15] F. Björck, M. Henkel, J. Stirna, and J. Zdravkovic, “Cyber Resilience – Fundamentals for a Definition,” Adv. Intell. Syst. Comput., vol. 353, pp. 311–316, 2015.
[16] G. Murino, A. Armando, and A. Tacchella, “Resilience of Cyber-Physical Systems: an Experimental Appraisal of Quantitative Measures,” in Proceedings of the 2019 11th International Conference on Cyber Conflict: Silent Battle, 2019, NATO CCD COE Publications, pp. 459–477.
SmartResilience: Indicators for Smart Critical Infrastructures
page 60
Annex 1 Summary of the input data
The data for predictive modelling was obtained from the following data sources:
1. Roches Point weather station hourly data Rainfall Air/Dewpoint temperature Relative humidity /vapour pressure Mean sea level pressure Wind speed/direction
https://cli.fusio.net/cli/climate_data/webdata/hly1075.zip
https://www.met.ie/weather-forecast/roches-point-weather-station-cork
2. Water level gauge Lee Road station 5min data Water level in metres
https://data.corkcity.ie/dataset/river-lee-levels
3. NMCI Tidal station 15min data Tide in metres
https://waterlevel.ie/hydro-data/stations/19069/Parameter/S/complete.zip
https://waterlevel.ie/hydro-data/search.html?rbd=SOUTH%20WESTERN%20RBD
Table 12: Data summary – Roches Point Weather Station – every hour
ID Element Unit
date weather Hourly measurements dd/hh/yyyy hh:mm
Ind (for date weather)
0. satisfactory.
1. deposition
2. trace or sum of precipitation.
3. trace or sum of deposition.
4. estimate precipitation.
5 estimate deposition.
6. estimate trace of precipitation.
rain Precipitation Amount mm
Ind1 (for rain)
0. positive.
1. negative.
2. positive estimated.
3. negative estimated.
4. not available.
temp Air Temperature °C
SmartResilience: Indicators for Smart Critical Infrastructures
page 61
ID Element Unit
Ind2
0. positive.
1. negative.
2. positive estimated.
3. negative estimated.
4. not available.
5. frozen negative.
wetb Wet Bulb Air Temperature °C
dewpt Dew Point Air Temperature °C
vappr Vapour Pressure Hpa
rhum Relative Humidity %
msl Mean Sea Level Pressure hPa
Ind3 2. Over 60 minutes.
4. Over 60 minutes and defective
6 Over 60 minutes and partially defective.
7. n/a
wdsp Mean Hourly Wind Speed kt
Ind4 2. Over 60 minutes.
4. Over 60 minutes and defective
6 Over 60 minutes and partially defective.
7. n/a
wddir Predominant Hourly wind Direction degree
Table 13: Data summary – Tidal Station NMCI Ringaskiddy Data - every 15mins
ID Element Unit
Date tide 15-minute interval measurements yyyy/mm/dd hh:mm:ss
Tide Height of tide metre
Table 14: Data summary – Water Level Station Lee Road - every 5mins
ID Element Unit
date Timestamp of water level measurement (5 minute intervals)
yyyy-mm-mmThh:mm:ss
level Level Station 1 (Target) metre
We used Lee Road Station: Lat: 51.89464 / Long: -8.51296.
SmartResilience: Indicators for Smart Critical Infrastructures
page 62
Annex 2 Charts
Figure 46: Insured businesses either affected or severely affected by the predicted water level
Figure 47: Ratio of the number of jobs at business locations either affected or severely affected by the
predicted flood water level
SmartResilience: Indicators for Smart Critical Infrastructures
page 63
Figure 48: Value of stock levels held at business locations either affected or severely affected by the
predicted flood water level
SmartResilience: Indicators for Smart Critical Infrastructures
page 64
Annex 3 Review process
The Content of this Annex has been submitted as part of the periodic review report to the PO/EU/Reviewers.
Review Response
Reviewer 1
General: Change page numbering (Roman numerals are used throughout the entire document).
Addressed
List of Acronyms: Provide a complete list (missing e.g. CSV, FRM, GW, LW, MCA, S, …).
Addressed
SmartResilience: Indicators for Smart Critical Infrastructures
page 65
Review Response
Section 1.3: Explain who the end users are. Is this a person that can perform FL assessment and MCDA using RapidMiner applications on his own, or must he use AIA as a consulting firm? How can the applications be made workable for all partners/users without having a RapidMiner expert?
We identified three D3.9 reader groups in section 1.3, one of them being data scientists. To extend the SmartResilience database with data-driven indicators for particular use cases (which each have their own data needs) it requires expertise in data science. Anybody with a background in data science can use RapidMiner and develop the data-driven indicators discussed in D3.9 and also MCA processes. This relates to my review comments as a reviewer for D3.6 which preceded this D3.9 report: “in the context of someone who is tasked with SCI resilience assessment, modelling, monitoring, analysing dependencies and interested in building decision support applications following a guideline my comments are: An organisation tasked to implement a project that results in decision support applications for end users in their operational environment would require the following project stakeholders: 1. application end users 2. business analysts 3. developers, data scientists Group one would be linked to application requirements and validation. Group two would be concerned with analysing the needs of application end users, the definition of their requirements, and working with data scientists & developers to build applications of value to group one. Group three would be concerned with preparing data from identified data sources, applying analytics methods to that data to calculate the value of smart indicators, updating the indicators registered in a database and creating decision support application that support SCI resilience assessment, modelling, monitoring, dependencies analysis.” The above has been addressed in section 1.3 of this D3.9 report. The section has been extended with a stakeholder role description. This D3.9 report addresses all three reader groups. By only addressing the end-user of indicators and not addressing business analysts and data scientists/developer it would be impossible to understand how an indicator database can be extended with new indicators for different domains. The CRISP-DM process we describe in section 2.2 together with chapters 3,4,5 & 6 informs business analysts and data scientists on how to add new resilience indicators to the database that meet end-users’ requirements.
Section 1.3 first sentence and Section 3.1 last sentence on page v: Denoting the project as a "data science-driven resilience assessment project" indicates a lack of understanding of the overall project. Big data and data analytics only have a partial role in the project.
Addressed in section 1.3.
Section 4.3: Change order of bullets according to the subsections 4.4.1-4.4.3 or change the order of the sub-sections.
Addressed.
Sub-section 4.4.4, Figure 29: Can refer to Figure 10, since this is the same figure.
Figure 29 (now 31) has been updated with the confusion matrix on unseen data.
SmartResilience: Indicators for Smart Critical Infrastructures
page 66
Review Response
Section 4.5, Figure 30: Figure 30 is difficult to read and understand. Explain the results and provide understandable captions on the X- and Y-axis.
Addressed in section 4.7 (now 3.7) figure 32 & 33
Section 5.1, third sentence: "The OPW supported the SmartResilience project and in particular the GOLF case study …". Did they support anything else apart from GOLF?
Addressed in section 4.1.
Section 5.2: Explain the method first, including complete list of criteria, and the ranges used for weights and scores. Then provide the example in Figure 31 (which should be reduced in size). Further, explain the results. What does the result -50 mean?
MCA is now explained, see section 5.3 (now 4.3) MCA benefit score approach. Explanation of figure 31 (now table 8) is addressed, reduced in size.
Section 5.3, Figure 37: Show the full calculation of at least one of the options in Figure 37, e.g. how do you obtain 780.561 for option 5? This will make the method and analysis more transparent.
Implemented, see table 9 - example MCDM calculation
Section 5.3, Figure 38: Explain the results. Implemented, added explanation below figure (now figure 40)
Section 5.4: What does it take for others to perform MCDM using RapidMiner (without data similar to OPW)? Is this readily available as an alternative to the MCDM in the Integrated Tool, or must AIA first create a new tailor-made application in RapidMiner? Could elaborate on this in Section 5.4.
Implemented, the RapidMiner MCA implementation can be adapted to other MCDM use cases with data similar to the one the OPW has been using for decision making relating to the GOLF case study. The MCDM implementation discussed can be used for other use cases that follow a similar process of global/local weights and scoring. We used the following data structure for data to feed into the MCDM process:
• Criteria (text) • global-weighting (number) • local-weighting (number) • score (number) • option (text)
Chapter 6: How has this supported the development of the integrated resilience assessment tool in D3.7?
Addressed in new section reuse of RapidMiner Analytics Processes in Chapter 6 (now 5)
Chapter 7: List all the new indicators. Chapter 7 lists new indicators derived from the following datasets: OSI height data (NDA) Historical tidal data Historical weather data Location statistics data such as CCC employment statistics database from which we have extracted data attributes and created dummy values given the database was supplied under an NDA Stock levels held were discussed during the workshop with participants, we did not succeed in accessing meaningful data, but data could be obtained from wholesalers like Musgraves supplying retail businesses in areas affected by flooding. Data access would need to be negotiated on commercial and legal terms.
Reviewer 2
SmartResilience: Indicators for Smart Critical Infrastructures
page 67
Review Response
Executive Summary: No results and conclusions are mentioned in the executive summary; only the objectives are stated and some acknowledgments are made. Please, make sure that the summary provides also key info on (i) background of the work, (ii) methods used, (iii) results and (iv) discussion / conclusions / interpretations of these results.
Implemented
Introduction: Section 1.2 lists related deliverables but gives no further explanation on what this relation is. This might also be a good place to position the work in D3.9 to the other efforts in the project more explicitly, maybe by pulling it together with the current chapter 2.
Implemented
Chapter 2: The table mentions that chapter 3&4 provide "resilience analytics" for e.g. % jobs affected by flooding. I don't find any such results in chapter 3&4. Please, harmonize this table with the actual content of the report.
Chapters 2 and 6 address development and use of data-driven predictive indicators such as %jobs affected for all three reader audiences of D3.9 - end users, business analysts and data scientists/developers.
Section 3.1: Instead of a one-to-one copy of RapidMiner promotional material, a critical appraisal of strengths and weaknesses of this software solution would be more in place. For instance, such environments are typically known to be limited in terms of performance for resource-intensive tasks and in terms of flexibility of implementing minor changes in the predictive algorithms. Is this an issue here too?
Implemented, section 2.1 has been extended with a discussion on performance and alternative software implementation solution options.
Section 4.2: I do not find the data to be sufficiently described in order to make the analysis reproducible. How was the data measured? How many sensors where there? Where was the location of these sensors? Over which time were the observations taken? Has this data been described elsewhere? Please, ask yourself if given on the information that you provide here, a person with a similar background as you could reproduce your findings.
Implemented, annex I summarises the data including measurement intervals, sensor locations, observation time frames. It also provides a link to the datasets (not the location statistics data such as employment statistics, location height data and others we obtained under an NDA).
Section 4.2: The most important omission is: What were the (potential) predictor variables, and what the predictees? The only (indirect) mention of this I find in the conclusion section!
Section 4.5 (now 3.5) provides a description of predictor variables such as water level station, rainfall etc. historical data and the predicted variable which is the future water level.
Section 4.3: Describe more clearly what you mean by "windowing", "horizon", and "offset" here. These are not standard terms.
Implemented, see section 3.4
Section 4.4.1: It is not clearly described whether the model was evaluated (the "Table" in "Figure" 10) on the training or test data. Also, if, as you describe, you use cross-validation, how can you conclude that your model works well for unseen data? I know that Naive Bayes has the advantage of being easily transferable, but in my understanding, that would require the use of a holdout dataset which I don't find described in the text.
Implemented, see sections 3.2, 3.3, 3.6.1, 3.6.3, 3.6.4 highlight the use of unseen data and the model performance discussion is based on unseen data Jan/Feb 2019. Training/test data was 2015-2018.
SmartResilience: Indicators for Smart Critical Infrastructures
page 68
Review Response
Section 4.4.2.1: I don't understand why one would use a standard ARIMA model to try to predict "deviations from the norm" If you want to predict such things from a timeseries model, consider the use of impulse response functions. In any case, it should be discussed why one would try to model extreme events (flood) through an autoregressive process, I don't get it...
We used this as ARIMA as one of the main methods for predicting time series. We also concluded that it did not work so not sure why the reviewer focuses on this. I also disagree a little with the reviewer. Floods do not come out of nowhere and are linked to wind, temperature, and tidal forecasts. These are also not isolated either so it was the right decision to attempt to fit an ARIMA model. Always easy afterwards to claim that it doesn’t work. The reality is that we are using models that do things humans can’t and the proof is in the pudding so trying them out is paramount.
Section 4.4.3: Again, it is not clear whether the model quality is measured in or out-of-sample.
Again, this is addressed in sections 3.2, 3.3, 3.6.1, 3.6.3, 3.6.4.
Section 4.4.4: For which reason are false negatives more acceptable than false positives?
We prefer to say that there will be a flood and then no flood occurs instead of the other way around. This point is made in section 3.6.4.
Figure 19 - …: Most of the time series figures have no or generic axis labels, here and in the following plots.
Implemented
Reviewer 3
Section 4: There is no description of the time series data other than a table with the excerpt from the database. No link to the SmartResilience indicator set is included. Also, no justification of the relevance of this data.
A description of the historical time series data is in Annex I. We do not use SmartResilience indicators as input to RapidMiner predictive modelling, the aim is to create these indicators. A justification of the relevance of the input data is sufficiently covered in the report.
Section 4: The following work is described, but only with respect to the data for GOLF (no link to other work performed in the project) o Preprocessing of data o The use of different prediction models: Naïve Bayes Classification, Deep learning, ARIMA o A lot of not really valuable screenshots are used to describe the performed work.
The work described aligns with the CRISP-DM industry reference model for data analytics projects and therefore can be applied to any SmartResilience case study, which likely requires the same time-consuming iterative approach we went through in the GOLF case. I disagree with the statement the screenshots are not valuable as they help the data scientist/business analyst reader (ref section 1.3) of this report to follow the approach and enable them to reproduce results for similar business use cases.
Section 4.4.4: In the analysis subsection 4.4.4 the model performance is discussed o Too brief and includes only a comparison of the different modelling approaches, but the objective of this deliverable is not do compare these three prediction models
I disagree with the review comment “objective of this deliverable is not do compare these three prediction models”. As clarified in CRISP-DM, modelling (step 3) is followed by evaluation (step 4). It is essential to evaluate the performance of the modelling, so it meets end user stakeholder requirements (step 1). A model needs to forecast values reliably and this reliability in the form of prediction accuracy is discussed in section 4.4.4 (now section 3.6.4)
Section 4.5: Subsection 4.5 includes only one rough figure about the predicted functionality level. This is not understandable without additional description. This is also the first time recovery is mentioned, but recovery aspects are described nowhere in the document.
Further charting of predicted impact and recovery levels has been added in annex II with axis descriptions and self-explanatory figure descriptions. Recovery is now also mentioned in the executive summary, further in sections 1.1, 1.2, 2.1, 2.2, 3.1, 3.7, 3.8 & 5.1. Section 3.7 contains an extensive description of the predictive impact and recovery levels.
Section 5: There is no link to other parts of the SmartResilience methodology / tool.
Now chapter 4, further explanation has been added on the use of indicators
SmartResilience: Indicators for Smart Critical Infrastructures
page 69
Review Response
Section 5: The section describes the implementation of MCDM for GOLF largely via the use of RapidMiner screenshots. The focus is on the technical implementation alone.
A detailed description of the method has been added to chapter 5 (now 4) with a calculation example
Section 5: There is no documentation of how the different decision criteria were developed together with the local stakeholders.
The documentation has been added.
Section 5: Section 5.4. (adaption to other use cases) could be highly relevant, but it just 4 lines. No details are provided.
This is now covered in section 4.5 conclusions
Section 5: Overall, this section is also poor. It does not address the reviewer’s interpretation of the task at any time. There is no link to the SmartResilience methodology / tool. Even for the other possible interpretation (independent use of RapidMiner), discussions related to the potential use of RapidMiner for other scenarios are missing. Everything is very brief, and the focus is just on the technical implementation.
Again, I disagree with the ‘poor’ rating. The chapter is in line with T3.4 in which AIA will contribute to developing multi-criteria decision analysis tools based on, e.g. RapidMiner. Further explanation has been added on the use of indicators. The application of the RapidMiner implementation in other use cases is now addressed in section 4.5 conclusions. A non-technical conceptual explanation of the MCA method has been added to chapter 5 (now 4)
Section 6: Section 6 consists of one page (1 figure and 1 paragraph text). No results are provided. This section is not acceptable.
Chapter 6 has been expanded and a conclusions section added.
Section 7: This seems to be somewhat unrelated to the objectives of this deliverable. It is a high level description of indicators from GOLF which are included in the SmartResilience database.
Again, this relates to a previous comment: As per the recent progress report D7.4, AIA is reporting on T4.6 within D3.9: “D3.9 is closely related to finalising AIA contributions in D4.6 and AIA will be in a position to update the Stuttgart database once D3.9 is completed, should D4.6 be submitted before D3.9 AIA plans to report on D4.6 in D3.9”