near real-time probabilistic damage diagnosis using

Near Real-Time Probabilistic Damage Diagnosis Using

Surrogate Modeling and High Performance Computing

James E. Warner∗

NASA Langley Research Center, Hampton, VA, 23666, USA.

Mohammad Zubair† and Desh Ranjan†

Old Dominion University, Norfolk, VA, 23665, USA.

This work investigates novel approaches to probabilistic damage diagnosis that utilizesurrogate modeling and high performance computing (HPC) to achieve substantial com-putational speedup. Motivated by Digital Twin, a structural health management (SHM)paradigm that integrates vehicle-specific characteristics with continual in-situ damage di-agnosis and prognosis, the methods studied herein yield near real-time damage assessmentsthat could enable monitoring of a vehicle’s health while it is operating (i.e. online SHM).High-fidelity modeling and uncertainty quantification (UQ), both critical to Digital Twin,are incorporated using finite element method simulations and Bayesian inference, respec-tively. The crux of the proposed Bayesian diagnosis methods, however, is the reformulationof the numerical sampling algorithms (e.g. Markov chain Monte Carlo) used to generate theresulting probabilistic damage estimates. To this end, three distinct methods are demon-strated for rapid sampling that utilize surrogate modeling and exploit various degrees ofparallelism for leveraging HPC. The accuracy and computational efficiency of the methodsare compared on the problem of strain-based crack identification in thin plates. Whileeach approach has inherent problem-specific strengths and weaknesses, all approaches areshown to provide accurate probabilistic damage diagnoses and several orders of magnitudecomputational speedup relative to a baseline Bayesian diagnosis implementation.

I. Introduction

Digital Twin is a structural health management (SHM) paradigm that integrates vehicle-specific char-acteristics including as-built components and as-experienced loading with continual in-situ damage diagno-sis and prognosis. The framework places a strong emphasis on high-fidelity, physics-based simulation formodeling complex components and requires rigorous, end-to-end uncertainty quantification (UQ) to enableprobabilistic forecasts of a vehicle’s reliability into the future. Since Digital Twin aims to continually monitora vehicle while in service and operating (i.e., online SHM), the foundational software it is built on mustalso be computationally efficient. While normal operating conditions necessitate feedback on the order ofminutes, cases of unexpected, discrete damage events require near real-time damage diagnosis and prognosisin order to adjust the current mission parameters accordingly. Since incorporating UQ in a framework thatrelies on physics-based simulation can be computationally prohibitive, delivering high-fidelity, probabilisticassessments with such a high degree of efficiency is extremely challenging.

In terms of the damage diagnosis capability of SHM, a common approach to integrate UQ is to firstpose the analysis as an inverse problem (model-based diagnosis1) and apply Bayesian statistics to form aprobabilistic solution.2 Here, the inverse problem is to estimate parameters characterizing damage in astructural component given measurement data from sensors describing its mechanical response (e.g. strain,vibrations, etc.). The Bayesian solution to the inverse problem (i.e., the diagnosis) is then a probabilitydistribution of the damage parameters that is based on a computational model’s predicted response in thepresence of damage. The resulting probability distribution can rarely be evaluated analytically, so numerical

∗Research Computer Scientist; Durability, Damage Tolerance, and Reliability Branch.†Professor; Computer Science Department.

1 of 15

American Institute of Aeronautics and Astronautics

sampling algorithms (e.g., Markov chain Monte Carlo (MCMC)3) must be employed to generate sample-based, probabilistic estimates of the damage parameters.

With respect to a deterministic approach to model-based damage diagnosis, the Bayesian approach hasthe significant advantage of recovering a fully probabilistic description of likely damage states rather thansimply a point estimate with no regard for inherent uncertainties. Furthermore, when the model used isa high-fidelity simulation (e.g., a finite element (FE) model), probabilistic diagnosis of arbitrarily complexcomponents with various damage types can be enabled. The drawback of such an approach, however, is thesubstantial computation overhead associated with obtaining solutions using MCMC sampling. This is bothdue to potentially slow convergence behavior (i.e., requiring a huge number of samples to accurately resolvethe underlying probability distribution) and the need to run a new FE simulation for each sample drawn. Forexample, if a FE model takes minutes or hours to evaluate, a Bayesian MCMC approach requiring thousandsof samples that utilizes a FE model would take days or weeks to complete, making its use for rapid, onlinedamage diagnosis infeasible.

To alleviate this computational burden, advanced MCMC methods have been developed to improve theconvergence rate of the sampling process.4–7 Another common approach is to replace the original physics-based model with a computationally efficient surrogate model using probabilistic spectral methods,8 reduced-order modeling,9,10 or machine learning algorithms.11 This technique requires the offline pre-computationand storage of an input-output pair dataset from the original computational model to train a surrogatemodel for use during online sampling. The development of accelerated Bayesian approaches that combineboth advanced MCMC methods and surrogate modeling, however, remains relatively limited for model-basedSHM applications.12,13 Furthermore, due to the inherent dependence between subsequent samples drawnwith MCMC (i.e., the Markov property), there have been few successful attempts14 to parallelize thesealgorithms and take advantage of high performance computing (HPC) capabilities.

Motivated by online SHM with Digital Twin, this study explores new numerical sampling approachesthat utilize surrogate modeling and HPC to enable probabilistic damage diagnoses in near real time. Inparticular, three distinct methods for rapid sampling that exploit various degrees of parallelism are studiedto speed up Bayesian model-based diagnosis, each relying on a precomputed training dataset to avoid theonline evaluation of the original FE model. While the proposed approaches are generally applicable to manymodel-based diagnosis problems, they are demonstrated on the problem of strain-based crack identificationin thin plates. Here, simulated strain data from a limited number of locations on the plate are used for bothprobabilistic damage localization (crack location estimates) as well as full characterization (crack location,size, and orientation estimates). The tradeoffs between each method in terms of accuracy, efficiency, andparallel scalability are discussed and demonstrated. While the strengths and weaknesses of each are illus-trated, all approaches are shown to produce accurate probabilistic damage diagnoses in just a fraction of thecomputation time of a baseline Bayesian implementation using serial MCMC (i.e., on one processor) andFE simulations.

The remainder of the paper is organized as follows. First, a formulation is provided in the followingsection that begins with a brief background on Bayesian model-based diagnosis and is followed by individualsubsections devoted to each of the three sampling approaches studied herein. Next, results from the strain-based crack characterization examples are presented, illustrating the effectiveness of each sampling approachwith respect to one another as well as a baseline Bayesian implementation. Finally, the findings of the studyare summarized in the conclusion section.

II. Formulation

In this section, the proposed rapid sampling algorithms for near real-time probabilistic damage diagnosisare presented. First, a brief overview of model-based diagnosis is provided, including a probabilistic solutionapproach that covers Bayesian inference and Markov chain Monte Carlo (MCMC) sampling. Then, individualsubsections are devoted to each of the proposed sampling algorithms to accelerate/replace traditional MCMCwith surrogate modeling and high performance computing (HPC).

2 of 15


A. Background

1. Model-Based Diagnosis

Damage diagnosis methods operate under the assumption that the mechanical response of a structuralcomponent is altered in the presence of damage. To this end, the goal of diagnosis methods is to usemeasured response data dobs ∈ Rm to detect if damage is present and then ideally estimate some parametersc ∈ Rd that characterize the damage (location, size, etc.). Model-based approaches to diagnosis require amodel of the structural component capable of predicting the mechanical response for a given set of damageparameters

M(c) = y ∈ Rm, (1)

where y ≈ dobs for a damage estimate c that effectively characterizes the true damage. Here, it is assumedthat M encompasses properly calibrated boundary conditions, material properties, etc. and postprocessingto extract predicted responses in y that correspond to the time/location of the measurements dobs.

In the context of model-based diagnosis,M is referred to as the forward model while the diagnosis problemof using dobs to infer c is the associated inverse problem. A typical deterministic approach to solving thisinverse problem is to first pose an error metric between the measured response data and corresponding modelresponse

Q(c,dobs) =m∑i=1

‖dobsi −Mi(c)‖2, (2)

whereMi(c) ≡ yi. Then, gradient-based or global optimization algorithms are employed to find the damageparameters that minimize Equation (2) to produce the so-called least squares estimator

cLS = arg minc

Q(c,dobs). (3)

The primary drawback of such deterministic approaches for model-based diagnosis is that only a pointestimate of the damage is produced with no regard to uncertainty inherent in the measurement data (noise,sparsity, etc.).

2. Bayesian Inference

The Bayesian inference approach to model-based diagnosis reformulates the inverse problem as one of de-ducing a probability distribution of the unknown damage parameters, c, conditional on the observed mea-surement data dobs. This distribution, p(c|dobs), known as the posterior distribution, is given according toBayes’ Theorem:15

p(c|dobs) ∝ p(dobs|c)p(c), (4)

integrating any knowledge about the damage prior to the measurement in the prior distribution, p(c), withthe information from the data, dobs, through the likelihood function, p(dobs|c). While the prior distributioncan be a powerful tool for applying an analyst’s insight about likely damage states for a component, a non-informative prior (e.g., uniform probability) is chosen in this work as to not bias the results. On the otherhand, the likelihood function in this work takes the form

p(dobs|c) =1

(2πσ2)m/2exp

(− 1

2σ2

m∑i=1

‖dobsi −Mi(c)‖2)

∝ exp

(− 1

2σ2Q(c,dobs)

), (5)

where this expression is the direct result of the following assumed relationship between the measured andcomputed responses:

dobsi =Mi(c) + δi, δi ∼ Normal(0, σ). (6)

That is, the measurement data are polluted with errors, δi, that are treated as a sequence of independent,identically distributed (i.i.d.) samples drawn from a zero-mean Gaussian (Normal) distribution with varianceσ2 (interpreted as the noise level). Note that this exposition was intentionally kept brief and only used asa means to facilitate the succeeding formulations, the interested reader can consult the references regardingBayesian inverse problems2,15 and its application to damage diagnosis13 for more details.

3 of 15


3. Markov Chain Monte Carlo (MCMC)

The solution of the model-based diagnosis problem as the posterior probability distribution in Equation (4)is not of practical importance on its own since it can rarely be evaluated analytically. MCMC3 is arguablythe most effective and common approach to explore the posterior distribution to form probabilistic damageestimates with a Bayesian inference approach. The goal of MCMC is to generate a collection of N damageparameter samples from the posterior probability distribution

{c(j)}Nj=1 where c(j) ∼ p(c|dobs), (7)

which can then be used to construct empirical probability distributions, credibility intervals, and momentestimates for C. Algorithm 1 summarizes the most basic form of MCMC, the Metropolis algorithm:3

Algorithm 1 Metropolis MCMC

Initialize c(0)

for j = 1 : N doSample u ∼ Uniform(0, 1)Sample c∗ ∼ q(c∗|c(j−1))if u < A(c∗, c(j−1)) = min{1, p(c∗|dobs)

p(c(j−1)|dobs)} then

c(j) = c∗

elsec(j) = c(j−1)

end ifend for

Here, the method simply draws a trial sample, c∗, at each iteration from a proposal distribution, q(c∗|c(j−1)),and then decides whether to accept or reject this sample based on the acceptance probability, A(c∗, c(j−1)).The Metropolis algorithm assumes that the proposal distribution is symmetric, where a common choice is aGaussian distribution centered at the previous sample

q(c∗|c(j−1)) = Normal(c(j−1),Σq), (8)

where Σq is the user-specified covariance matrix. Algorithm 1 with Equation (8) constructs a Markov chainthat, by design, is guaranteed to have a stationary distribution that reflects the true posterior distributionin Equation (4).

B. Methods for Rapid Sampling

While the Bayesian model-based diagnosis approach discussed in Section A provides an effective means ofgenerating probabilistic damage estimates, it generally incurs a significant computational overhead makingit impractical for online SHM. This computational expense is primarily attributed to the MCMC samplingprocess (Algorithm 1) detailed in the previous section. Here, the posterior probability distribution mustbe evaluated to determine the acceptance probability, A, for every sample drawn, so the computationalmodel, M, must also be executed at each iteration. Since generally N ∼ O(103 − 106) samples are neededfor convergence with MCMC, utilizing even a modestly complex model yields intractable analysis times.Furthermore, since there is an explicit dependence on the previous sample drawn at each iteration throughthe acceptance probability, parallelizing MCMC algorithms to leverage high performance computing (HPC)can be challenging.

In this section, three sampling methods are formulated to provide substantial computational speedupfor probabilistic diagnosis (by rapidly evaluating Equation (7)), each utilizing surrogate modeling and HPC.Surrogate modeling relies on the offline pre-computation and storage of input-output pair datasets from theoriginal model so that the posterior probability distribution can be rapidly evaluated during online sampling.Here, a set of T damage parameter arrays {c(k)}Tk=1 is first selected. Then, the model response correspondingto all m measurements are computed and stored for each damage state,

M(k)i ≡Mi(c

(k)), k = 1, ..., T (9)

4 of 15


for i = 1, ...,m. The result is the following T × (d+m) input-output dataset

S = {c(k);M(k)1 , ...,M(k)

m }Tk=1. (10)

Each sampling method to follow differs in how the input-output data, S, is utilized and in the resultingapproach to leveraging HPC. The first approach replaces each model predicted response, Mi, with indi-vidual, pre-trained surrogate models and employs a parallel MCMC algorithm14 that allows the models tobe partitioned across processors for sampling. The second approach constructs a surrogate model for theerror function, Q(c,dobs), on-the-fly to facilitate subsequent rapid MCMC sampling. The third approachdiscretizes the posterior probability distribution function, p(c|dobs), to enable completely parallel directsampling. The following subsections describe each of the three approaches in more detail.

1. Method 1: MCMC with Model Surrogates

The first sampling method adopts an existing approach11,13 of replacing the original computational modelitself with surrogate models and then explores the use of an emerging parallel MCMC algorithm14 to takeadvantage of HPC. From a machine learning perspective, S (Equation (10)) is the training data and avariety of off-the-shelf regression and interpolation algorithms can be utilized to directly infer the input-output mappings. Specifically, a surrogate model that maps a new damage state, c(∗), to the resultingsensor response is generated offline for each individual measurement

M̃i : c(∗) →M(∗)i for i = 1, ...,m, (11)

resulting in a suite of independent surrogate models, {M̃i}mi=1. Then, during online damage diagnosis, theposterior distribution can be efficiently sampled with MCMC using an approximate form of the likelihood(Equation (5)) that relies on these surrogate models

p(dobs|c; {M̃i}) ∝ exp

(− 1

2σ2

m∑i=1

‖dobsi − M̃i(c)‖2). (12)

Since M̃i can generally be evaluated much faster than the original model M, significant computationalspeedup can be obtained with this technique alone.13

To leverage HPC for additional speedup with this approach, multiple independent chains of MCMC cansimply be run on separate computer processors in parallel. While this straightforward parallelization will beused as a baseline method for comparison in this study, a more sophisticated parallel MCMC algorithm14

will be the primary focus for its potential efficiency and scalability benefits. The algorithm, which enablesasymptotically exact and parallel MCMC, was developed for (and has been traditionally applied to) MCMCfor big data applications where it is difficult to store and analyze all data observations on one processor. Themethod allows the data to be randomly partitioned among machines and independent MCMC chains to berun on only subsets of the data in parallel, resulting in improved scalability when utilizing many processors.The crux of the method is the utilization of special combination algorithms to merge samples from eachmachine in a manner that is provably asymptotically exact from the full-data probability distribution. Moredetails of the approach can be found in the original paper.14

To the authors’ knowledge, this is the first application of this parallel MCMC method to damage diagnosisor surrogate model-based inverse problems in general. While the diagnosis application tested later in thispaper uses a relatively small amount of measurement data (far from necessitating a big data method), thereis potential benefit due to the independent/individualized surrogate modeling strategy used (Equation (11)).

Since there is a one-to-one correspondence between each sensor measurement, dobsi , and surrogate model M̃i,partitioning the measurement data across processors effectively partitions the surrogate models as well. Thisstrategy can provide significant computational savings for applications with large amounts of sensor data orwhen the regression/interpolation algorithm used for surrogate modeling is memory and/or computationallyintensive.

2. Method 2: MCMC with Error Function Surrogate

Rather than construct individual surrogate models for each sensor response offline, the second samplingmethod builds a single surrogate model for the error function, Q(c,dobs), on-the-fly after receiving measure-ment data but prior to sampling. In this case, the dataset S is used to first evaluate Equation (2) to generate

5 of 15


new training data describing the input-output relationship between damage parameters and errors

Q = {c(k);Q(k)}Tk=1, (13)

where Q(k) ≡ Q(c(k),dobs). Using this dataset Q, regression or interpolation can be used to construct amapping directly from new damage parameters c(∗) to the resulting error

Q̃ : c(∗) → Q(∗). (14)

Then, the following approximate likelihood function can be formed using the error function surrogate model

p(dobs|c; Q̃) ∝ exp

(− 1

2σ2Q̃(c,dobs)

), (15)

allowing for rapid MCMC sampling of the posterior distribution without the evaluation of models or surro-gates for sensor values themselves. To leverage HPC with this method, the simple parallel MCMC approachof running several independent chains on separate processors will be used.

The benefit of the error function surrogate method is that there is a single surrogate model to evaluatefor each sample drawn using MCMC rather than a suite of models for each sensor prediction, resulting in aspeedup proportional to m during sampling. The disadvantage of the approach is that the surrogate modelfor error, Q̃, must now be constructed online, after receiving measurement data for diagnosis. Thus, thereis an additional initial computational overhead for generating the new training data, Q (which can be donein parallel), as well as fitting a regression/interpolation algorithm to these data to generate the surrogate.Another more subtle issue is that surrogate verification is more difficult in this case compared with the morestandard model surrogate approach in the previous section. Generating surrogates for the model-predictedsensor values is done offline rather than online and therefore time can be taken beforehand to verify theaccuracy and effectiveness of the approximate surrogate models versus the original model predictions. Thissame verification process cannot be carried out with the error surrogate approach, necessitating the futuredevelopment of a priori error estimates/bounds for the approach.

3. Method 3: Direct Probability Discretization & Sampling

Instead of relying on traditional MCMC sampling with an inherently limited degree of parallelism, the thirdsampling approach uses the stored training data (Equation (10)) to discretize and sample the posteriordistribution directly in parallel. To this end, a discretization of the probability distribution is assumed suchthat the parameter space is divided into grid cells with centroids coinciding with the input training dataset{c(k)}Tk=1 from S (i.e. nearest neighbor cells). Using the training dataset, a normalized posterior probabilityvalue can be computed at each grid point. Then, under the assumption of uniform probability within eachnearest neighbor cell, samples can be generated uniformly from each cell in proportion to its normalizedprobability value.

This procedure for direct discretization and sampling of the posterior distribution is described in Al-gorithm 2 below. Here, the evaluation of the posterior probability values for each grid point utilizes the

Algorithm 2 Direct Sampling

for each grid point c(k) do

p(k) ≡ p(c(k)|dobs) = exp(− 1

2σ2

∑mi=1 ‖dobsi −M(k)

i ‖2)

end forpsum = Σk p

(k)

for each grid point c(k) dop̄(k) = p(k)/psum

N (k) = p̄(k) ×NGenerate N (k) random points uniformly from the nearest neighbor cell of c(k)

end for

precomputed and stored model values M(k)i from Equation (10). Note also that since a uniform training

grid is used in this study, the nearest neighbor cells are simply hypercubes centered on the training data,but the approach could be extended to the case of nonuniform grids as well. It can be seen from Algorithm

6 of 15


2 that the approach is highly amenable to parallel computing with all normalized probabilities being com-puted independent of one another. The normalized probabilities are then used to compute psum using anefficient parallel sum reduction routine before communicating this value to all the processors. Thereafter,if N samples are needed from the posterior distribution and P processors are used, each processor simplygenerates N/P samples in parallel. The procedure can be implemented on a CPU with mulitple cores andalso on a graphics processing unit (GPU) device.

The direct sampling approach has the advantage of being highly parallel with the potential of enablingtremendous computational speedup over a MCMC-based sampling approach. If the training grid used is fineenough, the method is also expected to produce a distribution that closely approximates the actual posteriordistribution. Similar to the error surrogate MCMC approach, verifying the accuracy of direct sampling is notas straightforward as the model surrogates approach where models can be trained/verified before diagnosistakes place. While the approach will be shown empirically to yield highly accurate damage estimates inthe results to follow, development of an a priori approach for verification of the method is worthy of futurestudy.

III. Results

A. Application - Strain-Based Crack Characterization

While the damage diagnosis algorithms presented in Section II are general with respect to the measurementdata, dobs, model, M, and damage description, c, they are demonstrated here on the specific problem ofstrain-based crack characterization in thin plates. A diagram illustrating this application can be seen inFigure 1. Here, the damage is characterized by four parameters defining the location, size, and orientation ofa straight crack in the plate: c = [x, y, a, θ]. The measurement data, dobs, used to infer these crack parametersare m strain observations throughout the domain. The model,M, is a linear elastic FE simulation using theScalable Implementation of Finite Elements by NASA (ScIFEN)16 parallel FE code. The implementationof surrogate models and MCMC sampling for the three damage diagnosis algorithms was carried out inPython.17

!!

!!!!!!!!!!!!!!!!!!!!!!

a

x

y

✓

Sensors

w

h

⇥Mean Estimate

True Location

�u

Figure 1. Diagram of the problem domain and damage parameterization considered for the strain-based crackcharacterization application.

The measurement data, dobs, for the succeeding examples were generated by inserting a target damage,cref, into the finite element model, storing strains at predetermined sensor locations, and adding randomnoise according to Equation (6) to simulate sensor errors. Here, the noise level σ was selected such that therewas approximately 15% sensor error with respect to the measurement in each case. The plate geometry had awidth (w) of 96mm, height (h) of 215mm, and thickness of 5mm. The Poisson ratio was 0.3 and the applieddisplacement (∆u) was 1.0mm (the Young’s Modulus was arbitrary since the quantity of interest was strain).Two separate diagnosis examples are considered: 1) damage localization - estimating the location of the crackonly (c = [x, y]) and 2) damage characterization - estimating the location, size, and orientation of the crack(c = [x, y, a, θ]). The accuracy and computational efficiency of the three sampling approaches formulatedin Section II are compared on these two examples for varying numbers of measurements (m), training datasizes (T - Equation (10)), and number of samples drawn (N). The results for damage localization and

7 of 15


characterization are provided in the remaining subsections after a brief comparison of different surrogatemodeling algorithms.

B. Surrogate Model Comparison

In this section, a brief comparison of the surrogate modeling algorithms considered in this work is provided.Recall that the MCMC with model surrogates approach (Section II.B.1) requires pre-trained/stored surrogatemodels for each sensor value (Equation (11)) using interpolation or regression methods. Therefore, severalalgorithms can be tested offline before damage diagnosis is performed to identify the most accurate andfastest technique for surrogate modeling. Three algorithms were tested here, Gaussian process and K-nearest neighbors regression from the scikit-learn Python module18 and multilinear interpolation fromthe SciPy module.19 For brevity, only the results for the damage localization surrogate models (mappingdamage location to predicted sensor values) are shown while a similar analysis was conducted for damagecharacterization as well.

Training datasets (Equation (10)) of different sizes (T = {200, 800, 1800, 3200, 5000, 7200, 9800}) were firstgenerated using FE simulations. A testing dataset of 1000 randomly drawn crack locations and the resultingpredicted strains at the sensor locations was also generated for evaluating surrogate model accuracy. Thethree algorithms were then used to train surrogate models for each training dataset and generate predictionson the testing dataset, calculating the error versus the original FE solution. The results of this comparisonare shown in Figure 2. Figure 2(a) shows the median relative error while Figures 2(b) and 2(c) compare thetraining and prediction times for the different models tested, respectively.

0 2000 4000 6000 8000 10000Number Training Points

10-1

100

101

Med

ian

Rel

ativ

e E

rror

(%)

Linear Interpolation

Nearest Neighbors

Gaussian Process

(a)


10-5

10-4

10-3

10-2

10-1

100

101

Trai

ning

Tim

e (s

ec)


Nearest Neighbors

Gaussian Process

(b)


10-3

10-2

10-1

100

101

102

Pre

dict

ion

Tim

e (s

ec)


Nearest Neighbors

Gaussian Process

FEM

(c)

Figure 2. Performance comparison of three different surrogate modeling algorithms in terms of a) relativeerror, b) training time, and c) prediction time.

Accuracy and prediction time are most important for the effectiveness of the model surrogates samplingapproach, and it is clear from Figure 2 that the linear interpolation algorithm is most accurate and efficientin this particular application. Note that the average FE solution time was approximately 40 seconds and isdepicted in Figure 2(c) for comparison. Here, the substantial computational speedup that surrogate modelingprovides is evident as all three methods deliver several orders of magnitude improvement in predictiontime. While offline training time has less performance impact for online analysis, it can be seen thattraining the Gaussian process algorithm can be prohibitively expensive for larger datasets and so it was onlyconsidered for training data sizes up to T = 5000. Training time is critical, however, for the error surrogatesampling approach (Section II.B.2) where the surrogate model must be constructed on-the-fly during damagediagnosis. In this case, the nearest neighbor algorithm was chosen based on its significantly faster trainingtimes and comparative accuracy. Note that the nearest neighbor algorithm was also used for the damagecharacterization example, as it shows the best scaling with problem dimension and training dataset size.

C. Damage Localization

The sampling algorithms presented in Section II were first compared on a damage localization example.Here, the location of a crack (c = [x, y]) was estimated using simulated strain data assuming a known cracklength and orientation. The performance of the methods was studied for different numbers of measurements(m = {12, 24, 48, 100, 200}) and training grid sizes (T = {200, 1800, 3200, 5000, 9800}) as well as in serial(on 1 computer core) and in parallel (between 2-4 cores) on a quad-core 2.4GHz AMD Opteron processor.

8 of 15


In order to evaluate the accuracy of the methods, a reference solution was obtained for each number ofmeasurements considered by generating 10,000 samples from the posterior p(c|dobs) using serial MCMCwith the original FE simulation. Treating this collection of samples as the ground truth for each case, theaccuracy of the samples drawn with the three proposed methods was evaluated using the multidimensionalKolmogorov-Smirnov (KS) test.20

First, an example of the probabilistic damage localization solutions generated with each method is shownin Figure 3 for the case of m = 100 measurements and T = 5000 training data executed on P = 4 processors.Here, the probability density of the crack location is shown for a) the reference solution, b) MCMC withmodel surrogates, c) MCMC with an error surrogate, and d) direct probability sampling. The computationtime to generate each solution was approximately a) 4.5 days, b) 8.4 seconds, c) 2.7 seconds, and d) 0.3seconds. While the approximate probability distributions show generally good agreement with the referencedistribution, the model surrogates approach is noticeably less accurate in comparison to the error surrogateand direct sampling approaches.

20 30 40 50 60x (in)

100

105

110

115

120

125

130

135

140

y (

in)

Reference

Low

High

(a)

20 30 40 50 60x (in)

100

105

110

115

120

125

130

135

140

y (

in)

Model Surrogates

Low

High

(b)

20 30 40 50 60x (in)

100

105

110

115

120

125

130

135

140

y (

in)

Error Surrogate

Low

High

(c)

20 30 40 50 60x (in)

100

105

110

115

120

125

130

135

140

y (

in)

Direct Sampling

Low

High

(d)

Figure 3. Crack location probability contours for m = 100 measurements: a) reference solution using serialMCMC and FE simulation and approximate solutions using b) MCMC with model surrogates, c) MCMC withan error surrogate, and d) direct probability sampling.

A more comprehensive assessment of each sampling algorithm’s performance for damage localization canbe seen in Figure 4. Here, the error (as evaluated using the KS test) and computation time are illustrated foreach proposed approach and compared with a naive parallelization method (Model Surrogates (Baseline))that entails independent MCMC chains of duplicated model surrogates across each computer core. Figure4(a) compares the performance of each method for different numbers of measurements (m) for fixed trainingdata size (T = 5000) and number of processors (P = 4). Figure 4(b) provides a comparison for varyingtraining data sizes (T ) for a fixed number of measurements (m = 48) and number of processors (P = 4).Finally, Figure 4(c) demonstrates the scaling of each method for different numbers of processors (P ) whileholding the number of measurements (m = 48) and training data sizes (T = 5000) constant.

From Figure 4(a), it is clear that the direct sampling approach is significantly more efficient than the othertwo proposed sampling methods and the baseline implementation for each different number of measurementsconsidered. Furthermore, the accuracy of direct sampling is comparable or better than each of the other

9 of 15


10-1 100 101 102

Computation Time (sec)

0.0

0.1

0.2

0.3

0.4

0.5

Err

or

m = 12m = 24m = 48m = 100m = 200

Model Surrogates (Baseline)

Model Surrogates (Partitioned)

Error Surrogate

Direct Sampling

(a)

10-1 100 101 102


0.0

0.1

0.2

0.3

0.4

0.5

Err

or

T = 200T = 1800T = 3200T = 5000T = 9800



Error Surrogate

Direct Sampling

(b)

10-1 100 101 102


0.0

0.1

0.2

0.3

0.4

0.5E

rror

P = 1P = 2P = 3P = 4



Error Surrogate

Direct Sampling

(c)

Figure 4. Performance comparison of the proposed sampling methods for the damage localization examplefor (a) different numbers of measurements (with T = 5000 and P = 4), (b) different training dataset sizes (withm = 48 and P = 4) and c) different number of processors (with m = 48 and T = 5000).

methods. Another observation is that the computation time for direct sampling and error surrogate samplingare less sensitive to the number of measurements used in comparison with the model surrogates approach.This is due to the fact that a surrogate is trained and evaluated for each individual measurement in thelatter approach and therefore the complexity of the method scales linearly with m. While it can be seen thatthe proposed partitioned parallel approach for model surrogates sampling provides significant computationalspeedup over the baseline approach, it also incurs a noticeable penalty in terms of accuracy. This source oferror will be discussed further in the next section.

Figure 4(b) demonstrates the effect of training data size on the performance of each sampling method. Theplot illustrates that direct sampling results in the most efficient solutions with accuracy that is comparableor better than the other methods for varying training dataset sizes. As T increases, there is generally anincrease in computation time and decrease in error for each method, which is expected as the resolutionof the training grid is increased. The computation time for the model surrogates (both partitioned andbaseline) approach shows the most significant dependence on training data size. Again, partitioning themodel surrogates across processors results in a significant efficiency increase at the expense of accuracy.

Finally, the scalability of each approach for different numbers of processors is illustrated in Figure 4(c).The partitioned model surrogates and error surrogate approaches appear to enable the best scalability inthe sense that the decrease in computation time is most significant as processors are added in comparisonto the other approaches. The partitioned model surrogates approach achieves better scalability with respectto the baseline approach since the number of surrogates that must be evaluated on each processor duringsampling is decreased by a factor of P . However, it can be seen that the accuracy is negatively impactedfor increasing P since less data will be present on each processor, resulting in less information from whichto infer the damage during sampling. Note that the performance for the baseline and partitioned modelsurrogates approach coincide for P = 1 since the methods only differ in how they are parallelized (P > 1),so these markers coincide in Figure 4(c). With respect to the reference implementation with serial MCMCand FE simulation that took over four days to perform damage localization here, each of the proposedsampling approaches provide tremendous computational speedup with a relatively high degree of accuracy.

10 of 15


In particular, the model surrogates, error surrogate, and direct sampling approaches provide O(104), O(105),and O(106) computational speedup, respectively.

D. Damage Characterization

The performance of the proposed sampling methods is now illustrated for general crack characterization wheresimulated strain data were used to infer the location, size, and orientation of a crack in the domain (c =[x, y, a, θ]). The performance of each sampling method is compared for different numbers of measurements(m = {12, 24, 48, 100}) and different training data sizes (T = {10143, 19712, 39780}), where a substantialincrease in grid size is necessary to accommodate the increased dimension of the unknown parameters(d = 4). Each method was again executed on between P = 1 and P = 4 processors and reference solutionswere obtained using serial MCMC with FE simulation, drawing 10,000 samples for each approach. For thisexample, the evaluation of the KS score for the approximate samples was not feasible since the computationalcost increases exponentially with dimension so qualitative comparisons against the reference solution wereused instead. An efficient and scalable implementation of the KS score computation is an area of continuingresearch to rigorously study the sampling method accuracy for higher dimensions.

Figure 5 provides a qualitative assessment of the accuracy of each sampling method versus the referencesolution for damage characterization. Here, the cumulative distribution functions (CDFs) for each unknowncrack parameter are shown for each sampling method and the reference implementation for the particularcase of m = 48 measurements and T = 19712 training data points executed on P = 2 processors. Thecomputation time required to produce these results was 4.5 days for the reference implementation withserial MCMC and FE simulation, 44.2 seconds for the model surrogates MCMC approach, 4.5 seconds forthe error surrogate MCMC approach, and 0.4 seconds for direct probability sampling. While each methodprovides substantial computational speedup, Figure 5 shows that they also provide accurate and comparableapproximations to the probability distribution of the unknown parameters as well. Note that while only oneset of solutions are provided here for brevity, similar qualitative trends in accuracy were seen for differentparameter combinations as compared with the damage localization results in Figure 4. Here, decreasing thenumber of measurements and training dataset size generally increased errors slightly for all the methods,while varying the processors only affected the partitioned model surrogates approach, where the accuracydecreased with increasing number of processors. Future research will seek to provide more quantitativeassessments of errors for the higher dimensional damage characterization case.

Figure 6 represents a more in depth look at the performance of the sampling methods, comparing theircomputation times against the naive parallelization approach with model surrogates as a baseline. Figure6(a) plots computation time as a function of the number of measurements (m) for a fixed training data size(T = 19712) and number of processors (P = 4). Figure 6(b) shows computation time as a function of trainingdataset size (T ) for a fixed number of measurements (m = 48) and number of processors (P = 4). In Figure6(c), the normalized computation time is shown (normalized by the time for P = 1) as a function of thenumber of processors used to demonstrate the scalability of the methods for a fixed number of measurements(m = 48) and training data size (T = 19712). Overall, it can be seen that direct probability sampling isthe most efficient by a substantial margin, followed by the error surrogate MCMC approach, and then themodel surrogates MCMC approach. The approach consistently yields damage estimates in under one secondirrespective of the parameter {m,T, P} being varied. It is also observed that utilizing the parallel MCMCapproach,14 that partitions model surrogates across processors, provides significant computational speedupover the baseline approach.

Figure 6(a) demonstrates the dependency of computation times on the number of measurements usedfor each approach. It can be seen that the model surrogates approach displays a solution time roughlyproportional to the number of measurements since a surrogate model is trained/evaluated for each individualsensor value. On the other hand, the number of measurements in the range tested here display a negligibleimpact on the error surrogate MCMC approach and direct probability sampling. Similarly, the size of thetraining dataset used shows little effect on the computation time for each sampling method, as seen inFigure 6(b). This lack of sensitivity is primarily due to the efficient scaling of the prediction times withtraining dataset size (Figure 2(c)) for the nearest neighbor regression algorithm, which was used for themodel surrogates and error surrogate approaches in this example. It is also likely that the relatively smallrange in training dataset sizes considered in this example contributes to the apparent lack of sensitivity incomputation times.

Finally, Figure 6(c) demonstrates the scalability of each approach by showing the impact of the number of

11 of 15


25 30 35 40 45 50 55 60 65 70x (in)

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Reference

Model Surrogates

Error Surrogate

Direct Sampling

80 90 100 110 120 130 140y (in)

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Reference

Model Surrogates

Error Surrogate

Direct Sampling

5 10 15 20 25 30 35a (in)

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Reference

Model Surrogates

Error Surrogate

Direct Sampling

1.5 1.0 0.5 0.0 0.5 1.0 1.5θ

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Reference

Model Surrogates

Error Surrogate

Direct Sampling

Figure 5. Comparison of the estimated cumulative distribution functions using each sampling method withthe reference solution for each crack parameter a) x, b) y, c) a, and d) θ. The results shown are for m = 48measurements and a training grid of size T = 19712 executed on P = 2 processors.

20 40 60 80 100Number of Measurements (m)

100

101

102

103

104

Com

puta

tion

Tim

e (s

ec) Model Surrogates (Baseline)


Error Surrogate

Direct Sampling

10000 20000 30000 40000Training Data Size (T)

100

101

102

103

104

Com

puta

tion

Tim

e (s

ec) Model Surrogates (Baseline)


Error Surrogate

Direct Sampling

1 2 3 4Number of Processors (P)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Nor

mal

ized

C

ompu

tatio

n Ti

me



Error Surrogate

Direct Sampling

Figure 6. Performance (computation time) comparison of the proposed sampling methods for the damagecharacterization example for (a) different numbers of measurements (with T = 19712 and P = 4), (b) differenttraining dataset sizes (with m = 48 and P = 4) and c) different number of processors (with m = 48 and T = 19712).

12 of 15


processors used on the resulting normalized computation time. While the direct sampling approach yields thefastest computation times by far, it also shows the smallest dependence on increasing processor count. Here,given the efficiency observed on just one processor, it is likely that the training dataset size and number ofmeasurements needs to be increased significantly before parallel performance gains can outweigh the inherentcommunication overheads with the approach. While the remaining approaches generally show a performanceincrease when using more processors, the partitioned model surrogates MCMC approach demonstrates thebest scalability as the computation time consistently improves with each added processor. This indicatesthat this method has the potential to benefit applications where there is a large amount of measurementdata and access to adequate HPC resources.

E. Discussion

Some general observations and comparisons of the three proposed sampling methods will now be made in ref-erence to the formulations and performance results from the preceding sections. First, there is an assumptionmade with each method that the pre-computed/stored training dataset (Equation (10)) is large/extensiveenough to facilitate accurate approximations during diagnosis without additional evaluations of the originalmodel, M. This may not be feasible in cases where the unknown damage, c, has a high dimension, whenthese parameters take a wide range of values or are unbounded, or when M is too computationally inten-sive. It can be argued, however, that damage diagnosis is particularly well-suited for this type of approachsince typically the dimension of c will necessarily be low since sensor measurements will not be sensitiveto higher order descriptions of damage (e.g. inferring crack curvature versus crack length). Furthermore,the parameters describing damage will generally have well-defined bounds (e.g. the damage must lie withinthe component geometry) that can be readily discretized. Finally, the generation of the training dataset isdone offline and can be done completely in parallel, fully utilizing available HPC resources to alleviate thecomputational expense.

While it may be feasible to generate an adequate training dataset for the proposed diagnosis methods,knowing beforehand how large it must be to be considered adequate can be more challenging. For the caseof the MCMC approach with model surrogates (Section II.B.1), it is straightforward to perform an offlineverification study between the surrogate models and the original model. More input/output pairs fromthe original model can continually be added to the training dataset until a prescribed accuracy has beenobtained, along the lines of Figure 2(a). However, since both the MCMC with an error surrogate approach(Section II.B.2) and direct probability sampling (Section II.B.3) utilize the training dataset in an onlinemanner after measurement data have been obtained, this same type of offline verification is not possible.While the methods demonstrated a high degree of accuracy in the diagnosis examples presented here, asystematic method to estimate or bound the error in the recovered probability distribution a priori will bea worthwhile topic for future work.

Despite the potential difficulty with offline verification of its accuracy, the direct probability samplingmethod is the one approach studied here that may enable real time damage diagnosis in its current state.The approach consistently yielded solution times that were well below one second for all test cases inboth the damage localization and damage characterization examples considered here, showing potentialfor time-critical discrete damage event applications. In comparison to the other methods, direct samplingprovided six orders of magnitude computational speedup over the reference serial MCMC implementationwith FE simulations and was over two orders of magnitude faster than the baseline model surrogates parallelapproach. The straightforward, brute force-nature of the approach also circumvents many of the traditionalcomplications associated with MCMC algorithms, including generating an appropriate initial guess, tuningthe proposal distribution parameters, fully resolving multimodal distributions, and assessing convergence.As this style of algorithm is amenable to parallel computing, the migration of the method to use GPUcomputing will be explored in future work as well.

Finally, the performance of the parallel MCMC algorithm14 for the model surrogates approach is worthyof discussion. It was demonstrated that the method yields more efficient and scalable damage diagnosisestimates relative to the simple baseline parallel approach, but generally at the expense of accuracy. Theerrors observed here can be mainly attributed to two causes. First, the algorithm was originally designedfor big data problems involving so many measurements/observations that they may not all fit on a singlemachine. Damage diagnosis applications will generally have relatively fewer measurements, and the resultspresented herein showed that the accuracy generally degraded as the number of measurements decreasedand number of processors used increased. In these cases, it is likely that some or all processors may have

13 of 15


too few measurement data points to generate accurate estimates of the damage parameters. This inaccuracycould potentially be improved by developing heuristics for partitioning the sensor data appropriately amongprocessors in cases of sparse measurements or by estimating lower limits on the number of measurement datapoints needed on each processor. The second source of errors observed with the parallel model surrogatesMCMC approach was the choice of combination algorithm used in this study. The original work proposedthree combination algorithms for creating a final collection of samples from the full-data distribution basedoff each processor’s samples and the crudest, but most efficient, of these algorithms was used in this work.Future work will examine the efficiency/accuracy tradeoffs of each of the combination algorithms for damagediagnosis applications.

IV. Conclusion

Motivated by the Digital Twin structural health management concept, this study presented new ap-proaches to enable high fidelity, probabilistic damage diagnosis in near real time. While the foundation ofthese methods is based on finite element modeling, Bayesian inference, and surrogate modeling, the focuswas on reformulating traditional numerical sampling algorithms to leverage high performance computing togain substantial computational speedup. To this end, three distinct methods for accelerating sampling wereproposed and compared on the application of strain-based crack characterization. The accuracy, computa-tional efficiency, and scalability of the methods were illustrated for two examples of damage localization anddamage characterization.

While each parallel approach demonstrated several orders of magnitude improvement in computationalefficiency over a sequential Bayesian approach with finite element simulation, the particular strengths andweaknesses of each approach were illustrated and discussed. In particular, the direct probability discretizationand sampling approach was the fastest method on the examples tested, consistently yielding probabilisticdiagnosis estimates in well under one second, and retained a high degree of accuracy with respect to referencesolutions. To this end, the approach shows potential for enabling time-critical diagnosis for discrete damageevents for online SHM frameworks like Digital Twin. The difficulty of the direct sampling method withrespect to a more conventional model surrogates-based MCMC approach, however, is the absence of astraightforward means of verifying/estimating the accuracy of the approximation a priori. Thus, formulatingerror bounds and studying the convergence properties of the approach with respect to the size of the availabletraining dataset is a worthwhile area of future research.

References

1Barthorpe, R. J., On Model- and Data-Based Approaches to Structural Health Monitoring, Ph.D. thesis, University ofSheffield, 2010.

2Idier, J., Bayesian Approach to Inverse Problems, Wiley, New York, 1st ed., 2008.3Gamerman,, D. and Lopes, H. F., Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Chapman

and Hall/CRC, Boca Raton, Florida, Second ed., 2006.4Haario, H., Saksman, E., and Tamminen, J., “Adaptive proposal distribution for random walk Metropolis algorithm,”

Computational Statistics, Vol. 14, No. 3, 1999, pp. 375–395.5Haario, H., Saksman, E., and Tamminen, J., “An adaptive Metropolis algorithm,” Bernoulli , Vol. 7, No. 2, 04 2001,

pp. 223–242.6Haario, H., Laine, M., and Mira, A., “DRAM: Efficient adaptive MCMC,” Statistics and Computing, Vol. 16, No. 4,

2006, pp. 339–354.7Vrugt, J., ter Braak, C. J. F., Diks, C. G. H., Robinson, B. A., Hyman, J. M., and Higdon, D., “Accelerating Markov

chain Monte Carlo simulation by differential evolution with self-adaptive randomized subspace sampling,” Internation Journalof Nonlinear Sciences and Numerical Simulation, Vol. 10, No. 3, 2009, pp. 271–288.

8Marzouk, Y. M., Najm, H. N., and Rahn, L. A., “Stochastic spectral methods for efficient Bayesian solution of inverseproblems,” Journal of Computational Physics, Vol. 224, 2006, pp. 339–354.

9Galbally, D., Fidkowski, K., Willcox, K., and Ghattas, O., “Non-linear model reduction for uncertainty quantification inlarge-scale inverse problems,” Internation Journal for Numerical Methods in Engineering, Vol. 81, 2009, pp. 1581–1608.

10Wang, J., and Zabaras, N., “Using Bayesian statistics in the estimation of heat source in radiation,” International Journalof Heat and Mass Transfer , Vol. 48, 2005, pp. 15–29.

11Meeds, E., and Welling, M., “GPS-ABC: Gaussian process surrogate approximate Bayesian computation,” CoRR,Vol. abs/1401.2838, 2014.

12Warner, J. E., and Hochhalter, J. D., “Probabilistic damage characterization using a computationally-efficient Bayesianapproach,” NASA/TP-2016-219169, 2016.

13Warner, J. E., Hochhalter, J. D., Leser, W. P., Leser, P. E., and Newman, J. A., “A computationally-efficient inverse

14 of 15


approach to probabilistic strain-based damage diagnosis,” Annual Conference of the Prognostics and Health ManagementSociety, Denver, CO, October 2016.

14Neiswanger, W., Wang, C., and Xing, E., “Asymptotically exact, embarrassingly parallel MCMC,” arXiv preprintarXiv:1311.4780 , 2013.

15Kaipio, J. and Somersalo, E., Statistical and Computational Inverse Problems, Springer, 2004.16Warner, J. E., Bomarito, G. B., Heber, G., and Hochhalter, J. D., “Scalable Implementation of Finite Elements by NASA

- Implicit (ScIFEi),” NASA/TM-2016-219180, 2016.17Python Software Foundation, “Python Language Reference, version 2.7,” 2016.18Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort,

A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G., “API design for machine learning software:experiences from the scikit-learn project,” ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013,pp. 108–122.

19Jones, E., Oliphant, T., Peterson, P., et al., “SciPy: Open source scientific tools for Python,” 2001–.20Fasano, G., and Franceschini, A., “A multidimensional version of the Kolmogorov-Smirnov test,” Monthly Notices of the

Royal Astronomical Society, Vol. 225, March 1987, pp. 155–170.

15 of 15


near real-time probabilistic damage diagnosis using

Documents