big data meets evm (submitted).pptx

Thanks for a+ending our session today.

We’re going to take a quick tour through a touchy subject – unan;cipated grown in EAC.

We all know about this, read about it, and most likely work program that are in that condi;on, if not in worse condi;ons ,like OTB or OTS.

We’ll present two solu;ons to the forecas;ng of EAC that address the core problems using today’s approach

•  The EAC is not sta;s;cally sound •  Risk is not included in the EAC •  Compliance with Technical Performance Measures are not considered in the EAC

calcula;on.

These solu;ons make use of exis;ng data in the Earned Value Central Repository of the DOD, using tools available for free.

1

We all know of troubled programs. Program that are OTB. Programs that are OTS.

Programs that have failed to deliver their expected value on ;me and on budget.

The literature on Nunn-‐McCurdy has detailed root causes of many of these issues. But even if the program didn’t go Nunn McCurdy, the same Root Causes are likely in place.

The Earned Value data from the program can’t address the technical aspects of program performance. EV data is a secondary indicator of technical performance shorQalls. But EV data can provide an indicator of future EAC growth.

This presenta;on will speak to mathema;cal methods of mining the data in the EV-‐Central Repository (EV-‐CR), in an a+empt to construct a sta;s;cally sound EAC in support for forecas;ng future growth.

2

We’ve all seen these pictures of unan;cipated growth.

I say unan'cipated, because if we know cost and schedule growth is coming we can doing something about it.

In the current approach to performance analysis, much of the growth is unan;cipated for a simple reason:

•  The data in the EV-‐CR is used as Descrip;ve data. That is it is analyzed from the point of view of past performance.

•  Of course there are EAC calcula;ons. But these calcula;ons use the EV data in ways that wipe out past variances and use only current period data to make a forecast of future performance.

•  This is sta;s;cally unsound at best, and naïve use of the data at worse. This is strong language but it is mathema;cally true. Time series forecas;ng has been around a long ;me. Every High School sta;s;cs class has a sec;on on ;me series forecas;ng. Every biology, chemistry, physics class does as well. Social sciences, marke;ng, sales, ecology, sports coaching, nearly every topic has some understanding of ;me series forecas;ng.

But the EV-‐CR and the reports use a formula that is missing the past sta;s;cally behaviors, the past risks, the future risk and the ;me-‐evolu;on of the underlying sta;s;cal processes driving the program behaviors

3

With data held on the EV-‐CR we have an opportunity to change how we analysis the program’s performance using sta;s;cally sound processes.

This is currently called BIG DATA in commercial, scien;fic, and mathema;cal domains.

There are three type of data analysis processes in the BIG DATA world

1.  Descrip;ve – which is what we do when we are looking at the IPMR

2.  Predic;ve – which is the EAC calcula;ons. These are of course naïve predic;ons for the reasons men;oned before

3.  Prescrip;ve – where is where we want to get to eventually The descrip;ve and current Predic;ve forecasts also fail in one important way. With the data they don’t tell the Program Manager what to do about the upcoming unfavorable outcomes. Where to look, how to fix them. How to conduct what if assessments of the program, given the past performance.

In other words – nice report, what do you expect me to do about it?

4

Let’s look at the current descrip;ve analy;cs we get from the EV-‐CR.

We gets lots of data. Many would say too much data. But in the BIG DATA paradigm we want more data. The more data we have, the be+er chance we have of finding what we’re looking for.

This is counter intui;ve for the non-‐mathema;cal of us. But it is in fact true. This is the basis of all BIG DATA ini;a;ves, from Google, to Safeway, to the science and medical industry

5

The Defense Acquisi;on University Gold Card lays out the formulas for compu;ng the Es;mate At Comple;on.

These formulas are:

•  Linear – addi;on and ,mul;plica;on of EV variables

•  Non-‐sta;s;cal – use of cumula;ve values wipes out the variances informa;on form past performance

•  Non-‐risk adjusted – no forward impacts on performance from risk are used

•  Assume sta;onary behaviors – as the program moves from leb to right the underlying sta;s;cal processes are likely to change in their behavior.

This means the EAC does not address

•  Future risks to performance

•  The non-‐sta;onary behavior of the underlying sta;s;cal processes that drive variance

•  The non-‐sta;onary behavior of the risk probability distribu;on func;ons •  The coupling between work elements and deliverables not visible in the WBS and

only visible in the physical system architecture, usually contained in the the CAD system

6

We’ve skipped over the predic;ve analy;cs for now and moved to the prescrip;ve analy;cs.

Prescrip;ve is what we want. Improved Predic;ve we’ll come back to.

Without prescrip;ve analy;cs, we may know what is going to happen but have no way to doing anything about it in an analysis of alterna;ves way.

7

With prescrip;ve analy;cs, we can assess our alterna;ves for taking correc;ve ac;ons.

The milestone picture is here for effect. When we hear about assessing performance with Milestones, we need to think of what the milestone actually represents – both now and in the Roman Empire.

Milestones are rocks on the side of the road that had the distance back to Rome. You only knew the distance back to Rome when you passed the milestone and looked back.

We can’t really manage the program to success using milestones, because we don’t know we’re late un;l we’ve passed the milestone.

We need be+er forecas;ng of future performance. We have the data on a per-‐period basis in the EV-‐CR. We need to use it to forecast further performance in a sta;s;cally sound manner.

8

With the data submi+ed monthly to the EV-‐CR, using the WBS as the primary key – Format 1 – we now have the ability to start doing sta;s;cal ;me series forecas;ng in ways not available in the past.

But the current sta;c repor;ng processes – essen;ally looking at the contents of Format 1 with a viewer – offer li+le insight in the sta;s;cal nature of the programs performance.

It is this sta;s;cal behavior that we need to know about.

This comes from a fundamental principal of all projects work. The variables of projects – cost, schedule, technical performance – are random variables. Some can be controlled some can not. But all the variables are generated by underlying stochas;c processes. Some of these processes are sta;onary – the don’t change with ;me. Some are non-‐sta;onary – they change with ;me.

This informa;on – for the most part – is currently in the EV-‐CR. We need to get at it and applying our tools to reveal informa;on we currently don’t have access to.

9

This picture shows what we all know. The probability of being hit by a hurricane on the eat coast of North American depends on where you live. The probability is not uniform.

The sta;s;cal processes that drive the crea;on of storms in the Atlan;c are actually well understood. But the modeling of the mo;on of the storms requires that start on a path and have some past performance before an es;mate of where they will strike land can be developed.

In the same way, our program performance forecas;ng requires we have some past performance of the work processes, labor u;liza;on, technical performance, and other variables, before we can start making forecasts of where the project is going.

While forecas;ng hurricane's is a complex process. Forecas;ng where an indicator like Cost Performance Index, is going is rela;vely straight forward, given the past performance of the indicator.

For the moment will make some simplifying assump;ons to show how this can be done, using the data we already have in the EV-‐CR.

10

Let’s remind ourselves again of what we’re working with.

The EV-‐CR contains reported data on the end of the month. The CPI and SPI are calculated from this data. The prior months are cumulated and the current month used to calculate our Es;mate At Comple;on.

In fact the system that generates these numbers is a non-‐sta;onary stochas;c process. This system is genera;ng random numbers from the underlying sta;s;cal processes. And we’re assuming they are NOT random numbers drawn from an underlying probability distribu;on, but are “accoun;ng” number. A point value with no a+ached variance value.

11

With the EV-‐CR, we need a simple, inexpensive tool to start our sta;s;cal assessment.

The programming language is that solu;on. It provides the needed sta;s;cal analysis tools, including the ARIMA and PCA func;ons.

Star;ng with the ;me series from the EV-‐CR, we can forecast future values of CPI and SPI, given the monthly EV numbers.

R is well known in many other domains, but is just star;ng in the DOD community. Lot’s of training materials, books, working code, and user groups through “Meet Up,” in nearly every major city around the world.

12

With a sample CPI ;me series from an actual program, here’s an example of the 4 lines of R needed to produce a forecast.

The heavy libing of this approach starts with credible ;me series data from the program.

Then formanng that data into a raw structure usable by R.

From that point one, it is literally as simple as the four lines of R in this example.

Of course knowledge of ARIMA, and the details of senng up the parameters is necessary., so we’re not glossing over that.

In our paper, we speak to some of the details of ;me series analysis and the related Principal Components Analysis, but for now – in our limited ;me – we’ll assume all of this is understood and applicable on your own database of program performance data.

13

Let’s take a short diversion.

There is a dark secret of Earned Value. The units of measure of Earned Value Management are dollars. Not ;me, not business value, not technical performance. No other measure than dollars. And these dollars are budget dollars not funding dollars.

The second d secret is the IPMR reports wipe out the past sta;s;cal variances of the variables through the cumula;ve collec;on of the past. This actually prevents a credible forecast of EAC, since there is no past performance at the detailed level to base a forecas;ng algorithm on.

In general the non-‐sta;s;cal nature of the current EAC calcula;ons, lays the ground work for unan;cipated EAC growth. This needs to be fixed on both the contractor side as well as the government side. Since the data of all projects is actually a non-‐sta;onary stochas;c process model – each work ac;vity has Aleatory uncertainty driving the actual dura;on, the stochas;c nature of the program is always presence.

The next secret is the EV reports have no representa;on of the correla;ve effects of the work ac;vi;es. We all know one late drives others to be late, but the sta;s;c reports have no way to represent the connec;vity of the work ac;vi;es in a stochas;c network of ac;vi;es.

The last secret is none of the forecasts consider the future risks to the performance

14

ARIMA has been around for a long ;me and applied to forecas;ng ;me series for the same long ;me.

Up to this point the data for EV has not been available in electronic form. With the EV-‐CR, the data can be used to forecast the future using ARIMA. The XML data stream provides this. Of course this data has to be well formed, and that is s;ll an a+ribute that needs to be confirmed.

But for the moment, let’s assume it is. Them poin;ng the R tool at the EV-‐CR can reveal a sta;s;cally sound forecast of EAC at any level of detail needed.

We have to remember, Darrell Huff’s How to Lie With Sta's'cs, statement about hiding the variance by aggrega;ng those variances to the top.

Analyzing the IPMR at the lowest level of the WBS (Format 1), is a daun;ng task. So let’s hire a computer to do that for us. The current viewers provide a view to the data, but it is s;ll more of the same. Not predic;ve analysis to show the drivers of the variance. And no prescrip;ve analy;cs to show what to do about the variance

15

So what do we need?

We need more power Sco=y

It’s that simple and it’s that complex. The data is available, we need to access it. The data has to be well formed, and we need the tools to make use of the that data.

The current approach, as men;oned before, does not address the underlying sta;s;cal nature of the programs performance, adjust any future for risk or this past sta;s;cal behavior, or consider any future sta;s;cal behavior in calcula;ng the EAC.

As a quick sidebar, the underlying sta;s;cal processes of the program change as the program moves from leb to right in ;me. This creates a non-‐sta;onary impact of the underlying processes. This non-‐sta;onary stochas;c process is further complicated by the coupling of these processes with each other in the network of ac;vi;es.

16

Our next step beyond ARIMA analysis of the ;me series of CPI and SPI from the program is the use of Principal Component Analysis. This technology takes in a large data set of a+ributes from the program and reveals which of them are the source of the largest variance – that is which are the principal components of this variance.

It is an exploratory technique which specifies a linear factor structure between variables, and is especially useful when the data under considera;on are correlated. If underlying data are uncorrelated then PCA has li+le u;lity. In our case we make the assump;ons that all the variables on the program are correlated in some when, since the are physically connected through the topology of the products being built. And logically connected through the network of ac;vi;es in the Integrated Master Schedule.

PCA analyzes a data table represen;ng observa;ons described by several dependent variables, which are, in general, inter-‐correlated. Its goal is to extract the important informa;on from the data table and to express this informa;on as a set of new orthogonal variables called principal components.

Once the Principal Components have been discovered they have the following proper;es

•  Each factor accounts for as much varia;on in the underlying data as possible. •  Each factor is uncorrelated with every other factor. •  Principal components elucidate the dominant combina;ons of variables within the

covariance structure of the data.

17

For PCA to have any value, we need to augment the EV-‐CR data – CPI/SPI by WBS element as a ;me series – with other program data for the same WBS elements. This starts with the data produced from the Systems Engineering Management Plan (SEMP). Measures of Effec;veness, Measures of Performance, Technical Performance Measures, Key Performance Parameters, Risk re;rement or buy down ;me series, and other a=ributes of the program that are assessed at the same ;me as the Earned Value data is assessed.

This creates a correlated set of informa;on that is the raw data for the PCA process.

It’s beyond this presenta;on and even our paper to delve into PCA, but the PCA process is well developed in the literature. The R programming system has PCA func;ons built in, and with the EV-‐CR data sets and with the augmented data available from the program – but not in the EV-‐CR – we can start to ask ques;ons about the principal contributors to the variance in the EAC in ways not possible with the EV-‐CR data alone, or the other assessment data alone.

The goals of PCA (again) are:

1.  Extract the most important informa;on from the data table; 2.  Compress the size of the data set by keeping only this important informa;on; 3.  Simplify the descrip;on of the data set; and 4.  Analyze the structure of the observa;ons and the variables.

18

Here’s a no'onal example of a PCS process of two dimensional data. Our data sets will have 8 or 9 dimensions. But the number of dimensions is irrelevant, in fact the more dimensions the be+er.

What this example shows is the principal component of the data set – the one with the most variance – and shows it in a simple graphical form.

When we add more dimension, the result is no longer a two dimension PCA, but a bar graph showing the higher dimensions as a bar graph where the dimension are linear lay out.

The result, no ma+er the number of dimensions, is a reduc;on of all the data, to the few Principal Components that represent the high varia;ons in the data set.

19

Here’s a sample list of the components in a program. This is not all the components of our, but these are one were are familiar with.

•  CPI/SPI are in the EV-‐CR •  Technical Performance Measures – have a ;me series of values with a upper

control limit and lower control limit, or a outside the bounds assessment, or some other comparison of actual to plan.

•  The risk re;rement buy down assessment is similar to the TPMs. We have a planned re;rement value and an actual assessment of the risk at each point in ;me to create a variance between planned and actual.

•  A similar ;me series is the cost or schedule margin burn down. A comparison between the planned margin and the actual margin as a func;on of ;me.

20

In the short ;me we’ve had, we covered a lot of a ground. Likely a two semeister university course on produc;ve analy;cs using Big Data.

But here’s our call to ac'on for the next steps.

It’s hopefully obvious what each of these steps are.

1.  The cleaning of the ;me series data in the EV-‐CR in prepara;on for further analysis. Without good data, no algorithm is going to help us. Currently the EV-‐CR has an opportunity for improvement. This is a normal startup process.

2.  With good data – normalized using MIL-‐STD-‐881C – for example, the ARIMA tools can be pointed to this contents and we can get outcomes with ease.

21

big data meets evm (submitted).pptx

Technology

t thebasisofallbigdatainiaves

socialsc markeng

usuallycontainedinthe

andmathemacal domains