supply chain digital twin - ulisboa

Supply Chain Digital TwinA Case Study in a Pharmaceutical Company

João Afonso Ménagé Santos

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Supervisors: Prof. Susana Margarida da Silva VieiraProf. Joaquim Paul Laurens Viegas

Examination Committee

Chairperson: Prof. Carlos Baptista CardeiraSupervisor: Prof. Susana Margarida da Silva Vieira

Members of the Committee: Prof. Jacinto Carlos Marques Peixoto do NascimentoProf. Rui Fuentecilla Maia Ferreira Neves

November 2019

AcknowledgmentsFirst, I would like thank my supervisors, Prof. Susana Vieira, Prof. Joaquim Viegas and Eng. Miguel

Lopes. Their support was paramount for the success of this work and I could not have asked for better

guidance.

I also extend my appreciation to Hovione Farmaciencia S.A., for giving me the possibility of executing

my work on a highly dynamic and challenging environment, Prof. Joao Sousa, for offering me this

opportunity and for giving advice on the topics of the thesis whenever necessary, and Eng. Andrea

Costigliola, for giving me all the tools needed and for providing counsel and feedback throughout the

duration of this work.

Additionally, I am very grateful to all the teams and colleagues with whom I have had the pleasure of

working with during these past few months, especially the Data Science & Digital Systems, Applications

Development and Supply Chain teams.

Just as importantly, a big thank-you to my girlfriend, who has supported me unwaveringly, my parents

and my brother, who have made me into what I am today, supporting me in every decision, and my

friends and colleagues.

iii

ResumoA cadeia logıstica integrada e uma rede onde todas as areas de negocio sao dependentes entre si.

Apesar de serem estruturas extremamente poderosas, a logıstica por detras destas redes interconec-

tadas de pessoas, produtos, maquinas e informacao e altamente complicada e solucoes para a sua

otimizacao sao cada vez mais necessarias. Os avancos tecnologicos vistos em anos recentes, permi-

tiram uma melhor otimizacao dos seus processos, e sao cada vez mais adotadas solucoes baseadas

em dados, devido aos seus resultados precisos. O conceito de digital twin, quando aplicado a cadeias

logısticas internas, tem a possibilidade de ajudar na gestao das cadeias logısticas integradas, junta-

mente com a sua capacidade de aumentar a percecao dos colaboradores com cargos de decisao das

empresas, enquanto permite a implementacao de modelos de simulacao precisos. Esta tese apresenta

um gemeo digital de uma cadeia logıstica interna de uma empresa farmaceutica, juntamente com uma

ferramenta de planeamento de capacidade bruta aproximada por simulacao, capaz de gerar estimati-

vas da capacidade mensal necessaria para cada area produtiva da empresa a longo-prazo. O trabalho

realizado foi um caso de estudo numa empresa farmaceutica. O digital twin desenvolvido inclui uma

interface grafica, com diversas perspetivas das atividades executadas no passado e presente, e com

a evolucao dos indicadores de desempenho chave. A ferramenta de simulacao esta tambem incluıda

na interface grafica, permitindo aos colaboradores com cargos de decisao a criacao dos seus proprios

cenarios e a obtencao dos resultados da sua simulacao.

Palavras-chave: Digital Twin, Cadeia Logıstica Interna, Investigacao Operacional, Planea-

mento da Capacidade Bruta aproximada por Simulacao, Cadeia Logıstica Farmaceutica

v

AbstractThe integrated supply chain is a process where all the business areas are dependent on each other.

While they are extremely powerful structures, the logistics behind these interconnected networks of peo-

ple, products, machines and information are highly complicated, and solutions for their optimization are

increasingly required. The technological advancements seen in recent years, have allowed for better op-

timization of its processes, and data-driven solutions are being extensively adopted, due to their accurate

results. The concept of the digital twin, when applied to the internal supply chain, has the possibility of

aiding in the management of the integrated supply chains, along with its capability of increasing aware-

ness to the company’s stakeholders and decision-makers, while allowing the deployment of accurate

simulation models. This thesis presents a digital twin of a pharmaceutical internal supply chain, along

with a simulation-based rough cut capacity planning tool, capable of giving estimates of the required

monthly capacity for the different areas of the organization on the long-term. The developed digital twin

offers a graphical user interface, with several views into the past and present tasks performed and the

evolution of the key performance indicators. The simulation tool is also included in the user interface,

giving the possibility to decision-makers of creating their own scenarios and performing the simulation.

Keywords: Digital Twin, Internal Supply Chain, Operations Research, Simulation-based Rough

Cut Capacity Planning, Pharmaceutical Supply Chain

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1

1.1 Pharmaceutical Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Pharmaceutical SCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Supply Chain 4.0 7

2.1 Digital Twin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Literature review & Commercially Available Solutions . . . . . . . . . . . . . . . . . 9

2.1.3 Examples of Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Enterprise Resource Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Production Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.3 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.4 Integrated Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Proposed Solution and Expected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Knowledge Extraction 25

3.1 Collected Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Processes Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Distributions Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Selecting the Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Data Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3 Outlier Identification and Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.4 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.5 Fitting the Distributions to the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ix

3.2.6 Results of the fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Simulation-Based Rough Cut Capacity Planning 43

4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Code Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Digital Twin User Interface 69

5.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 Conclusions 75

6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.1 Quality & quantity of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.2 Improving the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Bibliography 79

A Goodness-of-fit Tests Comparison A.1

A.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1

A.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2

A.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2

B Digital Twin User Interface Screenshots B.1

x

List of Tables

2.1 Frequencies of appearances of digital entities against the type of the study [27] . . . . . . 9

2.2 Time horizons for the different S&OP cycle stages [5] . . . . . . . . . . . . . . . . . . . . 18

3.1 Results of the optimization of the PDFs parameters . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Example scenario of orders to be sampled . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Efforts Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Monthly and area-wise relative error (with sign) of the median per iteration, compared to

50000 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Monthly relative and absolute (with sign) errors for manufacturing, QA and Warehouse . . 61

4.5 Monthly relative and absolute (with sign) errors for QC IPC, release and release review . 62

4.6 Occurrences of real monthly capacities being within the 1 or 2 IQR . . . . . . . . . . . . . 62

4.7 Consumed capacity percentage for each type of optimization . . . . . . . . . . . . . . . . 67

A.1 Example 1 goodness-of-fit values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1



xi

List of Figures

1.1 R&D productivity evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Basic diagram of the current pharmaceutical Supply Network . . . . . . . . . . . . . . . . 4

2.1 Basic schematic of how a DT works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 High-level external SC relationships scheme [33] . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Pharmaceutical CDMO SC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Internal SC relationships scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Automation Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Production Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Typical evolution of the capacity through the time horizons . . . . . . . . . . . . . . . . . . 19

3.1 Extracted dates and their chronological relation to the real processes. . . . . . . . . . . . 27

3.2 Binomial, negative binomial and Poisson distributions for different parameters . . . . . . . 30

3.3 Examples of PDFs from the obtained data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Outliers filter results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Mean vs Standard Deviation of the projects . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Statistical properties parallel plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.7 Examples of Kurtosis and Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8 Cullen and Frey graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.9 Results of the PDF fitting process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 ECDFs and how the theoretical PDFs fit to them . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Capacity Planning Process [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Examples of truncated distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Example representation of the assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Example representation of the manufacturing tasks scaling . . . . . . . . . . . . . . . . . 49

4.5 Evolution of a month’s capacity distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6 Evolution of the median of the monthly capacity utilization (%) by area . . . . . . . . . . . 55

4.7 Evolution of a month’s capacity distributions for QC and warehouse. . . . . . . . . . . . . 56

4.8 Non-normal examples of distributions at 50000 iterations . . . . . . . . . . . . . . . . . . 56

4.9 Evolution of the number of BA interferences versus the number of iterations . . . . . . . . 57

4.10 Code Efficiency of regular versus parallel computation . . . . . . . . . . . . . . . . . . . . 58

4.11 Code Efficiency – 3 loops vs a single loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.12 Code Efficiency – 6 vs 7 vs 8 CPU cores used . . . . . . . . . . . . . . . . . . . . . . . . 59

4.13 Capacities validation graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.14 Forecasted capacity evolution per month and area . . . . . . . . . . . . . . . . . . . . . . 64

4.15 Percentage of maximum capacity utilized per month and area . . . . . . . . . . . . . . . . 64

xiii

4.16 Gantt chart of the BA’s utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.17 Gantt chart of the BA’s utilization after optimization . . . . . . . . . . . . . . . . . . . . . . 65

4.18 Monthly capacities per area after optimization . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Examples of maps shown in the Overview tab . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Map before and after being clicked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Network graph of a project and corresponding map of buildings . . . . . . . . . . . . . . . 71

5.4 Example representation of the schedule of activities . . . . . . . . . . . . . . . . . . . . . 72

5.5 Gantt representation of the recipe of a project . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6 Example of PDFs of manufacturing, QR and adherence to start date of a project . . . . . 74

A.1 Example 1 goodness-of-fit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1



B.1 Overview tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1

B.2 Activities by building tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1

B.3 Activities by project tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2

B.4 KPIs tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2

B.5 Projects schedule Gantt chart tab screenshot . . . . . . . . . . . . . . . . . . . . . . . . . B.3

B.6 Projects database example view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3

B.7 Example of modal help window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4

B.8 RCCP: main view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4

B.9 RCCP: options modal window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5

B.10 RCCP: start simulation modal window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5

B.11 RCCP: existing scenarios to be loaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6

xiv

Acronyms

BA bottleneck asset.

BOM bill of materials.

BPR batch production record.

CDF cumulative distribution function.

CDMO contract development and manufacturing organization.

CMO contract manufacturing organization.

CS Chi-squared goodness-of-fit test.

DT digital twin.

ECDF empirical cumulative distribution function.

EDD earliest due date.

ERP enterprise resource planner.

FP final product.

GoF goodness-of-fit test.

IN intermediate product.

IoT internet of things.

IPC in-process control.

IQR interquartile range.

KPI key performance indicator.

LSD latest start date.

MC Monte Carlo.

PDF probability distribution function.

PP production planning.

QA quality assurance.

xv

QC quality control.

QC R QC release.

QC RV QC release review.

QR quality release.

R&D research and development.

RCCP rough cut capacity planning.

RM raw material.

S&OP sales and operations planning.

SC supply chain.

SC4.0 supply chain 4.0.

TU time unit.

UI user interface.

xvi

Chapter 1

Introduction

The 21st century has come as a century of technological innovation and integration, where mod-

ern technologies are being quickly and comprehensively extended to every field. Data has become

increasingly important and abundant, and its large quantities allow many deductions to be made, aiding

in generating accurate predictions.

The supply chain (SC) is an area where digitalization can bring numerous advantages. These advan-

tages can even be more significant for the integrated SC, which envisions the SC as an interconnected

collection of areas which seamlessly interact between each other, allowing for better resource manage-

ment and overall improved performance. The adoption of the most recent technological advances can

help successfully manage this complex network of areas, resources and entities. The internal SC of an

organization, which concerns the interactions between the company’s own areas and agents, is possibly

the most promising recipient of these new technologies, allowing for better internal logistics, which are

often extremely complex and difficult to maintain.

Furthermore, production planning and scheduling, which are central topics of the SC, can greatly

benefit from these mentioned advances. The possibility of generating accurate forecasts can reduce the

mentality of solving unexpected situations, commonly seen at the production planning level, converting

it into a mentality of predicting the unexpected and planning accordingly beforehand. Planning gener-

ally deals with the selection of the most appropriate procedures in order to achieve the objectives of

the project, while scheduling is the process of converting the scope, time, cost and quality plans into

an operating timetable. Both these areas can be substantially improved by the adoption of accurate

forecasting tools, data-driven and based on demonstrated performance.

The pharmaceutical industry has suffered extensive transformation, affected by its changing circum-

stances. Shah [50] states that the industry was historically characterized by good research and devel-

opment (R&D) productivity (number of approved drugs divided by the investment in R&D), long effective

patent lives, large technological barriers to enter and a limited number of substitutes, which resulted in a

strategy based on the exploitation of the price inelasticity to further invest in R&D and on a dependence

on blockbuster products, i.e., extremely popular drugs that generate large annual sales. However, ac-

cording to the author, these trends have dramatically changed in the recent years, with R&D productivity

declining (see figure 1.1), patent lives shortening and the existence of substitutes, such as generics

(when patents have expired). Furthermore, the liberalization of the global marketplace, exposing prod-

ucts to competition, and the creation of stricter laws controlling the drugs’ prices have called for drastic

modifications in how pharmaceutical organizations operate. These circumstances have created a ne-

cessity for achieving operational excellence within the whole enterprise.

1

Companies in the 21st century are pressured to deploy these new technologies, since they can op-

timize their businesses to levels that could not be achieved before. This includes the use of several

concepts, such as: internet of things (IoT), interconnecting devices in a production plant and allow-

ing the use of artificial intelligence to autonomously make decisions; automation, allowing machines

and intelligent systems to replace workers on tedious and repetitive jobs while increasing throughput;

smart manufacturing, using powerful algorithms to optimize processes and to have predictive capabilities

based on historical data. All of these possibilities have arrived due to the advancements in technology

that have been seen in recent years. The concept of the industry 4.0 has expanded into different areas

with the appearance of Pharma 4.0 and supply chain 4.0 (SC4.0), for example. Both of these can be

defined as the application of the industry 4.0 concepts in each respective area. The SC4.0 can benefit

greatly from the forecasting tools enabled by the computational power easily accessible today. Further-

more, the large amounts of data that are collected and stored allow these algorithms to achieve very

accurate forecasts. Optimizing the complex agents and interconnections within the SC also becomes

possible and has the potential of greatly reducing costs and increasing efficiency. The adoption of the

SC4.0 into the pharmaceutical industry (as a part of the Pharma 4.0) can be what is necessary for

achieving the operational excellence that the current paradigm of the pharmaceutical industry demands.

This chapter provides an overview of the current state of the pharmaceutical industry and its evo-

lution in the recent years. A focus on the pharmaceutical SC networks is given, stating how these are

being affected by the changes seen in the industry and how the 21st century and its technological ad-

vancements can provide the necessary improvements to successfully deal with today’s reality. Finally,

the contributions made by this work are presented and the structure of the thesis is enumerated.

1.1 Pharmaceutical IndustryContract manufacturing organizations (CMOs) and contract development and manufacturing orga-

nizations (CDMOs) are two types of key players in the pharmaceutical industry. In fact, Shah [50]

enumerates the key players of the pharmaceutical industry as listed below.

• Large R&D multinationals with presence in branded drugs.

• Generic manufacturers, producing drugs with expired patents.

• Contract manufacturers (both CMOs and CDMOs) which do not have their own product portfo-

lio, but instead produce intermediate products or active pharmaceutical ingredients. Operate by

providing outsourcing services to other companies.

• Drug discovery and biotechnology companies, often small and with limited manufacturing capacity.

Organizations from the first two types are commonly named Big Pharma companies; these are often

large organizations, spread over multiple countries. These companies tend to frequently resort to CMOs

or CDMOs for the manufacturing of the drugs or drug components, due to a variety of reasons: (1) the

lack of production capability, either constant or seasonal; (2) for the first type of organizations, to allow

them to focus on R&D and marketing while leaving the production to external organizations.

2

This focus on R&D by big pharma companies is caused mainly by the rise of generic manufacturers,

which compete on the non-patented blockbuster drugs. This tendency has forced the big multinationals

to develop new drugs that may become blockbusters and since they are patent protected, cannot be

produced by generic manufacturers. This behavior can be verified on the graphs from figure 1.1, which

shows that not only the R&D investments have been rising consistently and significantly (with an average

annual increase of 10.5% from 1980-2018) but also the number of yearly approved drugs by the American

Food and Drug Administration (FDA) has also been increasing since 2002. This means that even though

the R&D productivity has been slightly decreasing (as shown in the bottom graph from figure 1.1), the

Big Pharma companies are still investing increasingly more in R&D and the number of yearly approved

drugs by the FDA is also increasing. Note that the values of investment are from the USA and the

approved drugs are by the FDA. Even though the data shown is exclusively for the USA, this is a fair

approximation since USA’s investment corresponds to around 80% yearly of the global R&D investment

[41]. Furthermore, the USA can be considered without a doubt the biggest pharmaceutical market (and

therefore, the most representative), with sales of new drugs corresponding to 64.1% and of total drugs

corresponding to 48.1% of the global sales (including Canada) [19]. This new necessity of developing

new drugs and focusing on R&D has led to an increase on the importance of CMOs and CDMOs.

0

20

40

60

App

rove

d dr

ugs

& R

&D

Inv

estm

ent

1

2

3

4

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018Year

App

rove

d dr

ugs/

R&

D B

illio

n

Figure 1.1: Graphs of (1) year evolution of approved FDA drugs (bar chart) and R&D investment in Billions of US dollars (linechart) [41, 34] and (2) year evolution of approved drugs by the FDA per Billion of US dollars invested in R&D.

An important note must be made regarding the second graph from figure 1.1. The graph shows the

R&D productivity, defined by the yearly number of approved drugs by the FDA divided by the yearly

investment in R&D in billions of US dollars. This is merely an approximate indicator of the tendency

since the investments made in R&D in one year do not affect directly the number of approved drugs,

due to the lengthy process of creating new drugs and performing the clinical trials. However, due to the

linear tendency of the R&D investments, it is a fair approximation and shows numerically a reality that is

affecting the pharmaceutical industry.

CDMOs differ from CMOs due to their Development component. This R&D component can be mainly

one of two types. It can be the development of new methods for synthesizing components or the de-

velopment of new industrial processes. The first deals with the creation of new methods for creating a

3

specific component and then licensing the method while the second is the development of the industrial

processes for the manufacturing of a certain pharmaceutical product. This is a fundamental step since

when drugs are discovered they are produced in laboratories and their manufacturing methods are not

suitable for industrial upscalling.

Due to the new conditions that pharmaceutical industry is facing, their focus has been shifted towards

optimizing the whole business, not only the R&D processes, marketing and sales, but instead focusing

on other SC agents that need operational excellence to allow for a smooth transition into Pharma 4.0

corporations – the optimization of the SC becomes, therefore, a fundamental requirement.

1.2 Pharmaceutical SCsDue to the aforementioned scenario currently affecting the pharmaceutical industry and the new

organizational structure that is seen today, with the rise of generics manufacturers, the greater focus on

R&D by big pharma companies and the new necessity of outsourcing the manufacturing to third-party

organizations, a new complex supply network has emerged, connecting these different organizations.

RM Drugs Clients

R&D

Big Pharma

Bio

CDMOs

Manufacturing

Generics

Big Pharma

CDMOsCMOs

Raw Material Suppliers

Figure 1.2: Basic diagram of the current pharmaceutical Supply Network

Figure 1.2 shows a diagram of the basic workings of the modern pharmaceutical supply network.

This network features on a higher level the raw materials suppliers, manufacturing, R&D and the clients

or final consumers.

• Raw material (RM) suppliers: these are third party organizations focused on producing the basic

components for the manufacturing process. These materials can be simple, such as acetone or

even ice, ranging to much more sophisticated components with a longer lead-time. Additionally,

CMOs and CDMOs can produce intermediate product (IN) or active pharmaceutical ingredients

which can be used as RM in other manufacturing processes.

• R&D: the R&D area focuses on developing new drugs, procedures or techniques, with the ulti-

4

mate goal of reaching production in order to be more easily accessible to the people afflicted with

the diseases which the drug tries to fight and to become profitable. This means that this area

”produces” intellectual property used in the manufacturing.

• Manufacturing: the different organizations produce the drugs and pharmaceutical products to

be sold to clients. Note that CMOs and CDMOs to not manufacture products to be sold under

their company’s name; their products, independent from what stage in the product cycle they are,

are sold to big pharma or generics companies. Their clients are, therefore, not the hospitals

or pharmacies, but the other manufacturing organizations, which is not explicitly shown in the

diagram.

• Clients: the customers of the pharmaceutical products, often hospitals, clinics, pharmacies and

other health-related private or public organizations. The end-customers of the pharmaceutical

product are the people which require the medical effects of the products.

This complex SC scheme existent in today’s pharmaceutical industry faces several challenges and

opportunities for improvement. The digitalization and integration of the SC are some of the most attrac-

tive solutions for most of the problems that the SCs face today. The larger number of products in the

market nowadays, which can be seen by the increase of approved drugs by the FDA from figure 1.1,

has led to bigger project portfolios in most pharmaceutical companies. This creates a higher complexity

in the SCs of the companies, derived from a series of factors shown below.

• The need for a larger amount and a more varied set of RM. This has a direct impact in the ware-

house management and requires the addition of new RM suppliers.

• New manufacturing processes are needed. These can be independent processes which require

a higher initial investment, or processes which can be partially or completely done in existing

workcenters. A more complex scheduling is a clear consequence of this.

• The addition of a new project generally requires pilot batches and subsequent a process validation

campaign, which is done to confirm that the production recipe works as planned. Pilot batches are

often more prone to delays, occasionally affecting other campaigns.

Due to these reasons, the tendency for larger portfolios has increased the complexity of the phar-

maceutical SCs and a new way of organizing it must be adopted. The digitalization of the SC can bring

integration to its internal processes. This Integrated SC concept features a centralized way of managing

and interacting with the SC agents and benefits greatly from implementing a digital approach. Recent

technologies, such as IoT can provide great amounts of varied data, which can be used to automate

processes that do not need human interaction, such as automatically contacting RM suppliers to order

the products according to each individual lead-time (effectively reducing time in storage), aid in the plan-

ning and scheduling of the manufacturing and support areas’ processes or create forecasts based on

demonstrated performance.

5

1.3 ObjectivesThe main objectives of this thesis are:

• Mapping the internal SC

– Map crucial material, data and information flows

– Establish key dependencies and layers of interaction across and within areas

• Building the internal SC DT

– Bring end-to-end visibility to operational processes

– Monitor key performance indicators based on historical data

– Conduct scenario-based forecasting, through a simulation-based rough cut capacity planning

(RCCP) tool

1.4 ContributionsThis thesis was developed in a partnership between the Institute of Mechanical Engineering of the

Instituto Superior Tecnico (IDMEC-IST ) and the pharmaceutical CDMO Hovione Farmaciencia, S.A..

A SC DT with a simulation-based RCCP tool as the simulation tool was the solution developed for

this work. Furthermore, a graphical user interface was developed, with the objective of (1) intuitively

showing current and past information about the manufacturing plant and the internal SC agents and (2)

delivering the RCCP tool in way that allows the users to perform minor modifications and refinements

to the simulation’s parameters and receive its results in a detailed way. This tool has the objective of

supplying information to the users, for their decisions to be more well-thought and data-driven and will

be actively used by key stakeholders as part of the new operations planning policy at the CDMO in study,

contributing to the key task of RCCP.

1.5 Thesis outlineChapter 1 describes the industry in study, problems and opportunities and the objectives of this

work. Chapter 2 presents the SC4.0, clarifying how a DT can provide support to the SC. The concepts

of DT and SC are thoroughly explained. Furthermore, the related work, proposed solution and expected

results are presented. Chapter 3 deals with knowledge extraction. The section starts by identifying the

data that was used and how it was extracted. Then, a comprehensive description of the process of fitting

a theoretical probability distribution function (PDF) to the processes measured durations is presented.

Chapter 4 introduces and defines the simulation-based RCCP tool. The implementation of such tool is

presented, and thorough convergence and efficiency analysis are described. Lastly, the validation of the

simulation’s results is performed and predictions made by the tool are delivered. Chapter 5 presents

the user interface (UI) of the DT, introducing the chosen methods for translating meaningful statistical

information into graphics and tables, as well as intuitive ways of interacting with the data shown and the

simulations performed. Chapter 6 presents the conclusions and achievements, as well as future work.

6

Chapter 2

Supply Chain 4.0

The improvement of the SC is a requirement that comes from the need of obtaining operational

excellence in all business areas. The most viable and effective way of improving its performance is by

converting it into a digital supply chain, also known as the SC4.0. Since the digital supply chain can

actually have two different meanings (with the second meaning being the supply chain of digital goods,

such as songs or e-books), the term of SC4.0 will be used instead. SC4.0s have the opportunity to

reach the next horizon of operational effectiveness and to do so, they need to become much faster,

more granular and precise [1]. Employing algorithms and frameworks that can aid in optimizing the SC

and the whole business can be an important step for companies to keep their competitive edge.

In fact, Hala Zeine (president of the digital supply chain at SAP AG) states that ”the future of the

supply chain is somewhere where the digital world and the physical world are entangled. It is where you

can simulate in one and execute in another”. Furthermore, she states that the latest technologies, such

as 3D printing, robotics, big data, AI or blockchain are not useful unless there is visibility into whether

the processes are working or not. She concludes by asserting that the way to push the SCs into the

future and to prepare the companies to do it is by giving them visibility, by creating a digital twin of their

end-to-end supply chain [45].

In this chapter the concepts of DTs and SCs are discussed, and the proposed solution and expected

results are presented.

2.1 Digital Twin

2.1.1 Concept

A DT can be defined as a dynamic virtual representation of a physical object or system, using real-

time data to enable understanding, learning and reasoning [8]. Although its definition varies from source

to source, the basic idea consists on a digital representation of an asset (being it tangible [entity] or

intangible [system]), which uses IoT to receive meaningful real-time data and reaches conclusions based

on both the developed model, how it has performed in the past and how it is performing at the present.

The DT is an extremely attractive concept nowadays, having been distinguished in Gartner’s Top 10

Strategic Technology trends for both 2018 and 2019 [20], an annual ranking made by Gartner (an S&P

500 global research and advisory firm) that distinguishes the most promising innovative technologies.

The basic schematic of a DT is shown in Figure 2.1.

The concept of the DT was initially introduced in 2002 by Dr. Michael Grieves from the University of

Michigan, only receiving its name in 2010, on a NASA’s roadmap document by Piascik et al. [42]. Some

7

Digital TwinReal Asset

Decision-Maker

- Physics Models- Statistics Models

- AI Models

Automatic updates

CAD/FEAModels

MaintenanceHistory

OperationalData History

Real-Timeoperating data

CurrentData

HistoricalData

Forecasts

Param

eter c

hang

es/

mainten

ance

alloc

ation

/

other

instr

uctio

ns

Figure 2.1: Basic schematic of how a DT works. Note that the automatic updates transition may be optional.

sources claim that the concept was coined by the USA’s Defense Advanced Research Projects Agency

(DARPA) [21], but no unambiguous evidence of such was found when reviewing the literature.

Being a broad and relatively recent concept, the DT is usually regarded as the digital replica of exclu-

sively physical tangible assets [29], since that is its most common application. In fact, it is mostly used

in situations such as mimicking machines on a production plant or a turbine in an airplane. However,

the concept can be very clearly applied to intangible assets, with more recent definitions of the con-

cept defining it as digital copy of a physical system rather than a physical entity or asset, which can be

semantically more excluding towards intangible assets. These intangible assets can be transportation

networks, economic flows, manufacturing processes, HVAC systems or, similarly to the one in study,

SCs.

Since a DT receives real-time data and stores historical data, it is quite common for it to act as a

tool for visualization. Being so dense in the information that it contains, it can be generally used to show

certain performance indicators, states and numerical figures regarding the present or a specific point in

time. This can provide additional awareness to decision-makers, in order for their decisions to be more

data-driven. Besides this, simulation capabilities are also extremely useful in DTs; in fact, considering

that the system contains both large amounts of data and the behavior and inner workings of the model,

it is straightforward to understand the utility of using this tool for simulation.

The use of IoT is fundamental on DTs, since it is the technology that provides the data to the models.

This can be done resorting to sensors (temperature, pressure or other quantities), or other corporate

systems that collect information in a less straightforward manner, such as the number of trucks that

arrive (by accessing the records of those occurrences) or the throughput of a certain production facility.

Artificial intelligence algorithms (machine learning, deep learning) are also frequently used with DTs.

This is due to two reasons, (1) the current development and widespread use of powerful hardware,

capable of performing complex calculations and (2) the existence of large quantities of data already in the

8

DT, which is a requirement for most artificial intelligence algorithms to perform accurately. 3D modelling

is also frequently related to DTs that deal with machines or other tangible assets. By modelling its

components and the physical interactions between them, the model is able to better forecast the asset’s

state, which can be used for predictive maintenance, for example.

DTs are frequently confused with both monitoring tools and simulation models. In reality, DTs bring

together both concepts, effectively delivering a visualization tool with improved simulation models [32].

DTs differ from simulation models in the sense that they receive real-time data to generate better predic-

tions. Regularly, simulation models have complete descriptions of the object or system in study, but often

lack its historical performance and almost always lack their current state. By having both, the generated

simulations can be verified and improved by simulating on past data and the predictions that the model

can create are based on current states, which will deliver more data-driven and accurate responses. DTs

also supersede monitoring tools in the sense that all the data that these tools possess and display is

also available by DTs. Additionally, DTs have access to forecast data, created by its accurate simulation

models.

2.1.2 Literature review & Commercially Available Solutions

Kritzinger et al. [27] make a clear distinction between 3 types of digital entities that vary on their level

of integration: digital models, digital shadows and DTs. The main difference between the 3 are that

digital models have manual data flows from the physical object to the digital one and vice-versa, digital

shadows have manual data flows from the digital object to physical one, but automatic data flow from

the physical object to the digital one and DTs have automatic data flows both from physical to digital

and from digital to physical objects. On their literature review, the authors analyzed a series of scientific

articles which regarded DTs, both conceptual, review, case-study, definition or study, and determined

the actual type of digital entity that the articles were referring to, according to the authors experience.

What was shown was that the majority of the articles that regarded DTs were actually describing other

digital entities with a smaller level of integration. In fact, as can be seen from table 2.1, only 18.60% of

the articles featured a digital entity with a level of integration that allowed the authors to consider it a DT.

Note that the study made by the authors regarded DTs in manufacturing.

Integration Level Type

Case-study Concept Definition Review Study

Undefined 4.7% 11.6% 0.0% 2.3% 0.0% 18.6%DM 11.6% 14.0% 0.0% 0.0% 2.3% 27.9%DS 7.0% 25.6% 0.0% 2.3% 0.0% 34.9%DT 2.3% 2.3% 4.7% 9.3% 0.0% 18.6%

Total 25.6% 53.5% 4.7% 14.0% 2.3% 100.0%

Table 2.1: Frequencies of appearances of digital entities against the type of the study [27]

Besides scientific articles and academic applications of DTs, there are a few commercially available

implementations of it. It is important to mention that most commercial applications of DTs are of tangible

physical assets, such as machines or production facilities, rather than intangible ones.

9

• SAP: the company places the DT at the center of the design-operate-manufacture-deliver chain. At

the implementation level, SAP offers the Asset Information Workbench which creates a DT to help

managing the enterprise’s assets, by creating a digital replica of the physical assets, processes

and systems.

• Anylogic & Simio: both simulation software packages include DT frameworks that allow enter-

prises to implement the concept. Being simulation software packages, the implementation of a DT

is rather straightforward, and for its most basic form, the addition of real-time data to its models

is the necessary step for the simulated assets to become digital twins. Furthermore, Anylogic

features several case-studies of DTs applications.

• Ansys & Siemens NX: although it is not clear the level of integration that these finite element

modelling software packages have with regard to DTs, their use is paramount in DTs of tangible

assets. These packages simulate physical phenomena on the assets and the addition of real-time

data measurements can further improve the quality of the results.

• Aspentech Aspen Mtell: Aspentech provides supply chain planning solutions and its Aspen Mtell

tool includes optimized production scheduling based on an integrated digital twin maintenance

model.

• PWC’s Bodylogical: a DT of the human body, that harnesses data from, for example, fitness

trackers. PWC claims that it could help people better manage their health and stick to their doc-

tor’s orders and pharmaceutical companies and governments better understand the global health

problems and employ counteractive measures to regulate it.

2.1.3 Examples of Applications

Several examples of DTs are presented below, which can help the reader better visualize how they

are usually implemented.

• DT in energy production: GE engineers developed a DT of the Haliade 150-6 wind turbine’s yaw

motor. The objective of the DT was to simulate the temperature at various parts of the motor,

by using its physical model and a series of sensor data. This allowed for better temperature

monitoring, which ultimately translates how it is being used. By using simulation software and the

real data collected, the temperatures at any point of the motor can be estimated fairly accurately.

Furthermore, this allows for predictive maintenance, effectively reducing downtime. [43]

• DT in healthcare: Bruynseels et al. [9] present a form of therapy as digitally supported engineer-

ing. On it, a virtual patient is created as a model of a person, made possible by experimental big

data collected using advanced technologies, from molecular to macroscopic scale. This creates a

virtual model of the human body. Then, utilizing sensorial data currently employed on fitness track-

ers, for example, the DT could track the health condition of the individual and forecast problems

with the user’s health. The concept of the DT in healthcare has also been applied to individual

organs, such as the creation of a model of the human heart [49].

10

• DT in fleet management: as part of the EU OPTIMISED project, Alstom developed a DT to enable

the correct scheduling and maintenance management of their UK fleet of trains. The DT dealt with

daily operating requirements, maintenance regimes, capacity and abnormal cases of accidents

and failures. The solution created included an interactive UI showing the trains’ current location

and their future location estimation. [3]

• DT in the automotive industry: DTs can be of great use in the automotive sector, from the ve-

hicles’ operation to their manufacturing or sales. Sharma and George [51] explore this possibility:

while driving, a DT of the vehicle could combine the real world with the digital world (for exam-

ple, the navigation system) into an Augmented Reality solution, delivering real-time information to

driver, effectively allowing him to make more data-driven decisions and reducing the distractions

caused by currently employed systems. In more advanced approaches to this DT, it could actually

control the vehicle, or give driving assistance to the driver (becoming an autonomous vehicle). The

DT would interact with the real vehicle, which in turn would supply it with real-time odometry data.

• DT in crisis scenarios: [44] presents a conceptual example of a DT in the form of an interactive

audio assistant, which deals with the water and sewer system of a city. The example shows the

DT alerting the operator that a pipe burst may have happened (through real-time pressure mea-

surements). It then supplies the operator with the required information, for instance the location

of the burst or the anticipated impact. Finally, it gives response options and performs the com-

mands given by the operator, such as notifying emergence response teams, assessing risk and

alerting authorities. Note that this example is different from a regular audio assistant in the way

that it knows the system internally and receives real-time data to estimate its real-time behavior

and performance. The audio interactivity is merely an interaction method, similar to a UI.

• DT in manufacturing: Kritzinger et al. [27] presents a compilation of applications of manufacturing

DTs, with their corresponding opportunities.

– Production planning and control: order planning according to statistical assumptions; im-

proved decision support using detailed diagnosis; automatic planning and execution.

– Maintenance: ability to identify the effect of state changes on the processes of the production

system; evaluation of machine conditions based on descriptive methods and machine learn-

ing algorithms; achieve better insights into the machine’s health by analyzing process data at

distinct phases of the product’s lifecycle.

– Layout planning: continuous production system evaluation and planning; independent data

acquisition.

• DT of an entire country: the UK National Infrastructure Commission has suggested the creation

of a DT of the entire country, mapping power production, water management, communications,

meteorological and demographic history and transportation networks. With this complex model

and the astronomical amounts of data that it would produce, the Commission would have insights

into almost everything that is happening at a given time and it would be able to answer strategic

11

questions regarding investments or improvements to infrastructures or organizations such as: what

would the impact of closing a road be in the traffic; is it possible to avoid building a new hospital

car park by managing appointment times and traffic flows. [8]

2.1.4 Advantages and Limitations

The advantages of DTs can be summarized into improved awareness and predictive capabilities.

Having access to real-time sensorial data and past historical data, the user of the DT has information

regarding the state of the represented asset and can make deductions of seasonality, for example, based

on its history. A more detailed model and a more comprehensive collection of data can lead to more

accurate and more diverse information available. Furthermore, since the DT contains the model of the

asset, further data that is not explicitly collected can be calculated and presented to the user, allowing for

additional information to be displayed. Consider, as an example, a production plant which consistently

tracks a production process. Although it does not measure explicitly the stocks of the raw materials

used on the production, by knowing the quantities ordered and the quantities used in the manufacturing

process (from the production model), they can be tracked and the user can receive this information.

The opportunity presented by DTs in terms of presenting information also comes as a challenge for

data representation. It is extremely important to convey the information in a manner that is clear and

precise. Additionally, there is often the need for these representations to be interactive and allow the

users to filter the data shown to their needs. Several constituents of the chosen representation form,

which are frequently deemed unimportant and irrelevant, are of extreme importance when transmitting

information clearly, such as the type of graphs or the color scheme.

Regarding the predictive capabilities of DTs, their advantage is quite straightforward: a precise model

of the physical asset, receiving meaningful and accurate data, creates forecasts, from manufacturing, to

maintenance or planning and scheduling. Manufacturing forecasts can give insights on when to order

raw materials; predictive maintenance can predict when maintenance is necessary for a specific asset,

allowing it to be planned ahead and effectively reducing downtime and reactive maintenance; accurate

insights into future manufacturing processes, maintenance durations and expected delays, for example,

allow for better planning and scheduling, which can account for these situations and increase the overall

performance of a business.

An additional advantage of DTs which is frequently overlooked is the ability of simulating scenarios in

a risk-free environment. This basically means that since the DT contains a representation of the actual

asset, simulations can be ran that would not be safe to do in the physical asset. Consider, for example,

removing the scheduled maintenance of a certain asset and simulating how long it would run until failure;

the simulation with have no impact, but experimenting it on the physical asset could have catastrophic

implications.

The disadvantages of DTs are mainly the challenges they present on implementation: the initial

cost of sensor installation and the effort of creating a comprehensive model of the asset. An incredibly

detailed model that receives accurate and varied data is more capable of making forecasts, but the

difficulty of implementation increases with the level of detail. Furthermore, especially on non-corporate

12

DTs, such as healthcare or automotive, the collection of data to be supplied to global models to improve

the overall model effectiveness may be seen as privacy infringement. For all the models, data encryption

is also a challenge.

2.2 Supply Chain

An SC is a complex network of agents, such as organizations or people which work in a connected

manner, with the objective of delivering a product or service to a customer. These networks feature

the processes of converting raw material into the final product (FP) and all the logistics that it entails.

Mentzer et al. [33] define SC as ”a set of three or more entities (organizations or individuals) directly

involved in the upstream and downstream flows of products, services, finances, and/or information from

a source to a customer”.

Figure 2.2: High-level external SC relationships scheme [33]

SC management is a term that is customarily used when studying SCs. In fact, this is a fundamental

area that regularly needs to be optimized; doing so, could improve the company’s efficiency and effec-

tiveness, conferring a smoother operation. Mentzer et al. [33] define SC management as ”the systemic,

strategic coordination of the traditional business functions and the tactics across these business func-

tions within a particular company and across businesses within the supply chain, for the purposes of

improving the long-term performance of the individual companies and the supply chain as a whole”. In a

simpler form, this means that it consists in organizing the SC agents and the interactions between them,

so as to improve the performance of the whole company.

In this work, only the internal SC will be studied, which excludes the importing of raw materials from

external suppliers and the exporting of finished produced goods to external clients. Note that from the

diagram of figure 2.2, the suppliers and customers are excluded, leaving mainly the organization and its

direct influences, such as market research firms. It makes sense then to expand the organization block

into a more detailed view, exposing the internal connections; this is shown in figure 2.4. Examples of

internal SC agents are the production areas (which can be subdivided into specific areas), quality con-

trol (QC), quality assurance (QA), warehouse, management, marketing, while examples of processes

between these agents can be the delivery of raw materials from warehouse to a production area, the

delivery of FPs from one production area to the warehouse or the QC of a raw material that needs to

finish before the production can start. The global view of the SC of a typical CDMO company is as

depicted in figure 2.3. For this work only the processes (referenced as ”Process step i” in the figure) will

be considered, along with all its supporting areas.

It is important to understand the internal processes that happen in a CDMO, since they define how

the business operates and how the specific internal SC of such organizations looks like. The main

13

Raw MaterialSuppliers

Storage Process Step 1 Process Step nShipment to

client

Raw MaterialTests

In ProcessControl (IPC)

StabilityTests

...

IntermediateProduct Tests

Final ProductTests

Figure 2.3: Pharmaceutical CDMO SC

agents of these organizations are itemized below.

• Manufacturing: productive areas of the business, where RM or IN are transformed into FPs or

other INs. Often, organizations are comprised of several manufacturing areas which are indepen-

dent between each other in terms of assets, be it machines or workforce, but consume capacity

in the support areas. After the productive tasks are completed in a campaign, the manufacturing

teams are also responsible for reviewing the batch production record (BPR).

• QC: area that receives samples and verifies whether or not these are according to pre-defined

standards. Several areas use QC for a number of reasons:

– Manufacturing requires QC analysis during production (in-process control (IPC); sometimes

production halts for the results of the analysis) and after the production of FPs or INs.

– After the productive tasks, the analytical packages have to be release by the QC. This involves

two stages: the QC release (QC R) and the QC release review (QC RV) performed right

after the QC R. A certain campaign can have several analytical packages to be released

and reviewed, regarding different operations; each of these are paired and the review stage

of a certain operation only starts after the release stage of the same operation. Generally

speaking, all the QC R stages start immediately after the production.

– Raw material received by external suppliers has to be verified by QC.

– R&D departments need frequent QC analysis.

– QC performs regular stability analysis to products that are in the warehouse. This analysis is

scheduled months in advance and therefore is less prone to scheduling complications.

The pharmaceutical CDMOs QC is an area which has been extensively studied. Examples of this

are the work by Costigliola et al. [16], providing a simulation model for optimizing QC’s workflow, or

the work by Lopes et al. [30], a decision support system based on simulation for resource planning

and scheduling.

• QA: in the context of providing support to manufacturing operations, QA is the area that approves

the analytical package (after all the QC reviews are performed) and the BPR (after the manufactur-

14

ing area reviews the BPR). This process tends to take more time and require greater effort when

dealing with FP production rather than intermediate products.

• Warehouse: area that stores and delivers the resources to and from the other areas. The pro-

cesses done by this area are mainly resource storage and delivery (for raw materials, FPs, in-

termediate products, by-products, co-products and other types of resources, such as packaging

material) and measuring quantities. This area interacts with manufacturing (dispensing of raw

materials and receiving FPs), QC (dispensing samples and receiving information on the quality

status), QA (receiving information if FP has been approved for shipping) and R&D (similar to man-

ufacturing).

• R&D: area that is responsible for the discovery and development of new chemical synthesis routes,

industrial processes and QC strategies. Includes both GMP and non-GMP laboratories. GMP lab-

oratories are laboratories that follow the good manufacturing practices while non-GMP laboratories

just follow good laboratory practices (GLP). The second type of laboratories offer more freedom

to the chemists, but the developed methods always have to be converted to GMP for validation

batches and for industrial production. The complete description of the two practices can be further

analyzed in [23]. The R&D area interacts with warehouse and QC, in the sense that it requires

materials from the warehouse and quality control testes to its samples.

Although the areas mentioned above are the ones responsible and that support the production,

there are a few more areas present in the internal supply chain that are of extreme importance, such

as the IT, management, sales, marketing, finance, human resources and purchasing. The relationship

scheme between the main areas is shown in figure 2.4, depicting a business process model and notation

(BPMN) graph. BPMN graphs are graphical representations of business processes, with the objective of

supporting business process management, for both technical and business users. Note that the BPMN

shown can be viewed as a more granular view of a process step from figure 2.3.

The manufacturing process at CDMOs is generally as described in the internal SC relationships

scheme from figure 2.4. Several distinct types of products can be manufactured, being considered

either FPs or INs. These last ones are generally stored and used in the production of an FP or another

intermediate. However, these processes tend to behave differently when a new product enters the

company’s product portfolio. After passing the clinical trials (the long process of discovering a drug and

having it approved by the responsible entities), one or several validation batches have to be performed.

2.2.1 Enterprise Resource Planner

An enterprise resource planner (ERP) is a tool which delivers the ability of integrating a suite of

business applications, according to Gartner, Inc. The majority of this work will be focused on operations,

more specifically, material management and production planning (PP).

The importance of the ERP can be clearly observed through the automation pyramid, which is shown

in figure 2.5. The automation pyramid is defined as the pictorial representation of the distinct levels of

automation in an organization. As depicted in the graph, the higher the position in the pyramid, the

15

Pla

nnin

gan

dS

ched

ulin

g

Pla

nnin

gan

dS

ched

ulin

g

Planned ProcessOrder (PO)

War

ehou

seM

anuf

actu

ring

QC

QA

War

ehou

seM

anuf

actu

ring

QC

QA

PO Arrives

Request RMor IN

WarehouseRM or

IN Shipping

QC Lab(IPC)

Manufacturing

Approve FP

Warehouse:Reject Product


WarehouseReject

WarehouseStore

WarehouseFP Shipping

StoreProduct

ERP

QA AnalyticalPackage Release Inform P&S


ReviewBPR

QC Lab(FP + IN)

QA BPRRelease

Request FPto be shippedERP

Process Request(PR) Created

Are

as

Product Deliveredto Manufacturing

IPCSample

If approved

If notapproved

IfApproved

If notapproved

If FP approvedby QA

Figure 2.4: Internal SC relationships scheme

more information-dense the resource possesses. In fact, comparing the ERP with field level sensors,

the amount of information there contained is completely different, from a single data point per second or

milisecond (assuming that the sensor does not include memory) to much more processed information,

regarding decades of data collection. Similarly, in terms of quantity of devices, sensors exist in large

quantities while the ERP is a single solution.

ERP

MES

Visualization

Advance Control & Diagnostic

Process control

Sensor / Actuator

Am

ou

nt

of

Dat

a

Nu

mb

ero

fC

om

po

nen

ts

Ente

rpri

seLe

vel

SCA

DA

Le

vel

Fiel

d L

evel

Tim

e C

on

stra

ints

ms

seconds

minutes

hours

days

years

Figure 2.5: Automation Pyramid [6].MES ≡ manufacturing execution system, SCADA ≡ supervisory control and data acquisition

The utility of this central system is clear: it stores, manages and controls data collected at lower

levels of the automation pyramid. Furthermore, it has the ability of performing optimizations based on

16

demonstrated performance and can have almost limitless automatic behaviors. Bajer [6] states that

company executives often relay on information from the ERP to make critical decisions in near real-time:

it is the enterprise level of the automation pyramid.

2.2.2 Production Planning

Sales Forecast

Production Planning(material & capacity

requirements)

Start ofProduction

Inspection andQuality Control

Dispatch toCustomers

Customer

Quality Assurance

ProductionScheduling

ProductionControl

Production

Figure 2.6: The production cycle flowchart.Note the production planning stage withinthe cycle, represented as the orange box.

The production cycle corresponds to the sequence of planning,

scheduling and execution steps involved in the manufacturing pro-

cess. This cycle is shown graphically in the diagram from figure

2.6. It shows that the process often repeats itself provided there

is interest from a customer (this interest also aids in driving the

sales forecasts). The central block in this diagram, in terms of lo-

gistics, planning and scheduling is the PP. PP tends to be one of

the most fundamental stages in the internal SC of an organization.

Carefully organizing the materials, workers and workcenters a few

months ahead can be of extreme importance, especially in highly

dynamic environments with little flexibility in the short-term. This

means that a good PP optimization is paramount in obtaining SC

operational excellence.

Another important concept which is deeply connected with the

production cycle is the sales and operations planning (S&OP) and

the S&OP cycle. The American Production and Inventory Control Society (APICS – a non-profit or-

ganization for supply chain management) defines S&OP as the ”function of setting the overall level of

manufacturing output and other activities to best satisfy the current planned levels of sales, while meeting

general business objectives of profitability, productivity, competitive customer lead times, as expressed

in the overall business plan” [17]. The S&OP cycle is then comprised of the different stages in a corpo-

rate plan that are sequential and always repeating and that feature different objectives at different stages

of the cycle.

A production plan is made systematically, for a given time period, known as the planning horizon.

Generally, four planning horizons can be distinguished. These time horizons are also often called S&OP

time fences, since they bound different stages of the S&OP cycle.

• Strategic Horizon: horizon beyond the long-term, which deals with strategy rather than planning.

It is within this horizon that management evaluates the impacts of, e.g., increasing the available

capacity or workforce. Objectives are also defined during this period.

• Long-term Horizon: horizon that features planned orders (orders that will almost certainly hap-

pen) and opportunities exploration. This last part refers to results from forecasts and with the ob-

jective of capacity optimization. Generally, capacity utilization is measured monthly, which means

that it is merely an estimation.

• Medium-term Horizon: horizon when orders’ plans are fixed – no new orders are accepted (un-

17

less agreed by management or due to major production delays) and there should not be changes

bigger than one week.

• Short-term Horizon: horizon where scheduling is dealt with. Individual tasks for specific projects

are allocated to workcenters and operators. During this period, no changes to the plans are al-

lowed, except when caused by manufacturing delays (changes no larger than one shift).

These time horizons are of extreme importance since they define the different planning stages. Ad-

ditionally, the aim within each time horizon changes greatly, from an operational and task-oriented point

of view in the short to medium-term, to a planning and capacity focus in the long-term, to a strategic and

managerial perspective in the strategic horizon.

Although there is not a clear consensus on the exact values of these time fences, which are highly

dependent on the industry in question or even in the specific organization, according to several sources

the time fences are distributed as follows.

Horizon Short-Term Medium-Term Long-Term Strategic

Time 1-8 Weeks 1-3 Months 1-24 Months 3-5+ Years

Table 2.2: Time horizons for the different S&OP cycle stages [5]

2.2.3 Capacity

On a manufacturing organization, available capacity is measured for a given production plant, area or

workcenter and for a specific range of time. It corresponds to the total available time (in the considered

period) multiplied by the number of resources related to the selected scope. This can be expressed both

in worker ·hours or machine ·hours. Capacity utilization is then a measure of how intensively a resource

is being used at a given time, with relation to its available capacity [55]. A similar concept is the effort

(or utilized capacity) that a certain activity requires. It can be simply defined as the amount of work that

is required to complete it. Both capacity and effort can be measured in terms of [worker · hours]. For

example, a specific task that requires an effort of 100 [worker · hours] means that it can take 50 hours

with 2 workers at any given time, 10 hours with a team of 10 workers or any combination of time and

workers as long as t ·w = 100. However, the time that the operation takes in not arbitrary, specially in the

production of chemicals, which often have specific reaction times; these require a given duration, which

cannot be shortened with additional workers. Since both capacity and effort are directly proportional

to the number of workers at any given time, the required monthly capacity can be used to estimate

the number of workers required at any time. Consider the example presented in equation 2.1, where

it is considered that a specific area requires 3500 [worker · hours] in a month. Consider the units as

w ≡ worker, sh ≡ shift, h ≡ hour, m ≡ month, d ≡ day

3500[w·hm

]30[dm

]· 8[hsh

]· 3[shd

] =3500

30 · 8 · 3

[w · h ·m · sh · dm · d · h · sh

]= 4.86 [w] (2.1)

For a specific product, the amount of effort necessary can be defined a priori. Both the effort and

duration of the processes are subject to slight changes from their a priori values, due to the variability

18

inherent to the processes and production plants. These changes, even if relatively small, can have

a substantial impact on the overall productive activities and support areas. The variability in terms of

duration can affect the schedule of the tasks, while the variability in the efforts can affect the capacity

occupancy. To solve this problem, buffers are generally used; these can be in terms of duration, e.g.

scheduling a machine for a longer period than necessary to account for eventual delays in the process,

or in terms of effort, e.g. setting capacity limits below the maximum capacity to account for unexpected

capacity increases or losses of available capacity due to a sick worker. However, adding these buffers

decreases productivity: adding, for example, 10% more time in a specific task as a buffer means that

the next task can only start after the end of the buffer and if it is not necessary, a certain amount of time

would be useless. Better predictions of duration and effort can, therefore, reduce the buffer size and

therefore increase productivity.

Measuring the baseline capacity of a production plant, regarding each individual productive area and

other supporting areas, such as QA, QC or warehouse, is a very important step, since it defines the limit

of asset utilization in each area; in regular conditions, there are not any more workers or workcenters to

increase such capacity.

0

100

Short−Term Medium−Term Long−Term

Time Horizon

Cap

acit

y (%

)

Figure 2.7: Typical evolution of the capacity through the time horizons

Figure 2.7 shows a typical evolution of the allocated capacity along the different time horizons, de-

fined in the previous section. Note that the strategic horizon is not included in the graph since it does not

deal with capacity allocation or resources but instead with managerial decisions. The types of orders

that exist and their impact on the allocated capacity are defined below.

• Planned Order: order placed by a client with a specific deadline and that can be considered

as a confirmed order (that will almost certainly happen). This type of order is placed within the

long-term time horizon, having the possibility of being placed within the medium-term horizon on

certain occasions, if so instructed by management. These orders include mainly information about

the product to be produced, the quantity required, and the deadline agreed upon – the client has

to have in consideration the lead-times offered by the manufacturing company.

• Process Order: a planned order is converted into a process order mainly during the medium-

term time horizon. This process associates the master recipe (which regards to tasks, operations

and workcenters) and the bill of materials (BOM) (which deals with materials necessary for the

19

production) to the order. Two important operations are then performed automatically by the ERP:

(1) the tasks are scheduled to their respective workcenter and their occupied capacity is accounted

for (note that this step may only happen during the short-term horizon when there are more process

orders) and (2) the materials are verified if they are in stock and ordered if not – this means that

projects with raw materials with known long lead-times must be converted into process orders

sooner than other projects. On this work, these orders are often referred to as Current Orders.

• Opportunities: derived from sales forecasts, the opportunities indicate potential clients and their

required products. Often opportunities can also be more on the product-side, trying to find a client;

imagine an independent manufacturing area with its predicted allocated capacity far from its limit

– it may be desirable to increase production there and find a client for those produced goods.

Opportunities are defined in the long-term horizon and sometimes even later.

Having these definitions in mind, one can observe how these influence the allocated capacity accord-

ing to the different time fences. In figure 2.7, it can be distinguished three time horizons. The short-term

fence, which has its capacity at the defined capacity limit, regards mostly process orders. The medium-

term fence contains allocated capacity from a mixture of planned and process orders; some have been

converted while others have not. The long-term fence contains only allocated capacity by planned or-

ders, with available capacity for opportunities exploration or the addition of new orders by clients.

2.2.4 Integrated Supply Chain

The need for an integrated SC derives from the pharmaceutical industry’s current paradigm and the

evolution seen in computational capabilities, which enables the collection, processing and analysis of

enormous quantities of data, in order to extract knowledge and reach conclusions that otherwise would

not be possible (or at least extremely difficult). The integrated SC acts as a centralized entity that

manages all the areas that are integral to the business in order to optimize efficiency. This goes against

the more traditional way of having each SC member in a concentrated view of its objectives and tasks,

instead requiring all members to collaborate towards a common objective.

The digitalization of the integrated SC, converting it into a SC4.0 is a natural and fundamental step for

overall improvement of its performance. A procedure for this transformation goes along the application of

IoT and the adoption of the automation pyramid from figure 2.5. This means that field level measurement

systems should be installed, collecting data at high frequency rates, leading to an ERP dense in correct

information and with all its capabilities fully utilized. The processes should be correctly and exhaus-

tively described, with automatic confirmation systems to avoid human-error and to enable demonstrated

performance-based simulation on each operation. Automatic resource measurement systems should

also be adopted, to control stock levels, order materials automatically according to the necessities and

lead-times of the suppliers and the implementation of heuristics for determining the capacity limits of

individual workcenters should be performed. Given sufficient and meaningful data, the opportunities for

improving performance along the SC agents are near limitless.

Robinson [46] points out several advantages of the digitalization of the SC, among them the ability to

better share information with international sites of the company, the decentralization of the inventories

20

and an overall streamlined route-to-market.

The DT both contributes and is a recipient of the capabilities of the SC4.0. It takes advantage of the

SC4.0’s capabilities by using its large quantities of both raw and processed data, for visualization pur-

poses, while aiding in its development by generating predictions which may be used by other SC agents.

The concept of DT is connected with the SC in an equivalent way to a physical asset. Considering figure

2.1 and a turbine as the physical asset, the DT would be a digital model of the physical components and

interactions between them, physical principles, finite element models and historical data, while receiving

real-time measurement data about pressures and temperatures in specific points. The DT could then

perform accurate forecasts and give insights into the behavior of the physical asset. Analogously, a DT

of an internal SC would be comprised of a model of the SC, featuring models of its agents and their in-

teractions, historical data regarding durations, capacities, maintenance, malfunctions and projects, while

receiving data regarding current projects, available capacity, material stocks and other real-time infor-

mation. Using both the intrinsic knowledge of how the SC (contained within its models) works and how

it is behaving at a given point (including information about the planned future), the DT could display the

knowledge it possesses and could perform forecasts, giving the user insights into the future and what

should be done to increase efficiency.

2.2.5 Related Work

DTs, specially applied to the SC, are still not very researched topics in academia and there are

not many applications in the industry. Using the article by Kritzinger et al. [27], it can be seen that

most of the current applications of DTs and other digital integrations are made in manufacturing context.

The article featured a majority of application of DTs in maintenance, product lifecycle and production

planning & control. Although the latter could be connected to the SC in many ways, further research into

the individual articles reviewed by the authors showed a tendency for using discrete event simulation

in the production planning, autonomous guided vehicles and mechatronic systems, rarely mentioning

capacity planning.

Ivanov et al. [26] describe the DT as a combination of simulation, optimization and data analytics. The

authors proceed to describe the new paradigms in SC management, with the appearance of technolo-

gies such as IoT, cyber-physical systems and connected products. The presented digital technological

applications are described on their effects in risk management and ripple effect. The SCs are then in-

troduced as cyber-physical systems, which utilize the modern technologies in their cyber counter-part

and big data analytics, artificial intelligence, simulation and optimization for the conversion from physical

to cyber SCs. This approach allows for improved resilience and generates contingency and recovery

plans.

Regarding the simulation tool developed in this work, a simulation-based RCCP, it was seen that no

exact application of the concept could be found either in academia or industry. However, some work

has been done in related topics. There are many applications of RCCP tools not based on simulation,

extremely frequent in many enterprise software. Papavasileiou et al. [39] offer a Monte Carlo simulation-

based approach to task scheduling in pharmaceutical environment. Although there are several method-

21

ologies used in both this research and this thesis, the main objective differs from a more operational

point-of-view (at the short-term and more task-related) to a more strategic perspective, not focused on

the individual operations. Spicar and Januska [52] present an application of Monte Carlo Markov chains

in capacity planning. The presented method combines the power of Markov chains in estimating a state

given the previous state with the stochastic character offered by the Monte Carlo methods. The results

showed sales, income and defective produced units for a given month, according to the buying behavior

of the studied company’s clients and likely competition by its main competitors.

2.3 Proposed Solution and Expected Results

Two main objectives for the tool were identified as per the thesis objectives: deliver clear visualization

into past and present metrics of the SC and offer a scenario-based forecasting tool. The DT should

receive historical information from the higher levels of the automation pyramid, while using data from

lower levels of the pyramid for real-time data. Considering that the object of study is the SC, the real-

time data tends to be more relaxed into a daily or weekly timeframe, corresponding to data from the

SCADA level or from the manufacturing execution system.

The DT should feature efficient methods of portraying information in a meaningful but comprehensive

way. Information regarding activities happening at a given time, with the possibility of selecting the

shown data by production area, building or project, as well as the activities schedule and evolution of

the key performance indicators should be present, along with freedom for the users to interact with the

information and filter it in the most appropriate way to fit to their needs. Additionally, and as a support

for the simulation tool, there should be a project database, including information by project, the project’s

BOM, recipes, routings, efforts and demonstrated durations of the measured processes, along with the

probability distribution that was fitted to the existing data.

The simulation tool that best delivers the proposed objectives was a simulation-based RCCP, which

deals with capacity planning. This tool has the objective of acting as a decision-support tool and scenario

explorer for the decision-makers. The tool should perform the RCCP but unlike most currently available

tools, it should be done based on demonstrated performance and on the inherent variability that phar-

maceutical operations possess: it should be based on probability distributions that model the activities

durations, obtained by statistical analysis of the past data. Based on this, Monte Carlo simulation will be

used in order to simulate multiple scenarios and detetect a convergence in the monthly capacity utiliza-

tion. Furthermore, the utilization of bottleneck assets (BAs) should be checked and verified if there are

any consistent overlays in asset utilization.

The objective of this project, as a DT of the internal supply chain of a pharmaceutical CDMO, with

a simulation tool capable of RCCP has not been found in either commercially available solutions or

academic research. Even for less specific components of the project, either simulation-based RCCP or

DTs of the internal SC, available solutions are scarce. This denotes the challenge and significance of

this project.

The definition of DT given by Kritzinger et al. [27] as an object with automatic data flows to and from

digital and physical objects does not agree with the concept here implemented. In fact, the definition

22

may not be seen as the most appropriate for this study, since the author considers studies mostly made

on the operational level and not on a logistics level. This means that the DTs analyzed by the author

were mostly regarding tangible assets, in contrast to the internal SC in study on this work, which has a

much higher abstraction level. At the level of tangible assets, automatic correction of the plans can be

made with little consequence in case of errors. However, on the SC (logistics level), where campaigns

are scheduled months in advance, deadlines with customers are set, raw materials are dependent on

suppliers’ lead-times and there are overall more stakes involved, automatic update of the real plans and

schedules without human interaction cannot simply be made, for responsibility and liability reasons. This

means that the definition made by Kritzinger et al. may not be applicable to every scenario.

The verification of interferences between project’s BAs combines the work by Papavasileiou et al.

[39] in terms of demonstrated performance-based simulation and main asset scheduling with capacity

planning, more specifically, RCCP. Unlike the work done by Spicar and Januska [52], this current thesis

does not have the objective of evaluating sales, clients’ behavior and competition, and is instead focused

mainly on utilized capacity along the different areas that directly affect the production of pharmaceutical

products. Furthermore, the use of Monte Carlo Markov chains is less justifiable in scenarios where the

operations to be performed are known and future states are not dependent on current states, meaning

that there are no transition matrices.

The DT should be in line with the basic definition of DT by Ivanov et al. [26] as a combination of

simulation, optimization and data analytics. In fact, data analytics will be used to uncover the statistics

behavior of the processes with sufficient data; simulation will be used to obtain the expected monthly

capacity utilized; optimization will be used to verify that no BA is in conflict, and correct such conflicts

otherwise.

The developed solution is expected to both increase insights into the internal SC through its data

visualization abilities and deliver data-driven forecasts on future area occupancy based on demonstrated

performance. This allows for (1) better awareness about the current and past states of the SC, which

can aid in making more data-based decisions and (2) assist in better allocating resources and identify

opportunities to insert new campaigns.

23

Chapter 3

Knowledge Extraction

The DIKW pyramid is a model which represents the hierarchy and the functional relationship between

data → information → knowledge → wisdom [47]. The model simply states that wisdom is created

from knowledge, which is created from information, obtained from data. The higher the hierarchy the

more actionable and valuable the asset becomes. However, to obtain the subsequent level in the hier-

archy, it must be analyzed and processed, often compiling the larger asset into a smaller, more detailed

and meaningful one. Obtaining wisdom is, therefore, the ultimate goal but the entire process starts with

just data. The levels of the pyramid can be defined as: [47]

• Wisdom: often considered an elusive concept, related to human intuition, understanding and

interpretation. Wisdom can also be defined as accumulated knowledge.

• Knowledge: combination of data and information, to which is added expert opinion, skills and

experience.

• Information: data that has been shaped into a form that is meaningful and useful to human beings.

• Data: discrete, objective facts or observations, which are unorganized and unprocessed, and do

not convey any specific meaning.

It can be concluded that although wisdom can be seen as an almost unattainable goal, everything

starts from the bottom, with the collection of data, followed by its processing into information, which can

be understood by humans, and with their skills and experience acquire knowledge.

Data has always been a crucial resource, but the ”democratized” computational power brought by

the 21st century has allowed the collection of varied types of data in tremendous amounts and its easier

processing into information. The practical advantages are near endless, with the identification of hidden

patterns, prediction of future events based on past data and the possibility of developing black-box

models, which with sufficient data can arrive at the desired conclusions, without the need to define the

behavior of the model itself.

An important aspect has to be mentioned regarding the models that require data: these models are

only as good as the data they are supplied with. This may seem like an obvious statement, but frequently

data is corrupted in some form, and the model will undoubtedly fail to portray the system it intends to

represent. Nevertheless, there are ways of contradicting this problem, especially when only part of the

data is corrupted. The main method for solving this problem is the outlier identification and removal,

which will later be addressed.

25

The CDMO in study uses an ERP which stores its data in databases. The use of databases for

storing large volumes of data is a customary practice, as can be deducted by its semantics. These tend

to be efficient in dealing with large volumes of data and can be queried to upload or retrieve existing data.

For this thesis a developer environment was set-up, based on the same architecture of the production

system. The data contained there was a mirror of the actual production data, retrieved on September

2019 and hosted on a SQL server that mimics the actual system, and thus is prepared to work with live

data with no no significant modifications.

3.1 Collected Datasets

The data used on this work was contained in the PP and material management modules from the

ERP. From the first module, the tables used regarded planned orders, production orders, workcenters,

reservations and tasks, while from the second module only inventory management tables were needed.

The process of determining the tables and fields to be considered was the result on an extensive study,

which provided a dictionary between the data and its meaning and utilization. This study resulted in

a exploratory analysis report that was shared with the project stakeholders and is now being used by

experts at the company to assist in exploratory analysis.

The datasets extracted can be divided into 4 categories:

• Processes duration: duration of the processes in study, calculated as the variation between the

start and end dates of the activities. Explained in more detail in section 3.1.1.

• Planned orders: correspond to the orders that are already confirmed and will happen in the

following months. These can have either or both a start date and a deadline. The extraction of

the data is done assuming that the start date and deadline are contained within the long-term time

horizon, not including any horizon before or after. This is due to the fact that the orders in the short

and medium-term horizons are fixed and only subject to minor changes. It is not in the scope of

this project to deal with orders at those horizons. The data extracted contains the project, the start

date and deadline and the desired quantity.

• Required resources: from the routings and tasks tables, the required resources (regarding worker-

hours and equipment) associated with each project were extracted. The data extracted for each

project regarded mainly the operation description, equipment code, duration and effort. Note that

these values are theoretical values, which means that they are the reference values and are sub-

ject to change in the real operation. However, the ERP uses these durations and efforts when

scheduling activities and allocating capacity, which may lead to errors. Additionally, through the

equipment code, each operation was divided into manufacturing, QA, QC or warehouse.

• Available capacities: to accurately characterize workcenters with respect to their available capac-

ity, their maximum daily effort was extracted. Since the capacity will be evaluated monthly, the daily

values have to be summed by month for the total monthly capacity. These capacities are related

to the required resources since they establish the available daily (or monthly) limit that each area

26

has of a specific resource. Basically, each area has a capacity of available resource and these will

be consumed by the manufacturing and support activities.

3.1.1 Processes Duration

The tables of interest for this extraction regarded the PP, routings and warehouse movements. In

terms of the duration of the processes, the data collection can be summarized in the graph from figure

3.1. This figure shows graphically how the available data is chronologically related to the productive and

support operations.

M

WH (FP)WH WH

IPCIPC

M (BPR) QA (BPR)

QC R QC RV QA

Planned start ofproduction

Actual startof production

Actual finishof production

FP stored inthe warehouse

Stock transferredinto unrestricted use

Shipping of thefinal product

Figure 3.1: Extracted dates and their chronological relation to the real processes. The bars represent the real starts and ends ofthe different stages, while the vertical lines represent the dates that can be extracted. Note that M corresponds to manufacturing

with M (BPR) being the BPR review process by the manufacturing team (not manufacturing itself).

As can be seen from the graph, there is not enough granularity from the extracted data to obtain the

actual starts and ends of all the processes. With this in mind, the quality release (QR) is considered,

which corresponds to the region immediately after production and containing the QC R and RV, manufac-

turing (BPR review) and QA operations. The QR can be measured with its start being the manufacturing

end and its end being the stock transferred to unrestricted use. Note that this QR is a construct and is

not actually a phase of the processes, but was created as the agglomeration of the mentioned stages.

Additionally, the mismatch between the planned start and the actual start results from possible delays

on the operation. These values can be simultaneous – no delays occurred, or the actual start can be

before the planned start.

The process of data extraction encompasses obtaining data from existing data sources, for further

processing. This is a fundamental step in real-world systems because clean and easily accessible data

rarely exists. To obtain the durations of the Manufacturing and QR’s processes, a series of operations

had to be made to the existing datatables, which were validated by the responsible stakeholders. The

process of extracting the mentioned durations comprises a series of steps, as shown and described

below.

1. Define assumptions

A couple of assumptions have to be set before extracting the data. Most of these have already

been mentioned in the description of the timeline graph from figure 3.1. All the assumptions are

presented below.

27

• Actual start denotes the beginning of the manufacturing processes and must always have a

non-null value.

• Actual finish denotes the end of the manufacturing processes and must always have a non-

null value.

• The start of the QR processes is assumed to coincide with the end of the manufacturing.

• The end of the QR processes corresponds to the entry of the accounting document meaning

that stock in quality inspection can be transferred to unrestricted use.

2. Extract the raw data – SQL query

Custom-made SQL queries were written and validated with subject experts within the organization.

3. Prepare the data

A final preparation has to be done to the extracted data. This includes mainly the calculation

of the duration of the processes, by converting the type of some fields and performing a few

mathematical manipulations. The duration of the processes is assumed to be discrete, since the

data rarely allows for increased precision and even when it does it would lead to greater noise;

it can be therefore assumed that the time unit is the smallest scale and that all the values are

discrete. This time unit will not be further defined for confidentiality reasons and will be henceforth

referred to as TU.

This extraction resulted in 2038 observations, regarding 194 unique projects.

3.2 Distributions FittingData as is collected tends to have a variability associated with it, even for the same activities or

processes. There are many unpredictable events and occurrences that lead to distinct outcomes in

terms of duration, produced quantity or cost – in fact, almost all real-world systems contain at least one

source of randomness [28, p. 279]. The manufacturing and QR processes’ duration will be the main

focus of study in this section. Despite the variability inherent to real-world processes, data shows that

there is often a pattern for the duration of the activities. Considering that the duration of the processes

are discrete values in TUs, as stated previously, the probability of it being a certain value can be obtained

from historical data and the aggregation of all the probabilities generally follows a common PDF. This

can be clearly seen in a histogram, where the counts (or frequencies) of occurrence of an activity lasting

a certain amount of time are shown.

An important assumption was made, considering the discrete nature of the data and the requirements

given by the stakeholders: the bin width considered is always the TU itself, the smallest scale considered.

This is due to the fact that a certain level of granularity is required, which can be better observed through

the smallest possible bin width. Furthermore, the values themselves are not too disperse, with datasets

frequently having only 10 to 20 bins even at the smallest bin width possible. This assumption can be

supported by the literature, in the sense that selecting the bin width (or number of bins) is a process

regularly done on a trial-and-error basis, observing the results and adjusting accordingly, until achieving

28

the bin width that creates the most ”rugged” histogram with the smallest bin width possible [28, p. 323].

Additionally, the problem of selecting the bin width is of greater importance in continuous distributions

rather than in discrete ones.

Replacing the PDF of a model by its mean is something that is often done. However, it can be

dangerous to perform this simplification, e.g. in high-variance PDFs, where the mean is not very repre-

sentative. Especially for simulation purposes, an activities’ duration should be modelled by a PDF that

represents it more reliably.

Even though the data itself can be used as the PDF, known as an Empirical Distribution, Hillier et al.

[25, p. 893] state that for simulation purposes, the assumed form of the distribution should be suffi-

ciently realistic that the model provides reasonable predictions while, at the same time, being sufficiently

simple that the model is mathematically tractable. While an empirical distribution does provide reason-

able predictions (sometimes there may be overfitting, causing wrong predictions), they are certainly not

mathematically tractable. Note that storing a few tenths or hundreds of observation data may not be

too computationally demanding, but when there are millions of observations it becomes much more of

a problem; in contrast, using only a few parameters to define the whole distribution is generally a much

better choice. Hillier et al. [25, p. 891,1079] also consider manufacturing systems as queuing systems,

more specifically as Internal Service Systems (considering the case, as an example, of a machine that

can be viewed as a service, with customers being the processed jobs). For these types of systems,

Exponential Distributions are advised by the writers, due to their fitting capabilities and mathematical

tractability [25, p. 887]. Note that it is also mentioned that this type of PDFs is generally used due to their

advantages but that other distributions may also be chosen. Contrarily, Law [28, p. 280-282] suggests

fitting a series of distributions to the data and choosing the one with the best goodness-of-fit test (GoF).

The approach that was chosen lies in between the methods stated in the citations above. First of

all, note that the PDFs of this problem are discrete, which by itself restricts the distributions chosen

to only discrete PDFs. Secondly, a considerable number of projects possesses only a small amount

of observations. This means that fitting a series of distributions to each individual project dataset and

choosing the best fit is not applicable to these few-observations projects.

The idealized approach was to select a single distribution for the projects and then consider it as

the distribution that best describes all the processes. This can also be supported by the fact that since

the operations are similar in nature (pharmaceutical manufacturing processes or pharmaceutical quality

release processes) it can be assumed that their PDFs are also similar, hence a single PDF with dif-

ferent parameters is considered sufficient to correctly describe them. This selected distribution could

be obtained by a similar approach to the one mentioned by Law, fitting a series of PDFs to the most

representative projects (the ones with a considerable amount of observations) and generalizing the dis-

tribution with the best GoF to all the projects; this way, all the data would be fitted to distributions that

are proven to correctly fit data from the CDMO’s own manufacturing and QR processes.

The process of selecting the distribution that best fits the more representative projects’ data follows

the steps enumerated below.

1. Choosing the set of contender distributions

29

2. Defining the number of observations threshold for considering the datasets representative

3. Preprocessing the data

4. Looking for data outliers and removing them

5. Obtaining the statistical properties, such as mean, variance or skewness

6. Determining the most representative distribution for the data, based on the statistical properties

7. Fitting the distributions to the data

8. Evaluating the GoF between the data and the distribution

9. Selecting the most representative distribution

Most of the methods for fitting PDFs to real data are empirical methods based on observation and

trial and error. Nevertheless, a more automatic approach can be taken with only a small loss in accuracy

when dealing with irregular cases. This is necessary in the current scope since the distribution fitting

process will have to be done programmatically in the future, to allow new observations to modify the

distributions’ parameters.

3.2.1 Selecting the Distributions

A comprehensive set of theoretical PDFs must be chosen that can effectively fit data of diverse

types. Note that the pool of available PDFs to be chosen must be of discrete PDFs. According to

Law [28, p. 308-313], the Binomial, Negative Binomial and Poisson distributions are the most common

discrete PDFs and can be fitted to a wide range of data types. This set of distributions is also supported

by the aforementioned claims made by Hillier et al., defending exponential distributions as the most

common and best choice for manufacturing processes; in fact, both these three distributions belong to

the exponential distributions’ family. Examples of these distributions’ behaviors can be seen in Figure

3.2.

t = 10

p = 0.2

t = 10

p = 0.5

t = 5

p = 0.2

0 1 2 3 4 5 6 7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

Den

sity

Binomial Distribution

s = 10

p = 0.5

s = 2

p = 0.5

s = 3

p = 0.3

0 2 4 6 8 10 12 14 16 18 20

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

Negative Binomial Distribution

λ = 0.9

λ = 5

λ = 10

0 2 4 6 8 10 12 14 16 18 20

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

Poisson Distribution

Figure 3.2: Binomial, negative binomial and Poisson distributions for different parameters

30

Another important aspect to mention about these PDFs derives from their mathematical form, shown

in equations 3.1a to 3.1c. In contrast to other PDFs, such as the most common Gaussian distribution,

these distributions cannot take negative values.

Binomial p(x) =

t!x!(t−x)! · p

x(1− p)t−x, if x ∈ 0, 1, · · · t

0, otherwise(3.1a)

Negative Binomial p(x) =

(s+x−1)!x!(s−1)! · p

s(1− p)x, if x ∈ 0, 1, · · ·

0, otherwise(3.1b)

Poisson p(x) =

e−λλx

x! , if x ∈ 0, 1, · · ·

0, otherwise(3.1c)

A crucial assumption has to be made that will have a certain impact on the PDFs. This assumption

states that the observation with the smallest duration will bound the lower probability; simply put, this

means that there will never be any value below the smallest one. Although in truth such occurrence

can happen, especially in less populated projects, the duration of a project is physically bounded by the

duration of the chemical processes which can never be sped up. This means that in an ideal system,

the distributions would tend to become increasingly exponential in shape. This can in fact be seen in the

projects with a higher number of observations, especially in manufacturing. Even though the projects

have a lower bound, the distributions allow values greater or equal to zero; cutting the distributions at

the lowest existent value would solve this problem, but the fit would never be quite as good as a different

approach: offsetting the distribution to the smallest value, effectively considering it as the zero-value.

Mathematically, this would imply a minor change in their mass functions, shown in equations 3.2a to

3.2c. Consider x0 as the smallest observed value, in a certain project.

Binomial p(x) =

t!(x−x0)!(t−(x−x0))! · p

(x−x0)(1− p)t−(x−x0), if x ∈ x0, x0 + 1, · · · t+ x0

0, otherwise(3.2a)

Negative Binomial p(x) =

(s+(x−x0)−1)!(x−x0)!(s−1)! · p

s(1− p)(x−x0), if x ∈ x0, x0 + 1, · · ·

0, otherwise(3.2b)

Poisson p(x) =

e−λλ(x−x0)

(x−x0)! , if x ∈ x0, x0 + 1, · · ·

0, otherwise(3.2c)

3.2.2 Data Segmentation

Consider the data extracted according to the conditions detailed in the chapter regarding processes

duration. This data needs to be subdivided into a smaller group of only representative projects, i.e.,

projects with a sufficiently high amount of observations, capable of being fitted into a distribution in a

31

trustworthy manner. First of all, it is important to understand the order of magnitude of the data. There

is a total of 2038 observations, regarding 194 projects. From this dataset, a series of observations

were removed: projects with fewer than 3 observations, since they would not be significantly fitted to

a distribution; observations with negative duration, since they were obviously originated from a data

insertion error. This led to a modification in the existing amount of data: regarding Manufacturing a total

of 1905 observations were recorded for 104 projects, while in QR 1869 observations for 102 projects.

Note that the mismatch between QR and Manufacturing comes from the fact that corrupted observations

in one area do not necessarily imply that they are corrupted in the other area.

Even for this filtered set, the majority of the projects has a small number of observations; the solution

for finding the most representative distributions is to only select projects with more than 30 observations.

This leaves a total of only 10 projects, with the number of observations per project ranging from 36 to

415. Although the number of projects that verify the above conditions is small, it may be enough to reach

the desired conclusions.

0

25

50

75

100

10 20 30 40

Cou

nt

0

10

20

30

10 20 30 40 50Duration [TU]

0

1

2

3

4

0 50 100 150

Figure 3.3: Examples of PDFs from the obtained data

As can be seen from figure 3.3, the PDFs of the real data tend to vary quite substantially from project

to project. However, even with the observed variability in the existent data, all the considered theoretical

PDFs can successfully fit to them, since they can take varied forms (check the graphs from figure 3.2).

Note that the third histogram relates to the data from a QR process; in reality, many of the QR processes

tend to have a much higher variability than the manufacturing ones. The chosen distribution function

has to be able to successfully fit both distributions with an exponential tendency (such as the first graph)

a normal behavior (such as the second graph) and a uniform trend (such as the third graph).

3.2.3 Outlier Identification and Removal

By definition, an outlier is an observation that is far removed from the rest of the observations [31,

p. 89]. Its significance as representative data can be questionable: it may be data incorrectly measured,

that suffered from irregular deviations and should be discarded, or may be merely an extreme manifes-

tation of the random variability inherent in the data, on which case the values should be retained and

processed in the same manner as the other observations in the sample [22].

Current methods for identifying and removing outliers from data can vary greatly. There are extremely

powerful methods using supervised or unsupervised anomaly detection algorithms such as support

vector machines [48], replicator neural networks [24] or fuzzy logic-based [11]. Although these methods

32

may be of interest when considering massive and diverse datasets, the data used is never greater

than 500 observations per project, which means that employing such a sophisticated method for outlier

detection is not necessary for the foreseeable future. A simpler method that performs rather well for the

size of the datasets used is based on the interquartile range (IQR). This method is based on Tukey’s

Fences, a concept introduction by John Tukey in 1977 in [57]. According to his definition, an outlier is an

observation outside of the range defined by:

[Q1 − k(Q3 −Q1), Q3 + k(Q3 −Q1)] (3.3)

On equation 3.3, Q1 refers to the lower quartile (which splits off the lowest 25% of data from the

highest 75%) and Q3 refers to the upper quartile (which splits off the highest 25% of data from the lowest

75%). Note that Q2 corresponds to the median (effectively diving the dataset in half). As proposed

by John Tukey, a value of k = 1.5 corresponds to a dataset without outliers, while a value of k = 3

corresponds to a dataset without only far-out observations.

The effects of the outlier filter were applied to some of the distributions shown in Figure 3.3 and the

results can be seen in Figure 3.4.

0

25

50

75

100

10 20 30 40

Cou

nt

0

10

20

30

10 20 30 40 50Duration [TU]

0

1

2

3

4

0 50 100 150

Accepted Data k = 0 k = 1.5 k = 3

Figure 3.4: Data with outliers filtered out. Different values for k were used to observe its influence on the outlier filtering. Notethat on the histogram on the right, only when k = 0 (which is an overly aggressive filter) were there outliers caught – this meansthat using this method there can be situations when there is no outlier selected. The vertical lines plotted on the graphs denote

the median of the distributions.

As can be seen from the figure, using k = 1.5 it is possible to obtain results that look promising. In

fact, the third graph visually does not seem to have any outlier (which is supported by k = 1.5); regarding

the first histogram, it can be seen that the filter could be slightly more conservative, but the results are

acceptable. Note that using a value k = 0 (which was merely done to observe the results of a far more

aggressive filter – filtering everything out of the [Q1, Q3] range) shows that the outliers can be both on

the left or right of the graph. However, due to the inherent skewness that the histograms have, with

longer tails on the right, outliers on the left will not be as frequent as on the right.

The fact that the third graph shows that no observations were removed also reinforces the definition

of outlier by Grubbs [22], which stated that outliers could be a mere manifestation of the randomness of

a distribution and should not be discarded.

33

3.2.4 Data Statistics

There are interesting statistical characteristics that can be calculated from the data which can be

helpful in determining the PDF that fits to the data in the most appropriate way. These parameters

can provide insights into the shape, tendency, variability and other interesting characteristics. Besides

the more straightforward characteristics, such as the number of observations, mean, standard deviation,

minimum and maximum values, mode and median, a few other significant parameters can be calculated.

• Lexis Ratio: the Lexis Ratio is a parameter that can often provide useful insights into the form of

a discrete distribution. Note that this parameter can only be calculated for discrete distributions. It

is calculated through the expression 3.4.

τ =V ariance

Mean=σ2

µ(3.4)

One empirical discrimination can be made from the lexis ratio, as stated by Law [28, p. 322]. Ac-

cording to the author, a Poisson distribution is characterized by having τ = 1, a Binomial distribution

by having τ < 1 and a negative binomial distribution τ > 1.

• Skewness: the skewness of a distribution is a measure of its asymmetry. It can be calculated

through the expression 3.5. Note that while there are several methods of calculating the skewness

of a distribution, the one shown is the Pearson’s moment coefficient of skewness.

Skew[X] = ν = γ1 = E

[(X − µσ

)3]

(3.5)

A skewness ν = 0 indicates a symmetric distribution (for example, the normal distribution); a

skewness ν = 2 regards an exponential distribution, skewed to the right. Generalizing, a positive

skewness indicates a right-skewed distribution while a negative skewness a left-skewed distribu-

tion.

• Kurtosis: the kurtosis of a distribution is also a property that describes its shape, more specifically,

it is a measure of the ”tail weight” of a distribution [28, p. 322]. The most commonly used method

for calculating the kurtosis is by the excess method, also defined by Karl Pearson.

Kurt[X] = γ2 = E

[(X − µσ

)4]− 3 (3.6)

Note that this method of calculation of the Kurtosis is extremely similar to the calculation of skew-

ness by the moments method. Indeed, the calculation of the Kurtosis by the moments is also part

of the calculation of the property by the excess method: it corresponds to the expected value part.

The excess method of calculation of the Kurtosis only removes 3 units to the moment’s Kurtosis,

with the objective of setting the normal distribution’s kurtosis at zero units.

While historically the Kurtosis has been claimed to give insights into the flatness of the distribution

and its tail weight, this claim as been refuted as can be seen in [58], where the author states that

34

its only unambiguous interpretation is in terms of tail extremity.

The properties for the 20 most relevant distributions (10 for each area: manufacturing and QR) were

calculated and several conclusions could be extracted from the results.

Manufacturing QR

0 10 20 30 40 0 10 20 30 400

10

20

30

Mean [TU]

Stan

dard

Dev

iati

on[T

U]

# of Observations 100 200 300 400

Figure 3.5: Plot of the mean versus the standard deviation of the representative projects, discriminated between manufacturingand QR. Note that a third dimension is included through the size of the markers; as suggested, a bigger marker indicates a

project with a larger number of observations.

From the graphs shown in Figure 3.5, it can be seen that the manufacturing values are much more

consistent on the projects’ duration but especially on their variance. Their mean is never above 20 TUs

(frequently around 10 times units) while their standard deviation is generally below 5 TUs. Dataset size

does not appear to have a direct correlation with the results. Regarding QR’s values, the most interesting

conclusion is the apparent effect that dataset size has on the standard deviation. It can be expected that

a project with a greater number of observations is more well defined and the variance is lower, and this

effect is clearly seen on the QR’s results. The remaining data shows what can be clearly seen from the

observations: QR duration is generally longer and more disperse.

The remaining properties were calculated and can be seen in the parallel plot in Figure 3.6.

A few conclusions can be taken from the parallel plot. Perhaps the most obvious one is the clear

higher variance on most properties for the QR data. In reality, the manufacturing process tends to be

more constant between different projects than QR, whose processes are more chaotic, less predictable

and more prone to external influence. The dataset sizes show that there are 5 projects between 30

and 60 observations and 5 projects between 60 and 420. Note that the mismatch between the dataset

size for the manufacturing and QR of a single project derives from the fact that the outlier filter does not

necessarily remove the same number of observations for each area. The values for mean and standard

deviation show the same information as in Figure 3.5. Regarding the median and the mode, it can be

seen that for the manufacturing duration, the values tend to remain the same for a single observation,

and actually equal to the mean value. This is an occurrence typical in theoretical distributions and shows

that the distribution is not multimodal, for example. Regarding these parameters for the QR’s duration, it

can be seen that they tend to vary more for a single project, which can be explained by the occurrence

of more uniform distributions. Regarding the Lexis ratio, which is a measure of variance, it can be seen

that it is similar to the standard deviation. This parameter can aid in selecting a theoretical distribution

between the binomial, the negative binomial and the Poisson PDFs. The majority of the values can

be seen to be above τ = 1 (4 values for Manufacturing and 8 values for QR), which by the empirical

35

30

60

90

120

150

180

210

240

270

300

330

360

390

420

450

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

40

45

50

0

5

10

15

20

25

30

35

40

45

50

0

5

10

15

20

25

30

35

40

45

50

0

2.5

5

7.5

10

12.5

15

17.5

20

22.5

25

−0.75

−0.5

−0.25

0

0.25

0.5

0.75

1

1.25

1.5

−1.25

−0.95

−0.65

−0.35

−0.05

0.25

0.55

0.85

1.15

1.45

DatasetSize

StandardDeviation

Mean Median Mode Lexis Ratio Skewness Kurtosis

Manufacturing QR

Figure 3.6: In this parallel plot, the information regarding the distributions statistical properties is shown, discriminated betweenmanufacturing and QR duration. This plot aims to clearly picture the patterns that can be found in these statistical properties, and

if they are correlated to the areas they represent. A certain project’s distribution is represented by a line, with each property onthe axis. This can give insights into a single project’s properties, by following a line, or how a certain property is distributed across

the projects, by looking at a single property’s axis values.

rule described would mean that the most appropriate distribution to be fitted to the data would be the

negative binomial distribution.

Regarding the skewness of the projects, it can be seen that they are mostly positive, which indicates

that the distributions have generally a skewness to the right. However, this skewness is not large,

often being neglectable (note that an exponential distribution as a skewness of 2, for example). Finally,

regarding the Kurtosis of the distributions it can be seen that it is mostly negative, for both manufacturing

and QR. In practical terms, and knowing that a Kurtosis of zero corresponds to the normal distribution,

it means that the distributions have a smaller tail weight than a normal distribution. While it may not be

completely accurate, a negative value of kurtosis can also mean that the ”head” of the distribution is also

flatter than that of a comparable normal distribution, which can explain the uniform-like distributions, for

example.

0

25

50

75

5 10 15

Cou

nt

0

1

2

3

25 50 75

Figure 3.7: Examples of distribution with distinct values for skewness and kurtosis. Note that the outliers have already beenremoved in this graph. The first graph has Skew[X] = 1.00 and Kurt[X] = 0.14, while the second graph has Skew[X] = 0.31

and Kurt[X] = −1.02

Considering the graphs from Figure 3.7, it can be seen that the first one shows a high skewness to

36

the right, due its exponential-like figure and a tail weight comparable to that of a normal distribution, with

its head being relatively regular. This explains the values of high skewness and near-zero kurtosis. The

second graph is much different, featuring a more uniform tendency. The kurtosis is that small due to

the fact that the whole distribution is considered the head; this way, not only does it contribute by having

a flat head, it also means that there is no tail weight, because the whole distribution is the head. The

skewness is less meaningful in such a situation and only differs from zero because there are a few peaks

and a few null entries in between existing ones.

There are a few empirical methods for choosing a PDF using the calculated statistical parameters.

One has already been described and its conclusions have been taken (using the Lexis Ratio). Another

method is through the Cullen and Frey graph, a graph presented in 1999 on Alison Cullen and Christo-

pher Frey’s book Probabilistic Techniques in Exposure Assessment [18]. There, a method for graphically

evaluating the most appropriate theoretical PDF is presented, through a plot of the Square of Skewness

versus the Kurtosis (in its unbiased form). This plot can be seen in figure 3.8, where the representative

datasets are plotted against the regions that classify the theoretical PDFs.

0 10 20 30 40 50

Cullen and Frey graph

Square of Skewness

Kur

tosi

s

316

2942

5568

Project Theoretical distributionsNormal PDFNegative binomial PDFPoisson PDF

Figure 3.8: Cullen and Frey graph. Note that as the legend suggests, the different regions for PDFs are distinguished by the linesor areas and the observations are shown as the blue dots.

As can be seen from the figure, the observations appear to be consistently inside the region that

classifies the distributions as following the negative binomial PDF. This is in accordance with the conclu-

sions taken from the Lexis ratio, which makes it likely that the most appropriate distribution to be used

is the negative binomial. However, further analysis will be performed to evaluate how each theoretical

PDF fits to the data.

3.2.5 Fitting the Distributions to the Data

The process of fitting a theoretical PDF to the existing data is paramount in obtaining a PDF that

is representative of the data that it is being fitted to. The procedure of fitting a theoretical PDF is an

optimization of the distribution’s parameters in order to maximize the GoF between the theoretical distri-

bution and the real data.

The GoF is a test that calculates a parameter with the objective of measuring how well a distribution

is fitted. Law [28, p. 344] defines a GoF test as a statistical hypothesis test used to formally assess

whether the observations of a certain dataset are an independent sample from a particular distribution.

37

Several GoF tests exist and are widely used. The most significant ones in the literature will be described

and one will be chosen for the optimization operation. Note that while only one GoF test can be chosen

as an optimization criterion, several tests can be used to assess visually how well the data is fitted,

according to each of their own criteria.

• Chi-squared goodness-of-fit test (CS): Pearson’s CS (χ2) tests are a set of statistical tests used

to evaluate three types of comparison: homogeneity, independence and GoF. Here, only the GoF

test will be performed. For discrete distributions, the procedure for calculating the GoF starts by

dividing the dataset into cells. It is a widespread practice, given the relatively small size of the

dataset, to consider each TU as a cell, effectively rendering the cutting process unnecessary.

The formula for calculating the CS value (called Pearson’s cumulative test statistic) is shown in

expression 3.7.

χ2 =

n∑i=1

(Oi −Npi)2

Npi,with

Oi ≡ number of observations for TU i

N ≡ total number of observations

pi ≡ theoretical probability of TU i

n ≡ number of cells

(3.7)

• Kolmogorov-Smirnov GoF test: the Kolmogorov-Smirnov GoF test compares the empirical cu-

mulative distribution function (ECDF) with the cumulative distribution function (CDF) of the hypoth-

esized distribution [28, p. 351]. The ECDF is calculated through the expression 3.8

Fn(x) =number of Xi’s ≤ x

n(3.8)

The Kolmogorov-Smirnov GoF statistic is then simply defined as the largest vertical distance be-

tween the ECDF and the fitted CDF. The statistic is calculated using expression 3.9.

D+n = max

1≤i≤n

i

n− F (X(i))

, D−

n = max1≤i≤n

F (X(i))−

i− 1

n

Dn = max

1≤i≤n

D+n , D

−n

(3.9)

• Cramer-von Mises GoF test: the Cramer-von Mises GoF tests are a set of tests used to evaluate

the GoF of a CDF when compared to an ECDF, similarly to the Kolmogorov-Smirnov GoF tests.

As stated in [4], although these tests are originally designed for continuous distributions, they have

also been adapted to discrete ones. This thesis uses the definitions given by Choulakian et al. [15]

in their article on Cramer-von Mises statistics for discrete distributions.

The Cramer-von Mises GoF tests can be divided into 3 separate tests, each with its own statistic:

the Cramer-von Mises GoF test, U2, the Watson GoF test, W 2 and the Anderson-Darling GoF test,

A2. Their statistics are defined in equations 3.10.

W 2 = N−1k∑j=1

Z2j pj (3.10a)

38

U2 = N−1k∑j=1

(Zj − Z)2pj (3.10b)

A2 = N−1k∑j=1

Z2j pj

Hj(1−Hj)(3.10c)

k ≡ Number of duration entries

N ≡ Number of total observations

Zj =∑ji=1 oi −

∑ji=1 ei ≡ Difference between observed and expected cumulative frequency

Z =∑kj=1 Zjpj ≡ Average cumulative difference

Hj =∑ji=1 oi/N ≡ Expected CDF

From the GoF tests shown, one had to be chosen, in order to optimize the distribution’s parameters

in relation to the chosen test. The test chosen was the Pearson’s CS test, for several reasons. First of

all, the CS is undeniably the most used, verified and supported test. Although this may not seem an

immediate reason for being a selection criterion, a more reliable test, with results that are meaningful to

a larger number of people is generally preferable. Secondly, in terms of algorithm, the CS, is simpler,

much less prone to errors and its results appear to be consistently good, only matched by the Cramer-

von Mises GoF test in terms of PDF. Finally, from an industry-specific point-of-view, it makes more sense

to optimize according to the PDF, like the CS test, rather than the CDF, like the remaining tests. This

is due to the fact that the optimization made through the CDF tends to optimize the whole distribution,

weighing the tail significantly better than the CS test. This is not necessary for this specific problem and

can even jeopardize the GoF of the remaining distribution. Nevertheless, the results of how the different

GoF tests influence the results, both from a numerical and graphical point-of-views are shown in the

Appendix B. The optimization formulation, using the CS test as the GoF metric can then be made as

shown in expressions 3.11.

mint,p

χ2 =

n∑i=1

(Oi −Npibin)2

Npibin=

n∑i=1

(Oi −N · t!(x−x0)!(t−(x−x0))! · p

(x−x0)(1− p)t−(x−x0))2

N · t!(x−x0)!(t−(x−x0))! · p(x−x0)(1− p)t−(x−x0)

s.t. t > 0

p ∈ [0, 1]

(3.11a)

mins,p

χ2 =

n∑i=1

(Oi −Npinbin)2

Npinbin=

n∑i=1

(Oi −N · (s+(x−x0)−1)!(x−x0)!(s−1)! · p

s(1− p)(x−x0))2

N · (s+(x−x0)−1)!(x−x0)!(s−1)! · ps(1− p)(x−x0)

s.t. s > 0

p ∈ [0, 1]

(3.11b)

minλ

χ2 =

n∑i=1

(Oi −Npipois)2

Npipois=

n∑i=1

(Oi −N · e−λλ(x−x0)

(x−x0)! )2

N · e−λλ(x−x0)

(x−x0)!

s.t. λ > 0

(3.11c)

39

with

pbin, pnbin, ppois ≡ PDFs shown in expressions 3.2

s, t, p, λ ≡ PDFs’ parameters to be optimized

Lastly, an optimization method has to be selected. There are several criteria for choosing an opti-

mization algorithm: the problem’s complexity, efficiency of the algorithm and how well it converges for a

global minimum. In this current problem, the complexity is low, with only one or two parameters to be

optimized; this means that here is no need for a very sophisticated algorithm (such as Metaheuristics)

and that efficiency of the algorithm will never be a problem. The main objective is to choose an algorithm

able to consistently deliver satisfactory results. Due to the relatively simplicity of the optimization, a na-

tive R function was used for the optimization. The function was the constrOptim function, from the stats

package [56]. This function receives the function to be optimized, the initial estimates for the parameters

and the constraints of the optimization (e.g., the constraint that p ∈ [0, 1] for the Binomial and Negative

Binomial PDFs). A series of other parameters can be modified, including the method of the optimization.

The existing methods were tested to obtain the one which performed the best. An implementation of the

Nelder and Mead algorithm and of the BFGS were found to be the ones which delivered the best results

more consistently. Their values were practically the same, which meant that the method of choice was

the Nelder and Mead, which appeared to be less prone to errors. The method presented in the paper

by Nelder and Mead [35] in 1965 is a Heuristic search method characterized by comparing the function

values at the vertices of a simplex with (n + 1) vertices (n being the number of dimensions). As stated

by the authors, the method is shown to be effective and computationally compact.

Estimating the initial parameters for the optimization is something that should be done as frequently

as possible, not only because it provides an approximation to the parameters which reduces the number

of iterations necessary, effectively increasing the algorithm’s efficiency, but also because it makes the

algorithm less prone to falling into local minima. The estimation of the parameters is evidently dependent

on the PDF that is being optimized. The Poisson PDF, the only considered with only one parameter is

the simplest and most effective one at parameter estimation. As stated by Law [28, p. 313], the maximum

likelihood estimation for the Poisson distribution is λ = X(n), i.e., the average value of the distribution’s

observations. Regarding the Binomial and Negative Binomial distributions and considering that both

their parameters are unknown, the maximum likelihood estimation is a much more complex process

which, given the relatively simple nature of the optimization, becomes unfruitful. As a consequence, the

initial parameters were set a priori, based on results obtained empirically and that seemed to lead to

optimizations with fewer iterations and without getting stuck in local minima. The initial values were then

t = 100, p = 0.5 for the Binomial PDF and s = 1, p = 0.2 for the Negative Binomial PDF.

3.2.6 Results of the fitting

The algorithm for the implementation of the PDF fitting is shown in algorithm 1. On the algorithm

shown, a few R functions are considered; first of all, the functions dbinom, dnbinom or dpoisson corre-

spond to the PDF for the binomial, negative binomial and Poisson distributions, respectively. IQR is the

function that calculates the interquartile range, and quantile the function that calculates the quantile at a

40

certain percentage. Lastly, the function constrOptim is the optimization function that receives the initial

parameters, the function to minimize and the constraints. Assume the variable manufacturing values

(and QR values) as the data structure containing the distribution values for all the different projects and

Freq as the vector containing the real frequencies of a project. Note that all the functions mentioned

belong to the R package stats [56].

Algorithm 1 Distribution Fitting Algorithm1: procedure CHISQUARE(par)2: p = dbinom(x, par[1], par[2]) . Could be dbinom(x, t, p), dnbinom(x, s, p) or dpoisson(x, l)3: expected = n ∗ p4: return sum((Freq − expected) ∧ 2/expected)

5: vals = manufacturing values[project] . Manufacturing or QR6: cut = IQR(vals) ∗ 1.5 . (Removing Outliers)7: lower = quantile(vals, 0.25)− cut8: upper = quantile(vals, 0.75) + cut9: vals = vals[vals > lower & vals < upper]

10: gf = constrOptim(par, ChiSquare, constr) . Optimize parameters (initial parameters par )

With the objective of evaluating how well each theoretical distribution fitted the data, the optimization

process was performed, for all the representative projects and for both areas, manufacturing and QR.

The results are presented in table 3.1. Note that for obtaining the time taken by each optimization, they

were ran for 1000 iterations and the values were averaged.

PDF Manufacturing QR Average

CS Time Exc. CS Time Exc. CS Time Exc.

Poisson 336.2 1.1 0 284.4 1.1 7 324.2 1.1 7Binomial 445.3 14.9 0 579.6 17.6 7 476.3 15.5 7

Negative Binomial 20.6 15.4 0 72.0 5.3 0 46.3 10.3 0

Table 3.1: Results of the optimization of the PDFs parameters for distribution fitting. The values presented are the CS value, thetime that each optimization takes (in milliseconds) and the number of exclusions. An excluded value regards to a project which

for a specific distribution has a CS value above 10000. This limit was selected empirically and with the objective of removingvalues that were immensely greater than the values shown, and would therefore, create meaningless average values.

Analyzing the results from table 3.1, a series of conclusions can be taken. Regarding code efficiency,

it can be seen that clearly the Poisson distribution is the most efficient; this comes as no surprise given

the fact, that it is the only distribution that has only one parameter. Negative binomial seems to be

generally better than the binomial. However, it is important to mention that the optimization efficiency

is not a crucial factor when selecting the distribution, since the magnitudes of the values are relatively

small (in the order of the milliseconds). It can be seen that in terms of CS value, the negative binomial is

much better than the remaining distributions, both in Manufacturing and in QR. The binomial distribution

is the one that performs the worst. Furthermore, it can be seen that while the CS value increases

from Manufacturing to QR, regarding the negative binomial distribution, it does not have any exclusion,

contrarily to the remaining distributions, were the majority of the projects had CS values above 10000. In

figure 3.9, the different distributions are fitted to the histograms that have already been shown.

Visually, it is clear that the Negative Binomial PDF offers the best fit to the majority of the datasets.

Note the first and third graphs, describing data with an exponential and uniform tendency, respectively.

41

0

25

50

75

100

10 20 30 40

Cou

nt

0

10

20

30

40

50

10 20 30 40 50Duration [TU]

0

1

2

3

4

5

0 50 100 150

ProbabilityDistribution Function

Binomial NegativeBinomial Poisson

Figure 3.9: Results of the fitting process between the projects’ distributions and the theoretical PDFs, applied to the examplesthat been already shown.

Especially on these cases it can be seen that the negative binomial is the most appropriate PDF. Addi-

tionally, by observing the CDFs shown in figure 3.10, it is possible to see the fits of the different distribu-

tions. Note that while the optimization was not based in a CDF, like most of the described GoF methods,

the fitted distributions (especially the negative binomial) tend to be robust in following the ECDF.

5 10 15 10 15 20 25 0 50 100

0.00

0.25

0.50

0.75

1.00

Duration [TU]

Den

sity

PDF

BinomialNegativeBinomialPoisson

Figure 3.10: ECDFs and how the theoretical PDFs fit to them

After analyzing all the data from the fitting process, it is clear that the best choice for the PDF that

delivers the most consistent and most accurate prediction in both Manufacturing and QR processes is

the Negative Binomial. The assumption is then made that all the processes at the CDMO in study follow

a Negative Binomial PDF. In fact, using the Negative Binomial distribution when modelling durations can

be found in literature, for example in the article by Carter and Potts [10], where Hospital length-of-stay is

modelled using this PDF. Literature on modelling pharmaceutical processes durations’ PDFs was found

to be scarce and there were no occurrences of the usage of the negative binomial distribution; however,

its utilization can be justified by the study here presented.

42

Chapter 4

Simulation-Based Rough Cut Capacity

Planning

The simulation tool developed with the objective of conducting scenario-based forecasting and op-

timizing the overall SC service level was a simulation-based RCCP tool, data-driven and built on the

concept of demonstrated performance. The key stakeholders at the CDMO in study considered this to

be the tool which could bring the most benefits, given the nature of the project.

4.1 Problem Description

Capacity planning can be defined as the process of determining and evaluating the amount of capac-

ity required for future manufacturing operations. This capacity can often be in terms of labor, machinery,

warehouse space or supplier capabilities. Planning for capacity is a crucial step, since it can both eval-

uate if the future manufacturing processes can take place without problems and enable better resource

allocation, reducing inventory levels and increasing the overall utilized capacity. It can be performed

at several levels: product-line level (resource requirements planning, RRP), master-scheduling level

(RCCP) and material requirements planning level (capacity requirements planning, CRP) [54]. This ca-

pacity planning process is graphically represented in figure 4.1, where it can be seen that the RCCP is

the capacity planning at the master-scheduling level. The master production schedule is the plan made

by the company regarding production, staffing and inventory [7].

The RCCP step comes as the capacity plan at the tactical level, which regards the master production

schedule at the requirements level. In fact, Oracle Applications [36] defines it as a long-term capacity

planning tool that marketing and production use to balance required and available capacity, and to

negotiate changes to the master schedule and/or available capacity. Using the results from an RCCP,

the master schedule can be modified in order to solve capacity inconsistencies by moving scheduled

dates or increasing/decreasing scheduled production quantities. Additionally, the baseline capacity can

be increased when necessary, by adding overtime shifts or subcontracting personnel; to this end, a

rough estimate of the necessary capacity at a given time has to be known ahead, hence the need for

the RCCP.

The RCCP can be distinguished from the RRP and the CRP due to the level at which they operate.

The main definitions and differences between these are described below.

• RRP: RRP has the objective of creating a profile of the work centers’ load that the system uses to

validate a forecast, determining available capacity and long-range requirements for a work center.

43

Forecasting

Resources

Requirements

Planning

Master

Scheduling

Rough Cut

Capacity

Planning

Master

Requirements

Planning

Capacity

Requirements

Planning

Requirements Capacity

Strategic Plan

Tactical Plan

Operational Plan

Figure 4.1: Capacity Planning Process [38]

This means that it is a planning stage more focused on the strategic level. Usually, the RRP is

generated after generating a long-term forecast, using its data of future sales to estimate time and

resources required for the production operations. Only after the RRP can the master schedule be

produced, which justifies the higher level of the RRP comparing with the RCCP. Due to its strategic

operating level, RRP can aid in several aspects, e.g. expanding existing facilities, staffing loads or

determining capital expenditures for equipment [37].

• CRP: CRP is used to verify if an enterprise has sufficient capacity available to meet the capacity

requirements from the MRP plans. CRP is a more detailed capacity planning tool than RCCP in

the sense that it considers schedules and on-hand inventory quantities when calculating capacity

requirements. The capacity plans that come from the CRP are a direct statement of the capacity

required to meet the company’s net production requirements [36].

The rough character of the RCCP has a series of implications. First of all, the scheduled campaigns

(short-term horizon) and on-hand inventory quantities are not within the scope of the capacity require-

ments calculation. Secondly, the capacity tends to be measured at a large timeframe, frequently monthly

or biweekly.

4.2 Proposed ApproachThe main objective behind developing an RCCP tool was to obtain an estimate of the utilized ca-

pacity in the long-term horizon, regarding workstations and workforce. Since this tool was to be used

as a component of a DT, which possesses large quantities of information regarding current and past

states of the productive areas and models of how the processes tend to occur, adding the concept of

demonstrated performance was deemed as an opportunity for improving the baseline performance of

44

the tool. This elevates the results of the RCCP tool from strictly dependent on recipe information to

results based on performance that was seen, effectively accounting with more scenarios and generat-

ing more accurate and dependable results. Furthermore, the DT directly affects this simulation-based

RCCP by automatically updating the probability distributions that model the processes durations and by

adding the information regarding the current orders (and short/medium-term horizon ones), creating an

additional constraint in terms of activity planning. For this work, and considering the data in table 2.2,

the short-term timeframe is considered at 1 month from the current date, medium-term time frame at 3

months and long-term at 2 years.

The approach for the implementation of the demonstrated performance in the RCCP tool was by

using the Monte Carlo (MC) method. This class of computational algorithms relies on random sampling

of values in order to find a pattern or tendency, and theoretically, is able to solve any problem with

probabilistic interpretation. In the case at hand, it was seen that the manufacturing and QR durations

had probabilistic characteristics that could be measured (see section 3.2), which are propagated to the

area’s efforts.

The use of the MC method is justified by both the non-linear character of the problem and the fact

that the system cannot be accurately modeled. While other methods are more accurate and much less

computationally demanding (such as the Kalman filter for linear systems or the extended Kalman filter

for nonlinear systems), these require an accurate system model. The objective of the MC method is then

to generate distributions of the predicted monthly capacity for each area, given the variability inherent to

the systems.

The MC method can be mathematically formulated by initially defining a probability space (Ω,F , P ),

corresponding to the sample space Ω, the set of event outcomes F and the function that assigns prob-

abilities to the events P . The application of the probability space to the problem at hand is made as

shown in equation 4.1.

Ω = DM1 , DQR1 , DM2 , DQR2 , · · · , DMN

, DQRN ,Ω ∈ N

F = 2Ω

P (x) =[∏N

i=1 P (DMi, sMi

, pMi, x0Mi

) · P (DQRi , sQRi , pQRi , x0QRi)]j

(4.1)

with P (x, s, p, x0) =(s+ (x− x0)− 1)!

(x− x0)!(s− 1)!· ps(1− p)(x−x0) (Negative Binomial PDF)

A few considerations are in order, regarding the probability space of each PDF. Each value of DMi

or DQRi corresponds to a duration in terms of manufacturing or QR, corresponding to project i, taking

a value from Dmini , · · · |M . While establishing Dmin as the minimum duration for a process is correct,

doing so for the upper bound Dmax would not be quite as mathematically correct, since in theory there is

no upper bound. However, in practical terms, an upper bound exists and could be observed. Secondly,

note that this formulation tends to vary from more common applications of the MC method. Oftentimes,

MC simulation is used in gambling, to evaluate the risk of successively playing in a certain game; this

assumes that the probabilities are sampled in a succession, with the complexity increasing with each

additional succession considered. This means that for a considerable complexity not all combinations

45

can be accounted for, but the universes that are being calculated converge to a representative solution.

This would be near-impossible to obtain mathematically for problems with a certain complexity. In this

problem, the probabilities are not sampled in a succession, but they are rather part of the same universe.

This means that for a given iteration of the Monte Carlo simulation, each campaign’s duration is sampled

and translated to a specific resulting capacity. Each iteration will then feature a different universe, char-

acterized by a certain monthly utilized capacity on each area. By analyzing more universes (iterations),

the capacity will converge to a more representative one.

Sugita [53] offers additional formulation of the MC method, and also justifies why the utilization of

pseudorandom numbers in the sampling process is acceptable. This is important because while theoret-

ically MC simulation works for completely random sampling, it has been the subject of some suspicion

when utilizing pseudorandom sampling, which comes as a condition when computationally sampling

random values. However, the author proves mathematically that it is also valid.

4.2.1 Methodology

The methodology followed when constructing the simulation is described in this section. This includes

assumptions made and approaches followed.

For the first approach, the concept of confidence level has to be defined in the scope of the current

work. Given that the PDFs do not have an upper bound and can theoretically assume values up to

infinity, such event is undesirable and can greatly influence the results. To account for this problem a

confidence level was defined, as the percentage corresponding to the maximum acceptable duration in

the CDF. By doing so, the theoretical PDF is truncated, only accepting values inside the chosen confi-

dence level. This level can be chosen by the user, but setting it to 90% has been seen to successfully

remove a sizable portion of undesirable points, while keeping a varied distribution. To implement this

truncation, the algorithm cannot simply assign the maximum value (corresponding to the chosen confi-

dence level) to any sampled value bigger than that, since this would create an unbalanced frequency on

the maximum value. Instead, this truncation is implemented by sampling values in a loop until they are

in the acceptable region. The result from this process can be seen in the graphs from figure 4.2.

0.00

0.01

0.02

0.03

0 30 60 90

Freq

uenc

y

0.00

0.05

0.10

15 20 25 30 35Duration [TU]

0.00

0.01

0.02

0.03

30 60 90

Figure 4.2: Examples of truncated distributions. The confidence level was set to 90% and the number of samples for eachexample was 2000. Note that the grey region represents the theoretical PDF, the bars region corresponds to the sampled values

histogram and the vertical line corresponds to the confidence level.

46

By setting a confidence level and performing the truncation of the PDFs, the range of values from

where DMi and DQRi are taken becomes bounded, in such a way as Dmini , · · · , Dmaxi. To better

understand the mathematical formulation, consider the example defined by a scenario where the projects

to be sampled are defined as shown in table 4.1.

Project, i Possible manufacturing durations Possible QR durations

1 4, 5, 6 19, 202 13, 14 4, 53 3, 4 15, 16, 174 13, 14 4, 5

Table 4.1: Example scenario of orders to be sampled. Note that no start and finish dates are provided: the dates are notnecessary for the sampling process. This example is simplified and generally the range of possible values is much larger, as well

as the total number of projects.

For this specific example, the probability space would be defined as shown in equations 4.2

Ω = DM1, DQR1

, DM2, DQR2

, DM3, DQR3

, DM4, DQR4

j =

= 4, 19, 13, 4, 3, 15, 13, 4, 5, 19, 13, 4, 3, 15, 13, 4, 6, 19, 13, 4, 3, 15, 13, 4, · · · (4.2a)

P (x) =

[N∏i=1

P (DMi, sMi

, pMi, x0Mi

) · P (DQRi , sQRi , pQRi , x0QRi)

]j

(4.2b)

Although this example is extremely simple, the sample space Ω would be composed of 576 possible

scenarios, with their own respective probability of occurrence, as defined by equation 4.2b. Furthermore,

the set of possible events F = 2Ω = 2576 = 2.47 · 10173 would also be immensely large. In fact,

considering the scenario used for the results presented in section 4.3.4, where a total of 547 orders

were considered, resulting in a sample space with around 101392 scenarios, it can be seen that it would

be impossible to calculate the probability of each scenario, and evaluating which scenario would be more

probable. The use of Monte Carlo is justifiable for such a scenario: by randomly sampling values, not all

scenarios can be obtained, but a convergence can be found, which in theory would tend to the scenario

with the highest probability of occurring.

Note that one could say that obtaining the scenario with the highest probability could simply be made

by directly extracting each PDFs highest probability value, generating a scenario comprised of all the

planned orders with their manufacturing and QR durations being the highest probability ones. However,

the objective of this algorithm is obtaining the most probable monthly capacity utilization scenario, which

is not necessarily the scenario with the most probable durations. Instead, the monthly utilized capacities

have to calculated for each scenario and a convergence has to be found.

After a confidence level and the number of iterations of the Monte Carlo algorithm are set, the base

loop can be ran. This loop simply samples durations for the manufacturing and QR processes of each

project and for each iteration. While the capacities could be calculated on this loop, it is not done and

is instead calculated in a separate loop (this is possible because the results are stored). The start and

47

end dates of each campaign are also calculated. These can be calculated through 2 different methods,

chosen by the user: latest start date (LSD) or earliest due date (EDD). Note that the planned start date

and deadline are given with the planned orders. The two types of simulation simply establish if during

the simulation, either the planned start date or deadline are fixed. By fixing the start date, the simulation

is ran and an expected finishing date is obtained, which corresponds to the EDD; in contrast, by fixing

the deadline (while applying a safety buffer), the durations are calculated and the start date is obtained,

corresponding to the LSD.

After running the main loop, the monthly capacities can be calculated for each iteration. The as-

sumptions used for this calculation are fundamental for the results to make sense. These assumptions

are based on the values of manufacturing and QR duration (sampled from PDFs) and on the values

from the projects’ recipes, the processes durations and efforts, for the different productive and support

areas. The main assumption is that the manufacturing effort is scaled with its duration. This means that

if the recipe states that the manufacturing process has a duration of d and an effort of e, but in reality,

the duration is dreal (sampled from a PDF), then the effort would be dreald · e. The remaining efforts are

not scaled in quantity, only in ”location”. While this means that the total effort is always the same, the

monthly effort can vary. Note that the QR processes include the FP storage by the warehouse, manu-

facturing BPR review, QC R and RV operations and QA. In terms of efforts, QR is divided into all these

operations and does not have its own effort. The basic assumptions regarding the manufacturing and

support areas are shown in table 4.2.

Area Sampled Manufacturing Sampled QR

M Scaled BeginningQA – After M + After QC RV

QC IPC Distributed –QC R – Beginning

QC RV – After QC RWH Beginning Beginning

Table 4.2: Efforts assumptions. Scaled means that the daily effort is constant and expanded to the actual duration of the process.Distributed means that the total effort is the same, but the daily effort is changed, to account for the real duration. Beginning

means that the effort is placed in the first day of the process (when there is no more precise information).

These assumptions are also graphically represented in the graph of figure 4.3. Note that while there

is only one set of QC R/RV, in a real campaign there can be multiple occurrences of these stages,

regarding different operations. The way that they operate is simple: all of the QC R stages start after the

end of the manufacturing processes; each QC RV stage starts after the associated QC R stage ends;

finally, the QA operations regarding QC start after the last QC RV operation ends. These assumptions

are specific to the CDMO in study and where verified and approved by the responsible stakeholders of

this project.

One final constraint is used by the algorithm, at a planning level. The basic idea is to verify whether

or not, for a specific simulation, there are clashes in the scheduling of the BA of each campaign. A BA

is defined in the scope of this thesis as an asset which is unique and fundamental for a certain task in

a project and may be used to produce different products. This verification has the objective of analyzing

48

100

101

1

150

101

1

5 8

5 7 6

8

Sampled Manufacturing Sampled QR

5 8

5 7 6

8

0 5 10 15 20 25 30 35 40Time [TU]

AreaM

QA

QC IPC

QC R

QC RV

WH

Figure 4.3: Example representation of the assumptions. Note that the first line represents the durations and efforts according tothe recipe, while the second represents the values scaled according to the rules mentioned. The efforts are expressed inside thebars, while the duration is translated from the x axis. On the second plot, the sampled values are also indicated; these will affect

the used values for both durations and effort according to the rules previously stated.

if there are any campaigns that require a given BA at the same time. This can be an opportunity for

improvement since the ERP at the CDMO in study schedules the campaigns based on their recipes and

it has been seen that it is not always strictly followed. Usually, buffers are set after the campaigns to

account for these problems, but ideally these should be kept as small as possible. A better solution is to

use the demonstrated performance to obtain more representative scenarios. Similarly to the graph from

figure 4.3, the individual tasks start and duration have to be scaled. This is a straightforward process,

represented through the graph of figure 4.4.

0 5 10 15 20 25 30 35 40Time [TU]

Area Manufacturing Task 1 Task 2 Task 3

Figure 4.4: Example representation of the manufacturing tasks scaling. As can be seen, the tasks scale linearly, both in terms ofstart and duration. This is the approach used to obtain the tasks’ beginning and end, to check for asset clashes.

Using BAs as an aid in scheduling has been done in the literature, even applied to the pharmaceu-

tical industry. Papavasileiou et al. [39] consider main equipment when scheduling batches’ activities,

determining the recipe cycle time using such approach, which basically is a method for establishing

which are the main operations that take the longest time, and by doing so, enabling the start of new

campaigns while others are still running, provided they only start after the recipe cycle time.

49

4.2.2 Implementation

Simply put, the implementation of the algorithm can be divided into three parts, which have been

theoretically described in the previous section. These are (1) the sampling of the Manufacturing and QR

durations, (2) the calculation of the monthly capacities and (3) the calculation of the BA’s utilization and

verification of clashes. All of these are calculated per iteration, which can be seen as the calculation

per reality or universe of values – different universes will feature different sampled values, which will

eventually lead to different utilized capacities and occupied assets. Note that while all these operations

could be calculated in the same loop, for reasons of code tractability and ease of modifications, the three

algorithms were separated. The effects of this decision in terms of code efficiency are further analyzed

in section 4.3.2.

The sampling algorithm is rather easy to explain. It is described in the algorithm 2. The main objective

of the algorithm is to sample for each iteration all the manufacturing and QR durations of each project.

Algorithm 2 Durations sampling algorithm1: for i in iterations do2: Initialize vector for QR and manufacturing durations and start and end dates, with length equal

to the number of projects in study3: for campaign in planned orders do4: Sample value for manufacturing duration5: Sample value for QR duration6: if Latest Start Date then7: Set the end as the deadline minus a safety buffer8: Set the start as the end minus the QR and manufacturing durations9: else

10: if Earliest Due Date then11: Set the start as the planned start12: Set the end as the start plus manufacturing and QR durations

The capacity calculation algorithm is slightly more complicated than the sampling one, and assumes

one initial variable that contains necessary information for the correct working of the algorithm. This

variable capacities is a matrix containing as many rows as there are projects and 9 different fields, as

listed below. This information is extracted from the projects’ recipes.

• Daily manufacturing effort

• Manufacturing effort for the BPR review

• QA effort after the manufacturing BPR review (pair of percentage of QR duration and effort)

• QA effort after the last QC RV stage (pair of percentage of QR duration and effort)

• QC IPC effort (pairs of percentage of manufacturing duration and effort)

• QC R effort (pairs of percentage of QR duration and effort)

• QC RV effort (pairs of percentage of QR duration and effort)

• Total warehouse effort during manufacturing

50

• Total warehouse effort during QR

Using this information and the sampled values done with algorithm 2, the monthly capacities are

calculated using the algorithm 3. The monthly capacities are obtained (independently) for manufacturing,

QA, QC IPC, QC R, QC RV and warehouse.

Algorithm 3 Monthly capacities calculation algorithm1: for i in iterations do2: Collect vectors of processes durations, start and end for i3: Initialize vectors of daily capacity for each area4: for campaign in planned orders do5: Determine range of days for manufacturing and QR6: Add the daily manufacturing effort to each manufacturing day7: Add the manufacturing BPR review effort to the first day of QR8: Obtain the manufacturing days when QC IPC takes place and add the corresponding effort9: Obtain the QR days when QC R takes place and add the corresponding effort

10: Obtain the QR days when QC RV takes place and add the corresponding effort11: Obtain the QR day when manufacturing BPR review ends and add the QA BPR effort12: Obtain the QR day when the last QC RV ends and add the QA QC effort13: Add the manufacturing warehouse effort to the first day of manufacturing and the QR ware-

house effort to the first day of QR14: Aggregate the daily efforts by month

After calculating the monthly capacities for every universe sampled, it is necessary to aggregate the

results into a single monthly value with a variance measure. Although using the mean and the standard

deviation tend to be used, it seemed more appropriate to use the median and the IQR, since they tend to

be more representative in skewed distributions. The approach is simple: for each month, the capacity is

converted to the median of the capacities of that month across all the iterations, with a variance measure

equal to plus or minus 1 IQR.

Regarding the calculation of the BAs utilization, the algorithm is described in algorithm 4. Note

that a base structure variable is created before the cycle starts. This variable contains every utilized

asset by every project and the percentage of start and end of the asset utilization in relation to the whole

manufacturing process according to the recipe (similar to what is explained in figure 4.4). The usefulness

is clear, since the variables are pre-allocated, and this is possible to do since the processes are always

going to be the same for a simulation. Additionally, a variable containing the BAs for each project is

loaded before the cycle.

Algorithm 4 BAs utilization calculation algorithm1: Create base structure with project, asset and start and end percentages of manufacturing2: for i in iterations do3: Collect vectors of manufacturing duration and start i4: for campaign in planned orders do5: Multiply the sampled duration of the manufacturing process by the percentages of start and

end of the tasks and sum the start date

This algorithm simply calculates the data necessary for creating a Gantt chart of the BAs tasks, for

each universe of sampled values. Two steps are then necessary: aggregating the iterations results and

detecting clashes between the tasks of a single asset. The aggregation is done once again employing

51

the median and IQR as the central tendency and variance measures. However, it is not as straightfor-

ward as when calculating for the utilized capacities. The process varies whether the simulation is being

ran for EDD or LSD.

• EDD: the parameters used are the processes start and duration. The median and IQR are calcu-

lated for both parameters. The aggregated start becomes the median of the start, and its variation

measure is plus or minus 1 IQR of the start. Regarding the aggregated end of the process, the

value corresponds to the median start plus median duration, while the variation measure is this

value plus or minus the sum of the IQR of the start and duration.

• LSD: the parameters used are the processes end and duration. The median and IQR are cal-

culated for both parameters. The aggregated end becomes the median of the end with variance

equal to plus or minus the IQR of the end. The aggregated start corresponds to the median end

minus the median duration, while its variance measure is plus or minus the sum of the IQR of the

end and duration. The processes tend to be much more variable using this type of simulation,

since both the manufacturing and QR variability affect the results.

After having a fixed result with a measure of variability, a last step as to be done: detecting if there

are any clashes between tasks of different campaigns on a single asset. To do so the algorithm 5 is

followed. Note that the result of this algorithm is to classify each activity into one of three categories:

no interference; interference; possible interference. The first two categories are self-explanatory: if

there is no clash on one activity then it is categorized as no interference; if there is at least one clash

it is categorized as interference. The possible interference category relates to tasks that while have no

interference using their median measures, if the activities are considered by their worst case IQR they

clash. They are said to have the possibility of interference. All activities are initially categorized as no

interference.

Algorithm 5 BAs clash detection algorithm1: for i in unique BAs do2: For every occurrence of asset i obtain the range of dates that the asset is being occupied, both

for the median range and for the worst-case IQR3: for j in occurrences of asset i do4: if Median range of occurrence j with values in the remaining median ranges then5: Occurrence j of asset i categorized as interference6: else7: if IQR of occurrence j with values in the remaining IQR then8: Occurrence j of asset i categorized as possible interference

Additionally, an optimization stage is also performed, if so desired by the user. The objective of such

optimization is to mitigate the BA’s interferences. The formulation of this optimization is presented in

equation 4.3. The formulation assumes a series of variables. Consider BA = BA1, · · · , BAm =

BAi, i ∈ [1,m] as the set of BAs, chosen by the user. i corresponds to the index of the BA, and m is

the total number of BAs. Each BA features a series of activities from different projects; considering the

BA with index i (BAi), Pi = P1i , P2i , · · · , Poi = Pji, ji ∈ [1, oi]. Here, ji corresponds to the index of

each activity (for BAi), with the number of activities being oi.

52

minn

n =∑

[Pji ∩ Pki 6= ∅]

s.t. i ∈ [1,m]

ji, ki ∈ [1, oi]

ji 6= ki

n ≥ 0

(4.3)

The algorithm for clash minimization is described in algorithm 6. Note that this optimization process

is done for a fixed scenario, considered as the aggregation of all the scenarios, which can either be

the median or the median plus a measure of variability. This optimization also considers clashes with

BAs regarding orders in the short and medium-term timeframes (up to 3 months from the current date).

These are used to verify that there are no clashes, but cannot be moved, since they are contained in a

timeframe which does not allow for changes in scheduling.

Algorithm 6 BAs clash optimization algorithm1: Define if optimization is done by possible or actual interference2: for ba in BAs do . Start by dealing with current-planned orders interactions3: Obtain maximum finish date of current orders in ba as maxba4: if maxba bigger than any start date in planned orders of ba then5: Add time difference to affected planned orders6: while There are (possible) interferences do7: Identify the BAs with (possible) interferences8: for ba in BAs with (possible) interferences do9: for activity in activities in ba do

10: if Interference in activity then11: Obtain the amount of interfered time i12: Add i to the start of all the activities in the campaign corresponding to activity13: Update the campaigns start dates14: Calculate capacities of the resulting scenario

After performing the optimization from algorithm 6, the user receives the new corresponding Gantt

chart with the operations related to each BA and the capacity plots. However, both these graphs are

shown in a deterministic way, without any variance associated. This is due to the fact that the variance

cannot be propagated from the raw data after the optimization stage. Since the start dates of the

operations are updated after the optimization stage, the solution for obtaining variance measures is to

re-run the simulation, with the new start dates, which have a greater probability of removing the BA’s

operations interferences. Note that, it is not guaranteed that not interference will take place: due to the

stochastic nature of the process, a new simulation will certainly sample different values and the results

will not be the same ones, creating the possibility for new interferences that were not observed in a

previous simulation.

All of these algorithms were implemented in R, a programming language mainly used for statistical

computing and data science. The main reason for choosing this language over other more common-

place languages, such as Python, were its capabilities for designing intuitive and aesthetic frontend

applications, which is a requirement for a DT (further explained in chapter 5). R is an extremely popular

53

programming language nowadays, with a vast community, support and updates, which justifies its choice

as the main programming language for this work.

4.3 Results

4.3.1 Convergence Analysis

Evaluating the convergence of the results is a crucial step since it can help understand if the sim-

ulation is trying to reach some conclusion and how long it takes to reach said conclusion. In terms of

monthly capacity, it was seen that for most of the areas and during months with considerable activity,

the results did converge to a certain median monthly capacity. Interestingly, increasing the number of

iterations tends to generate a normal probability distribution on most areas, as can be seen in figure 4.5,

where the distributions for 10 to 50000 iterations are shown for the manufacturing and QA areas. Note

that while the shape of the distributions varies significantly from number of iterations to the next, the

mean and median do not fluctuate greatly. While this may justify using smaller number of iterations, it is

riskier to take conclusions regarding the central value from simulation with 10 iterations than with 50000

iterations, since the distributions are much more developed and less influenced by outliers.

Mean: 11518.1Median: 11444.9SD: 339.5IQR: 210.5


Mean: 11802.8Median: 11809.4SD: 371IQR: 482.2

Mean: 11789.3Median: 11781.3SD: 341IQR: 453.7













M QA

1050

100

500

1000

5000

1000

050

000

11000 12000 13000 200 300 400 500Duration [TU]

Figure 4.5: Evolution of a month’s capacity distribution. The areas represented are manufacturing and QA, from 10 to 50000iterations. As can be seen, the monthly capacities start to form a normal distribution with the increase in iterations.

The corresponding evolution of all the areas monthly capacities with the number of iterations, regard-

ing the same month as shown in figure 4.5 is shown in figure 4.6. this graph shows the evolution of the

54

median value, depicting the central tendency measure, with the median plus or minus one tenth of the

IQR as the shaded area. The IQR value was divided by 10 because otherwise the graph’s limits would

be affected and would not show the information as intended; the IQR is only meant to show the tendency

of the variability along the number of iterations.

QC R QC RV WH

M QA QC IPC

10 100 1000 10000 10 100 1000 10000 10 100 1000 10000

10 100 1000 10000 10 100 1000 10000 10 100 1000 1000027.3

27.4

27.5

27.6

24.5

24.6

18

19

20

4.2

4.3

4.4

4.5

33.5

34.0

34.5

5.8

5.9

6.0

6.1

# of Iterations

Figure 4.6: Evolution of the median of the monthly capacity utilization (%) by area. The shaded regions correspond to the medianplus or minus one tenth of the IQR, to give insights into the variability evolution.

Some conclusions are in order, regarding the convergence of the capacities. First of all, note that

results from different months vary greatly, depending on the amount and project of the planned cam-

paigns for the month. Secondly, it can be seen that the median does not vary greatly with the number

of iterations. In fact, the data shown in table 4.3 shows the percentage variation between the median

at different numbers of iterations (discriminated for each area) and the median for the 50000 iterations

simulation, which is hereby considered as the ground truth, since it is the most representative result. The

table shows that the results are never too disparate from the 50000 iterations simulation, with arguably

only the 10, 50 and 100 iterations simulations offering less-than-optimal results, as made clear by the

absolute total of the variance of the areas per iteration. Note that even the graphs that appear to feature

greater variances in figure 4.6, if analyzed closely can be seen that the shown variance is on a small

scale.

Iter M QA QC IPC QC R QC RV WH Absolute Total

10 −2.8% 9.8% 0.1% −1.7% 0.2% −0.1% 14.6%50 0.7% 2.3% −0.5% −0.4% 2.0% 0.3% 6.0%100 0.3% −5.1% 0.1% 2.5% −0.4% 0.0% 8.5%500 0.1% −0.2% −0.2% 0.0% 0.2% 0.0% 0.7%1000 −0.04% −0.1% −0.1% 0.0% 0.0% 0.0% 0.3%5000 −0.1% 0.00% −0.1% 0.0% −0.3% 0.0% 0.5%10000 0.1% −0.5% 0.02% 0.0% −0.2% 0.0% 0.8%

Table 4.3: Monthly and area-wise relative error (with sign) of the median per iteration, compared to 50000 iterations

Additionally, the tendency of the distributions to become normal is not always shown and depends

on the month and area in question. Observe, for example the graphs from figure 4.7, which shows the

55

probability distributions for the 3 QC areas and warehouse, for the month in study in figure 4.5 and for

a simulation of 50000 iterations. As can be seen, the results appear to follow an approximately normal

shape (the distribution for the QC R area, for example, follows a normal tendency, even though it is

clearly not a normal distribution). Nevertheless, there are results which feature distributions without such

a clear pattern. Examples of these cases are shown in figure 4.8. However, these can happen for less

represented months, or for unusual conditions. But even on those cases, the convergence is verified,

with little error from the ground truth capacity. In conclusion, and only with the capacity calculation in

mind, simulations of 500 iterations appear to offer reliable results. The smaller the number of iterations

the better, as can be seen in section 4.3.2.

QC_RV WH

QC_IPC QC_R

100 150 200 280 290 300 310

1160 1200 1240 160 180 200 220

Duration [TU]

Figure 4.7: Evolution of a month’s capacity distributions for QC and warehouse. The results shown are for a simulation at 50000iterations.

200 300 400 500 20 30 40Duration [TU]

Figure 4.8: Non-normal examples of distributions at 50000 iterations

Regarding the BA’s utilization graphs, the convergence was tested in a slightly different manner.

Instead of testing the convergence of a single activity’s start or end, analogously to what was done

regarding the capacities, for the BA’s utilization the total number of asset interferences, possible inter-

ferences and no interferences were counted and its evolution with the number of iterations used for the

simulation was monitored. The results are shown in figure 4.9.

As can be seen in figure 4.9, the convergence of the number of operations with interferences is clear,

decreasing approximately 50% from simulations with 10 iterations to simulations with 50000 iterations.

The convergence of this parameter can be seen at around 500 iterations per simulation. Additionally,

it can be seen that the BAs with possible interference tend to increase until a certain point and then

56

0

100

200

300

10 100 1000 10000# of Iterations

Tas

ks

Interference No interference Possible interference

Figure 4.9: Evolution of the number of BA interferences versus the number of iterations

gradually decrease. This is actually a very predictable behavior; the value of possible interferences in-

creases while the value of actual interferences decreases. This happens because after a BA loses its

interferences it automatically becomes a possible interference; during this regime, the interferences are

converted into possible interferences, while the BAs with no interferences tend to remain approximately

stable, not changing much. After the number of interferences converges and stagnates, it can be seen

that the number of possible interferences starts to decline. This can be explained by the fact that in-

creasing the number of iterations per simulation will likely decrease the variation measured in the start

and end of the tasks. This reduces the range for possible interference, effectively reducing the number

of tasks with possible interferences.

4.3.2 Code Efficiency

The algorithms implemented often take long times due to the shear amount of data processed and

generated. Consider simulation with 50000 iterations, the largest number of iterations simulated. Ad-

ditionally, for the results here shown, the number of planned orders was 299. During the first loop, the

sampling algorithm samples values for each campaigns manufacturing and QR and stores 4 vectors

with these durations and the start and end of each campaign. This is done for each iteration and then

aggregated – leading to 299[campaigns] · 4[vectors] · 50000[iterations] = 59800000 values on this loop.

All these values proceed to the capacities calculation loop. There 6 vectors (for each area) containing

1500 values are created, which are then aggregated by month and stored. Although the final amount of

data from this loop is only 25[months] · 6[areas] · 50000[iterations] = 7500000 values, this loop is much

more computationally demanding than the other 2 loops. Finally, the BA’s utilization loop just returns a

structure with the start and end of each BA utilized by each project. For the tested scenarios this corre-

sponded to 305[BAproject] · 3[fields] · 50000[iterations] = 45750000 values. Note that the clash detection

and optimization algorithms are not considered in terms of efficiency because they are performed in

the aggregated data and are only done once, which means that the time taken for these operations will

57

always be significantly smaller.

Due to these reasons, performing simulations of a substantial number of iterations may often take up

to 1 hour, which justifies the application of some strategies to try to reduce the time spent on simulation.

The first strategy was observing the code bottlenecks and try to develop alternative strategies for those

sections. This was done using Profvis a profiling tool for the R computer language [13]. By using this

tool, the bottlenecks were successfully identified and some of those were mitigated. The second and

most significant approach was the application of parallel computing instead of regular computation for

the loops. This was done using the function parLapply from the package Parallel of the R computational

language [56]. This function requires the creation of a cluster before the computation, which receives

the number of CPU cores to be used. Since this thesis was performed on a computer with an 8 core

CPU and it is often recommended to leave one core out of the simulation for other uses, 7 cores were

used for the parallel computations. The results of using parallel computation versus regular computation

are shown in figure 4.10.

1 − Main Loop 2 − Capacities Loop 3 − BA's Utilization Loop

1e−01

1e+00

1e+01

1e+02

1e+03

Dur

atio

n [s

]

Type of Computation Non−Parallel Computing Parallel Computing

10 100 1000 10000 10 100 1000 10000 10 100 1000 10000−1.0

−0.5

0.0

0.5

# of IterationsAdv

anta

ges

of P

aral

lel

vs n

on−

Par

alle

l

Figure 4.10: Code Efficiency of regular versus parallel computation

The results shown in figure 4.10 are extremely interesting. They show that non-parallel computing

is actually more efficient for smaller iterations. This happens because of the creation of the cluster for

parallel computation. Before each loop, the cluster has to be created and after the loop it has to be

destroyed. This process has a duration of around 1.5s. On simulation with a small number of iterations,

1.5s is actually rather substantial, and therefore the results are spoilt by this setup process. When the

simulations start to take longer times, this setup time end up diluted in the time of the simulation and

the advantages start to be noticeable, frequently achieving efficiencies 50% better than the non-parallel

computation ones. Although this may not seem like much, at 50000 iterations a non-parallel simulation

takes an overall time of 42.83 minutes, while a parallel computation simulation takes only 18.07 minutes,

effectively reducing the duration in 24.76 minutes, corresponding to a 57.82% reduction.

As explained in section 4.2.2, the calculations of capacities and BA’s utilization could be performed

all on the same loop. However, due to tractability and ease of code maintenance they were separated

into three distinct functions. The effects of this decision in terms of efficiency are shown in figure 4.11.

The results are predictable for few iterations. Due to the setup times of the cluster it is clear that

58

10

100

1000

10 100 1000 10000# of Iterations

Dur

atio

n [s

]

Loop type Divided loop Single loop

Figure 4.11: Code Efficiency – 3 loops vs a single loop

the separated loop algorithm takes a longer time to process, since it involves the creation of 3 clusters

rather than just one, similar to the joined loop algorithm. However, for simulations with more iterations

the separated functions algorithm starts to become slightly more efficient. Although it is not clear why

this happens, the main conclusion that can be taken from this test is that there are no big advantages or

disadvantages in terms of efficiency between choosing the aggregated functions algorithm versus the

separated one – enabling the choice of the separated algorithm for the reasons presented before.

One last efficiency test was performed, to check the impacts of using 7 CPU cores instead of 8 or 6.

The results of this test are shown in figure 4.12.

1 − Main Loop 2 − Capacities Loop 3 − BA's Utilization Loop

3

10

30

100

300

Dur

atio

n [s

]

# of CPU cores used 6 cores 7 cores 8 cores

10 100 1000 10000 10 100 1000 10000 10 100 1000 10000

−0.2−0.1

0.00.10.20.3

# of Iterations

Adv

anta

ges

of7

core

s

Figure 4.12: Code Efficiency – 6 vs 7 vs 8 CPU cores used. Note that the graph on the bottom shows the advantages of using 7cores instead of using either 6 or 8 cores, e.g., a value of +50% means that using 7 cores takes 50% less time.

The results show that no clear conclusion can be taken from using 7 versus 6 or 8 CPU cores.

One possible explanation for the fluctuation of the results is the system instability created by using the 8

cores and not leaving a single core for other operations required, which may actually negatively affect the

results, or the reduced performance that could be expected in principle from the 6 CPU cores simulation.

59

It can be concluded that there is no advantage in performing the parallel computation with 6 or 8 CPU

cores instead of 7 and that it may even bring unjustified instability to the system.

In conclusion and considering 500 iterations as the ideal amount for simulation, it can be seen that

using parallel computation takes 12.79 seconds while non-parallel computation takes 21.78 seconds.

The advantage of using parallel computation is clear, even though the times are not too long. This

means that simulations with larger number of iterations can actually be used at acceptable lengths, with

the benefits of using parallel computation increasing with it.

4.3.3 Validation

To validate the results obtained by the simulation tool, data from past campaigns was used. The

method followed was specifying a date range where the study would be performed (this excluded the first

3 months of the chosen range – short and medium-term timeframes, but since the date was empirically

chosen, this could be neglected). After having a specific range of dates, the orders executed on such

range were extracted, along with their planned start date, actual start date, actual manufacturing duration

and actual QR duration. Orders that did not have all the fields were filtered out.

With the remaining orders, the planned start date was used for the simulation process, following

the methods described on section 4.2.2. Note that the type of simulation ran was EDD, since at the

CDMO in study, the orders are process based on their start date and not on their deadline, and for

validation purposes, no data could be used to verify a simulation ran by LSD. The simulation was then

ran and the monthly capacities were calculated for every iteration (500 iterations) and aggregated. For

the actual consumed capacities, the approach followed was the conversion of the orders into capacities

by following the assumptions described in section 4.2.1. This was the only possible approach, since

there is no data regarding the actual consumed capacity per order, the values of consumed capacity are

taken directly from the recipe, which does not translate the reality itself. Following these methodologies

for the simulated and real capacities, the results were obtained and are shown in figure 4.13.

Note that the values from the graph shown in figure 4.13 are masked by a multiplicative factor along

all the values, for confidentiality reasons. This hides the true values, but the comparisons and relative

deviations are correct. By visual inspection of the graph, it can be seen that the error between the actual

monthly capacities and the simulated ones does not appear to be large, and that often the correct result

is within the 2 IQRs, occasionally being within the 1 IQR. In fact, tables 4.4 and 4.5 show the numeric

figures behind the graph.

The results from tables 4.4 and 4.5 show a series of interesting behavior of the data. Note that

the average used is the absolute average (average of the absolute values) since it translates better the

tendency of the relative and absolute errors than other common measures of aggregating error, such as

the root mean squared error. It can be seen that the absolute average of the relative error is generally at

around 10%, which is a very good estimation for a rough cut simulation tool. The worst case occurs at the

QA, and by visual inspection of the corresponding graph from figure 4.13, it can be seen that the IQR of

the QA utilized capacity is often much larger than other areas. This derives directly from the fact that the

PDFs that modeled the QR duration are often very disperse and QA processes are the last processes

60

0

5000

10000

15000

Month 4

Month 5

Month 6

Month 7

Month 8

Month 9

Month 10

Uti

lized

Cap

acit

yManufacturing

0

200

400

600

800

Month 4

Month 5

Month 6

Month 7

Month 8

Month 9

Month 10

Month 11

QA

0

500

1000

1500

Month 4

Month 5

Month 6

Month 7

Month 8

Month 9

Month 10

Month 11

Uti

lized

Cap

acit

y

QC Type IPC Release Release Review

QC

0

100

200

300

Month 4

Month 5

Month 6

Month 7

Month 8

Month 9

Month 10

WH

Figure 4.13: Capacities validation graph. The graphs of the four areas are shown with the QC graph being divided into the 3sections. The shaded regions over each bar correspond to the real capacities, while the color bars correspond to the simulated

capacities. The error bars are distinguished between 1 IQR and 2 IQRs: the broader-width error bars correspond to 2 IQRs, whilethe narrower-width bar corresponds to 1 IQR.

Month Manufacturing QA Warehouse

Err % Err Err % Err Err % Err

4 2.5% 252.2 38.0% 15.6 3.5% 5.95 7.7% 1020.6 19.3% −97.6 0.4% 1.16 7.0% −1216.9 4.7% 28.0 2.1% −6.17 1.5% 186.5 42.0% 136.7 5.4% 9.08 13.6% −2067.9 16.4% −77.5 8.4% 18.79 4.0% −590.3 2.5% 11.0 17.0% −43.6

10 11.0% −348.2 28.1% −145.0 36.1% 9.611 − − 82.1% −19.2 − −

Absolute Average 6.7% 811.8 29.2% 66.3 10.4% 13.4

Table 4.4: Monthly relative and absolute (with sign) errors for manufacturing, QA and Warehouse. The absolute average iscalculated as the average of the absolute values, which generates a better measure of variability.

during such stage, which will result in a wider range of values for the QA efforts, generating more varied

capacity distributions that will feature greater variability when aggregated. This means that although the

relative error tends to be greater on this area, this behavior is actually expected and accounted for, made

evident by the larger IQR.

It is important to highlight the fact that all the areas besides QA and QC release review only have

values for 7 months instead of 8. This happens because both QA and QC release review are the final

tasks that are performed, which means that campaigns that started in the end of the sixth month (month

9 – the last allowed month for orders to start) may have had their QA processes happening during month

11.

61

Month QC IPC QC Release QC Release Review

Err % Err Err % Err Err % Err

4 3.2% −33.0 36.3% 38.0 12.6% 13.25 5.3% 74.3 0.5% 1.4 14.7% 30.26 0.8% −12.0 7.3% −24.1 8.7% −33.17 17.1% 199.0 11.2% 22.8 16.9% 34.38 14.7% −210.0 2.7% −6.6 3.7% −9.29 4.5% −61.6 5.0% −13.0 32.1% 72.1

10 25.1% 39.3 20.4% −23.4 36.0% −72.611 − − − − 37.3% −3.6

Absolute Average 10.1% 89.9 11.9% 18.5 20.2% 33.5

Table 4.5: Monthly relative and absolute (with sign) errors for QC IPC, release and release review. The absolute average iscalculated as the average of the absolute values, which generates a better measure of variability.

Table 4.6 shows an aggregation of the counts and percentages of occurrences per area of real

monthly capacities being within either the 1 or 2 IQRs.

IQRs Manufacturing QA QC IPC QC R QC RV Warehouse

1 Count 2 3 2 3 3 1% 28.6% 37.5% 28.6% 42.9% 37.5% 14.3%

2 Count 6 7 4 6 6 3% 85.7% 87.5% 57.1% 85.7% 75.0% 42.9%

Table 4.6: Occurrences of real monthly capacities being within the 1 or 2 IQR

As can be seen from the results presented in table 4.6, considering the range of 2 IQRs, a vast

majority of the real monthly capacities are contained within said range of the simulated capacities. For

the warehouse, which feature the lowest percentage of occurrences at only 42.9%, it can be seen from

table 4.4 that the relative error is generally small. What explains the lack of adherence to the IQR is

their small values, derived from small variances that warehouse effort feature. This could actually be

also seen in the graph referring to the warehouse from figure 4.6, regarding convergence; from the

graph it can be seen that the monthly warehouse consumed capacity converges but fluctuating only

approximately 1 hour per month, which is extremely small. Although this does not unequivocally explain

the low variability of the area, it is actually a reliable indicator.

The validation of the BA’s utilization Gantt chart was not performed for two main reasons. First,

while the simulation was based on the planned start date, the actual start date was often not the same,

sometimes fluctuating by considerable amounts. While the effects of such events did not greatly impact

the monthly capacities, the BA’s utilization is measured on a continuous scale, with tasks taking hours

or days; changes in the start date would greatly impact the allocation time range of the assets, which

would make the real versus simulated occupation almost non-comparable. Secondly, the application of

the BA’s utilization is a mere accessory for the capacity calculation and the results should not be used

strictly but instead as a reference for how the operations tend to happen to avoid the asset’s clashes.

In fact, a schedule of asset utilization made one year in advance or more is not reliable at all, and

the objective of its inclusion in this work is more on the basis of predicting asset’s influence on utilized

62

capacity and how solving the conflicts would affect the overall monthly capacities.

4.3.4 Prediction

The main objective of the algorithms described are to generate future predictions of capacity uti-

lization in the productive and support areas. The approaches followed were described and justified in

section 4.2.1 and the implementation methods in section 4.2.2. Furthermore, the convergence of the

results and the efficiency of the algorithms were tested in sections 4.3.1 and 4.3.2. Lastly, the results

were validation against past scenarios in section 4.3.3. The objective of all these steps was the creation

of trustworthy results that predict how the future capacities would be utilized, in an approximate fashion,

as per the definition of RCCP.

For the prediction, a specific scenario had to be followed. The number of iterations ran was 500,

the type of simulation was EDD. A total of 299 planned orders (3 months – 2 years) and 248 current

orders (present – 3 months) were considered. Note that these orders are actual orders at the CDMO in

study. However, the names of projects and assets are hashed, the months are replaced by a sequence

of months (month 1, month 2, ...) and whenever quantities are shown these are scaled by a hidden

factor. These steps are required for confidentiality reasons, but the modifications allow for the patterns

and tendencies to be conveyed.

Given the described scenario, the simulation was ran and the results were obtained. The first and

most important result is the monthly capacities by area. These results are shown in figure 4.14. The

capacities shown can also be seen in figure 4.15, where they are plotted in percentage of maximum

capacity utilized.

It can be seen from figure 4.15 that the capacity utilization tends to roughly follow the expected pat-

tern, exemplified in figure 2.7. Although the behavior is not perfect, it is actually a particularly good

approximation, which is the objective of an RCCP tool. Furthermore, although the months have been

anonymized, it is possible to observe the seasonality inherent to productive operations. Another interest-

ing point, especially in the manufacturing area, is that the utilized capacity is generally capped at around

75%. This is actually a customary practice to manufacturing operations, to set an upper boundary on

capacity utilization in order to account for delays, unexpected events or priority orders (when customers

pay a premium for expedite manufactruing).

Regarding the Gantt chart of the BA’s utilization for this simulation, it is shown in figure 4.16. Note

that this is a subset of the BAs, which is sufficient to show the tendencies of the assets. The choice of

what are BAs is actually left for the user to decide (a set of BAs is predefined, but further customization

is possible).

The Gantt chart from figure 4.16 shows the utilization of the specific assets and how different projects

may compete for a BA at a given time. When the ERP creates the base schedule, it considers the

durations of the processes as they are in the recipe; in contrast, these algorithms consider simulation

of the demonstrated performance. This means that there may be some clashes between tasks, which

although they are often not large, they should be considered. Note that the interferences that are shown

in red usually have extremely small regions of interference. This comes as a result of the ERP schedule,

63

0

5000

10000

15000

20000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Uti

lized

Cap

acit

yManufacturing

0

300

600

900

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

QA

0

1000

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Uti

lized

Cap

acit

y


QC

0

200

400

600

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

WH

Figure 4.14: Forecasted capacity evolution per month and area. The full color bars correspond to the actual simulated capacityfor the month and area in question, with an error bar indicating ± 1 IQR. The grey bars correspond to the capacities from thecurrent orders, which even though are orders starting on the first 3 months, often have effects on the following months. The

shaded background area corresponds to the limit capacity of each area, per month.

QC Release QC Release Review Warehouse

Manufacturing QA QC IPC

2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24

2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20 22 240%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

Month

Figure 4.15: Percentage of maximum capacity utilized per month and area

64

A1

A10

A11

A12

A13

A2

A3

A4

A5

A6

A7

A8

A9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Month

Ass

et

Fixed Order Interference No interference Possible interference

Figure 4.16: Gantt chart of the BA’s utilization. Tasks color-coded as grey signify that they are current orders, which can nolonger be modified; green tasks do not have any kind of interference; red tasks have schedule interference with other task(s),considering the median as the aggregation criteria for the simulations; yellow tasks indicate possible interference, when tasks’

schedules collide considering their extended start and end (median start minus IQR and median end plus IQR)

which even though may not be the most precise, it does generally account for the majority of the tasks’

duration.

After running the initial simulation and obtaining the results in terms of utilized capacity and BA’s

utilization, the optimization of the assets can be performed. The objective of the optimization is to remove

any BA’s interferences. However, the user may choose what to consider as the evaluator of interference,

the median of the beginning and end of the tasks or their extended values, considering either 0.25, 0.5,

1 or 2 times the IQR of the beginning and end. The optimization here shown was performed considering

0.5 IQR. The results from this optimization can be seen in figure 4.17 for the Gantt chart of the BA’s

utilization and in figure 4.18 for the capacity utilization resulting from the optimized scenario.

A1

A10

A11

A12

A13

A2

A3

A4

A5

A6

A7

A8

A9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21Month

Ass

et

Figure 4.17: Gantt chart of the BA’s utilization after optimization. The grey rectangles correspond to tasks regarding currentorders (which were not changed during the optimization and cannot be changed) and the green rectangles correspond to the

remaining orders, having been modified or not.

The graphs shown in figures 4.17 and 4.18 show the results in terms of BA’s utilization and utilized

capacities for a specific reality, generated by the optimization of the asset’s utilization. This fixed reality

has an associated capacity consumption. Regarding the first graph it can be seen that the tasks are

much more spread out, when comparing with the Gantt chart from figure 4.16. This is a natural step since

the tasks allocation had to be expanded so as to not clash with each other, with a confidence of 0.5 IQR.

Note that one project may have several tasks on different BAs, which means that if one task is affected,

all the tasks of the project are affected. If the simulation had been ran to mitigate clashes between

65

0

5000

10000

15000

20000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Uti

lized

Cap

acit

yManufacturing

0

300

600

900

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

QA

0

1000

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Uti

lized

Cap

acit

y


QC

0

100

200

300

400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

WH

Figure 4.18: Monthly capacities per area after optimization. Note that the graphs do not included error bars since thiscorresponds to a fixed scenario and not an aggregation of multiple scenarios.

tasks only considering the median, these would be generally closer together. Regarding the capacity

utilization graphs, the new tendency is actually quite predictable, having in mind the optimized scenario

BA’s utilization Gantt chart. Since the tasks have larger spaces between each other, the capacities tend

to be more evenly spread out, instead of mostly being concentrated in the initial months. This can be

observed in the graphs from figure 4.18, where in all the 4 areas, the maximum monthly utilized capacity

is greatly reduced (excluding the capacities derived from the current orders), and the capacity of latter

months is generally increased. The effects of the distinct types of optimization are shown in table 4.7,

where for manufacturing, QA and QC IPC the monthly capacity percentage (of total available capacity)

is shown for the 6 initial months of the medium-term timeframe.

By analysis of the results shown in table 4.7, a few conclusions regarding the data tendencies and

patterns can be reached. First of all, it can be seen that fixating the initial month and one area, the per-

centage decreases along the type of iterations, with an average reduction of −2.57% per optimization for

manufacturing, −0.64% for QA and −1.41% for QC IPC. This pattern of reduction continues consistently

throughout the initial 3 months (4-6), with the third month having the highest reduction per optimiza-

tion for the 3 areas, with −4.27% for manufacturing, −1.38% for QA and −2.80% for QC IPC. Months

7 and 8 do not appear to have a clear pattern, with manufacturing and QC IPC continuing to reduce

per optimization type, but QA already increasing. However, on the last month (month 9), all the areas

increase their average variation of consumed capacity percentage per optimization to positive values,

with manufacturing featuring a value of +0.95%, QA +1.21% and QC IPC +1.99%. This can be seen as

the turning point month, from where the capacities seen in the base optimization will start to increase.

These values tell precisely what can be seen when comparing graphs of utilized capacity before and

after optimization (figures 4.14 and 4.18): the capacities of the initial months decrease, while after a

certain month they start to increase, converging to a more uniform capacity distribution. Furthermore,

this behavior is much more visible the greater the optimization type (considering a ”smaller” optimization

66

Manufacturing

Month Base Median 0.25 IQRs 0.5 IQRs 1 IQR 2 IQRs

4 54.1% 53.5% 51.6% 48.0% 45.3% 41.2%5 33.2% 34.0% 30.6% 25.9% 22.8% 21.1%6 44.9% 51.4% 42.3% 35.8% 29.3% 23.5%7 39.2% 47.5% 38.6% 33.0% 32.3% 24.1%8 42.9% 41.6% 41.8% 39.9% 33.5% 30.8%9 15.3% 8.8% 23.5% 31.4% 26.7% 20.0%

QA


4 63.1% 62.9% 60.6% 60.4% 60.3% 59.9%5 24.9% 23.5% 25.6% 23.7% 24.0% 21.7%6 29.0% 34.7% 30.8% 25.8% 22.0% 22.1%7 21.6% 43.7% 40.1% 36.8% 35.6% 22.0%8 12.4% 29.3% 26.0% 27.6% 24.4% 23.2%9 25.7% 15.8% 25.9% 37.2% 33.8% 31.8%

QC IPC


4 36.1% 36.1% 35.5% 33.2% 31.0% 29.0%5 20.9% 20.8% 19.1% 16.6% 15.3% 17.8%6 31.1% 31.6% 25.4% 23.0% 18.3% 17.1%7 29.9% 30.6% 26.1% 22.9% 22.9% 16.1%8 36.7% 36.7% 35.7% 33.5% 30.7% 30.5%9 15.1% 14.4% 25.4% 22.2% 20.8% 25.0%

Table 4.7: Consumed capacity percentage of the months 4-9 for the manufacturing, QA and QC IPC and for each type ofoptimization. Base refers to a non-optimized version of the results, while the subsequent columns refer to the optimization by

median, 0.25 IQRs, 0.5 IQRs, 1 IQR and 2 IQRs.

just the median optimization and the ”greatest” the 2 IQRs optimization).

Another interesting pattern that can be observed is the average variation of the utilized capacity along

the months, for each optimization type and area. Here, for all the areas and all the optimization types,

the value is always negative, which makes complete sense, since the utilization tends to be greater on

the initial months, due to a larger number of planned orders. With minor exceptions, the values shown

are as could be predicted: the larger the type of optimization, the smaller the monthly capacity reduction.

What this means is that for the base scenario, the first month features a big capacity, with the second

being much smaller and so on, quickly arriving at no capacity at all; the variation is big in this scenario.

In an optimized scenario, where the tasks are more evenly spread, the monthly capacities are also more

spread out along the months; this means that the first month will have a certain capacity, while the

second will have a slightly smaller capacity and so on, reaching lower monthly capacity reductions.

The optimization process is done in order to reduce BA’s utilization interferences and generate the

respective capacity utilization of the optimized scenario. However, after being run, a good practice is

to rerun the simulation with the altered start dates and deadlines, in order to have many scenarios and

their aggregation, for a situation less likely to feature interferences.

The capacities presented, although generally delivered in hours of effort, can be quite easily con-

verted to the average number of workers per shift, which is the usual output of RCCP tools. In fact, both

67

measures are directly proportional: a larger number of hours of effort directly requires a larger number

of workers per shift. The calculation that has to be performed is shown is equation 4.4 (this equation is

similar to the application shown in equation 2.1). Note that this calculation considers an average of 30

days per month and a total of 3 shifts per day, making 22 hours of daily work. The result is the average

number of workers that should be working at any given time to successfully deliver the required hours of

effort.

W =Cap

30 · 22=Cap

660[w] (4.4)

In conclusion, it can be said that the results from this simulation-based RCCP are very trustworthy,

given the rough nature of the tool. Furthermore, the results obtained are supported by the demonstrated

performance of the activities performed in the past, which is a clear advantage over commonly used

approaches to the problem, which are supported by the recipe information, often imprecise.

68

Chapter 5

Digital Twin User Interface

By definition, the concept of Digital Twin is deeply intertwined with a virtual representation of the

assets in study. This means that a user friendly, interactive and uncomplicated way of showing the data

and allowing for user input comes as a necessity and can be of terrific value within an organization.

For these reasons, a UI for the Digital Twin was implemented. The front-end capabilities of the R

language and more specifically, the Shiny package were used [12]. The UI can be divided into two

parts: visualization and simulation. Note that the latter regards how the simulation tool is delivered to

the users and how can they intuitively interact with it, not how the simulation is performed (see chapter

4). All the tabs from the UI are displayed in appendix B.

The way that data is conveyed is extremely important for the recipient to correctly and quickly detect

patterns and ultimately, take conclusions. There are many ways of showing information and these can

greatly modify the perception that the user has regarding the data in question: some types of graphs may

show the same data in different ways, highlighting different patterns and behaviors. It is then important to

select the most adequate way of representing the information in question, which often vary substantially

from graph to graph.

Additionally, and since this final tool is supposed to be used by decision-makers initially unfamiliar

with it, each tab of the UI includes a FAQ containing usage instructions and explaining what is on-screen.

An example of such help modal window is shown in figure B.7 of appendix B.

5.1 Visualization

The developed UI features 5 tabs regarding data visualization. Each tab has its purpose, with some

of them focusing on current and past activities and some on key performance indicators (KPIs). The

tabs are Overview, (activities) By Building, (activities) By Project, KPIs and Schedule.

The first tab, Overview, has the objective of delivering a very high-level view of the factory in study,

featuring a map of the production plants and a set of gauge plots, measuring the shortest timeframe

KPIs of the selected view. This goes along a selector, for the user to select either the global view, or a

specific productive area or building. Maps are extremely useful and effective ways of delivering data that

has geographical meaning. For this tab, the map acts as an aid in identifying each building’s objective (if

viewing it in the global view), or as a geographical locator of buildings of a specific area or even specific

building. The information conveyed in the map is modified by changing the selector. Figure 5.1 shows

examples of displays that the map can output.

On the Overview tab, the most significant KPIs are shown for the current week, and these are the

ones affecting the selected option. This changes not only the values but also the own KPIs; the most

69

Figure 5.1: Examples of maps shown in the Overview tab. The first graph shows the identification of a single building; the secondmap the identification of all the building in a certain area; the third map shows the global view, indicating all the areas that each

building performs.

significant ones for the manufacturing area may not be the ones that best describe the warehouse’s

performance, for example. The KPIs are shown in gauge plots, which are a graphical way of showing a

value that is ideally bilaterally bounded. It is often used to show performance, since this usually comes

in percentage form. A set of two gauge plots are then shown, measuring the two most significant KPIs

for the select view.

The second tab deals with (activities) By Building. Its objective is merely showing the current, past

and future activities being performed in a specific building, selected by the user. The UI presents a map

of the plant, a slider for defining the range of dates shown and a table containing information according

to the selected dates range and building. This table presents information regarding the activities taking

place, specifying the building, production line, project, planned start date and expected finishing date,

and for activities that have already taken place, the actual start date and actual finishing date. Regarding

the map, it basically takes two forms, the complete map with all the buildings (resulting in the data shown

by the table being referent to the whole plant) and the map zoomed on a single building (after being

clicked by the user). The representation of these map states becomes as shown in figure 5.2.

This tab offers a way of showing the activities taking place at a specific building. This can be par-

ticularly useful for decision-makers to understand the current operations of each building and how they

have performed in the past.

The third tab (activities) By Project is similar to the second one, but offers the activities’ information

based on the project that is being produced, rather than on the building where it is being made. The tab

is characterized by having a selector, where the user can choose which project to see, a network graph,

showing the INs necessary for the production of the FP, a map showing on which buildings a certain

project has productive activities and a table with the same information as the table from the second tab,

but filtered by activities regarding the selected project. An example of a combination of network plus

map is shown in figure 5.3.

This way of showing information can be useful for users who need to check the activities of a single

project, view on which buildings they are being produced and understand the usual duration of a project’s

processes.

70

Figure 5.2: Map before and after being clicked. Note that the crosshair button on the top left of the map is a button for resettingthe map view.

Figure 5.3: Network graph of a project and corresponding map of buildings

The fourth tab shifts its scope from activities to KPIs. The objective of the tab is to show the KPIs,

their current and past values,on a weekly, monthly or yearly timeframe, filtered by area or building. The

tab features a selector (similar to the one on the Overview tab) where the user can choose the global

view, or a specific area or building, a set of the 6 most relevant KPIs (their current values) for the selected

scope, shown in gauge plots, and a line chart, depicting the evolution of the KPIs through time. This

tab offers many options to the user, so that the most appropriate view can be shown. First of all, and

similarly to the Overview tab, the 6 most relevant KPIs depend on what area or building is selected.

The values presented on the gauge plots are referent to the current week’s KPIs. If the user wants to

check the historical evolution of one or more KPIs, a selection button is present on each gauge plot,

which activates the line chart with the historical information regarding the selected KPI(s). The user can

then choose to view the evolution of the weekly, monthly or yearly KPI. Furthermore, there is a selection

71

button to activate the normalization of the data between 0 and 1. This allows two KPIs with information

of different scales to be plotted simultaneously, and while the information regarding the absolute value

of the KPI is lost, the KPIs evolution can be compared. Lastly, the line chart allows for zooming in

on sections. Note that the KPIs are calculated externally to this work, which only has the objective of

bringing together all the KPIs from different areas and at different dates. Some KPIs may be added

later, but the framework for their inclusion is already constructed and it merely a task of modifying the

endpoints of data exchange.

The fifth and last tab regarding visualization is the Schedule tab, focused on delivering the activities

schedule per productive line, in a Gantt-like fashion. The tab features a slider for defining the range of

dates shown and two selectors, one for choosing whether or not there should be any filter applied, and if

so, by building or by project, and another selector for choosing the desired projects or buildings (multiple

choices can be selected at the same time). A graph is then presented showing the activities. This graph

features the production lines on the y axis, the date on the x axis and the occupation is represented

as bars, color-coded according to the project they are referent to. Figure 5.4 shows an example of the

schedule for a given range of time. Note that the activities, projects and production lines are not real

values, due to confidentiality reasons. The only objective is how the data is shown and not what data is

being shown.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

B12 − L36B12 − L82B12 − L86B12 − L87B3 − L48B3 − L63B3 − L90

B40 − L28B40 − L82B40 − L86B54 − L48B54 − L63B54 − L80B8 − L21B8 − L36B8 − L47B8 − L73B8 − L81B8 − L86

B99 − L49B99 − L63B99 − L86B99 − L99

Apr May

Day

Month

Project

P73

P23

P93

P9

P65

Figure 5.4: Example representation of the schedule of activities

Several packages from the R programming language were used for the creation of the different plots.

The maps were created using the Leaflet package [14]; the gauge plots and line chart were created

using the Billboarder package [40]; the project network was created using the VisNetwork package [2]

and the schedule Gantt-style graph was created using the ggplot2 package [59]. Additionally, the tables

were generated and shown using the DT (DataTables) package [60].

5.2 SimulationInteraction with the simulation can be viewed as one of the most important parts of the UI. It should

be easy and intuitive, but dense with information and complete, in order for the decision-makers that use

the tool to be able to get insights into the projects, modify the simulation according to their needs and

72

access its results in a straightforward way. To this end, the UI features 2 tabs on the topic of simulation.

The first is a Project Database, while the second is the RCCP itself.

The first tab of the simulation category offers a project database where, simply put, the user has ac-

cess to all the projects that can be produced in the production plants. This tab offers diverse information

about the projects, which are selected trough two selectors, with the first allowing the user to choose

the maincode of the project and the second to choose the project itself. A series of information is then

displayed. First of all, one table shows the recipe of the project, indicating the version of the recipe and

the individual steps that are taken, along with information regarding the step itself, the duration and the

associated effort. With this, a Gantt chart is also displayed as a graphical representation of the recipe,

allowing for better inspection of the data there contained. This graph and table allow the user to have

better insights into the processes involved in the production of the chosen project, their duration and

their efforts. An exemplified version of the Gantt chart translating a project’s recipe is shown in figure

5.5.

QA2

QC1

QC8

QC16

QC12

M3

QC11

QC14

QC18

QC4

M9

M19

M7

QC15

QC20

M6

M17

M5

WH10

0 4 8 12Time

Ass

et

Area

M

QA

QC

WH

Figure 5.5: Gantt representation of the recipe of a project

Additionally, the BOM is also displayed, showing the materials used for the production of the project.

A representation of the probability distributions that model the project’s manufacturing and QA durations

is also shown, both the PDFs that were fitted to the available data and the available data itself. This

allows the user not only to check the typical verified durations of the processes, but also to verify whether

or not the durations used in the recipes are accurate, and extract conclusions regarding how accurate

the automatic scheduling made by the ERP are; some projects may be accurately scheduled while

others whose recipes do not translate the reality will certainly be less accurate. A distribution of the

adherence to the planned start date, along with the number of observations used for the fitting process

are also shown. Examples of the PDFs of a project are shown in figure 5.6. Lastly, a table is presented,

containing the history of the selected project’s orders, including information such as planned start date,

actual start date, manufacturing duration, QA duration, quantity produced.

Regarding the final tab, the RCCP, it can be considered the most complex and dense tab, featuring

simulation and optimization processes, a plethora of parameters available for the user to modify, tables

and graphs with results and initial parameters, filters to modify all the displayed information and modal

73

0.00

0.03

0.06

0.09

10 20 30 40 50Duration

Den

sity

Manufacturing

0.0

0.1

0.2

0.3

0 10 20 30 40 50Duration

Den

sity

QR

0.0

0.1

0.2

0.3

0.4

0.5

−20 0 20Duration

Den

sity

Difference from Planned Start

Figure 5.6: Example of PDFs of manufacturing, QR and adherence to start date of a project. Note that the vertical lines indicatethe duration of the processes referenced in the recipe, even though they frequently do not add up to the observed reality

windows to add and edit information. The basic objective of the tab is to run the simulation but doing

so requires a series of parameters. Most of these parameters are not required for the user to input,

since they have predefined values but can be modified if so desired. The initial step is to load the

planned orders. A table then presents these, displaying each project’s name, start date and deadline.

The displayed projects can be modified, by selecting the desired order and pressing the corresponding

button and new projects may be added. For the simulation process, a series of parameters can be

modified, such as the number of iterations or the confidence level. Additionally, several parameters are

compulsory for the simulation to be ran: the name of the simulation (for storing purposes) and the type

of simulation (between EDD and LSD). An estimated duration of the simulation is presented. After the

simulation is ran, a table with the results’ overview is shown, as well as the Gantt chart of BA’s utilization

and monthly capacities, similar to figures 4.16 and 4.14. At this point, selecting an order on the planned

orders table highlights the tasks regarding such order on the Gantt chart and its impact on the monthly

capacities.

After the base simulation is ran, the user can choose to optimize the results, following the method-

ology described in section 4.2.2. The type of optimization simply has to be chosen, between median,

0.25 IQRs, 0.5 IQRs, 1 IQR or 2 IQRs. After the optimization is done, the results are updated, be-

coming similar to the ones shown in figures 4.17 and 4.18. At this point the user has 3 alternatives:

either a new optimization is ran, the non-optimized results are shown, or the start dates and deadlines

are updated, allowing for a new simulation process that will theoretically have less BAs interferences.

During the entire process, the information is stored under the project name defined by the user, with the

creation timestamp added. The objective of this is so that any simulation can be later accessed, giving

the users the power to either continue working on a specific scenario or simply compare their simulation

with others. Additionally, a button is included to export the simulation results into a .pdf file, enabling

offline access to the results, which can also be printed, if so desired.

74

Chapter 6

Conclusions

The main objective of this thesis was the creation of a DT of the internal supply chain at the CDMO

in study. This tool should aid in monitoring the performance indicators across and within areas of the

internal SC and be able to conduct scenario-based forecasting, specifically, through a simulation-based

RCCP tool, capable of predicting monthly capacity utilization.

To this end, the internal SC processes were mapped and extensively studied, to fully understand the

connectivities between the areas and the usual workflows existent throughout the different projects and

campaigns. The processes’ durations, starts and ends were collected from the ERP, and the probability

distributions that defined the manufacturing and the QR processes were created in order to describe

their observed variability. The process of fitting theoretical PDFs to the historical data was accomplished

after an extensive statistical analysis and an optimization process, with the objective of minimizing the

CS GoF. After having modelled the projects duration and their inherent variability, the simulation-based

RCCP tool was constructed, built on the concept of demonstrated performance. The tool used Monte

Carlo simulation as its simulation engine, allowing multiple scenarios to be ran, and a convergence in

the monthly utilized capacity to be found. Furthermore, the tool provided a representation of how the

BAs where utilized and if there were any interferences between projects using a single BA at the same

time. An optimization algorithm was also implemented with the objective of removing (or reducing)

the interferences. Regarding the visualization, the tool offered intuitive ways of delivering visibility to

its users, effectively concentrating information from a series of disperse sources and allowing it to be

filtered according to the users needs.

Overall, the objectives defined for this project were fulfilled and in some cases even added additional

functionalities.

6.1 Achievements

The developed tool was successful in the two objectives that were proposed: visualization and sim-

ulation.

The visualization component of the tool was able to deliver intuitive views into the operations happen-

ing at the production plant at a given time; several methods of delivering these views were implemented

in order for the users to view the information in the way that best fits to their needs. Additionally, there

are ways of viewing the key performance indicators, per area, building or globally, measured by week,

month or year and with the possibility of view each KPI’s historical data. Insights into the tasks schedule

are also delivered, allowing the users to filter the data by date, projects or production building. Users can

also access each project’s information, historical data, BOM, tasks recipe and a probability distributions.

75

All of these ways of showing information are supplied with information from the company’s ERP system,

when the data is available, and endpoints were created for receiving new data that may be generated in

the future.

The simulation tool created was able to successfully generate accurate forecasts (in the scope of a

rough-cut tool), regarding the necessary monthly capacity, or the monthly percentage of the maximum

capacity needed. The simulations ran were validated and showed promising results for a rough-cut

tool, with the possibility of improving the base capacity utilization estimations made by the ERP, since

demonstrated performance values were considered instead of the values strictly derived from the recipe.

Furthermore, the ability of adding new orders and generating new scenarios brings a clear advantage

to this tool. The optimization algorithms offered enable the tool to be more conservative in the asset’s

allocation, effectively reducing the need for time buffers, or at least allowing them to be greatly reduced.

6.2 Future WorkFuture work for this project can be divided into 2 levels:

• DT framework: how it can be improved in the data it collects and how it can translate the current

and past states of the production plants more effective and comprehensively.

• Simulation-based RCCP tool: how the model that supports the tool can be improved, with the

ultimate goal of generating more precise predictions.

6.2.1 Quality & quantity of Data

Future work passes greatly by the implementation of more measurement points, both at the shop-

floor (such as power measurements of the assets, temperatures, pressures) and the corporate level,

improving and increasing the information contained in the ERP. The concept of edge computing, per-

forming data processing at the sensor level, would be extremely useful for the tool. It would allow the

data collected at the sop-floor level to be processes on-site, effectively delivering less meaningless data

to the storage systems and instead, more processed information. Improving the data collection and

processing systems would results in a DT with more and better information, enabling the users to have

access to everything happening (or that happened in the past) in a given production plant, both on a

productive, logistical, or even managerial level, without leaving their office.

6.2.2 Improving the Models

Several improvements could be done to the simulation tool in the future. Many of these improvements

regard the PDFs that model the processes durations. First of all, a correlation between the duration of

the manufacturing and the QR processes has been observed in some projects. This could be studied to

achieve better combination of durations of manufacturing and QR, when sampling values. Additionally,

there may be some influence of the batch size and the manufacturing duration and since the batch size

is known a priori, a correlation could also be studied. The difference between the planned start date of

the projects and the actual start date of the projects could also be modelled. Some projects have also

been seen to feature multimodal distributions (especially in the QR duration). This multimodal behavior

76

could be explained through the manufacturing-QA correlation, but investigating this phenomenon could

be useful in generating better predictions. Weights should be added to the campaigns when fitting the

theoretical PDFs to the real observed durations, in the sense that more recent campaigns should be

more representative of the reality than past ones.

Possibly, the best way to achieve good estimates for the projects’ durations, considering all the con-

ditions already in effect and the ones here described as future work would be the application of black

box models, that would receive a project’s code, start date, deadline, batch size and other meaning-

ful metrics and would output the predicted duration of the manufacturing and QA processes. Artificial

neural networks are computing systems that could provide excellent results for this problem, and are

much easier to implement. Given enough data, they are able to achieve comparable, frequently superior

results in comparison with results from heuristic methods, without nearly as much effort.

77

Bibliography

[1] K. Alicke, J. Rachor, and A. Seyfert. Supply chain 4.0 – the next-generation digital supply

chain. Technical report, McKinsey&Company, June 2016. https://www.mckinsey.com/business-

functions/operations/our-insights/supply-chain-40--the-next-generation-digital-

supply-chain.

[2] Almende B.V., B. Thieurmel, and T. Robert. visNetwork: Network Visualization using ’vis.js’ Library,

2019. URL https://CRAN.R-project.org/package=visNetwork. R package version 2.0.6.

[3] Anylogic. Alstom develops a rail network digital twin for railway yard design and predictive fleet

maintenance. Anylogic Case Studies, Apr 2018. URL https://www.anylogic.com/digital-

twin-of-rail-network-for-train-fleet-maintenance-decision-support/?utm source=

white-paper&utm medium=link&utm campaign=digital-twin.

[4] T. B. Arnold and J. W. Emerson. Nonparametric goodness-of-fit tests for discrete null distributions.

R Journal, 3(2):34–39, 2011. doi:10.32614/RJ-2011-016.

[5] AVATA. S&OP/IBP Express, September 2015. Slide 6. Accessed on 2019/08/28. https://

www.slideshare.net/christinabergman/avata-sop-ibp-express-53194300.

[6] M. Bajer. Dataflow In Modern Industrial Automation Systems. Theory And Practice. ABB Corporate

Research Krakow, Poland, 2014.

[7] J. E. Beasly. OR-Notes: master production schedule. Brunel University London, 1990. URL

http://people.brunel.ac.uk/~mastjjb/jeb/or/masprod.html.

[8] R. N. Bolton, J. R. McColl-Kennedy, L. Cheung, A. Gallan, C. Orsingher, L. Witell, and M. Zaki.

Customer experience challenges: bringing together digital, physical and social realms. Journal of

Service Management, 29(5):776–808, 2018. doi:10.1108/JOSM-04-2018-0113.

[9] K. Bruynseels, F. Santoni de Sio, and J. van den Hoven. Digital twins in health care: eth-

ical implications of an emerging engineering paradigm. Frontiers in genetics, 9:31, 2018.

doi:10.3389/fgene.2018.00031.

[10] E. M. Carter and H. W. W. Potts. Predicting length of stay from an electronic patient record system:

a primary total knee replacement example. BMC medical informatics and decision making, 14(1):

26, 2014. doi:10.1186/1472-6947-14-26.

[11] S. Cateni, V. Colla, and M. Vannucci. A fuzzy logic-based method for outliers detection. In Artificial

Intelligence and Applications, pages 605–610, 2007.

[12] W. Chang, J. Cheng, J. Allaire, Y. Xie, and J. McPherson. shiny: Web Application Framework for R,

2019. URL https://CRAN.R-project.org/package=shiny. R package version 1.3.1.

79

https://www.mckinsey.com/business-functions/operations/our-insights/supply-chain-40--the-next-generation-digital-supply-chain



https://CRAN.R-project.org/package=visNetwork

https://www.anylogic.com/digital-twin-of-rail-network-for-train-fleet-maintenance-decision-support/?utm_source=white-paper&utm_medium=link&utm_campaign=digital-twin



https://www.slideshare.net/christinabergman/avata-sop-ibp-express-53194300

https://www.slideshare.net/christinabergman/avata-sop-ibp-express-53194300

http://people.brunel.ac.uk/~mastjjb/jeb/or/masprod.html

https://CRAN.R-project.org/package=shiny

[13] W. Chang, J. Luraschi, and T. Mastny. profvis: Interactive Visualizations for Profiling R Code, 2019.

URL https://CRAN.R-project.org/package=profvis. R package version 0.3.6.

[14] J. Cheng, B. Karambelkar, and Y. Xie. leaflet: Create Interactive Web Maps with the JavaScript

’Leaflet’ Library, 2018. URL https://CRAN.R-project.org/package=leaflet. R package version

2.0.2.

[15] V. Choulakian, R. A. Lockhart, and M. A. Stephens. Cramer-von mises statistics for discrete distri-

butions. Canadian Journal of Statistics, 22(1):125–137, 1994. doi:10.2307/3315828.

[16] A. Costigliola, F. A. Ataıde, S. M. Vieira, and J. M. Sousa. Simulation model of a quality control

laboratory in pharmaceutical industry. IFAC-PapersOnLine, 50(1):9014–9019, 2017.

[17] J. F. Cox and J. H. Blackstone. APICS dictionary. Amer Production & Inventory, 2002.

[18] A. C. Cullen and H. C. Frey. Probabilistic Techniques in Exposure Assessment: a handbook for

dealing with variability and uncertainty in models and inputs. Springer Science & Business Media,

1999.

[19] European Federation of Pharmaceutical Industries and Associations. The pharmaceutical industry

in figures. EFPIA, 2018. URL https://efpia.eu/publications/downloads/efpia/2018-the-

pharmaceutical-industry-in-figures/.

[20] Gartner Top 10 Strategic Technology Trends for 2019, Oct 2018. https://www.gartner.com/

smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2019/.

[21] E. Glaessgen and D. Stargel. The digital twin paradigm for future NASA and US Air Force vehicles.

In 53rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference,

page 1818, 2012. doi:10.2514/6.2012-1818.

[22] F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):

1–21, 1969. doi:10.1080/00401706.1969.10490657.

[23] S. I. Haider. Pharmaceutical master validation plan: the ultimate guide to FDA, GMP, and GLP

compliance. CRC Press, 2001.

[24] S. Hawkins, H. He, G. Williams, and R. Baxter. Outlier detection using replicator neural networks.

In International Conference on Data Warehousing and Knowledge Discovery, pages 170–180.

Springer, 2002. doi:10.1007/3-540-46145-0-17.

[25] F. S. Hillier, G. J. Lieberman, B. Nag, and P. Basu. Introduction To Operations Research. Mc Graw

Hill Education, sie tenth edition, 2017. ISBN 978-93-392-2185-0.

[26] D. Ivanov, A. Dolgui, A. Das, and B. Sokolov. Digital supply chain twins: Managing the ripple effect,

resilience, and disruption risks by data-driven optimization, simulation, and visibility. In Handbook

of Ripple Effects in the Supply Chain, pages 309–332. Springer, 2019.

80

https://CRAN.R-project.org/package=profvis

https://CRAN.R-project.org/package=leaflet

https://efpia.eu/publications/downloads/efpia/2018-the-pharmaceutical-industry-in-figures/

https://efpia.eu/publications/downloads/efpia/2018-the-pharmaceutical-industry-in-figures/

https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2019/

https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2019/

[27] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn. Digital twin in manufacturing: A

categorical literature review and classification. IFAC-PapersOnLine, 51(11):1016–1022, 2018.

doi:10.1016/j.ifacol.2018.08.474.

[28] A. M. Law. Simulation Modeling and Analysis. McGraw Hill Education, International Fifth edition,

2015. ISBN 978-1-259-25438-3.

[29] J. Lee, E. Lapira, B. Bagheri, and H.-a. Kao. Recent advances and trends in predictive

manufacturing systems in big data environment. Manufacturing letters, 1(1):38–41, 2013.

doi:10.1016/j.mfglet.2013.09.005.

[30] M. R. Lopes, A. Costigliola, R. M. Pinto, S. M. Vieira, and J. M. Sousa. Novel governance model

for planning in pharmaceutical quality control laboratories. IFAC-PapersOnLine, 51(11):484–489,

2018.

[31] G. S. Maddala and K. Lahiri. Introduction to Econometrics. Macmillan New York, Second edition,

1992. ISBN 978-0-02-374545-4.

[32] A. M. Madni, C. C. Madni, and S. D. Lucero. Leveraging digital twin technology in model-based

systems engineering. Systems, 7(1):7, 2019.

[33] J. T. Mentzer, W. DeWitt, J. S. Keebler, S. Min, N. W. Nix, C. D. Smith, and Z. G. Zacharia. Defining

supply chain management. Journal of Business logistics, 22(2):1–25, 2001. doi:10.1002/j.2158-

1592.2001.tb00001.x.

[34] A. Mullard. 2018 FDA drug approvals. Nature Reviews – Drug Discovery, January 2019. URL

https://www.nature.com/articles/d41573-019-00014-x.

[35] J. A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):

308–313, 1965. doi:10.1093/comjnl/7.4.308.

[36] Oracle Applications. Overview of capacity planning, November 1997. Accessed on 2019/10/07.

https://docs.oracle.com/cd/A60725 05/html/comnls/us/crp/ovwcp.htm.

[37] Oracle Help Center. Overview to resource requirements planning, February 2013. URL https:

//docs.oracle.com/cd/E26228 01/doc.93/e21770/ch over resrc req pln.htm#WEAMP231. Ac-

cessed on 2019/10/07.

[38] Oracle Help Center. JD Edwards Enterprise One Applications Requirements Planning Implemen-

tation Guide: planning production capacity, 2014. URL https://docs.oracle.com/cd/E64610 01/

EOARP/plng production capacity.htm#EOARP00393. Accessed on 2019/09/09.

[39] V. Papavasileiou, A. Koulouris, C. Siletti, and D. Petrides. Optimize manufacturing of pharmaceutical

products with process simulation and production scheduling tools. Chemical Engineering Research

and Design, 85(7):1086–1097, 2007. doi:10.1205/cherd06240.

81

https://www.nature.com/articles/d41573-019-00014-x

https://docs.oracle.com/cd/A60725_05/html/comnls/us/crp/ovwcp.htm

https://docs.oracle.com/cd/E26228_01/doc.93/e21770/ch_over_resrc_req_pln.htm#WEAMP231

https://docs.oracle.com/cd/E26228_01/doc.93/e21770/ch_over_resrc_req_pln.htm#WEAMP231

https://docs.oracle.com/cd/E64610_01/EOARP/plng_production_capacity.htm#EOARP00393

https://docs.oracle.com/cd/E64610_01/EOARP/plng_production_capacity.htm#EOARP00393

[40] V. Perrier and F. Meyer. billboarder: Create Interactive Chart with the JavaScript ’Billboard’ Library,

2019. URL https://CRAN.R-project.org/package=billboarder. R package version 0.2.5.

[41] Pharmaceutical Research and Manufacturers of America. 2019 PhRMA Annual Membership

Survey. PhRMA, 2019. URL https://www.phrma.org/report/2019-phrma-annual-membership-

survey.

[42] B. Piascik, J. Vickers, D. Lowry, S. Scotti, J. Stewart, and A. Calomino. Materials, structures,

mechanical systems, and manufacturing roadmap. NASA TA, pages 12–2, 2012.

[43] D. Pomerantz. The french connection: Digital twins from paris will protect wind tur-

bines against battering north atlantic gales. GE Reports, April 2018. URL https:

//www.ge.com/reports/french-connection-digital-twins-paris-will-protect-wind-

turbines-battering-north-atlantic-gales/.

[44] Reby Media. Engineering matters ep. 4 – the rise of the digital twin, July 2018. https:

//engineeringmatters.reby.media/2018/07/23/4-the-rise-of-the-digital-twin/.

[45] S. Rehana. Making a digital twin supply chain a reality. ASUG, November 2018. URL https:

//www.asug.com/news/making-a-digital-twin-supply-chain-a-reality.

[46] A. Robinson. The rise of the digital supply chain begets 5 huge benefits. Cerasis, February 2016.

URL https://cerasis.com/digital-supply-chain/.

[47] J. Rowley. The wisdom hierarchy: representations of the DIKW hierarchy. Journal of information

science, 33(2):163–180, 2007.

[48] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating

the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.

doi:10.1162/089976601750264965.

[49] S. Scoles. A digital twin of your body could become a critical part of your health care. Slate,

2016. URL https://slate.com/technology/2016/02/dassaults-living-heart-project-and-

the-future-of-digital-twins-in-health-care.html.

[50] N. Shah. Pharmaceutical supply chains: key issues and strategies for optimisation. Computers &

chemical engineering, 28(6-7):929–941, 2004. doi:10.1016/j.compchemeng.2003.09.022.

[51] M. Sharma and J. P. George. Digital twin in the automotive industry: Driving physical-

digital convergence (white paper). Technical report, TATA Consultancy Services, Decem-

ber 2018. https://www.tcs.com/content/dam/tcs/pdf/Industries/manufacturing/abstract/

industry-4-0-and-digital-twin.pdf.

[52] R. Spicar and M. Januska. Use of Monte Carlo modified Markov Chains in capacity planning.

Procedia Engineering, 100:953–959, 2015.

82

https://CRAN.R-project.org/package=billboarder

https://www.phrma.org/report/2019-phrma-annual-membership-survey

https://www.phrma.org/report/2019-phrma-annual-membership-survey

https://www.ge.com/reports/french-connection-digital-twins-paris-will-protect-wind-turbines-battering-north-atlantic-gales/



https://engineeringmatters.reby.media/2018/07/23/4-the-rise-of-the-digital-twin/

https://engineeringmatters.reby.media/2018/07/23/4-the-rise-of-the-digital-twin/

https://www.asug.com/news/making-a-digital-twin-supply-chain-a-reality

https://www.asug.com/news/making-a-digital-twin-supply-chain-a-reality

https://cerasis.com/digital-supply-chain/

https://slate.com/technology/2016/02/dassaults-living-heart-project-and-the-future-of-digital-twins-in-health-care.html

https://slate.com/technology/2016/02/dassaults-living-heart-project-and-the-future-of-digital-twins-in-health-care.html

https://www.tcs.com/content/dam/tcs/pdf/Industries/manufacturing/abstract/industry-4-0-and-digital-twin.pdf

https://www.tcs.com/content/dam/tcs/pdf/Industries/manufacturing/abstract/industry-4-0-and-digital-twin.pdf

[53] H. Sugita. A mathematical formulation of the monte carlo method. In Monte Carlo Method, Random

Number, and Pseudorandom Number, pages 9–21. Mathematical Society of Japan, 2011.

[54] Supply Chain Resource Cooperative SME. Capacity planning. NC State University, January 2011.

URL https://scm.ncsu.edu/scm-articles/article/capacity-planning.

[55] Supply Chain Resource Cooperative SME. Capacity utilization. NC State University, January 2011.

URL https://scm.ncsu.edu/scm-articles/article/capacity-utilization.

[56] R. C. Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical

Computing, Vienna, Austria, 2019. URL https://www.R-project.org/.

[57] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.

[58] P. H. Westfall. Kurtosis as peakedness, 1905–2014. RIP. The American Statistician, 68(3):191–

195, 2014. doi:10.1080/00031305.2014.917055.

[59] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN

978-3-319-24277-4. URL https://ggplot2.tidyverse.org.

[60] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library ’DataTables’, 2018. URL

https://CRAN.R-project.org/package=DT. R package version 0.5.

83

https://scm.ncsu.edu/scm-articles/article/capacity-planning

https://scm.ncsu.edu/scm-articles/article/capacity-utilization

https://www.R-project.org/

https://ggplot2.tidyverse.org

https://CRAN.R-project.org/package=DT

Appendix A

Goodness-of-fit Tests Comparison

The results of the optimization of the theoretical PDFs according to the different goodness-of-fit

tests is shown here, as described in the section regarding Distribution Fitting. Note that the results

are shown for the three example distributions shown in Figures 3.3, 3.4 and 3.9 and considering the

Negative Binomial PDF as the theoretical PDF. The results presented for each distribution are a table

which shows the different goodness-of-fit values for each optimization (tables A.1, A.2 and A.3) and the

PDFs and CDFs of the fitted distributions (figures A.1, A.2 and A.3).

A.1 Example 1

Goodness-of-fit Test CS KS CVM W AD

Chi-Squared 17.8 0.19 0.41 0.16 2.36Kolmogorov-Smirnov 69.95 0.12 1.76 1.45 10.21

Cramer-von Mises 20.12 0.21 0.15 0.15 0.89Watson 20.36 0.21 0.15 0.15 0.89

Anderson-Darling 20.53 0.2 0.15 0.15 0.85

Table A.1: Table showing the goodness-of-fit values for the optimizations ran for the first example. Note that the first columndenotes the goodness-of-fit test chosen for the optimization process and the remaining values show the results from the other

different GoFs regarding that optimization, to enable the comparison between results of different optimizations.

5 10 15 5 10 15

0

100

200

300

400

0

25

50

75

Duration [Time Units]

Cou

nt

Goodness−of−fitTest

Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov Watson

Figure A.1: Results of the fitted distributions for the first example, optimized according to the different goodness-of-fit testsconsidered in this study.

A.1

A.2 Example 2



Cramer-von Mises 38.57 0.14 0.27 0.26 1.84Watson 52.01 0.11 1.14 0.15 7.74


Table A.2: Table showing the goodness-of-fit values for the optimizations ran for the second example.

10 15 20 25 10 15 20 25

0

100

200

0

10

20

30


Cou

nt


Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov Watson

Figure A.2: Results of the fitted distributions for the second example.

A.3 Example 3



Cramer-von Mises 157.45 0.08 0.11 0.11 0.78Watson NaN 1 85.01 3.43E-19 NaN


Table A.3: Table showing the goodness-of-fit values for the optimizations ran for the third example.

A.2

0 50 100 150 0 50 100 150

0

25

50

75

0

1

2

3

4


Cou

nt


Anderson−Darling Chi−Squared Cramér−von Mises Kolmogorov−Smirnov

Figure A.3: Results of the fitted distributions for the third example. Note that on this example the Watson GoF was removed sincethe result it provided was not a good fit and would jeopardize the visualization of the remaining curves. The results are still

presented in table A.3, however.

A.3

Appendix B

Digital Twin User Interface

Screenshots

Figure B.1: Overview tab screenshot

Figure B.2: Activities by building tab screenshot

B.1

Figure B.3: Activities by project tab screenshot

Figure B.4: KPIs tab screenshot

B.2

Figure B.5: Projects schedule Gantt chart tab screenshot

Figure B.6: Projects database example view

B.3

Figure B.7: Example of modal help window

Figure B.8: RCCP: main view, displaying the planned and user defined orders

B.4

Figure B.9: RCCP: options modal window

Figure B.10: RCCP: start simulation modal window

B.5

Figure B.11: RCCP: existing scenarios to be loaded

B.6

supply chain digital twin - ulisboa

Documents