method for event detection in mechatronic systems using ...1294113/...method for event detection in...

80
INOM EXAMENSARBETE MASKINTEKNIK, AVANCERAD NIVÅ, 30 HP , STOCKHOLM SVERIGE 2018 Method for Event Detection in Mechatronic Systems Using Deep Learning EDVIN VON OTTER WILLIAM BRUCE KTH SKOLAN FÖR INDUSTRIELL TEKNIK OCH MANAGEMENT

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • INOM EXAMENSARBETE MASKINTEKNIK,AVANCERAD NIVÅ, 30 HP

    , STOCKHOLM SVERIGE 2018

    Method for Event Detection in Mechatronic Systems Using Deep Learning

    EDVIN VON OTTER

    WILLIAM BRUCE

    KTHSKOLAN FÖR INDUSTRIELL TEKNIK OCH MANAGEMENT

  • Method for Event Detection in Mechatronic Systems

    Using Deep Learning

    WILLIAM BRUCE

    EDVIN VON OTTER

    Master’s Thesis at ITMSupervisor: De-Jiu Chen

    Examiner: Martin Törngren

    TRITA-ITM-EX 2018:195

  • Master Thesis MMK 2018:195

    Method for Event Detection inMechatronic Systems Using Deep Learning

    William BruceEdvin von Otter

    Approved:2018-06-11

    Examiner:Martin Törngren

    Supervisor:De-Jiu Chen

    Commissioner:Atlas Copco

    Contact Person:Daniel Lundborg

    Abstract

    Artificial Intelligence and Deep Learning are new drivers for technological change, and findstheir way into more and more applications. These technologies have the ability to learn com-plex tasks previously hard to automate. In this thesis, the usage of deep learning is appliedand evaluated in the context of product assembly where components are joined together. Thespecific problem studied is the process of clamping by using threaded fasteners.

    The thesis evaluates several deep learning models, such as Recurrent Neural Networks (RNN),Long Short-Term Memory Neural Networks (LSTM) and Convolutional Neural Networks(CNN), and presents a new method for estimating the rotational angle at which the fastenermates with the material, also called snug-angle, using a combined detection-by-classificationand regression approach with stacked LSTM neural networks. The method can be imple-mented to make precision clamping using angle tightening instead of torque tightening. Thistightening method o�ers an increase in clamp force accuracy, from ±43% to ±17%.

    Various estimation methods and inference frequencies are evaluated to o�er insight in the lim-itations of the model. The top method achieves a precision of ≠0.05 ± 2.35¶ when estimatingthe snug-angle and can classify where the snug-angle occurs with 99.26% accuracy.

    The thesis also takes into account the demanding requirements of an implementation onmechatronic systems and presents advantages and disadvantages of the state-of-the-art modelcompression methods used to achieve a lightweight and e�cient algorithm. Usage of thesemethods can give compression rates, energy e�ciency and speed that are in the order of 10◊to 100◊ compared to the original model, without loss of performance.

  • Examensarbete MMK 2018:195

    Metod för händelsedetektering i mekatroniskasystem genom djupinlärning

    William BruceEdvin von Otter

    Godkänd:2018-06-11

    Examinator:Martin Törngren

    Handledare:De-Jiu Chen

    Uppdragsgivare:Atlas Copco

    Kontaktperson:Daniel Lundborg

    Sammanfattning

    Artificiell Intelligens och djupinlärning är nya drivkrafter för teknologisk förändring ochförekommer i allt fler applikationer. Dessa teknologier har förmågan att lära sig kom-plexa uppgifter som tidigare var sv̊ara att automatisera. I detta examensarbete undersöksmöjligheten att använda djupinlärning inom åtdragning av gängade fästelement.

    Examensarbetet utvärderar flera djupinlärningsmodeller, s̊asom Recurrent Neural Networks(RNN), Long Short-Term Memory Neural Networks (LSTM) och Convolutional Neural Net-works (CNN) och presenterar en ny metod för att estimera rotationsvinkeln vid vilken fästele-mentet tar i materialet, även kallat snug-angle. Detta uppn̊as med en kombinerad detektering-via-klassificering- och regressionsmetod baserad p̊a staplade LSTM-nätverk. Metoden möjliggörprecisions̊atdragningar genom vinkel̊atdragning istället för moment̊atdragning. Denna typ avåtdragning ger en precisionsförbättring fr̊an ±43% to ±17% p̊a uppn̊add klämkraft.

    Flera estimeringsmetoder och utsignalsfrekvenser utvärderas för att belysa algoritmens be-gränsningar. Metoden uppn̊ar en noggrannhet p̊a ≠0.05 ± 2.35¶ och klarar att klassificera enhändelse med 99.26 % precision.

    Examensarbetet tar även hänsyn till de höga krav en implementation av maskininlärningställer p̊a mekatroniska system och presenterar fördelar och nackdelar med de senaste kom-pressionmetoderna som används för att f̊a en lättviktig och e�ektiv algoritm. Användandetav dessa metoder kan ge ökad energie�ektivitet, snabbhet och storleksminskning i storleksor-dningen 10x till 100x jämfört med originalmodellen, utan att förlora prestanda.

  • Acknowledgements

    First of all, we would like to thank our industrial supervisor Daniel Lundborg for believingin us and making the subject of our thesis possible.

    We would like to thank our academic supervisor De-Jiu Chen for the valuable advice andfeedback during this project.

    To our thesis coordinator at KTH, Damir Nešić, thank you for the planning and execution ofthis year’s thesis projects.

    To Adam Klotblixt at Atlas Copco, thank you for the advice and guidance in the realm ofthreaded tighteners.

    To our friends and family, thank you for the love, support and understanding during thesefive years of studies at KTH, as well as these intense last months.

    Lastly, we would like to thank Ulf Samuelsson for your fascinating stories, introducing noiseinto our days at Atlas Copco, making us generalize better and achieve a deeper learning.

    William Bruce and Edvin von Otter

    Stockholm, June 2018

  • Contents

    1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 Conversion of clamping force to torque . . . . . . . . . . . . . . . . . . 21.1.2 Tightening methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4.1 Tightening strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.3 Choice of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.5 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Background Study 72.1 Introduction to Tightening . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Introduction to Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.1 Regression and Classification . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Sequence Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.3 Event Detection in Sequential Data . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Introduction to Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.5.1 Hyperparameter Search . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.2 Learning Rate Change Methods . . . . . . . . . . . . . . . . . . . . . . 172.5.3 Ensemble and Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.4 Training, test and validation . . . . . . . . . . . . . . . . . . . . . . . 192.5.5 Neural Network evaluation . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.6 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6.1 Long Short-Term Memory Recurrent Neural Networks . . . . . . . . . 21

    2.7 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7.1 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.7.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.8 Compressing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.8.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.8.2 Weight quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.8.3 Hu�man Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

  • 3 Method and Implementation 293.1 Event Detection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.2.1 Multilayered Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . 303.2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.4 LSTM-MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.5 Stacked LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.6 LSTM Fully Convolutional Network (LSTM-FCN) . . . . . . . . . . . 31

    3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Data Acquisition and labeling . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Setup of training algorithm . . . . . . . . . . . . . . . . . . . . . . . . 333.3.3 Hyperparameter Search Method . . . . . . . . . . . . . . . . . . . . . 333.3.4 Full training of models . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.5 Full model prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.4 Hardware used for Implementation . . . . . . . . . . . . . . . . . . . . . . . . 343.4.1 EVGA NVIDIA GTX 1080 . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.5 Software, Libraries and Frameworks used for Implementation . . . . . . . . . 353.5.1 Tensorflow and Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 DEWESoft® and dwdatareader . . . . . . . . . . . . . . . . . . . . . . 353.5.3 Numpy and matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4 Results 374.1 Model Training for First Stage Models (Classification) . . . . . . . . . . . . . 374.2 Model Training for Second Stage Models (Regression) . . . . . . . . . . . . . 414.3 First Stage Model results on Identifying Snug-Segment . . . . . . . . . . . . . 454.4 Second Stage Model Results on Identifying the Snug-Angle . . . . . . . . . . 454.5 Full model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5 Discussion 495.1 The Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Dataset Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    6 Conclusion 536.1 Deep Learning Architecture for Angle Detection . . . . . . . . . . . . . . . . 536.2 Implementation of Deep Learning Based Event Detection on a Mechatronic

    Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    7 Future work 557.1 Implementation of the Proposed Strategy . . . . . . . . . . . . . . . . . . . . 557.2 Model Architecture and Training . . . . . . . . . . . . . . . . . . . . . . . . . 55

    Bibliography 57

    Appendices 61

    A Hyperparameter Search 63

  • List of Figures

    1.1 Example of a sectioned threaded joint . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.1 Ideal tightening curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Diameter parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Examples of real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Flowchart for di�erent AI systems . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Visualization of representations of data in a Convolutional Neural Network . . . 112.6 Illustration of the input segment sliding of the tightening curve . . . . . . . . . . 132.7 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8 Computational Model of Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 Structure of a 3-layered feed-forward neural network. . . . . . . . . . . . . . . . . 152.10 Illustration of how the grid search can conceal the importance of certain hyperpa-

    rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.11 Dropout applied to a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.12 Classification space with two classes . . . . . . . . . . . . . . . . . . . . . . . . . 202.13 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.14 Di�erent kinds of unfolded RNN architectures . . . . . . . . . . . . . . . . . . . . 222.15 LSTM block with one cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.16 Weights of a 3 ◊ 3 ◊ 3 convolutional filter . . . . . . . . . . . . . . . . . . . . . . 242.17 Convolutional operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.18 Max Pooling operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.19 Compression scheme for deep models . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.1 Flowchart of the snug-detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Architecture of MLP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Architecture of RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Architecture of LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Architecture of LSTM-MLP model . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Architecture of Stacked LSTM model . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Architecture of LSTM Fully Convolutional Network model . . . . . . . . . . . . . 31

    4.1 Validation for each epoch during training of the first stage MLP model . . . . . . 374.2 Validation for each epoch during training of the first stage LSTM-MLP model . 384.3 Validation for each epoch during training of the first stage LSTM model . . . . . 384.4 Validation for each epoch during training of the first stage RNN model . . . . . . 394.5 Validation for each epoch during training of the first stage Stacked-LSTM model 394.6 Validation for each epoch during training of the first stage LSTM-FCN model . . 40

  • 4.7 Validation for each epoch during training of the second stage MLP model . . . . 414.8 Validation for each epoch during training of the second stage LSTM-MLP model 424.9 Validation for each epoch during training of the second stage LSTM model . . . 424.10 Validation for each epoch during training of the second stage RNN model . . . . 434.11 Validation for each epoch during training of the second stage Stacked-LSTM model 434.12 Validation for each epoch during training of the second stage LSTN-FCN model 444.13 Error of prediction for prediction frequency 8 kHz . . . . . . . . . . . . . . . . . 484.14 Error of prediction for prediction frequency 80 Hz . . . . . . . . . . . . . . . . . 484.15 Error of prediction for prediction frequency 16 Hz . . . . . . . . . . . . . . . . . 484.16 Error of prediction for prediction frequency 5.33 Hz . . . . . . . . . . . . . . . . 48

    5.1 Illustration of a Turbotight® curve . . . . . . . . . . . . . . . . . . . . . . . . . . 50

  • List of Tables

    2.1 A selection of hyperparameters and their influence . . . . . . . . . . . . . . . . . 17

    4.1 Lowest validation loss for the first stage models . . . . . . . . . . . . . . . . . . 404.2 Lowest validation loss for the second stage models . . . . . . . . . . . . . . . . . 444.3 Accuracy of first stage models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Mean absolute error of second stage models . . . . . . . . . . . . . . . . . . . . . 464.5 Full model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    A.1 LSTM-FCN: Fine Search top 5 results . . . . . . . . . . . . . . . . . . . . . . . . 63A.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.3 LSTM: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.4 LSTM-MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.5 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.6 Stacked LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

  • Glossary

    ANN Artificial Neural Network. 13

    CNN Convolutional Neural Networks. 23

    CPU Central Processing Unit. 34

    FC Fully Connected Layers. 25

    GPU Graphics Processing Unit. 34

    LSTM Long Short Term Memory. 22

    Prediction Common term for the output of a machine learning algorithm. 3

    RNN Recurrent Neural Networks. 21

    SGD Stochastic Gradient Descent. 15

  • Chapter 1

    Introduction

    E�cient and reliable assembly of products is a prerequisite for our modern industrial society,and has been a driver for the industrial revolution. Whether it’s a smartphone or an airlinerbeing produced, the customer expects a quality associated to its cost. Airplanes need to handlehundreds of flights per year and mobile devices sustain harsher conditions than we give themcredit for. As products become more advanced and environmental awareness demands longerlife cycles to mitigate ecological impact, the assembly processes need to evolve and meet thosedemands.

    1.1 Background

    A big part of an assembly process concerns joining components. There are several methodsto achieve a joint, e.g. bonding, welding, riveting, however, the most common method is touse threaded fasteners, i.e. screws to clamp the joint members together. Threaded joints arefast to achieve, versatile and easy to dismantle. An example of a threaded joint is presentedfor reference in Figure 1.1. Clamping is accomplished either by tightening of a screw togetherwith a nut or a threaded hole in one of the components. The screw is mechanically compa-rable to a spring, pulling the components together.

    1

  • CHAPTER 1. INTRODUCTION

    Figure 1.1 – Example of a sectioned threaded joint

    1.1.1 Conversion of clamping force to torque

    When developing a design, manufacturing engineers establish specifications in the blueprintson the amount of clamping force required in various joints. When the design is received forassembly, the clamping force is converted to a tightening torque by using standard tables.This is done because there is no appropriate way to directly measure the clamping force dur-ing assemblies. The measuring methods available are expensive, require separate equipmentor can only be used for testing due to fastening of the measuring component in the joint [1, 2].

    1.1.2 Tightening methods

    Tightening methods are separated on what the goal variable for the process is, such that whenthe goal is reached the process is finished.

    Torque control as a tightening method

    Torque control as a tightening method is relatively inaccurate compared to other availablemethods. The reason for this is the relationship between the torque and the clamping force,subjected to many material parameters. The clamping force can vary from ±17% to ±43%for a given tightening torque measurement [3]. This results in a low bolt-utilization, meaningthat the bolt needs to be stronger in practice than in theory to get a significant overhead andminimize the risk of breaking. It also makes plastic region tightening unreliable, due to theunknown stress of the bolt [1, 4].

    Angle control as a tightening method

    Another way of measuring the clamping force is to measure the turned angle from the momentthe elongation of the bolt starts. This point is called ”snug” and occurs when the bottomsurface of the bolt head makes contact with the underlying surface. It is indicated by a distinctincrease in torque as the bolt continues to rotate, but di�cult to detect with numerical

    2

  • 1.2. PURPOSE

    methods among noisy data. The noise can, for example, be due to sensor anomalies orobstructing washers or dirt in the joint or thread. This method achieves a clamping forcespread of ±9% to ±17% [1, 3].

    1.2 Purpose

    Deep learning has shown great promise in the analysis of time series data such as evolutionof prices, natural language processing and other sequential data. Given the di�culty ofwriting rules that generalize well for estimating the angle at which snug occurs with numericalanalysis, this thesis will investigate if deep learning can be applied to detect the event of snugduring a bolt tightening using the sensor data available in the tool. To determine this, thefollowing questions will be researched:

    • How can deep learning be implemented in order to detect events in sensor data frommechatronic systems?

    • Which deep learning architectures are suitable, in terms of accuracy or precision, forclassifying sequential data to detect the angle where the snug fit occurs?

    Mechatronics is ”an interdisciplinary design methodology which solves primarily mechanicallyoriented product functions through the synergistic spatial and functional integration of me-

    chanical, electronic, and information processing subsystems” [5]. In this thesis, the method isapplied to sensor data collected on the a mechatronic system in the form of the Atlas CopcoPowerFocus 6000 [6].

    1.3 Method

    To address the research questions, review of existing literature on the subject of time seriesanalysis and prediction with deep learning will help narrow down the varieties of deep learningarchitectures that can be applied to time-series data.

    The research in this thesis will revolve around performing evaluation of di�erent deep learn-ing architectures found in the literature review. Once a few promising candidates have beenchosen, tweaking the parameters or architectures of the algorithms may increase their perfor-mance until a point where one candidate is concluded to be best suited for final implementa-tion.

    1.4 Delimitations

    Some delimitations have been decided upon in order to be able to answer the research ques-tions with as high a quality as possible.

    1.4.1 Tightening strategies

    Aside from various tightening methods that use their respective control variables, there arealso several kinds of tightening strategies. Continuous drive, Turbotight® and pulse drive are

    3

  • CHAPTER 1. INTRODUCTION

    discussed below.

    The continuous drive strategy tightens the joint in one continuous rotation of the bolt untilthe objective variable is reached. This produces a smooth curve like the one shown in Figure2.1.

    Turbotight® is a tightening strategy that reduces the reaction torque at the very end of thetightening, making the tool more ergonomic for the operator. This strategy produces curvesdi�erent from those of continuous drive, but have the same appearance around the snug-angle.We believe this allows the algorithm to classify even these curves accurately, yet this was nottested this thesis.

    Pulse drive, as the name suggests, applies pulses of torque to tighten the joint. This also pro-duces curves that are di�erent than those of continuous drive, making the snug-angle appearvery di�erently. Pulse drive tightenings are therefore excluded from the scope of this thesis.

    1.4.2 Implementation

    Due to complexity of the target hardware and limited time, we will not implement the al-gorithm on the target hardware. We will however evaluate recommended steps towards animplementation with regards to the target hardware, and simulate realtime execution.

    1.4.3 Choice of input data

    The data acquired for training of the algorithm contains several sequences of data measuredby the tool. We have decided to limit the input parameters of the network to torque- andangle sequences because they are consistently present in all data files available. This is alsothe most common way of analyzing a tightening and the sensors responsible for collecting thisdata are used in all tightening tools. This excludes eventual data of tool orientation, speed,current draw, voltage, etc. Our belief is that this will make for a well-generalizing model,applicable on various tightening strategies.

    1.5 Ethics and Sustainability

    Deep Learning applications are highly dependent on larges amount of data. During the col-lection of the data, ethical aspects such as privacy of the person giving their data needs to beconsidered. In the case of this implementation, the tool operator’s physical abilities might beof interest, since this will a�ect the tightening. The ethical consideration here would be toacknowledge that the identity is not of interested and that if the data is mishandled, it mighthave implications for the operator.

    In future developments of Deep Learning applications in mechatronic products, there is anoverall concern that the intelligence in these products can be of potential harm to humanbeings. If a machine, for example, is developed maximize e�ciency for a certain application,interference with the process could be devastating, as the machine would eliminate anything

    4

  • 1.5. ETHICS AND SUSTAINABILITY

    that a�ects the e�ciency.

    In terms of sustainability, it should be noted that one of the reasons for this thesis existence isto develop more e�ective and sustainable tightening methods. This concerns mainly the factthat previous tightening methods requires the joints and bolts to be over-dimensioned so tohandle the uncertainty of the bolt stress that is inherent to, for example, torque tightening.The proposed method could therefore lower material usage in industrial assembly.

    5

  • Chapter 2

    Background Study

    This chapter contains the background study and litterature review of previous work essentialto the thesis. The chapter will present the theory revolving around tightening, neural networksand methods associated with the purpose of the thesis.

    2.1 Introduction to Tightening

    This section will explain the required theory behind tightening and cover two metrics thatcan be used to measure the target of the tightening.

    During a tightening, the screw will go through the di�erent phases depicted in Figure 2.1 andexplained below.

    Torque

    Angle

    1 2 3 4

    Figure 2.1 – Ideal tightening curve

    1. Rundown: is the rundown or prevailing torque zone that occurs before the fastener heador nut contacts the bearing surface.

    2. Alignment and snug: The fastener and joint mating surfaces are drawn into alignmentto achieve a snug condition.

    7

  • CHAPTER 2. BACKGROUND STUDY

    3. Elastic Clamping: the slope of the torque-angle curve is essentially constant as the boltis elongated.

    4. Yield: The stress is now so high that the bolt is deforming plastically and will break iftightened further.

    The goal of this process is to tighten the bolt to before, and in some cases until, it plasticizes,i.e. close to or inside phase 4. From phase 2 and on, the bolt head and the pitch of the threadpull the bolt apart and cause it to lengthen. Like a spring, the bolt’s elasticity pulls the jointcomponents together with the so called clamp force.

    When converting clamp force F to tightening torque T , the relationship between them isapproximated by

    T = F (0.16P + 0.58d2µth +DKmµh

    2 ), (2.1)

    where P is the pitch of the thread, µth and µh are the friction coe�cients related to thespecific bolt thread and bolt head respectively while d2 = d+d32 and DKm =

    dw+dh2 as defined

    in Figure 2.2a and Figure 2.2b [1].

    d

    d2

    d3

    P

    (a) Thread parameters (b) Bolt head parameters

    Figure 2.2 – Diameter parameters

    When using angle control, the relationship between the angle A and the clamp force F isdescribed by the equation

    A = Fk

    , (2.2)

    where k is a material parameter of the combined joint and screw called force rate.

    The length of the rundown zone varies with several environmental parameters, such as thelength of the screw, the extent of it that has entered the threading before the tool is intro-duced and the thickness of the joint. The snug-angle can therefore not be predetermined fora general joint.

    Because zone 3 is distinguished by a sudden increase in dTdA , one could assume it would beeasy to pinpoint in the curve shown in Figure 2.1 using thresholds on T , dTdA or

    d2Td2A . That

    curve, however, is an ideal representation of the course of a tightening. In reality there is alot of noise in the measurements inherited from factors like vibrations, imperfections of the

    8

  • 2.2. INTRODUCTION TO ARTIFICIAL INTELLIGENCE

    0.0 0.1 0.2 0.3 0.4 0.5 0.6

    0.0

    2.5

    5.0

    7.5

    10.0

    12.5

    15.0

    17.5

    20.0

    Revolutions [rev]

    Torq

    ue [

    Nm

    ]

    0.0 0.1 0.2 0.3 0.4 0.5

    0

    10

    20

    30

    40

    50

    60

    Revolutions [rev]

    Torq

    ue [

    Nm

    ]

    Figure 2.3 – Examples of real data: Left: A sudden increase in torque due to imperfections.Right: A very noisy curve, probably due to a low rotation speed

    thread and sensor inaccuracy. Two curves taken from real data can be seen in Figure 2.3.Static analysis of this data could prove complicated and unrobust due to variations in thecurve due to environmental characteristics.

    2.2 Introduction to Artificial Intelligence

    The field of artificial intelligence (AI) is bigger than ever and is being integrated in manyapplications. Andrew Ng, esteemed AI researcher and professor at Stanford University, haseven dubbed AI the New Electricity saying that ”Just as electricity transformed almost ev-erything 100 years ago, today I actually have a hard time thinking of an industry that I don’tthink AI will transform in the next several years” [7].

    AI can come in many variations and the degree of intelligence is varying. Figure 2.4 showsflowcharts of how di�erent AI algorithms work. Early AI algorithms were knowledge basedand used formal, hard-coded logical rules. This approach is di�cult, since it is hard towrite rules that generalize well. Following this approach was AI that could acquire its ownknowledge and device its own rules. This approach, called machine learning, was able to makemore subjective decisions [8].

    9

  • CHAPTER 2. BACKGROUND STUDY

    InputHand-

    designedprogram

    Output

    InputHand-

    designedprogram

    Mappingfrom

    featuresOutput

    Input FeaturesMapping

    fromfeatures

    Output

    Input Simplefeatures

    Moreabstractfeatures

    Mappingfrom

    featuresOutput

    Deep

    Learning

    Representation

    Learning

    Classic

    Machine

    Learning

    Rule-basedsystem

    s

    Figure 2.4 – Flowchart for di�erent AI systems, with gray boxes indicating components that areable to learn [8].

    In machine learning, the algorithm relies heavily on the representation of the data it is given.That means that the data needs to be structured so that each piece of information is a relevantrepresentation that the algorithm can process. Each piece of information, also called features,will then be correlated to an output. For many tasks, however, feature extraction is hard. Forexample, an object in an image such as an wheel is hard to describe in terms of pixel values,and even if that could be done, it will be demanding to find the wheel depending on lightconditions, obstructions of view in the image or other factors that make the object less clear.An approach to tackle that is to use machine learning to learn the representation itself, notonly the mapping between representation and output. This approach, called RepresentationLearning, allows greater and faster adaptation to new tasks.

    When designing algorithms for learning features in data, processing the data to separate thefactors of variation is needed. The factors of variation are sources that influence the dataand can explain variations in the dataset, e.g. in speech recognition, the speakers voice isinfluenced by the speaker’s gender and accent. For many tasks, separating these factors canbecome almost impossible. Such a sophisticated factor of variation as the speaker’s accentrequires deep knowledge about the data and how it influences it.

    Deep learning approaches this problem by creating representations of other simpler represen-tations, as seen in Figure 2.5. In this way, the input is first separated into simpler conceptsand then combined in to new concepts that eventually lead to a prediction [8].

    2.2.1 Regression and Classification

    There are several types of tasks that machine learning can be used for, but among the mostcommon are classification and regression tasks:

    10

  • 2.3. EVENT DETECTION IN SEQUENTIAL DATA

    Figure 2.5 – Visualization of how di�erent layers represent the data in a Convolutional NeuralNetwork [9]

    • Classification: is where an algorithm determines to which of a set of categories c aninput belongs. The algorithm learns a function f : Rn æ {1, ..., c} and outputs a proba-bility for each of the categories. For example, classification is used in image recognitionto determine which of the previously learned classes has the highest probability of beingpresent in the image.

    • Regression: is where the algorithm predicts a continuous numerical value given someinput. The algorithm learns a function f : Rn æ R. Regression can be used, forexample, to predict future prices of stocks or predictions of weather.

    Deep learning algorithms are able to handle both of these types of tasks, and many more.For image classification, deep learning implementations surpass any other machine learningalgorithm [10].

    2.2.2 Sequence Labeling

    In machine learning, sequence labeling involves tasks where sequences of data are transcribedwith sequences of discrete labels. The goal for the algorithm is to assign a label to a sequenceof input data. There are three general types of sequence labeling tasks; temporal classifica-tion, segment classification and sequence classification. Temporal classification is where theonly information available is the target sequences, i.e. the input is classified as a whole andalignment is not interesting. Segment classification is a special case of temporal classificationwhere the data is labeled with both targets and input-target alignments. Sequence classifi-cation is an special case of segment classification and the most strict, where each input isassigned one, and only one, label [11].

    2.3 Event Detection in Sequential Data

    As a sequence is collected, event detection is the process in which the detector is to find aindex i, at which an event of interest occurs [12]. The goal of the event detector can thereforebe described as twofold:

    11

  • CHAPTER 2. BACKGROUND STUDY

    • Firstly, to detect whether an event of interest has occurred,

    • Secondly, to characterize the event by, for example, time, type or severity

    It is generally a more demanding task to detect events rather than classifying them. Essen-tially, this is due to the fact that the classification task has access to the boundaries in whicha classification should occur, while in the detection task the boundaries are not known inadvance [13].

    There are two categories of event detectors which are either threshold-based or supervisedlearning-based [14]. Unsupervised learning-based event detection, that is learning algorithmsthat find patterns in the data by itself, have also been applied for event detection tasks. Thatapproach is used where it is not know beforehand what kind of events that are of interest[15].

    Threshold-Based Methods

    Threshold-based methods uses the belief that an event will result in some change in thedata that di�ers from the normal. The normal behavior can be described as threshold, e.g.maximum values, rate of increase or combinations of these, based on historical data. If thedata contains more than one variable, e.g. several sensors are used to produce the data, itis possible to have individual thresholds that together make the event detector. An exampleof such could be a fire alarm, where temperature, carbon monoxide and other appropriatesensor are read and compared to the historical normal. For a specific problem, data variablesare weighted depending on their importance and combined to detect the event.

    Threshold methods have the advantage of having low computer utilization and is simple toimplement. However, it is highly dependable on the kinds of sensors that are involved andit can be hard to specify rules for events. Often, events can’t be fully captured by thresholdvalues [12].

    Supervised Learning-based Methods

    In supervised learning-based event detectors, the detector has access to annotated sequenceswhere an event has occurred. Sequences from the data are sampled at a constant samplingrate in s.c. frames or windows. For each window, characteristics (i.e. features) are extracted.These features will be the annotation labels to which the supervised learning can be applied[16]. However, even though the event is detected, the timing of event can occur anywhere inthe window. To handle both event detection and timing of the event, there are two commonapproaches:

    • Detection-by-classification: A window of a fixed length slides over the data seriesand a classifier determines whether an event occurs. Each window is then split intosmaller windows which classifies the type of event the larger window contains and alignsthe label to the data series. The window can move with a certain amount of time stepsso that it overlaps the previous window, or be non-overlapping [17]. An example of theformer is seen in Figure 2.6.

    12

  • 2.4. INTRODUCTION TO NEURAL NETWORKS

    Angle

    Torque

    t1 t2t3 t4

    t5

    Figure 2.6 – Illustration of the input segment sliding of the tightening curve

    • Detection-by-classification with regression: This approach builds on the detection-by-classification approach. A classifier determines whether an event has occurred. Thena second classifier determines what kind of event occurred. Subsequently, a regressionalgorithm determines where in the window the event was estimated to appeared [18].

    There are several machine learning algorithms that can be used as classifiers or regressors,some of which are Support Vector Machines (SVM), K-nearest neighbor, näıve Bayes andneural networks [14]. A general disadvantage with this approach is that the detector onlyfocuses on one frame at the time and might, depending on the algorithm, miss informationin previous frames [16].

    2.4 Introduction to Neural Networks

    With inspiration from biology, computer scientists have developed Artificial Neural Network(ANN), mimicking the basics of a neuron in a brain. A neuron, see figure 2.7, in the humanbrain receives its inputs through its dendrites and outputs a signal along its axon, that even-tually branches out and connects to other neurons through the axon terminals via synapses.The synapses, not shown in the figure, is where the communication between neurons is made.

    13

  • CHAPTER 2. BACKGROUND STUDY

    Figure 2.7 – Biological Neuron [9]

    In light of the biological neuron, the computational neuron acts similarly, see figure 2.8. Inthe so called forward-pass, a signal, x0, goes over the axon and interacts with the dendritesof the other neuron. The level of interaction is determined by the ”synaptical” strength, w0,which is a variable that is learnable. w0 essentially determines the influence of the connectionand is called a weight.

    q

    Axon from a neuron

    x0 w0

    Dendrite

    w0x0

    goutput axon

    y

    Figure 2.8 – Computational Model of Neuron

    The dendrites carry the signal multiplied by the synaptical strength, w0x0 to the cell body,where the signals are summed. In the biological neuron, the neuron would ”fire” a signalover its axon whenever the sum was over a certain threshold. In the computational neuron,however, the precise timing of the firing is unimportant. Instead, only the frequency of thefiring, that is how often the neuron is activated, communicates information [9]. This concept,called rate coding, stems from the belief that biological neurons partly communicate with thefrequency of firings. The computational neuron would therefore let the sum pass through anactivation function, g, which models this firing rate.

    The neurons are combined into an acyclic graph, i.e. the outputs of one neuron can be inputin another neuron. Most commonly, the neurons are organized in fully connected layers inwhich two adjacent layers are pairwise connected, as seen figure 2.9. The input is passedthrough hidden layers to the output layer. The number of layers, by convention, are thenumber of layers except the input layer, i.e. the hidden layers and the output layer.

    14

  • 2.5. NEURAL NETWORK TRAINING

    Input Layer Hidden Layers Output Layer

    Figure 2.9 – Structure of a 3-layered feed-forward neural network.

    By using this approach, an ANN made with these computational neurons f(x, W, b), where xis the input and W is the set of weights that are being learned, would be able to approximatethe function fú. It has been shown that no matter the function fú, there is a neural networkf(x, W, b) that for every possible value x will give the output fú(x). ANNs have thereforeearned the attribute Universal Approximators[19].

    2.5 Neural Network Training

    Training of neural networks for learning a certain task can be done with three di�erentapproaches: supervised learning (where each input in the dataset has a paired target), re-inforcement learning (where scalar reward values are provided for training) or unsupervisedlearning (where no information is given during training, and the algorithm will try to learnby only observing the data). In this thesis, the data is labeled, i.e. supervised learning canbe used.

    In common for each of these approaches is to achieve some minimum error, E, when doingclassification or regression. When training, the parameters in the network are incrementallyupdated so that a loss or cost function, that is closely related to E, is minimized. In gen-eral terms, the training is conducted in the following way: A subset of the dataset, called aMini-Batch, is fed to the network. For the given input, the network outputs values based onthe current parameters by doing a forward pass, also called inference. The output is thencompared with the target labels and the loss and gradients of the loss function are calcu-lated. The gradients are propagated back through the network and parameters are updatedaccording to a learning rate. This method, called Stochastic Gradient Descent (SGD), can beseen in Algorithm 1. Once one Mini-Batch has been fed, another Mini-Batch is created andthe process repeats it self, until the whole dataset has been used. Putting through the wholedataset as such is called doing one epoch. Often, the training will need to do several epochs.

    While very popular, SGD is slow to learn with and several evolutions of the algorithm havebeen developed, for example SGD with momentum. Other examples are algorithms withadaptive learning rates, such as AdaGrad, RMSProp and Adam. These have been developedsince researchers have realized that the most di�cult hyperparameter (see Section 2.5.1) to

    15

  • CHAPTER 2. BACKGROUND STUDY

    Algorithm 1 Stochastic Gradient DescentRequire: Learning rate ‘Require: Initial parameters ◊

    while training is not done doForward pass with mini-batch x: yi = f(xi, ◊, b)Compute gradients “g of loss functionUpdate ◊ with learning rate ‘: ◊ Ω ◊ ≠ ‘ “ g

    end while

    set for the learning algorithm is the learning rate. Even though it seems that the optimizerswith adaptive learning rates perform robustly, no optimizer has been dubbed the best andthe choice can be based on the users knowledge about the algorithm [8, 20].

    2.5.1 Hyperparameter Search

    The process of training neural networks is very complex and involves choosing and tuninga large number of so called hyperparameters. Hyperparameters are parameters external tothe model that is most often manually set. There are several hyperparameters that can beconsidered for tuning (in Table 2.1 some of the most influential are presented).

    Figure 2.10 – Illustration of how the grid search can conceal the importance of certain hyper-parameters [21]

    16

  • 2.5. NEURAL NETWORK TRAINING

    Table 2.1 – A selection of hyperparameters and their influence. With inspiration from [8].

    Hyperparameter InfluenceNumber of Hidden Units The model’s ability to learn di�erent represen-

    tations of the data varies with the size of themodel. A larger model gives more capacity tolearn. However, increasing the number of unitsincreases the time that is needed to train themodel.

    Learning Rate When moving towards a minimum in the costfunction, the optimizer takes steps with a stepsize multiplied by the learning rate. This is avery influential parameter that a�ects the op-timizers ability to ”get stuck” or ”break free”from local minima with the possibility to findan even lower minimum [9]. An improper learn-ing results in a model with low e�ciency.

    Regularization Regularization strategies such as dropout a�ectsthe generalization error and the models abilityto make general predictions.

    The process of tuning the hyperparameters can be conducted in two ways: Random or Gridsearch. Random search: the parameters are in general chosen randomly from a range ofnumbers where that range is predetermined, or sampled from a list of numbers. Anotherapproach, called grid search, would be to only have a list of numbers for each hyperparam-eter and sample from those. Though the distinction may seem trivial, using the randomapproach allows for greater understanding of what and how parameters influence the model.For example, when performing grid search, the unimportant values may be over analyzed,see Figure 2.10 [21]. A disadvantage of doing automatic hyperparameter search is that it iscomputationally very heavy.

    2.5.2 Learning Rate Change Methods

    During longer training, a common phenomenon is that training gets ”stuck” in a local min-imum [8]. By reducing the learning rate when stuck, the training can sometimes progressfurther and get out of that local minimum. This can be done either by applying a learn-ing rate schedule, where the learning rate is changed after a few epochs, or using a ”reducelearning rate on plateau” approach [22, 23]. The latter decreases the learning rate when thelearning has been stationary over a number of epochs.

    2.5.3 Ensemble and Dropout

    When optimizing and training neural networks, the algorithm should perform well not onlyon the training data but other similar input data as well. There are several methods toachieve this, which go under the term generalization methods, where the goal is to reduce thegeneralization error, i.e. the error on the new input data.

    17

  • CHAPTER 2. BACKGROUND STUDY

    One way of reducing generalization error is to use an ensemble method, for example Bagging[24]. Ensemble methods involve training several networks and average their results to makea final prediction. It can be shown that the ensemble on average performs at least as well asany of its members, and if the members make uncorrelated errors the ensemble will performsignificantly better than its members [8]. When performing bagging, each model is trainedon a independent subset of the dataset, thus ensuring that each model is missing informationthat could be found in other examples that other models have access to. Small nuances inknowledge will be obtained, and di�erent models will become better at certain types of ex-amples. By averaging the model outputs the algorithm becomes robust and generalizes well,even though each of its models individually generalize poorly [24].

    Another way of reducing generalization error is to perform what is called dropout. One of thedisadvantages with bagging is that it becomes computationally expensive when the modelsare deep. Dropout solves this by training subnetworks of the underlying base network byremoving non-output neurons from it, as seen in Figure 2.11. Every time a new minibatchis loaded, a random binary mask is applied to the network. The purpose is to mask thenetwork so that a subnetwork, containing a fraction of the whole network determined by themask, is created and trained with that minibatch. Unlike bagging, the model representedwith the subnetwork share parameters with the underlying network, thus allowing trainingof an exponentially growing amount of models with a controllable amount of computationalpower.

             CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

    yy

    h1h1 h2h2

    x1x1 x2x2

    yy

    h1h1 h2h2

    x1x1 x2x2

    yy

    h 1h 1 h2h2

    22

    h 1h 1 h2h2

    11

    yy

    h 2h 2

    x 1x 1 x 2x 2

    yy

    h1h1

    x1x1 x2x2

    11 22

    x 1x 1 x2x2

    h 2h 2

    22

    h1h1

    11

    h 1h 1

    22

    yy

    h2h2

    xx

    yy

    xx

    yy

    x2x2

    yy

    22

    yy

    11

    yy

    Base network

    Ensemble of subnetworks

                           Figure 7.6: Dropout trains an ensemble consisting of all subnetworks that can be con-                       structed by removing nonoutput units from an underlying base network. Here, we begin

                               with a base network with two visible units and two hidden units. There are sixteen                           possible subsets of these four units. We show all sixteen subnetworks that may be formed

                               by dropping out different subsets of units from the original network. In this small example,                           a large proportion of the resulting networks have no input units or no path connecting                       the input to the output. This problem becomes insignificant for networks with wider

                             layers, where the probability of dropping all possible paths from inputs to outputs becomessmaller.

    256

    Figure 2.11 – Dropout applied to a network. The base network is split into subnetworks thatare used in training. The subnetworks share parameters, which enables training of many di�erentmodels with a manageable amount of memory [8].

    18

  • 2.5. NEURAL NETWORK TRAINING

    2.5.4 Training, test and validation

    When training a machine learning model, a common mishap is overfitting, where the algorithmlearns the training data so well that it does not generalize and makes incorrect predictionson other data. A method to avoid that kind of behavior is using the holdout method, wherethe dataset is partitioned into three subsets [8]:

    • Training set: Used by the training algorithm to adjust the parameters in the model.

    • Validation set: Used to given an estimate during training of how well the modelgeneralizes on other data. This is monitored during training to determine when learningno longer is increasing.

    • Test set: Used to evaluate how well the model generalizes on other data and comparedto other models.

    By using this method, the models will be evaluated accordingly and generalization error canbe minimized [8].

    2.5.5 Neural Network evaluation

    The performance of machine learning algorithm is commonly evaluated on the basis of someform of metric. There are several to choose from, appropriate for di�erent tasks. A metricfunction is similar to a loss function, with the di�erence that its result is not used for training.Metrics are human-interpretable values used to compare di�erent models’ accuracies. Sincea model that generalizes well is the goal, the metric is calculated on a test set, containingsamples randomly drawn from the whole data set and not present in the training set, to makesure that they are new from the model’s perspective.

    Metric for Classification

    A classification made by an algorithm can either be a true positive/true negative (correctlyestimated as either positive or negative) or a false positive/false negative (incorrectly esti-mated as either positive or negative), visualized in Figure 2.12.

    To evaluate the algorithms ability to make good predictions, there are a number of metricsthat can be used. A common way is to use classification accuracy. It is simply measured asthe proportion of examples for which the model produces the correct output:

    Accuracy = tp + tntp + tn + fp + fn (2.3)

    A problem with classification accuracy it does not take into account whether the dataset isbalanced or not. The algorithm could learn to only predict positives or negatives and stillget a high accuracy. Metrics such as precision and recall are better suited to show that kindof behavior. Precision is the algorithms ability to do relevant predictions, i.e. the fraction ofcorrect predictions out of all predictions:

    Precision = tptp + fp (2.4)

    19

  • CHAPTER 2. BACKGROUND STUDY

    relevant elements

    selected elements

    false positivestrue positives

    false negatives true negatives

    Figure 2.12 – Classification space with two classes. The outlined area are classifications madeby the algorithm [25].

    Recall is the algorithms ability to retrieve correct predictions, i.e. the fraction of the correctpredictions out of all correct examples:

    Recall = tptp + fn (2.5)

    Another common metric is F1-score:

    F1 = 2 · Recall · PrecisionRecall + Precision (2.6)

    which is the harmonic average of the precision and recall.

    Metric for Regression

    Accuracy, recall or precision is not as telling for regression as it is for classification. It is notrequired nor likely for the algorithm to produce an output that is equal to the target down tothe last decimal. Hence, the proportion of ”correct” predictions is not a relevant metric forregression. The mean-squared-error (MSE) or mean-absolute-error (MAE) is instead used,signifying the mean numerical di�erence between the output values and the target values [8]:

    MAE = 1n

    nÿ

    i=1

    Ò(x̂i ≠ xi)2. (2.7)

    MSE = 1n

    nÿ

    i=1(x̂i ≠ xi)2, (2.8)

    20

  • 2.6. RECURRENT NEURAL NETWORKS

    Depending on the task at hand, either can be used. However, the implications when used asloss function during training can be severe depending on which algorithm is used. Considerthe absolute error. Its derivative with respect to x̂ is

    d(AE)dx̂

    = x̂ ≠ x(x̂ ≠ x)2

    =I

    1, x̂ > x≠1, x̂ < x

    , (2.9)

    preventing the gradient descent to granularly update the parameters according to the mag-nitude of the error. The squared error, on the other hand, has the derivative

    d(SE)dx̂

    = 2(x̂ ≠ x), (2.10)

    which changes with respect to the di�erence between the prediction and the target. This isfurther discussed in Section 3.3.2.

    2.6 Recurrent Neural Networks

    Recurrent Neural Networks (RNN) are a family of neural networks that specializes in pro-cessing sequential data, e.g. a sequence of values x(1), ..., x(·). An RNN is essentially an ANNwhere cyclic connections are allowed, which enables the network to have ”memory” fromprevious time steps that persists as an internal state in the network which in turn influencethe output of the network [8, 11]. Just as ANNs are universal approximators, an RNN withthe su�cient number of hidden units can approximate any measurable sequence-to-sequencemapping to arbitrary accuracy [26].

    h

    x

    y

    w2w1

    w3

    Figure 2.13 – Recurrent neural network. The input x passes through the network to the outputy via the hidden unit h. Hidden states from previous time steps are shared through the weightsw2.

    One of the strengths of the RNNs structure is its flexibility which allows many di�erent kindsof inputs and outputs and can be used for a variety of tasks from image captioning to machinetranslation, see Figure 2.14 [27, 28].

    2.6.1 Long Short-Term Memory Recurrent Neural Networks

    A problem with regular RNNs is that the influence of a given input on the hidden and outputlayers either decays or explodes as it is cycled through the network and the network will have

    21

  • CHAPTER 2. BACKGROUND STUDY

    Figure 2.14 – Di�erent kind of unfolded RNN architectures. Red boxes are inputs, green arehidden layers and blue are outputs. From left: regular neural network without RNN (e.g. imageclassification), sequence output (e.g. an image is given as input and output are words describingthe image), Sequence input (e.g. sentiment analysis of a sentence where the output is whether thesentence is positive or negative), sequence input and sequence output (e.g. machine translationwhere a sentence in one language is input and the output, given after the last word in the input,is a sentence in another language), synced sequence (e.g. video classification where each frame isbeing classified) [29].

    trouble learning long-term dependencies. A popular way to tackle this problem, called thevanishing gradient problem, is to use Long Short Term Memory (LSTM).

    Figure 2.15 – LSTM block with one cell

    The LSTM block, as can be seen in Figure 2.15, contains one or more so called memory cellsand three gates, the input-, forget- and output gates. The gates are summation units thatcollect activations from inside and outside the block and then control the cells own activations.The cell has a self-loop, an internal recurrence that is used in addition to the recurrences thatoccur outside the of the block. The self-loop is controlled by the forget gate, which gives theself-loop a leaky behavior, which allows the network to accumulate information and forgetold states. Further, the input and output of the block can be shut o� and on by the inputand output gate.

    22

  • 2.7. CONVOLUTIONAL NEURAL NETWORKS

    2.7 Convolutional Neural Networks

    Convolutional Neural Networks (CNN) share many of the properties of ordinary Neural Net-works mentioned in Section 2.4: they employ learnable weights and biases, perform a dotproduct of inputs and said weights, express a di�erentiable function for the its score and usea loss function in the last layer. The full architecture usually consists of multiple Convolu-tional Layers, Pooling Layers and Fully Connected Layers described in detail below.

    2.7.1 Convolutional layer

    Looking back at ANNs, they have one input neuron for each value of the input. Consider atwo-layered fully-connected ANN with 64 neurons in the hidden layer. If the input to thisnetwork were an image of dimensions 32 ◊ 32 ◊ 3 (the last dimension contains the RGBcolor channels), the number of weights from the input layer to the hidden layer would be32 ◊ 32 ◊ 3 ◊ 64 = 196608. A modern computer could handle this fairly well, but as theinput image scales to a more common size of 600 ◊ 400 ◊ 3, the number of parameters in thenetwork grows to just over 46 million. This is very computationally heavy to train, but canbe decreased using a CNN.

    Instead of having a specific weight for each individual input pixel, CNNs apply somethingcalled weight sharing. The underlying intuition behind this approach is that if a feature isuseful to compute at position (x1, y1), it should also be useful to compute at another posi-tion (x2, y2). This is done using reusable filters that recognize the same feature on di�erentpositions of the input. A filter is a small matrix, typically of size 3 ◊ 3 ◊ 3 where the lastdimension has the same meaning as above, while the first two dimensions are user definedand usually equal. A filter of that size contains 27 weight values, visualized in Figure 2.16.Multiple filters are used so that, during training, each filter tunes to recognize its relatedfeature in the input image [9, 30].

    The filter matrices are moved across and dot-multiplied with the input image to produce aweighted output as the sum of the products, illustrated in Figure 2.17. Note that the figureshows an input with one channel, for simplicity. The size of the output is dependent on thespatial parameters of the convolution layer.

    Most of these parameters have been covered, while stride and zero-padding remain. Thestride, S, is the number of steps the filter is moved in one dimension between each operation,while F is the spatial extent of the filter. A stride S < F means that a pixel value is reusedin multiple convolutions, while a stride S = F uses each input pixel only once. S > F wouldcause the filter to move so much that some pixels would be skipped. In Figure 2.17, S = 1and F = 3 is used [9].

    Zero-padding is a border around the input image that contain the pixel values 0. Thisapproach can be used for convenience such that the output of a convolutional layer remains

    23

  • CHAPTER 2. BACKGROUND STUDY

    w0,0,0

    w1,0,0

    w2,0,0

    w0,1,0

    w1,1,0

    w2,1,0

    w0,2,0

    w1,2,0

    w2,2,0

    w0,0,2

    w1,0,2

    w2,0,2

    w0,1,2

    w1,1,2

    w2,1,2

    w0,2,2

    w1,2,2

    w2,2,2

    w0,0,1

    w1,0,1

    w2,0,1

    w0,1,1

    w1,1,1

    w2,1,1

    w0,2,1

    w1,2,1

    w2,2,1

    Figure 2.16 – Weights of a 3 ◊ 3 ◊ 3 convolutional filter

    1001100100

    1100100010

    1000001101

    1001010001

    0111010000

    0000111010

    0010000010

    1011101100

    0110100000

    0110101010

    100

    110

    00-1

    23 2

    1

    Input image

    10 x 10 x 1

    Filter

    3 x 3 x 1

    Output

    8 x 8 x 1

    Figure 2.17 – Convolutional operation with stride 1 and no zero-padding

    the same size as the input. This is preferred because of the simplifications it makes whensizing multiple layers to work with each other [9].

    2.7.2 Pooling layer

    After a convolutional layer has extracted valuable features from the input image, its outputconsists of a volume where higher values correlate to recognized features of that input image.The next step is to downsample that output to receive a smaller representation of it, whichwill reduce the amount of values to be processed as the network goes deeper.

    24

  • 2.8. COMPRESSING NEURAL NETWORKS

    There are several ways of performing downsampling in a pooling layer. The most commonmethod is Max Pooling, where the maximum value of a submatrix of the input is chosen asan element of the output. This process is illustrated in Figure 2.18. The pooling, as theconvolution, is also carried out in patches, with a stride deciding how much to move betweeneach operation. The shape of the submatrix can vary, and the most common shape is 2 ◊ 2,thus downsampling and discarding 75% of the input [30].

    6

    1

    4

    -2

    4

    3

    3

    2

    5

    1

    0

    7

    8

    0

    3

    5 6

    4

    8

    7

    Max Pooling

    Figure 2.18 – Max Pooling with a 2 ◊ 2 pooling window and stride 2

    2.7.3 Fully Connected Layer

    The CNN architecture commonly stacks pairs of Convolutional layers and Pooling layersuntil the input has been reduced to a small number of parameters. The problem mentionedin the beginning of this section can now be avoided when using Fully Connected Layers (FC)(described in section 2.4), which often is the final step of the model. One or a series of FCtake the output of the last Pooling Layer and propagate it to the output of the last FC. Thisoutput is passed through an activation function and interpreted as the input association toeach class [9].

    2.8 Compressing Neural Networks

    While neural networks can solve complex problems with high performance, an out-of-the-boxmodel is only optimized for the problem and not for the constraints of its target hardware.Deep learning models can be very computationally heavy and in the size-range of hundreds ofmegabytes. The Deep Convolutional Image Recognition Network, AlexNet [31], with all of itsweights and biases takes up 240MB, while the VGG-16 network [32] measures in at 552MB.This can be more or less convenient to hold in memory, depending on the target hardware.

    Networks of this size use considerable memory, CPU/GPU resources and energy to run. Re-cent research, [33, 34, 35], aims to reduce these factors by using various compression techniqueswhile maintaining the performance. The result is a smaller, faster and more energy e�cientnetwork, that can be deployed on mobile devices or embedded systems. Compression rates,energy e�ciency and speed are in the order of 10◊ to 100◊ compared to the original model.The methods involve pruning, weight quantization, Hu�man coding and others. Figure 2.19

    25

  • CHAPTER 2. BACKGROUND STUDY

    shows the compression scheme used in [33]. These methods are explained briefly below.

    Figure 2.19 – Compression scheme from [33]

    2.8.1 Pruning

    Many models end up with more neurons than required to fit to the data. Some of theircorresponding weights can therefore take values close to zero during training, i.e have a lowcontribution to the output. These weights are still used for computations, but can be removedby pruning to reduce the number of parameters. To ensure the accuracy is not negativelya�ected, pruning is performed by removing a few weights at a time. After each removal, themodel is trained for a few epochs to adjust the remaining weights. Weights can be removedwhile monitoring the validation loss to maintain performance or until a certain decrease insize is achieved [36].

    By performing pruning, the number of weights is decreased, meaning that fewer operationsneed to be performed for inference. It also means the model gets smaller as there are fewerparameters to store.

    2.8.2 Weight quantization

    Weights are usually stored in a high precision format, such as float32 or higher. This isbeneficial during training to allow for high precision adjustments of the weights by the opti-mizer. When the model is fully trained, a lower precision format can be used for inferenceby quantizing the weights. This is done by extracting the maximum and minimum values ofeach weight matrix, input and output and storing them with high precision. The rest of thehigh precision weights are then converted to a lower precision range, for example uint8 whichwould be from 0 to 255. 255 would then correspond to the high precision maximum, 0 to theminimum and everything between can be derived linearly from the represented range [37].Converting from float32 to uint8 decreases the model size by almost 4◊.

    The above can be achieved by dynamically recalculating the ranges for inputs, outputs andweights during runtime. This adds computational overhead. To mitigate this, these rangescan be approximated beforehand by feeding the model with examples from the training setand recording the maximum ranges to be used during runtime [38].

    26

  • 2.8. COMPRESSING NEURAL NETWORKS

    Quantization can also be part of the training loop, as described in [33], which o�ers the op-portunity to use weight sharing and achieve even smaller bit depths.

    2.8.3 Hu�man Coding

    Data structures are often compressed with Hu�man code, an optimal prefix code used forlossless data compression. It encodes source symbols based on probability, using fewer bits toencode common characters than less common ones [39]. According to [33], this saves storageby 20-30% depending on weight distribution. The weight distribution can be optimized forcompression by rounding the weight values into predetermined step sizes. This results inrepeating bit patterns that are highly compressible [40].

    The reader is greatly encouraged to read the papers referenced in this section and their relatedwork to get a full understanding of the possibilities with model compression. TensorFlow-and Keras tools are also available to compress models created with their respective libraries,which can be useful for a high-level understanding of compression [41, 40].

    27

  • Chapter 3

    Method and Implementation

    3.1 Event Detection Strategy

    The strategy used in this thesis is based on the regression event detection method discussedin Section 2.3. As the torque and angle sensors collect data, a queue is filled. The latest1500 datapoints are continuously fed to the classification model which determines whetherthe current ”window” contains the snug-angle. If so, the window is provided to the regressionmodel, which estimates the angle. A flowchart of the process can be seen in Figure 3.1. Thechoice of window length as 1500 was established by an early hyperparameter evaluation notdescribed in this report.

    Input from

    sliding window

    No

    Yes Is snug angle in

    window?

    At what angle in

    window is snug

    occuring?

    Output

    Classification model Regression model

    Figure 3.1 – Flowchart of the snug-detector

    The reasons for using this approach are as follows: the implementation should be able topredict the snug-angle during runtime and not after. This excludes the possibility to give thealgorithm the whole curve as input. The algorithm was therefore designed to make estima-tions on smaller segments of the curve. Consider then a regression model that continuouslyreceives a moving segment of 1500 timesteps and would be forced to estimate some snug-anglein every segment, regardless of the presence of the snug-angle in that segment or not.

    Therefore, during training, the classification model will see 1500 timesteps of the torque/anglecurve that either can contain the snug-angle or not, while the regression model will only seesegments that contain the snug-angle.

    29

  • CHAPTER 3. METHOD AND IMPLEMENTATION

    3.2 Models

    A variety of models were developed for testing on the tasks at hand. All models were trainedboth for the classification task and the regression task. The layer sizes and other hyperparam-eters were established during the hyperparameter search described in Section 2.5.1. Thesemodels are presented below.

    3.2.1 Multilayered Perceptron (MLP)

    The MLP, seen in Figure 3.2, is 4-layers deep to enable it to learn many representations,however without recurrent connections it is believed to lack time dependencies which theRNNs or LSTMs have [42]. MLPs have been used with good results on time series andsequence labeling in previous implementations [43, 44].

    Input FC OutputFC FC

    Figure 3.2 – Architecture of MLP model

    3.2.2 RNN

    The RNN model, seen in Figure 3.3 has an RNN block consisting of a number of units anduses a fully connected layer to give output.

    Input OutputRNN-block FC

    Figure 3.3 – Architecture of RNN model

    3.2.3 LSTM

    As described in Section 2.6.1, the RNN can have trouble with long-term dependencies. AnLSTM block was used to compare these two architectures on the two stages of the proposedmodel. The architecture is shown in Figure 3.4.

    Input OutputLSTM-block FC

    Figure 3.4 – Architecture of LSTM model

    30

  • 3.2. MODELS

    3.2.4 LSTM-MLP

    To allow for more abstract features to be identified in the input data, combined with long-term dependencies, this architecture has the output of the LSTM-block fed into a stack ofthree FC layers before producing an output.

    Input OutputLSTM-block FC FC FC

    Figure 3.5 – Architecture of LSTM-MLP model

    3.2.5 Stacked LSTM

    Proposed in [45, 46], stacked LSTM network can be used in temporal tasks. This modelstacks three LSTM networks and produces an output with a FC layer.

    Input OutputLSTM-block FCLSTM-block LSTM-block

    Figure 3.6 – Architecture of Stacked LSTM model

    3.2.6 LSTM Fully Convolutional Network (LSTM-FCN)

    A parallel-type architecture of an LSTM and a CNN that was proposed in [47] and showedgreat promise in classification of sequential data. The pooling-layer employs global averagepooling [48].

    Input Output

    CONVCONV

    LSTM-block

    CONV POOL

    FC

    Figure 3.7 – Architecture of LSTM Fully Convolutional Network model

    31

  • CHAPTER 3. METHOD AND IMPLEMENTATION

    3.3 Method

    This section describes the method used for evaluation of the deep learning models and theevent detector. This can be divided in to sub-processes:

    • Data Acquisition and labeling: The data was preprocessed and labeled so to beable to conduct supervised learning.

    • Setup of training environment: Choices of cost function and optimizers so thattraining could be carried out in an e�cient manner.

    • Hyperparameter search: Hyperparameters were searched for all architectures, on alltasks, to get good training and model parameters.

    • Full training of models: The best performing models were trained to get the bestpossible performance on both tasks.

    • Full model prediction evaluation: The models with the best performance in thefull training on the two tasks were implemented in the event detector and evaluated.

    The processes are described more elaborate below.

    3.3.1 Data Acquisition and labeling

    A dataset of tightening runs executed with continuous drive was obtained. Each file of type.dxd contained several runs and were annotated with run-start and run-end events for eachof them. These annotations allowed for e�ortless extraction of each run into separate files..dxd is a proprietary file format used by a data recording device manufactured by DEWE-Soft®. It can be interpreted in Python using the dwdatareader module described in Section3.5.2

    To verify the quality of the data, a script was designed to read each run from file and plotthe curve and target angle on screen. This allowed for the authors to root out the runs thatused a tightening strategy that was not in the scope of this thesis or were otherwise unfit fortraining. This reduced the size of the dataset to a final 41 153 runs.

    Along with each run, a target value for training, i.e ground truth, was determined. This valuewas the angle at which the clamp force exceeded a threshold of 0.2 Nm and was determinedfrom clamp force measurements in the .dxd file, available since the data was recorded in atest rig capable of measuring clamp force. The threshold of 0.2 Nm was suggested by anadvisor with excellent knowledge on the subject. This ground truth could easily be converteddepending on the task for what it was being trained.

    Each run was then split into 1500 timestep long segments. For the classification task, theground truth was converted to either true or false depending if the snug-angle was in thesegment and for the regression task, the ground truth was converted into the timestep atwhich the angle occurred in the segment. The regression task was only trained on segmentsin which snug-angle appeared.

    32

  • 3.3. METHOD

    3.3.2 Setup of training algorithm

    As described in Section 2.5, training of Deep Neural Networks is done by feeding the networkwith a input, predicting an output, computing the gradient of the cost function (a functionclosely related with some error E) and update the parameters in the model with the computedgradients. Two choices had to be made, firstly the choice of cost function and secondly thechoice of the optimization algorithm.

    Choice of Cost Function

    Gradient descent minimizes a di�erentiable function, where in machine learning that functionis related to the error of prediction. It is also known as loss function, error function andobjective function. It is therefore advantageous for the algorithm to know in which way theerror is decreasing, i.e. the cost function should be di�erentiable [8].

    Cross-entropy loss measures the closeness of the probabilities of class membership output bythe algorithm. It is defined as

    L = x ln(x̂) + (1 ≠ x) ln(1 ≠ x̂) (3.1)

    for the output x̂ and the target x. It is the most common cost function used for classification,preferred for its di�erentiability and robustness [49].

    For regression, however, the output is not in terms of probability but in terms of numericalprediction. Another cost function is then used that relates to the absolute di�erence betweenthe continuous target- and output values.

    For this thesis, cross-entropy loss and mean-squared-error have been used for classificationand regression respectively.

    Choice of Optimizer

    For this thesis, Adam was chosen as the optimizer. Adam is derived from adaptive momentestimation and computes individual learning rates for parameters from estimates of firstand second moments of the gradients [50]. The choice was based on the fact that adaptivelearning rate optimizers in general outperforms other optimizers, and Adam outperforms theother adaptive learning rate optimizers [50, 20].

    3.3.3 Hyperparameter Search Method

    In this thesis, a combination of grid and random search was used. The search was conductedin several steps, where for each step the range of numbers that are used in the search aredecreased according to what hyperparamters provides the model with the best performancein terms of the calculated loss on the validation dataset:

    1. The first step is a coarse search with the widest range of hyperparameter values. Themodels are trained for 5 epochs and the top 10 are selected for evaluation.

    2. The evaluation of the coarse search results in a new range of hyperparameter values.Once again, the models are trained, but now for 10 epochs.

    33

  • CHAPTER 3. METHOD AND IMPLEMENTATION

    3. The top 10 from the finer search are evaluated and a new range is selected. The modelsare trained for 10 epochs.

    The top 3 models from the fine search were then selected to go through a long training. Thehyperparameters over which the search was conducted were the learning rate and the size ofthe model.

    3.3.4 Full training of models

    The top 3 models for each task found in the hyperparameter search were now selected for a fulltraining. This was conducted with the ”reduce learning rate on plateau” approach presentedin Section 2.5.2, where the learning rate was reduced by 0.2 if learning did not improve for 6epochs, and with a dropout rate of 0.3. In general, training was performed for 100-300 epochsand for each accomplished epoch, the validation loss was computed. If the validation lossimproved, all model parameters (i.e. parameters such as weights) were saved. This to ensurethat the best possible model was available for implementation in the full event detection.

    3.3.5 Full model prediction

    The full model, i.e the classifier and regressor in series, was evaluated by simulation. Thiswas carried out by employing the strategy described in Section 3.1 on the test set. As before,the queue size was 1500 timesteps. The number of new values in the queue each time themodel evaluated it was varied from 1, 100, 500 to 1500. Note that when the this value isequal to the queue size, 1500, the sliding window is non-overlapping. Considering the datasetwas collected with a sample rate of 8 kHz, this simulates prediction frequencies of 8 kHz, 80Hz, 16 Hz and 5.33 Hz, respectively. As predictions were made during simulation, severalestimations of the snug-angle were produced by the model for each run. The first and lastprediction, as well as the mean and median of the predictions were computed for each run toevaluate how these estimation metrics compared to each other.

    3.4 Hardware used for Implementation

    The recent years’ surge in Neural Network and deep learning applications is in large due tothe availability of the required computational power, which enables large sized networks andbetter performance [8]. With investments in research both from the commercial-, research-and open source community, there has been an acceleration in both hardware and softwaredevelopment for machine learning applications.

    3.4.1 EVGA NVIDIA GTX 1080

    In the history of deep learning, the traditional method for training was done using a Cen-tral Processing Unit (CPU). As the number of model parameters grow, more optimization isneeded, but eventually the size of the model will be limited, as well as the accuracy. Withthe evermore demanding video games came more powerful graphic processors, or GraphicsProcessing Unit (GPU). These specialize in performing many operations in parallel, such asmatrix multiplication and division. GPUs have been designed to have a high degree of par-allelism and large memory bandwidth [8]. Deep learning training requires largely the same

    34

  • 3.5. SOFTWARE, LIBRARIES AND FRAMEWORKS USED FOR IMPLEMENTATION

    kind of characteristics, with large sets of parameters and variables that need to be updatedat each training step.

    For this thesis the EVGA NVIDIA GeForce GTX 1080 [51] was used. This is a general purposeGPU that can run code with other purposes than graphics rendering. The GTX 1080 has 8GB of memory and a base clock with 1708 MHz. NVIDIA provides a programming language,CUDA, that can be used to implement deep learning models for training on the GPU. Thereare several software libraries such as Torch, Theano and Tensorflow that implement and runhighly optimized CUDA code.

    3.5 Software, Libraries and Frameworks used for Implementation

    The software in this thesis was largely written in Python. Python was chosen because it is byfar the most used language when developing machine learning applications [52]. It has a largecommunity support, both scientific and commercial, with a lot of libraries and frameworksand easy access to help. This section will describe some of the libraries and frameworks thatwere used in the thesis.

    3.5.1 Tensorflow and Keras

    As described in section 3.4.1, there are many software libraries used in order to write highlye�cient code for deep learning training. In this thesis, the TensorFlow library is used. Ten-sorFlow is an open source library for numerical computations using data flow graphs. Thenodes in the graphs represent mathematical operations and the edges represents matrices,or tensors, that flow between them. Since neural networks often are described in terms ofgraphs, this gives a very flexible tool that enables users to easily implement architectures.The TensorFlow library is written so that it can be used with NVIDIA GPUs[53].

    To simplify the process of building and testing deep learning models, the high level API Kerasis used. Keras comes with all the advantages that TensorFlow has, as it uses TensorFlow asbackend for computations. Furthermore, Keras comes with many of the most used neuralnetwork layers predefined, which allows fast implementation [54].

    3.5.2 DEWESoft® and dwdatareader

    DEWESoft® is a company that provides data acquisition software and test and measurementsolutions [55]. All the data used in this thesis was collected with DEWESoft® equipment.In order to be able to read and analyze the data on a large scale using Python, DEWE-Soft® provides a free library for Linux and Windows users. The open-source Python moduledwdatareader interacts with the library and has been used to export the data into a moremanageable data format. This process is described in Section 3.3.1.

    3.5.3 Numpy and matplotlib

    As discussed earlier, Python has a large scientific community that has contributed with a lot ofopen-source software. Numpy is a package for scientific computing with N-dimensional arrayobjects and linear algebra capabilities [56]. Further, it provides containers for data storage

    35

  • CHAPTER 3. METHOD AND IMPLEMENTATION

    and large amount of data can be saved and easily accessed in the .npy format. In this thesis,it is used for handling the large amounts of data in the dataset as well as preprocessing of it.When used together with matplotlib, a plotting library, it provides a great tool for viewingand analyzing data.

    36

  • Chapter 4

    Results

    This chapter presents the results from the training and evaluation of the implementation.

    4.1 Model Training for First Stage Models (Classification)

    Figures 4.1-4.6 show the training progress for the top 3 first stage models found in the hyper-parameters search. The models were trained for 100 epochs and with a dropout rate of 0.3.The stars indicate where the lowest validation loss occurred for each model.

    Figure 4.1 – Validation for each epoch during training of the first stage MLP model.

    37

  • CHAPTER 4. RESULTS

    Figure 4.2 – Validation for each epoch during training of the first stage LSTM-MLP model.

    Figure 4.3 – Validation for each epoch during training of the first stage LSTM model.

    38

  • 4.1. MODEL TRAINING FOR FIRST STAGE MODELS (CLASSIFICATION)

    Figure 4.4 – Validation for each epoch during training of the first stage RNN model.

    Figure 4.5 – Validation for each epoch during training of the first stage Stacked-LSTM model.

    39

  • CHAPTER 4. RESULTS

    Figure 4.6 – Validation for each epoch during training of the first stage LSTM-FCN model.

    In Table 4.1, the lowest validation loss scores are presented and at which epoch they occurred.

    Table 4.1 – Lowest validation loss for the first stage models

    Model Lowest val. loss Epoch

    4-layer MLP 0.058 41

    RNN 0.0614 98

    LSTM 0.0603 69

    LSTM-MLP 0.0573 67

    Stacked LSTM 0.0554 32

    LSTM-FCN 0.0758 73

    40

  • 4.2. MODEL TRAINING FOR SECOND STAGE MODELS (REGRESSION)

    4.2 Model Training for Second Stage Models (Regression)

    Figures 4.7-4.12 show the training progress for the second stage models. The top 3 models(except for the LSTM-FCN, for which only the top 1 was trained) of the hyperparametersearch were trained for 300 epochs and with a dropout rate of 0.3. The stars indicate wherethe minimum validation loss occurred.

    Figure 4.7 – Validation for each epoch during training of the second stage MLP model.

    41

  • CHAPTER 4. RESULTS

    Figure 4.8 – Validation for each epoch during training of the second stage LSTM-MLP model.

    Figure 4.9 – Validation for each epoch during training of the second stage LSTM model.

    42

  • 4.2. MODEL TRAINING FOR SECOND STAGE MODELS (REGRESSION)

    Figure 4.10 – Validation for each epoch during training of the second stage RNN model.

    Figure 4.11 – Validation for each epoch during training of the second stage Stacked-LSTMmodel.

    43

  • CHAPTER 4. RESULTS

    Figure 4.12 – Validation for each epoch during training of the second stage LSTM-FCN model.

    In Table 4.2, the lowest validation loss scores are presented and at which epoch they occurred.

    Table 4.2 – Lowest validation loss for the second stage models

    Model Lowest val. loss Epoch

    4-layer MLP 10389 266

    RNN 21219 213

    LSTM 14555 290

    LSTM-MLP 15820 179

    Stacked LSTM 9365 212

    LSTM-FCN 29633 118

    44

  • 4.3. FIRST STAGE MODEL RESULTS ON IDENTIFYING SNUG-SEGMENT

    4.3 First Stage Model results on Identifying Snug-Segment

    Table 4.3 shows the final results of the best performing classification models used in thethesis, evaluated as the models with the highest F1-scores on the test set. Scores are highfor all models, with the Stacked LSTM mod