case studies on learning and control architectures for...

70
Defence R&D Canada Technical Memorandum DRDC Suffield TM 2013-059 January 2013 Defence Research and Recherche et développement Development Canada pour la défense Canada Case Studies on Learning and Control Architectures for Autonomous Systems David X.P. Cheng DRDC Suffield

Upload: dotuyen

Post on 12-Sep-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

Defence R&D Canada Technical Memorandum

DRDC Suffield TM 2013-059 January 2013

Defence Research and Recherche et développement Development Canada pour la défense Canada

Case Studies on Learning and Control Architectures for Autonomous Systems

David X.P. Cheng DRDC Suffield

Page 2: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes
Page 3: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

Case Studies on Learning and Control Architectures for Autonomous Systems

David X.P. Cheng DRDC Suffield

Defence R&D Canada – Suffield Technical Memorandum DRDC Suffield TM 2013-059 January 2013

Page 4: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

Principal Author

Original signed by David Cheng

David Cheng

Approved by

Original signed by Chris Corry

Chris Corry

Head/Autonomous Systems Operations Section

Approved for release by

Original signed by Robin Clewley

Robin Clewley

Chair/Document Review Panel

© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2013

© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2013

Page 5: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 i

Abstract ……..

This report reviews some of the established learning and control architectures that have been applied or have potential to apply to autonomous systems, with an emphasis on their potential for military applications. In particular, techniques of reinforcement learning, neural network based learning, and genetic algorithms are reviewed with respect to the key progress made and main problems to be addressed in each of the research fields. To illustrate implementation of the learning approaches for autonomous systems, three cases are studied: Autonomous Land Vehicle in Neural Networks (ALVINN), evolutionary approaches for training ALVINN, and pattern recognition with recurrent neural networks for autonomous systems. Strengths, limitations, and potential of the learning techniques are reviewed and discussed for future development from the perspective of autonomous systems application.

Résumé ….....

Dans le présent rapport, on examine quelques-unes des architectures d’apprentissage et de contrôle établies qui ont été appliquées, ou qui pourraient être appliquées, à des systèmes autonomes, l’accent étant mis sur le potentiel en matière d’applications militaires. En particulier, on examine les principaux progrès accomplis et les principaux problèmes subsistants dans chacun des champs de recherche suivants : techniques d’apprentissage par renforcement, apprentissage par réseau neuronal, et algorithmes génétiques. Pour illustrer la mise en œuvre des méthodes d’apprentissage pour les systèmes autonomes, on a étudié trois cas : le véhicule terrestre autonome dans les réseaux neuronaux (ALVINN), les méthodes évolutionnaires pour entraîner l’ALVINN, et la reconnaissance des formes avec les réseaux neuronaux récurrents pour les systèmes autonomes. On considère les forces, les limites et le potentiel de ces techniques d’apprentissage et on en parle dans le contexte de leur futur développement pour l’application aux systèmes autonomes.

Page 6: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

ii DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 7: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 iii

Executive summary

Case Studies on Learning and Control Architectures for Autonomous Systems

David Cheng; DRDC Suffield TM 2013-059; Defence R&D Canada – Suffield; January 2013.

Background: Autonomous robotic systems, including unmanned ground vehicles (UGVs), have great potential to act as a force multiplier in military operations. To achieve this potential, adaptive behaviours without continuous involvement of human operators are essential for an autonomous system. Development of modules to enable autonomous operation with UGVs, such as autonomous navigation in an unstructured environment, will require the application of machine leaning.

This report reviews some of the established learning and control architectures that are relevant to autonomous systems, with an emphasis on their potential for military applications. The learning techniques discussed in this report include reinforcement learning, neural-networks, and genetic algorithms. Three cases are studied to illustrate implementation of the learning approaches for autonomous systems: Autonomous Land Vehicle in Neural Networks (ALVINN), evolutionary approaches for training ALVINN, and pattern recognition with recurrent neural networks for autonomous systems.

Results: This study addresses the following tasks:

Review important machine learning techniques applicable to autonomous systems. Thesetechniques can form software modules and incorporate into UGVs to provide solutions forcomplex problems such as terrain classification, path planning and obstacle avoidance.

Present case studies that demonstrate the feasibility and performance of the discussedlearning approaches in application to autonomous/unmanned systems.

Identify limitations and uncertainties relating to the implementation of the applicablelearning techniques.

Explain the developmental trends in machine learning technology with their application inUGVs.

Significance: Unmanned systems capable of performing learning using environmental features and operating autonomously in structured or unstructured environments will have significant applications for military operation. This article provides a qualitative overview of some of the key methods in machine learning and soft computing for sensor data interpretation and feature extraction with autonomous systems.

Future plans: Research on effective learning architectures should be pursued and applied to an experimental platform to handle certain problems for perception and control of autonomous systems.

Page 8: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

iv DRDC Suffield TM 2013-059

Sommaire .....

Études de cas sur les architectures d’apprentissage et de contrôle pour les systèmes autonomes

David Cheng; RDDC Suffield TM 2013-059; R & D pour la défense Canada – Suffield; janvier 2013.

Contexte : Les systèmes robotiques autonomes, y compris les véhicules terrestres sans pilote (VTSP), pourraient fort bien être des facteurs multiplicateurs de force dans les opérations militaires. Pour y parvenir, il est essentiel qu’un système autonome ait un comportement adaptatif sans intervention humaine continue. Le développement de modules permettant le fonctionnement autonome des VTSP, comme la navigation autonome dans un milieu non structuré, nécessitera l’emploi de l’apprentissage automatique.

Dans le présent rapport, on examine quelques-unes des architectures d’apprentissage et de contrôle établies qui pourraient être appliquées à des systèmes autonomes, l’accent étant mis sur le potentiel en matière d’applications militaires. Les techniques d’apprentissage dont on parle dans le rapport sont l’apprentissage par renforcement, l’apprentissage par réseau neuronal et les algorithmes génétiques. Pour illustrer la mise en œuvre des méthodes d’apprentissage pour les systèmes autonomes, on a étudié trois cas : le véhicule terrestre autonome dans les réseaux neuronaux (ALVINN), les méthodes évolutionnaires pour entraîner l’ALVINN, et la reconnaissance des formes avec les réseaux neuronaux récurrents pour les systèmes autonomes.

Résultats : Cette étude considère les tâches suivantes :

Examiner les techniques d’apprentissage automatique importantes applicables aux systèmesautonomes. Ces techniques peuvent créer des modules logiciels et être incorporées auxVTSP pour trouver des solutions à des problèmes complexes comme la classification desterrains, la planification du cheminement et l’évitement des obstacles.

Présenter des études de cas qui montrent la faisabilité et le rendement des méthodesd’apprentissage étudiées applicables aux systèmes autonomes ou aux systèmes sans pilote.

Indiquer les limites et les incertitudes liées à la mise en œuvre des techniquesd’apprentissage applicables.

Expliquer les tendances de la technologie de l’apprentissage automatique en matière dedéveloppement et leur application aux VTSP.

Importance : Dans le domaine des opérations militaires, il y aura d’importantes applications pour des systèmes sans pilote capables d’apprendre en se servant des caractéristiques du milieu et de fonctionner de manière autonome dans un milieu structuré ou non structuré. Le présent article offre un aperçu qualitatif de quelques-unes des principales méthodes en matière d’apprentissage automatique et de calcul souple pour l’interprétation des données de capteurs et la reconnaissance des caractéristiques pour les systèmes autonomes.

Page 9: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 v

Plans futurs : Il faudrait poursuivre la recherche sur les architectures d’apprentissage efficaces et l’appliquer à une plateforme expérimentale pour traiter certains problèmes relatifs à la détection réalisée par les systèmes autonomes et au contrôle de ces derniers.

Page 10: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

vi DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 11: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 vii

Table of contents

Abstract …….. ................................................................................................................................. i Résumé …..... ................................................................................................................................... i Executive summary ....................................................................................................................... iii Sommaire ..... .................................................................................................................................. iv

Table of contents .......................................................................................................................... vii List of figures ................................................................................................................................ ix

Acknowledgements ........................................................................................................................ x

1 Introduction ............................................................................................................................... 1

2 Robot Learning ......................................................................................................................... 3

2.1 Adaptive Control and Learning ..................................................................................... 3

2.2 Robot Learning .............................................................................................................. 4

3 Reinforcement learning............................................................................................................. 7

4 Learning using neural networks ................................................................................................ 9

4.1 ALNINN ...................................................................................................................... 10

4.2 MANIAC ..................................................................................................................... 10

4.3 ROBIN ......................................................................................................................... 11

4.4 Neural networks for road segmentation ....................................................................... 11

5 Genetic algorithms .................................................................................................................. 13

5.1 GA approach to learn obstacle avoidance parameters ................................................. 13

5.2 GA approach to optimization of sensor deployment for AUV .................................... 14

6 Case Study 1: Autonomous Land Vehicle in a Neural Network ............................................ 15

6.1 Internal Model ............................................................................................................. 15

6.2 Learning algorithms ..................................................................................................... 17

6.2.1 Network Steering Error ...................................................................................... 18

6.2.2 Extrapolation for Missing Pixels ....................................................................... 20

6.2.3 Estimation for Steering Direction ...................................................................... 21

6.2.4 Solution to the Over-learning Problem .............................................................. 23

6.2.5 Performance Evaluation ..................................................................................... 23

7 Case Study 2: Learning to Steer an Autonomous Land Vehicle with an Evolutionary Approach ................................................................................................................................. 25

7.1 Overview of PBIL ....................................................................................................... 25

7.2 Applications of PBIL in Steering Control of ALVINN.............................................. 28

7.2.1 PBIL vs. BP with a single output unit ............................................................... 28

7.2.2 PBIL vs. BP with 30 output units ...................................................................... 29

7.2.3 Integrating evolutionary approach and backpropogation .................................. 30

7.2.4 Changing error metric for specific tasks ............................................................ 31

Page 12: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

viii DRDC Suffield TM 2013-059

8 Case Study 3: Pattern Recognition with Recurrent Neural Networks for Autonomous Systems ................................................................................................................................... 35

8.1 Network Structure ....................................................................................................... 35

8.2 Input Signals ................................................................................................................ 37

8.3 Learning Rules............................................................................................................. 37

8.3.1 Hebbian Learning .............................................................................................. 37

8.3.2 Feedback Learning ............................................................................................. 38

8.4 Test Results ................................................................................................................. 38

9 Concluding Remarks ............................................................................................................. 41

References ..... ............................................................................................................................... 45

List of symbols/abbreviations/acronyms/initialisms .................................................................... 53

Page 13: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 ix

List of figures

Figure 1: Block Diagram of a Reference Model Adaptive Controller ............................................ 3

Figure 2: Block Diagram of a Model Identification Adaptive Controller ....................................... 4

Figure 3: Generic Model of Reinforcement Learning. .................................................................... 7

Figure 4: Simplified representation of a neuron. ............................................................................. 9

Figure 5: The neural network architecture of ALVINN (adopted from [85]) ............................... 16

Figure 6: Input images, target output and network’s output after training. Generated with codes and test data provided by [94]. .......................................................................... 17

Figure 7: Steering error of the network. Reprinted with permission from [85]............................. 18

Figure 8: Shifted and rotated images. Reprinted with permission from [85]. ............................... 19

Figure 9: Simulation of shifting and rotating the camera on the vehicle at two different positions. Reprinted with permission from [85]. ......................................................... 20

Figure 10: Two extrapolation techniques to fill in the missing pixels in the transformed image. Reprinted with permission from [85]. ............................................................. 21

Figure 11: The Pure Pursuit steering model (adopted from [85]). ................................................ 22

Figure 12: Road situations in which ALVINN is trained to drive: left – a dirt road; middle – a single-lane paved path; right – a two-lane highway. Reprinted with permission from [85]. .................................................................................................................... 24

Figure 13: The general scheme of PBIL algorithm in pseudo-code. Reprinted with permission from [99]. .................................................................................................................... 27

Figure 14: (a) average errors of the best network in the entire run up to the current generation; (b) average errors of the best network in the population of each generation. S stands for sampled training/validation sets; F for full training/validation sets. Reprinted with permission from [99]. ......................................................................... 29

Figure 15: A typical target and the actual outputs of the network trained by (a) PBIL and by (b) BP algorithms. The same set of test images is used for the network outputs. Reprinted with permission from [99]. ......................................................................... 30

Figure 16: Comparison of the empirical results from separated and integrated models of the PBIL and BP techniques. Reprinted from permission of [99]..................................... 31

Figure 17: Changing the error metric: PBIL + GPPE learning (left column) vs. BP + SSE learning schemes. Reprinted from permission of [99]. ............................................... 34

Figure 18: Structure of a recurrent neural network with associative memory. This figure is taken and adapted from [77]. ....................................................................................... 36

Figure 19: Ten image sensor patterns. Reprinted with permission from the copyright holder. .... 39

Page 14: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

x DRDC Suffield TM 2013-059

Acknowledgements

The author would like to thank Drs. Dean Pomerleau and Shummet Baluja for their permission to reprint the figures/tables used in their original work, and Dr. David Touretzky for his permission to use his codes and test data for illustration of ALVINN in this memorandum.

Page 15: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 1

1 Introduction

This report reviews the state of the art of the learning and control architectures that are applicable to autonomous systems [1],[2], including reinforcement learning, learning with artificial neural networks, and learning with a class of evolutionary algorithms. The first approach is related to control theory, while the last two are biologically motivated.

There is no common definition of learning for robotic systems. Arkin [3] provides a definition with a means for measuring learning by performance metrics that will be evaluated throughout the operation of robots:

Learning produces changes within an agent that over time enable it to perform more effectively with its environment [3].

One of the common attributes of learning techniques for robotic systems is an improvement in their information processing ability based on heuristic experiences. Therefore, learning approaches are particularly useful for systems control when a rigorous model is not available, or an excessively long time is required to obtain a solution with a rigorous model.

Regarding learning objects for robotic systems, Connell and Mahadevan [1] categorize the knowledge that would be useful for a robot to learn:

1. Hard-to-program knowledge: Consider the problem of trajectory tracking of a roboticmanipulator. The rigorous dynamics of robot arms may be very difficult to establish due to theircomplex structure or a varying payload, and it is hard to program a rigorous dynamics model foraccurate arm motion. It will be more effective to learn and program an implicit dynamics modelthan to identify and program an explicit model for trajectory planning and robot motion incertain compliances.

2. Unknown information: Robots should be able to learn information that is not readily available.For example, an autonomous vehicle might have to travel in an unknown terrain, where it wouldbe useful for the unmanned vehicle to perceive the environment and build a map with a learningapproach as it moves.

3. Changing environments: Robots should be able to learn the knowledge of both internal andexternal changes in a dynamic environment that are not initially anticipated by the system’sprogrammers. Uncertainties exist in the external environment where a robot operates, such asconstantly moving objects, and in the internal representations, such as the changes in thecalibration of the sensors installed on the robots. It is desirable that robots become robust overtime by learning.

“Robot learning” can be viewed as a subset of “machine learning” in the sense that machine learning deals with the computer algorithms that recognize patterns and make decisions based on empirical data, such as from “perception” or “sensing”. Many approaches for machine learning have been employed to robot learning, including reinforcement learning, neural networks, genetic algorithms, and adaptive controls:

Page 16: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

2 DRDC Suffield TM 2013-059

Reinforcement learning: No input/output training data sets are presented to the system(learning agent). Instead, reinforcement signals representing “rewards” or “punishment” areused for the agent to choose an action, or adjust the controller parameters in a robotic systemto optimize the system states. The evaluation of the system is often conducted concurrentlywith learning [4]–[6].

Neural networks: A set of simple parallel-processing units (“neurons”) are interconnectedwith synapses to form specialized architectures in which learning occurs by adjustments theweights of the synaptic connections [7]–[10].

Genetic Algorithms: A genetic algorithm is an optimization algorithm using the concepts ofbiological evolution. New candidate solutions are reproduced using genetic operations ofcrossover and mutation. Crossover combines the “genes” of the parents, while mutationrandomly changes the genotype of an individual candidate, leading to the “offspring” withbetter fitness values, or more efficient controllers in the context of robotics [11],[12].

Adaptive control: Adaptive control is a control technique in which sensory information isused to adjust the parameters of a system model or the internal parameters of a controllerusing the feedback information of its environment. Many adaptive control algorithms arelimited to linear systems, which makes it difficult to apply to the control of autonomousnavigation with high inherent nonlinearity [13],[14].

This report concentrates on the learning and control methods for autonomous robotic systems. Following an overview of the major issues in robot learning and machine learning in artificial intelligence, techniques of reinforcement learning, neural-network-based learning, and evolutionary learning are surveyed. These three approaches are related to control theory and artificial intelligence and applicable to some modules in an autonomous system, and hence deserve their place in this review. In the rest of this report, the learning techniques under survey are further explored and reviewed with specific cases that have shown successful implementation and been deemed to have high potentials for military application. The concluding section discusses some of the restrictions of the learning processes in real world situations and looks into directions for future development of the learning approaches.

Page 17: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 3

2 Robot Learning

2.1 Adaptive Control and Learning

Adaptive control is a control method that uses the feedback of sensory information to adjust a plant model or controller parameters. Surveys on adaptive control can be found in references [14] and [15].

Typically an adaptive controller is configured as a model reference adaptive controller (MRAC) or model identification adaptive controller (MIAC). MRAC uses a reference model that has the same input as the controller and generates the desired output as shown in Figure 1. For example, a reference model that has fast response and small overshoot to a step input may be applied to the motion control of robot arms. The error between the outputs from the plant and from the reference model is used to update the parameters of the controller so that the plant would respond just like the reference model.

On the other hand, MIAC incorporates a system identification module in the control loop and performs system identification on-line using both inputs and outputs of the controlled plant. The identification result is sent to an adjustment module, which in turn is used to modify the parameters of the adaptive controller (Figure 2).

Figure 1: Block Diagram of a Reference Model Adaptive Controller

Plant Control

Controller Input

Reference

Model

Output

+ _

Parameters

Adjustment

Mechanism

+

Page 18: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

4 DRDC Suffield TM 2013-059

Figure 2: Block Diagram of a Model Identification Adaptive Controller

From Figure 1 and Figure 2, it can be seen that the adaptive control algorithms are designed to automatically adjust the parameter settings for the desired response through feedback evaluation from the environment. This can be viewed as a learning process in a dynamic environment, as discussed in the previous section. Adaptive control systems are basically a class of learning systems in which controller gains are varied in response to varying environments of operation or varying workloads.

2.2 Robot Learning

Before close examination of a specific learning technique, there should be a basic understanding of the general issues regarding robot learning, which are discussed in [16]:

Robots perceive the world with their sensors and interact with the world with their effectorsor actuators. Robot learning is basically a mapping from sensory perception to roboticbehaviours.

In order to learn, a robot must have some internal model or representation of the tasks,circumstances, and/or environments in which it works. This model may be learned throughrepeated experiments, online or offline.

In order to learn, a robot must have a specific learning algorithm with respect to its tasks.

In order to learn, a robot must have some way of evaluating or measuring how well itperforms.

Typically, a robot is trained by the learning algorithm on training data sets, and its learnedknowledge is validated on test data sets. The training data has to be separated from the testdata to avoid circular reasoning.

Evaluation of the performance is required for the learning agent to adapt its strategy forperforming tasks such as motion control and obstacle avoidance.

PLANT ADATIVE

CONTROLLER Input Output

+ _

System

Identification Adjustment

Mechanism

Page 19: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 5

In supervised learning instant feedback is directly provided to evaluate how well the robotperforms, as shown in Figure 1 and Figure 2. In contrast, in unsupervised learning the robotmust extract the feature information encoded in the data without instant feedback.

A robot’s perception is subject to disturbances or noise. It is important to consider the effectof noise on the data that the robot acquires in designing a learning algorithm.

Following an overview of the major issues in robot learning, this report discusses three approaches to learning: reinforcement learning, neural-network learning, and learning with genetic algorithms.

Page 20: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

6 DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 21: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 7

3 Reinforcement learning

Reinforcement learning [3]–[6],[17]–[20] is a problem of optimization in which a control strategy (known as control policy) is learned to optimize a scalar utility function (known as reward or utility) through trial-and-error interactions with a dynamic environment. A mechanism (known as critic) is employed to evaluate the response and provide the reinforcement signal to the control system based on its evaluation.

There are two major approaches to solving reinforcement learning problems. The first is to perform trial-and-error search in the action space in order to discover the action that yields the most reward. The second is to use probabilistic methods and dynamic programming techniques to estimate the utility function for taking actions in the environment. The second approach predominates because it draws advantages from the unique structure of reinforcement learning that are generally unavailable in gradient-based solutions to optimization problems. There are two main types of reinforcement learning algorithms for the second set of solutions: adaptive heuristic critic (AHC) learning and Q-learning. In AHC learning the control policy and utility function are learned separately. Q-learning uses a single utility Q-function to evaluate both actions and states [21]. It has been shown in that Q-learning is better than AHC learning in some cases of reactive robotics application [22].

Figure 3: Generic Model of Reinforcement Learning.

The model of reinforcement learning for robot control is illustrated in Figure 3. The robotic agent interacts with the environment through perception as input and action as output. The critic evaluates the sensory input and the response and generates a reinforcement signal. The control policy determines which of the actions should be undertaken to optimize the long term measures of the reinforcement signal using various reinforcement learning algorithms.

Control Policy

Controlled Object

Critic Performance Evaluation

Action

Reinforcement signal

Perception

Feedback

Page 22: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

8 DRDC Suffield TM 2013-059

Overviews of reinforcement learning can be found in references [3]–[6].

Reinforcement Learning for Autonomous Navigation

Peng and Bhanu [23] used reinforcement learning to develop an approach to image segmentation and object recognition for autonomous navigation in outdoor environments. Many existing unmanned ground vehicle (UGV) perception techniques that use image segmentation and feature extraction have fixed parameters working for a particular environment and cannot adapt to environmental changes, such as different lighting conditions, which may significantly degrade the performance of the UGV system. Instances of this failure can be found in the U.S. Army’s Demo II and Demo III UGV programs. It has been shown that the performance of a general image segmentation algorithm can be improved by using reinforcement learning to automatically tune the segmentation parameters in changing environments [23].

Reinforcement Learning for Autonomous Flight

Reinforcement learning has been applied to autonomous helicopter flight, which is highly difficult to control. Andrew and Schneider [24] used reinforcement learning policy search methods to develop algorithms for evaluation and synthesization of an autonomous helicopter controller. The controller was applied to Carnegie-Mellon’s autonomous R-50 helicopter. It was shown that the controller based on the reinforcement learning policy search technique provides robust performance comparable to that of a highly trained human pilot.

Abbeel et al. [25] also used reinforcement learning to perform aerobatic flight with an autonomous helicopter. A helicopter dynamics model and a reward function were learned from human pilot’s demonstration of maneuvers and then optimized by reinforcement learning. The results presented in this work led to an expanded set of aerobatic maneuvers that the helicopter was able to complete autonomously.

Page 23: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes
Page 24: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

10 DRDC Suffield TM 2013-059

A variety of techniques have been developed to adapt the weights given the architecture of neural networks, which is conventionally determined by human experience and intuition. In reference 32, a class of constructive and destructive algorithms were developed for designing architectures of neural network. The constructive algorithms incrementally built network architectures one module at a time, while the destructive algorithms simplified network architectures by removing one module at a time. Approaches of this type aim to construct neural networks with compact or minimal architectures, allowing faster training process. A number of constructive and destructive learning algorithms, such as the LEABRA learning architecture [39], have been developed, each offering its own features. In addition to these approaches, genetic algorithms (to be described in next section) have also been used to search for optimal (or sub-optimal) neural network architectures.

Some applications of the neural network technology relevant to autonomous UGVs, follow.

4.1 ALNINN

ALVINN (Autonomous Land Vehicle In a Neural Network) [40]–[54] is a neural network based perception system developed at Carnegie Mellon University (CMU), which learns to control steering of the Navlab vehicles [52]. Its learning architecture consists of a feedforward neural network with one layer of hidden units. The input layer of the network has 32×32 units and serves as “retina” to receive an image array of 32× 32 pixels from the video camera outfitted on the vehicle. Each of the input units is connected to all the hidden units, which in turn are connected to the output units. Each output unit represents a quantized steering direction (e.g., sharp left, straight ahead, or sharp right) that will keep the vehicle on the road it travels along.

ALVINN was first trained by a human driver and then trained by the back-propagation algorithm on the road. After training, ALVINN was able to compute the mapping from input video images to steering directions, enabling the vehicle to autonomously follow roads of various types (e.g., dirt and paved roads) in certain conditions.

The implementation of ALVINN was tested on Navlab and transported for road-following tasks in the Demo II project. Its performance was robust in Demo II for driving on paved or dirt roads where the network had been trained at speeds up to 20 miles per hour. It was reported later [52] that ALVINN was trained to achieve a maximum speed of 55 miles per hour on a new version of the testbed vehicle for 90 miles.

ALVINN was subsequently extended to detect road segments and traverse intersections [42] based on the geometric model of the world. The Navlab vehicle was able to drive at low speed (5 miles per hour), but had problems to traverse into a road junction at a faster speed. It was proposed to use the active camera control methods to address the problem [44]. In the subsequent experiment, an active sensor controller (called Panacea) [48] was incorporated to steer the camera and enable ALVINN to see the road when it required a sharp turn or a quick response.

4.2 MANIAC

MANIAC (Multiple ALVINN Networks In Autonomous Control) [49] is a modular neural network system based on ALVINN. It consists of several ALVINN networks each of which is

Page 25: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 11

pre-trained for a road type that the testbed vehicle is expected to drive on. This network architecture learns to combine data from output units or hidden units of each ALVINN network. This system allows robust navigation between different road types with more information encoded in different ALVINN networks.

4.3 ROBIN

ROBIN (Radial Basis Network) [55] is an autonomous road-following system based on a radial-basis-function (RBF) neural network [56]. It was evaluated in road-following tasks in Demo II and Demo III projects. One of the advantages of RBF is that they can be trained rapidly. The input to ROBIN is a preprocessed camera image, with each of the image pixels representing an input unit. The center position of the receptive field represents a set of template road images that the vehicle will see on the road. The input image is then compared to the template scenes to measure how closely the template and the input match. The output of the network is then activated by a combination of the receptive field responses. The activation level of an output unit represents the networks estimate of how far the vehicle would be displaced from the lane center at the look-ahead distance, which is used to generate a correct steering and speed signal for the current situation.

Another feature of ROBIN is its reasoning module called the “deliberative outer shell”. The outer shell monitors the performance of the inner road-following component through the confidence measure received from the inner module and is able to act to maintain a high level of performance by changing the parameters of the inner module, for example, by slowing down the vehicle to obtain multiple looks in an unclear situation, or changing the sensors used to acquire the imagery of the roads [55].

It has been reported that ROBIN could drive at 25 miles-per-hour on secondary roads and 10 miles-per-hour on ill-defined trails in daytime with a color camera, and drive at 15 miles-per-hour on secondary roads and at 10 miles-per-hour on ill-defined trails in low-light conditions with a FLIR camera [55].

4.4 Neural networks for road segmentation

A road segmentation system using a neural network classifier was presented in reference [57] and was tested on the Experimental Unmanned Vehicle (XUV) in the Demo III project. A laser range-finder (LADAR) is used to acquire dense depth data and a video camera is used to capture scene images. Correlated image pairs were obtained by co-registering the LADAR and color video data in time and space.

Four features (height, smoothness, color, and texture) were used to delineate the road from its surrounding background. 3-D height (vertical distance of a laser point) and smoothness (height variation in the local vicinity of a laser point) were computed from LADAR. A color histogram was computed for color features over an image patch and Gabor filters were calculated to characterize the texture of an image patch.

The segmentation procedure assumed that roads are expected to be locally smooth, to be consistent in gray and brown mixed colors, and to be homogeneous in texture, while the off-road

Page 26: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

12 DRDC Suffield TM 2013-059

background, such as trees or rocks, would be bumpier, present more blue and green colors, and variation in texture. A three-layer feedforward neural network with 20 hidden units was trained to fuse the LADAR and video image data and learn a segmentation boundary in feature space.

The conjugate gradient back-propagation algorithm was used to update the weights of the network. To improve the performance of a single network model on the image set, the roads were classified into different types and a neural network was trained for each of the road types. The results of road segmentation remained good in the presence of shadows and road composition changes.

Data fusing from both LADAR and video camera results in better performance for segmentation than use of data from a single sensor. The drawback of this approach is that the training of the neural networks has to be performed offline, and is computationally complex for real-time application.

Page 27: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 13

5 Genetic algorithms

Genetic algorithms (GAs) in general are a class of gradient descent optimization techniques in which operators based on the concepts of biological evolution are applied to individual points in a search space for finding “good” solutions with respect to certain optimization cost functions within some amounts of time. Overviews of GAs can be found in references [11],[12],[58],[59].

A set of points in the search space represents a “population” and each point in the population represents an individual. An individual (solution) is typically encoded with a bit string of 0 and 1 representing a gene in a chromosome. An initial population of candidate solutions is generated randomly and the “fitness” of each individual is evaluated with a fitness function which measures how well the individual performs in a specific task. Given fitness ratings, genetic operators such as reproduction, crossover, and mutation are then applied to the population members to create successive generations with improved quality.

GAs are often applied to solve global optimization problems when the global cost function is discontinuous or where conventional algorithms might get stuck in local optima. The main restriction of GA methods is that repeated fitness function evaluation is computationally demanding for complex problems.

GAs can be used to optimize weights of a neural network, as well as other parameters of the network, such as the activation functions of the units [60]–[64], the mutation rate [60], the learning rate [65], and the learning algorithm [66]. GAs can also be applied to optimize neural networks along multiple dimensions using different multi-objective optimization techniques [67],[68].

Applications of GAs in robotics can be found in [69]–[72]. GAs have been used to design control programs for a variety of robot tasks.

In such applications, GAs search a space with each point representing a robot behaviour. By evaluating these behaviours on the fitness function with respect to the target task and performing genetic operations, the robot behaviours evolve to find out the solutions (control programs) that lead to effective execution of the robot task. For autonomous systems, GAs have been applied to find the “best path” between two points for autonomous vehicles [73]. GAs have also been used to design sensors, tune sensor characteristics, and optimize deployment of a sensor network [74].

5.1 GA approach to learn obstacle avoidance parameters

Hamner et al. [73] at Carnegie Mellon University implemented a steering control model for collision avoidance that embeds steering dynamics into the generation of commands. The parameters of the model are learned by approximating path of a human driver. Data is collected on an all-terrain vehicle (ATV) and obstacles are detected using the fused data from two LADARs while a human driver drives the ATV to follow a path and try to avoid obstacles.

Page 28: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

14 DRDC Suffield TM 2013-059

The relationship between the parameters of the control model and the resulting path is nonlinear and has many local minima, where traditional gradient descent methods get stuck and the global optimum is not found.

Hamner et al. [73] applied GA learning to address this problem as the first step of a two-step optimization process. Twenty-five sets of parameters (5-tuple constant terms in the control law) are chosen randomly as the initial population. A successive population is generated by combining or mutating the parameter sets that are randomly selected with respect to the probability of the minimal total distance errors. Ten sets out of the population of 25 with the lowest residual errors after 100 iterations of the GA are kept. A nonlinear least squares procedure is then applied to optimize the 10 parameter sets for the GA by choosing the one with lowest combined errors as the best parameter set. It is shown that the parameter sets generated by the GA method perform better than the hand-tuned sets in both simulation and tests on an autonomous undersea vehicle (AUV), with less vehicle stoppages and a higher overall success rate. However, the test result shows that the parameter sets generated by the GA method tend to make harder turns while producing smaller overall distance errors [73].

5.2 GA approach to optimization of sensor deployment for AUV

Heaney et al. [74] used the GA approach to determinate the sensor deployment strategy for optimal ocean sampling and prediction with AUVs. The search parameter vector for sensor placement consists of initial position, initial direction, range of sample path, and number of direction of turns.

The scalar cost function is defined as a weighted combination of the five cost functions. The cost function is then evaluated for each sensor deployment scheme (individual in GA population), and the GA generates a successive search population. The major benefit of this GA application is that the users can determine “optimal” by a weighting component cost functions based on their own need. However, the GA solution is not guaranteed to be the optimum with respect to the selected cost function and it is hard to prove that the solution is the optimal sensor deployment.

Page 29: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 15

6 Case Study 1: Autonomous Land Vehicle in a Neural Network

Important progresses in reinforcement learning, neural-network-based learning, and evolutionary learning for autonomous systems have been reviewed in previous chapters of this report. Trends for future research and limitations on application of the learning techniques in real autonomous systems were also discussed. In the rest of this report, the learning techniques under survey are further explored and reviewed with specific cases that have shown successful implementation and been deemed to have high potentials for military application.

The first case study is concerned about the ALVINN system [76],[80]–[93] developed at CMU, which has been briefly described in Section 4.1. ALVINN is a successful example of learning and control techniques based on artificial neural networks (ANNs). It has been applied to the DARPA UGV program, and is thought to be the most successful development of the program [76]. This case study attempts to get insight into the learning techniques developed for ALVINN by exploring how the general challenges in application of learning techniques to autonomous systems operating in dynamic environment are addressed. In particular, the techniques of transforming and buffering patterns for adding diversity to training sets and enhancing generalization ability to ANNs are reviewed.

6.1 Internal Model

ALVINN learns to map sensory perception to steering direction for robotic vehicles to drive autonomously on roads in various conditions. Its learning architecture utilizes a feedforward neural network trained by the backpropagation (BP) algorithm with training data sets originally generated by a human driver on the road. This section reviews how learning is implemented for ALVINN with respect to its perception, internal model, learning algorithm, evaluation and validation. The neural network architecture used for world perception in the ALVINN system is shown in

Figure 5.

An image taken from an onboard video camera or a laser rangefinder is digitized into an image array with 30×32 pixels, which is then projected to the input layer of the neural network. The input layer works as a perception “retina” of 30×32 units to receive the image array from the image preprocessing. Each of the input units is connected to all the units in the hidden layer that in turn are connected to the 30 units in the output layer. Each output unit represents a quantized steering direction ranging from “sharp left” in the left end to “straight ahead” in the middle to “sharp right” in the right end. The output steering direction is determined by the levels of activation of the output units, and is then transformed into a steering command that will be applied to the system actuators to keep the vehicle travelling along the desired direction or to avoid colliding obstacles on roads.

Page 30: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

16 DRDC Suffield TM 2013-059

Figure 5: The neural network architecture of ALVINN (adopted from [85])

The author of this report tested the ALVINN learning scheme using the codes and test data provided by [94]. A test result is shown in Figure 6, which displays some actual road images used by ALVINN (each has 30×32 pixels), target output and network’s actual output after training is accomplished. The bright pixels represent units with strong positive activation level. The dark pixels are units having strong negative activation. The target output shows Gaussian activation levels centered on the steering direction corresponding to the driver’s steering direction made when the training and test images are gathered.

30x32 Video Input

Retina

Sharp Left

Sharp Right

30 Output Units

Hidden Units

1 Pixel

Straight Ahead

Page 31: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 17

Figure 6: Input images, target output and network’s output after training. Generated with codes and test data provided by [94].

6.2 Learning algorithms

The learning algorithm adopted by ALVINN is the backpropagation training algorithm [95]–[97]. The training scheme is called training “on-the-fly” in which ALVINN is trained in real time while the human driver is steering the vehicle. ALVINN is first fed with images arrays of the road from the actual driving situations and the activation signals are propagated through the network to generate a network response that represents a steering direction.

Page 32: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

18 DRDC Suffield TM 2013-059

6.2.1 Network Steering Error

Figure 7: Steering error of the network. Reprinted with permission from [85].

The errors between the steering directions determined by ALVINN and by the human driver are calculated online and then propagated backward through the network (Figure 7). The weights of connections are adjusted in the process of backpropagation of the errors and the network response is corrected to become more “matched” with the human driver’s steering direction. This training on-the-fly scheme would avoid the burdensome tasks of generating realistic synthetic training data sets and therefore allow ALVINN to adapt to new driving situations rapidly.

However, it should be noted that there are two major drawbacks of the on-the-fly training scheme. The first drawback arises from the situations when the human driver steers the vehicle constantly near the center of the road in the training process and ALVINN would never learn how to correct the steering angles if the vehicle strays from the road center. The other drawback related to the on-the-fly training scheme is the “over-learn” problem that arises when the human driver keeps driving the vehicle in similar road conditions (as of long straight or long right turn) during training and ALVINN would “forget” what it learned earlier in training because it was fed with a long sequence of similar training data. For example, ALVINN would forget how to drive on a curved road after training on a long straight road, and it would form a tendency toward turning right after training on a long right turn.

Both drawbacks stem from the inherent property of the backpropagation training algorithm that requires the input exemplars used for training the network to cover the diversity of the input conditions.

The first problem can be solved by increasing the diversity of the training data set [80],[85],[89]. This can be achieved by shifting and rotating a single input image from the video camera during training into a series of images that look as if the vehicle was driven to the left and the right sides of the road as shown in Figure 8 [85].

Page 33: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 19

Figure 8: Shifted and rotated images. Reprinted with permission from [85].

The steps for image transformation are as follows [85],[89],[91]: Determine the two trapezoidal fields of view of the camera as the vehicle is situated in

the center and transformed directions. Determine the overlapping area of the two trapezoids. Project a pixel in the transformed image onto the ground plane. Project the pixel on the ground plane back to the original image. Determine the pixel value with respect to the transformed image.

It has to be noted that the assumption of planar ground plane is made in the image transformation so that the resulting transformation remains constant for all pixels to be sampled in the image. Therefore, this transformation can be done during the image preprocessing phase without the need of spending additional time. The image transformation is illustrated in Figure 9.

Page 34: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

20 DRDC Suffield TM 2013-059

Figure 9: Simulation of shifting and rotating the camera on the vehicle at two different positions. Reprinted with permission from [85].

In order to apply the image transformation technique described above, two problems have to be solved:

1. Fill in the missing pixels in the simulated road images resulting from the transformation;2. Determine the actual steering direction for the simulated road images.

The author of ALVINN proposed the extrapolation and estimation techniques as solutions to these problems [80],[85],[91].

6.2.2 Extrapolation for Missing Pixels

As shown in Figure 10, there are some areas in the transformed trapezoid that are not overlapped with the original one. In such areas, the values for pixels cannot be determined by the transformation of the corresponding pixels in the original image. Two extrapolation techniques have been reported [85],[91] to fill in the missing pixels in these areas, as illustrated in Figure 10:

In the first technique, the unknown pixel A in the transformed image is first projected to the ground plane and then to a pixel B in the original image with the restriction that pixel A has the shortest distance to pixel B in the ground plane. Pixel B is then sampled to determine the value for pixel A, as shown in the upper right image of Figure 10. The problem arising from this extrapolation scheme is that the features (such as edge lines) in the image may become unrealistically “smeared” [85],[91] into the missing areas in the image. This problem may cause ALVINN to learn an incorrect steering direction due to the high correlation among the features and their smearing.

Page 35: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 21

Figure 10: Two extrapolation techniques to fill in the missing pixels in the transformed image. Reprinted with permission from [85].

To overcome the smearing effect of the first extrapolation scheme, another extrapolation technique is implemented by which the extrapolation is performed along the line that connects the missing pixel A to the vanishing point of the scene. The extrapolation pixel, shown as point C in Figure 10, is determined by finding the closest point along this line in the original image. This extrapolation scheme is illustrated in the bottom right of Figure 10.

Compared with the first scheme, the second extrapolation technique improves the accuracy of steering direction produced by ALVINN as well as making the transformed image smoother. It was experimentally tested [91] that under the same transformation condition, the improved extrapolation technique reduced the steering errors by 37% over the first extrapolation scheme on a test data set of 100 images.

However, it needs to be pointed out that the second extrapolation scheme is based on the assumption that distinct features (such as edge lines) in the image are parallel to the current steering direction, as observed in usual cases. It may result in a different extrapolation outcome in the situations where this assumption does not hold.

6.2.3 Estimation for Steering Direction

To use the transformed images as training exemplars for the network, it is necessary to determine the steering direction for each of the transformed placement of the vehicle. A steering model called Pure Pursuit [85],[89],[91] was used to calculate the steering angle for bringing back the vehicle to the target position, as shown in Figure 11. Suppose the vehicle situated at point A is to

Page 36: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

22 DRDC Suffield TM 2013-059

be driven to the target point T by travelling a certain lookahead distance. With the vehicle’s position transformed to point B, the problem is to decide the steering radius r that will drive the vehicle from point B to the target T with the same lookahead distance using the Pure Pursuit model, where the parameters are defined as follows:

s – offset translation distance

θ – offset rotation angle

l – lookahead distance

rp – driver’s steering radius

r – transformed steering radius

dp – driver’s displacement from the target at the lookahead distance l

d – transformed displacement the target at the lookahead distance l

The transformed steering direction can be calculated with a straightforward triangle geometry given the transformation and the steering angle used when the original image is taken. The only unknown parameters in the Pure Pursuit mode is the lookahead distance l. which can be determined empirically as the driving distance with the vehicle travelling 2 to 3 seconds at a speed ranging from 5 to 55 mph [80],[85],[91]. It should be noted that the transformed steering direction calculated with this scheme corresponds with the range of the steering directions that a human driver would instinctively response in the same driving condition [85],[91]. Furthermore, the Pure Pursuit model is a geometric model and thus independent of the type of roads (dirt or lane-marked) a vehicle is driving on.

Figure 11: The Pure Pursuit steering model (adopted from [85]).

Page 37: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 23

6.2.4 Solution to the Over-learning Problem

Over-learning is one of the common problems in neural network based learning. In the ALVINN system, the training image sets are stored in a buffer accommodating 200 patterns for online learning. When the driving direction is unchanged over an extended distance, the network would tend to overlearn the patterns in the buffer and forget the driving history it has learned previously, which would results in incorrect steering commands. For example, after driving along a long stretch of right turn, the network would become biased toward steering right since the recent training data in the buffer is filled up with right-turn steering patterns.

To overcome the overearnings problem for ALVINN, four strategies were proposed and tested to replace the training data stored in the buffer [85]:

1. Replace the oldest images in the buffer;

2. Replace the images randomly chosen in the buffer;

3. Replace the images with the lowest errors in the buffer;

4. Replace the images with the overall steering direction closest to the straight-ahead direction.

It has been shown [85],[91] that the first two schemes would not work well for the long unchanged driving situations; the third scheme would work reasonably in keeping the diversity of the buffered data; the fourth scheme was more straightforward than the third and was able to actively compensate for the steering bias by incorporating a constraint on the average steering direction of the training buffer. With this replacement strategy, the pattern diversity in the buffer was maintained and the steering commands in all directions were balanced in the long run.

6.2.5 Performance Evaluation

Analysis of the weight diagrams of the ALVINN system [85],[86],[89],[91] shows that in various driving conditions, the hidden units are excited by the important features detected in the images and their projections to the output layer suggest the correct steering direction that would bring the vehicle back to the center of the road. As a result, ALVINN exhibits a high level of flexibility that a hand programmed system without learning is difficult to achieve. ALVINN was specifically trained to drive in various road conditions, including dirt roads, paved roads and lined highways (Figure 12). Combined with the video camera input and a rule-based arbitration system, ALVINN was also been trained to drive at night using laser reflectance sensing and to avoid obstacles in its driving environment using a laser rangefinder. It was reported that with its peak image processing rate of 15 fps, ALVINN allows the Navlab vehicle to drive at the maximal speed of 55 mph, which is “over 4 times faster than any other sensor-based autonomous system using the same processing hardware” [85],[91].

However, the flexibility of ALVINN stems from the fact that it is a weak model for feature detection, that is, it acquires the knowledge of what are important features in the images through training. A single ALVINN architecture may become unstable when transitioning from one type of road to another or from one direction to the opposite on the same road. The author of ALVINN

Page 38: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

24 DRDC Suffield TM 2013-059

developed variant architectures for ALVINN to improve its performance in a variety of situations [86],[88],[89],[93].

Figure 12: Road situations in which ALVINN is trained to drive: left – a dirt road; middle – a single-lane paved path; right – a two-lane highway. Reprinted with permission from [85].

Page 39: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 25

7 Case Study 2: Learning to Steer an Autonomous Land Vehicle with an Evolutionary Approach

This chapter presents a case study concerning an evolutionary optimization method for creating an ANN based UGV controller for Carnegie Mellon’s NAVLAB system. The standard error backpropagation (BP) algorithm for training ANNs is based on local gradient descent through the weight space, and is therefore subject to getting in local minima. In contrast, evolutionary algorithms (EAs) are global search based optimization techniques, and are much less likely to be trapped in local minima. Research on application of EAs to ANNs has concentrated on addressing problems with networks of relatively small sizes. The problem discussed in this case is addressed by an EA approach involving a large number of pixel-based inputs and outputs for controlling the steering direction of an autonomous vehicle. This case serves as a good example of application of EA techniques to a UGV in real world.

EAs are optimization techniques inspired by concepts of biological evolution such as selection, recombination and mutation. In the selection operation, the search points with a higher value of the merit function are chosen. The recombination process combines individuals in the population of points to reproduce successive generations. The mutation process randomly changes the genotype of an individual and maintains diversity in the population. Overviews of EAs can be found in [98].

EAs can be applied to learning with neural networks in two ways: searching for connection weights and a network topology that will optimize the network performance. Reference [99] presents an evolutionary search approach, called Population-Based Incremental Learning (PBIL), for optimizing the neural network used for controlling the steering direction of ALVINN. In this case study, the PBIL algorithm and some important problems concerning learning with PBIL are reviewed with respect to

reducing computational complexity using EA for training a neural network

integrating EA and BP algorithms for better generalization

specifying network error metric for performance improvement

7.1 Overview of PBIL

The PBIL algorithm combines techniques from a genetic algorithm and supervised competitive learning [99]. In PBIL, a solution point in the search space corresponds to a network whose topology and connection weights are encoded into a string of 1’s and 0’s of a base-2 number. The number of bits in each solution string is fixed. The goal of the PBIL algorithm is to generate a probability vector that defines the probability of having a “1” in each bit of the solution string. Each bit of the probability vector is initially set to a value of 0.5. A procedure based on the updating scheme of competitive learning is used to update the probability vector, and represented as follows [99]:

Pi,t+1 = Pi,t * (1 – r) + r * vi

Page 40: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

26 DRDC Suffield TM 2013-059

where Pi,t is the value of the probability vector in bit i at time interval t. vi is the solution vector. r is the learning rate, which is set to 0.1 in [99]. With this probability update rule, the probability gradually moves toward the corresponding solution vector with the lowest errors on the outputs of the network and toward the complement solution vector with the highest output errors as well.

In addition to the probability updating, a mutation operation is performed in the PBIL algorithm by randomly altering the position of a solution vector with small probability in an iteration step.

The PBIL learning algorithm can be programmed with the following steps [99]:

1. Initialize the probability vectors

2. Generate the sample vectors according to the probability

3. Update the probability vectors toward the best network

4. Update the probability vectors away from the worst network

5. Mutate the probability vectors

The general PBIL scheme is shown in

Figure 13. Compared with the standard GA method, PBIL is able to find a better solution with less search time because of its simpler operations on the members of the population.

Page 41: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 27

Figure 13: The general scheme of PBIL algorithm in pseudo-code. Reprinted with permission from [99].

Page 42: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

28 DRDC Suffield TM 2013-059

7.2 Applications of PBIL in Steering Control of ALVINN

In this section, the approach to neural network based control of AVINN with PBIL advanced by Baluja [99] will be explained in detail. The idea for acquisition of training data is basically the same as that has been discussed in case study 1: training data is first collected by driving the controlled vehicle on various types of roads including a dirt road and a line-divided highway in both directions, and the image transformation technique proposed by Pomerleau [85] is then applied to the collected data. To train neural networks with PBIL, a training image set is used to evaluate the sum squared error (SSE) and the network is evolved by the PBIL update rule described in section 7.1. A validation image set is used to determine the best network which has the smallest SSE in the run. A test image set is used to measure the generalization ability of the best network by evaluating its SSE on the test set.

To improve the speed of PBIL, a small subset randomly sampled from the entire training image set may be used to evaluate the SSE for each network. It was shown that this scheme was still able to maintain good generalization ability of the network while significantly reducing the computational cost of PBIL [99].

7.2.1 PBIL vs. BP with a single output unit

The performance of PBIL may be evaluated by setting BP with a single output as a benchmark algorithm. The single network output represents the steering direction of a controlled vehicle. For benchmarking comparison, the maximal network may be defined to have the maximum connectivity specified among the network layers, but some of its connectivity weights may be eliminated with the network evolved. The maximal network is first trained on the full training set and its sampled subset, and is then evaluated on the full validation set and its sampled subset to determine the best network.

Baluja [99] applied both PBIL and BP with one output unit to ALVINN. In this application, the input retina size for ALVINN was reduced to 15×16 pixels. A maximal network with 15×16 input units, 5 hidden units and a single output unit was constructed. The output unit’s activation level ranged from -1 to 1, corresponding to the steer angles from sharp left to sharp right of the road center. The PBHIL algorithm progressed 500 generations with 30 potential solution networks assessed for each generation. The average SSE per image was evaluated in the entire run up to the current generation, and in the population of each generation, respectively. The results are shown in Figure 14 for both situations.

Statistically, the reduced training set produces comparable performance with the full training set on the validation sets, which means that significant reduction of computational cost can be achieved without loss in the generalization ability by the training set selection scheme. However, it can be observed in Figure 14 that the training error drops more rapidly with the full training set than with its sampled subset in the generation progression, which may have been caused by the potential noise in each evaluation only using a small portion of the entire training set.

As a comparison benchmark, the maximal network should be also trained with the BP algorithm. It has been reported that compared to BP, the SSE was reduced by 36% on the test set with the network trained by PBIL using the sampled training set, and reduced by 40% using the full

Page 43: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 29

training set [99]. It should be noted that the final size of the maximal network was reduced nearly by half using both training set selection schemes [99].

Figure 14: (a) average errors of the best network in the entire run up to the current generation; (b) average errors of the best network in the population of each generation. S stands for sampled training/validation sets; F for full training/validation sets. Reprinted with permission from [99].

7.2.2 PBIL vs. BP with 30 output units

Baluja [99] evaluated the performance of PBIL further by augmenting the output units to 30 and the network was tested with the same training data as the network with a single output. The image transforming and buffering techniques used for BP training on ALVINN, which has been discussed in Case Study 1, were also applied in his experiment. The PBIL and BP algorithms were used to train the network. A typical network output generated with PBIL and BP is shown in Figure 15.

It can be seen from the figure that PBIL achieves a more accurate output than BP. An error reduction of 13% was reported in [99]. The drawback of the PBIL algorithm is that it takes a long time (maybe over an hour [99]) to evolve the network architecture and update its weights while the BP algorithm only takes several minutes to finish the training. When the maximum network architecture is maintained while only the connectivity weights are updated with PBIL, a slight improvement in error reduction rate (for example, 15% [99]) may be observed.

Page 44: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

30 DRDC Suffield TM 2013-059

(a) (b)

Figure 15: A typical target and the actual outputs of the network trained by (a) PBIL and by (b) BP algorithms. The same set of test images is used for the network outputs. Reprinted with permission from [99].

7.2.3 Integrating evolutionary approach and backpropagation

As discussed in previous sections, the EA and BP search approaches for training ANNs have their own advantages and disadvantages. The EA technique such as PBHIL tends to be fast in moving the search points toward regions of high evaluation rapidly, but to be slow in moving from the high performance regions to the optimum points. In contrast, the BP is able to rapidly converge to a local optimum point but may miss the global optimum since the BP search relies on local gradient-decent information. A natural idea for improving the EA or BP search approach is to integrate the two models.

There are four integration models for combination of PBIL and BP [99]:

1. PBIL + BP with evolved network structure and evolved weights. PBIL first evolves thenetwork to the best network (the one with minimum error on the validation set) and theninitiates BP to train the network using the structure and weights found by PBIL.

2. PBIL + BP with evolved network structure and random weights. PBIL first evolves thenetwork to the best network and then initiates BP to train the network only using the structurefound by PBIL. The weights for BP are randomly initialized.

3. PBIL + BP with maximal network structure and evolved weights. PBIL first evolves thenetwork to the best network and then initiates BP to train the network only using the weightsfound by PBIL. The network structure for BP is initialized to the full maximal architecture.

Page 45: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 31

4. PBIL with maximal network structure + BP with maximal network structure and weights.PBIL first evolves the weights with maximal network structure and then initiates BP to trainthe network using the weights found by PBIL.

The four integration models were tested with the 30 output unit network structure and the sum squared error (SSE) metric was used to measure the performance of the models [99]. It has been shown that the fourth integration model has the best error reduction rate over the standalone BP, and the second integration model with random initial weights for BP results in more output error over than the standalone BP mode [99]. This indicates that the BP approach may have led the search points to local optimums, and that PBIL does not necessarily end up with a network that will guarantee BP to work well with.

Baluja [99] evaluated the test results with the SSE metric, as shown in Figure 16. It can be observed that combining PBIL with BP using the fourth integration method described above would significantly improve the standalone PBIL with both the structure and weights evolved, but this is not the case for PBIL with only the weights evolved. For integration of PBIL and BP, it seems that evolving both the structure and the weights with PBIL or BP could not provide significant performance improvement in terms of the SSE metric. However, more comprehensive integration models and error metrics should be explored to find out the effect of architecture evolution on the actual network performance.

Figure 16: Comparison of the empirical results from separated and integrated models of the PBIL and BP techniques. Reprinted from permission of [99].

7.2.4 Changing error metric for specific tasks

For the applications reviewed in previous sections, the SSE error metric is used to compare the performance of the PBIL and BP training methods. For a specific task such as controlling steering direction of ALVINN, the network outputs have to be mapped onto a specific form that can be used for the task, and the error metric may have to be re-defined. As has been discussed, for

Page 46: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

32 DRDC Suffield TM 2013-059

neural network based steering control of the ALVINN system a Gaussian has been used to fit the output vector, and the error of the outputs has been defined as the distance between the peaks of the Gaussian of the actual output and the target output, which is called the Gaussian peak position error (GPPE) metric [99]. It needs to note that the GPPE metric cannot be used to guide the search for the BP algorithm, since the inverse mapping from GGPE to each of the output units is not explicitly defined. GPPE is mainly used for EA approaches. For comparison of the SSE and GPPE metric, the GGPE error for each of the EA and BP learning models was evaluated [99], as shown in Figure 16. It can be seen that the magnitude of an SSE output error becomes smaller in GPPE metric.

The training goals with GPPE metric and SSE are different. With SSE, the network is trained to produce exactly the same output activation as the target. With GPPE, the activation is mainly fitted to a region corresponding to the target, and certain noise in the images is filtered out. Baluja [99] used the GPPE metric in PBIL for training the network of 30 output units on the same training set. The average GPPE error was reported to be 2.76, less than the GPPE error (2.90) derived by the best PBIL and BP integration model using the SSE metric [99]. Some typical outputs of the networks trained with GPPE and SSE metrics were presented in reference [99] (Figure 17).

It can be seen from Figure 17 that the outputs generated by the network trained with the SSE metric are smoother than those generated with the GPPE metric, while the average steering errors of the network evolved with the GPPE metric are smaller those with the SSE metric. Hence the network performance can be improved by using an error metric appropriate for the specific task.

(a1) GPPE = 0.53 (a2) SSE = 0.49

Page 47: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 33

(b1) GPPE = 1.16 (b2) ) SSE = 1.70

(c1) GPPE = 1.71 (c2) ) SSE = 5.12

Page 48: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

34 DRDC Suffield TM 2013-059

(d1) GPPE = 2.41 (d2) ) SSE = 3.83

Figure 17: Changing the error metric: PBIL + GPPE learning (left column) vs. BP + SSE learning schemes. Reprinted from permission of [99].

Page 49: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 35

8 Case Study 3: Pattern Recognition with Recurrent Neural Networks for Autonomous Systems

This chapter reviews the structure and learning rules of a recurrent neural network (RNN) presented by Barton [77]. The RNN was reported to be capable of memory, recognition and identification of patterns in two image streams coming from video sensors [77]. The work was part of a Technology Investment Fund (TIF) project at DRDC Suffield with a long term goal of demonstrating self-organized, goal-driven adaptive learning for an autonomous vehicle operating in an unstructured environment.

Unlike the layered feedforward neural networks discussed in previous chapters of this report, an RNN can connect a node to any node in the network, including itself. Input signals are presented to the network by connections to an input region in an RNN. The strengths of interconnection (connection weights) have to be adjusted dynamically so that the RNN would converge to a point in its output space where the appropriate representation of an input pattern is encoded. With its unique interconnection architecture, stabilizing and converging problems may form great challenges to formulating learning rules for an RNN.

The RNN presented by Barton [77] has found its application in control of a simulated mobile machine; the RNN can learn to store the information on the sensor patterns and sensor states resulting from the actions taken. It can also learn to associate current sensor patterns with actions that would lead to improved future sensor states. It has been shown that a simulated mobile machine controlled by the RNN was able to demonstrate some level of goal-driven behaviour [78] and the mobile machine was strongly attracted to a moving source with the RNN controller, and continued to closely follow the source for thousands of network cycles [79]. The features of the RNN described in references [77]-[79] demonstrate some interesting potentials for military application. This Chapter reviews the details of the structure and learning rules developed for the RNN, and discusses the test results of associative memory with the RNN.

8.1 The Network Structure This section reviews the structure of the RNN used by Barton [77]. As illustrated in Figure 18, there are two independent sensor arrays: a 4×6 image array (SI) and a 4×1 code array (SC) associated with the image array. The image array SI is connected to 50 recurrently connected nodes (called R nodes) in the input region of the RNN. Each pixel in the image has a fixed number of connections to R nodes in the input region with the constraint that no two image pixels have the same set of output connections. The code array SC provides no direct input to the RNN. Correspondingly, the RNN has two memory regions that store and regenerate patterns from the two sensor arrays. The memory node arrays MI and MC in the memory regions generate outputs of SI and SC respectively. Output arrays are shown on the right of Figure 18, connected to the upper and lower R node arrays.

As shown in Figure 18, SI is the only input that generates responses in the memory regions MI and MC, and that during training there is a feedback of the difference between the corresponding pixels in the image and code arrays. This feedback adjusts the connection strengths (weights) between R nodes in the memory regions to reduce the difference in the image and code.

Page 50: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

36 DRDC Suffield TM 2013-059

The R node connection is initialized randomly with the constraint that each R node has the same number of input connections from any other R node in the RNN except itself, and connects its output to the next R node.

Figure 18: Structure of a recurrent neural network with associative memory. This figure is taken and adapted from [77].

The M nodes (nodes in the memory regions MI and MC) connections from R nodes are initialized randomly with the constraint that each M node has the same number of inputs from excitatory R nodes and has the same number of inputs from inhibitory R nodes as well.

All the weights in the RNN take a positive sign and all the outputs of R nodes have a fixed sign that is initialized randomly. The R node weights are all initialized randomly within a user-specified limit. The S to R node weights as well as the R to M node weights are all initialized to a user-specified constant (default 1.0).

Memory Region MI

Memory Region MC

Input Region

Sensor Array SI

Code Array SC

Upper R Node Array

Lower R Node Array

Page 51: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 37

8.2 Input Signals

In the RNN developed by Barton [77], the input signals to the network are the sensor images and sensor codes that are digitalized and stored in the sensor array. The input signals are first linearly combined with the predetermined weights and then fed to an activation function of the R node to generate an activation level. The R node takes the activation value as its response to the input signals. A sigmoid function is used as the activation function in [77]. The R node response can be calculated by the following formulas [77]:

1 1cn

t t tn ni j j

ix W f

10( ( ))

11

t tn n

tn s x x

fe

Where the parameters and the subscripts are defined as

n nth R node

t tth time cycle

i ith input connection to the R node

j jth input source to ith input connection

nc number of input connections

f R node output

W connection weight

σ connection sign

x sum of the weighted inputs

x0 offset

8.3 Learning Rules

In a training run, Barton [77] applied the Hebbian learning rules to update the weights and offsets of the R nodes for goal-driven behaviour [78],[79], and applied the correlation feedback learning for explicit memory of sensor images [100]. In the case discussed here, both Hebbian and feedback learning are used.

8.3.1 Hebbian Learning

In this case, modified Hebbian learning was used to update the weights and offsets for all R nodes that are not being connected to an M node for pattern memory [77]. The learning rule is given by [77]:

Page 52: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

38 DRDC Suffield TM 2013-059

1t t t tni j n niW f f W

0 ( 0.5)t tn nx f

Where α is the growth rate and γ the decay rate for the weight update, β is the change rate for the offset update. With the Hebbian learning rule given above, a strong response of an R node to a strong input will encourage the connection strength to grow within the limit set by the decay term.

8.3.2 Feedback Learning

For the R nodes that are being connected to one or more M node in the memory region, Barton [77] used a learning rule based on the feedback of the difference between the outputs of corresponding nodes the memory and sensor arrays to update the connection weights of the R nodes. The learning rule is given by the following formulas [77]:

cmnn

n iicm

Dn

t t t t tjn n n k k jnW D f f W

0t tn nx D

where the parameters are defined as follows:

i difference between the outputs generated by the ith S and M nodes

n output sign of the nth R node.

Dn normalized image or code differences for the nth R node

Wtjn change in the jth input weight for the nth R node at cycle t

feedback learning rate

xt0n change in the offset for the nth R node at cycle t

With the learning rule given above, the strength of the weight for the R node is determined by the strength of the incoming signal and on the strength of the weight itself in last learning cycle.

8.4 Test Results

The main goals of the work presented in [77] are to use RNNs to address two problems related to the control of autonomous systems: (1) store a pattern and its identification code by simultaneous presentation of two streams of sensor images; and (2) recover identification from the code memory array by presentation of a single pattern in the image sensor stream.

Page 53: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 39

Towards this end, ten 4×6 sensor images were used as a training set for the RNN of 24 S nodes, and ten 4×1 vectors were used as code sensor arrays representing the binary form of the decimal images shown in Figure 19. Two RNNs, containing 250 and 500 R nodes respectively, were used in the tests. Both RNNs had 40 input connections for each R node and 40 output connections for each S node. The RNN with 250 R nodes assigned 100 nodes to the image input, 100 to image memory, and 50 to code memory regions. The RNN with 500 R nodes doubles the size of each of these 3 regions.

The learning rules described in section 8.3 were applied to train the RNNs constructed as above. The first two image and code pairs were used as a training set. It was shown that using the feedback learning rule to train the connection weights of R nodes in the memory regions, both RNNs were able to reproduce the images and codes that had been seen during training, with the RNN of 500 R nodes exhibiting a stronger correlation level for the image and code pairs. This result validates that explicit memory is encoded in the connection weights in the memory regions with the feedback learning, and that with the input signal processing method described in the section 8.2, the RNNs trained by the feedback learning rule are robust to the noise present in the input images.

The test results also shows that for an RNN of fixed size, its storage capacity will degrade with an increased number of training pairs present to the network. However, the storage capacity can be improved by adding R nodes to the network.

The work of Barton [77]-[79] reveals that it is possible to address the two problems described at the beginning of this section using the RNN’s structure and learning rules proposed in this work.

Figure 19: Ten image sensor patterns. Reprinted with permission from the copyright holder.

Page 54: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

40 DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 55: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 41

9 Conclusions

This report reviews a number of learning and control architectures that have been successfully implemented and applied to practical autonomous systems, including neural networks, reinforcement learning, and genetic algorithms. Learning techniques using feedforward neural networks, evolutionary algorithms, and recurrent neural networks are studied in the context of military applications.

9.1 Remarks

Regarding UGVs, the relevant applications of robot learning technology would be primarily in autonomous navigation control, image segment analysis and classification, and target recognition. ANNs have been successfully implemented and demonstrated to be an effective learning technique for on-road autonomous. Road-following can be accomplished by training an ANN to establish a mapping between vision inputs and the steering outputs. However, it would be difficult to train neural networks to handle all the situations in a highly unstructured environment for off-road navigation.

The technique of training on-the-fly developed for the ALVINN system gives a solution to the learning problems of insufficient diversity and over-learning. This technique has made ALVINN a key technology for some military and civilian applications [76], and results in ANNs capable of autonomously and accurately steering a vehicle in a wide variety of situations. One of the drawbacks of ANNs is that one single ANN is not able to drive from one road type to another and conduct obstacle avoidance by itself. Variants of the ANNs have been proposed to extend its capacity and improve its performance for a wider range of its applications.

Genetic algorithms are substantially a class of optimization approaches developed using concepts from biological evolution. They are often applied to solve global optimization problems when conventional optimization techniques might be infeasible for a discontinuous global cost function or might get stuck in local optima. Regarding autonomous vehicles, one possible use would be to find an optimal path between two points given that there may be an infinite number of paths with random obstacles in between.

In the application of the evolutionary algorithms in training ANNs, the network performance should be extensively compared using a benchmark algorithm, such as the back propagation approach. It has been shown that on average the evolutionary approach performs better than the back propagation in terms of output error metrics. However, the computational cost of the evolutionary training may become significantly higher than the back propagation training. Although some solutions are proposed, including distributed paralleled computation for each evolving network and offline training prior to operation, efficient online training remains a key problem to be addressed for application of evolutionary algorithms to autonomous systems operating in dynamic environment. In the case reviewed in this report, emphasis has been put on accuracy of the approach. In practical application of an evolutionary optimization technique, one may have to compromise on the conflicting needs for accuracy and speed.

Page 56: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

42 DRDC Suffield TM 2013-059

The RNN learning architecture developed at DRDC has demonstrated the potential to store a pattern and its identification code by simultaneous presentation of multiple image streams to the network. Explicit associative memory capacity is established by training the input connection weights of the R nodes in the memory regions of an RNN with the feedback learning rule. This enables an image that has been “seen” during training to be explicitly reproduced by the output of the memory array through interactions of connections in the whole network. However, an image that has not been presented to the RNN cannot produce strong correlations for image and codes in the sensor array and in the memory region, and therefore cannot be identified.

Learning in general may be extensively applied to the software technologies of UGVs. In perception, for example, neural networks with statistical training algorithms have been used for feature classifiers for region or terrain classification. One of the major restrictions of its application is the large amount of computational resources and time that a learning algorithm may require. Research on more efficient learning architectures should be pursued, and should be applied to more complex problems that are hard to solve with traditional approaches (e.g., those requiring rigorous mathematical models).

Learning has not become essential for autonomous systems working in structured, predictable environment (e.g., in a design laboratory). However, as autonomous systems start to perform tasks in highly unstructured environment, which may bring up properties unrevealed in a design laboratory, they will have to learn to adapt to these new circumstances. Future autonomous systems will have to be capable of inductive learning, and extracting general information from specific examples. It is a trend that autonomous systems are equipped with some ability to learn. Learning will continue to be one of most challenging areas of research and development for system autonomy.

9.2 Future Direction

The image transformation technique used in ALVINN is attractive because it simulates the vehicle’s position without the need of actual driving, and acquires simulated images by shifting and rotating the original one with the need of little computing resource. Extrapolation techniques are then used to fill in the missing pixels in the simulated road images resulting from the transformation. The ANN learning is applied to the extended data sets to build a mapping from sensory inputs to steering control commands. As a possible research direction, the image transformation technique may be explored to apply to the Structure from Motion (SFM) [103] approach for addressing the problem of limited perception range of stereo vision on an autonomous vehicle.

It is well known that the limitation on sensor range and sensor resolution leads to limited range perception and thus degrades the performance of autonomous navigation of a UGV. Wide-baseline stereo vision with the SFM technique has been proposed to address this problem. With SFM, 2D scene images are accumulated over long distances to build 3D structure of the scene. However, the SFM technique requires actually driving the vehicle to various camera viewpoints by using small incremental motion. Its applicability is limited by obtaining accurate calibrated sequences of scene images, even in a structured environment. Feature correspondence techniques are often not working due to inaccurate images resulting from position errors of viewpoints and ambiguity of acquired images.

Page 57: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 43

With application of the image transformation technique that has been reviewed in this report, it is possible to generate a sequence of accurate images at various simulated viewpoints by simple homogeneous transformation without the need of actually driving the vehicles, as has been discussed in detail in Chapter 2. In future research, it would be necessary to design and experiment with an extrapolation algorithm that not only fills in missing pixels in the transformed images but also facilitates the feature correspondence.

For a specific path planned for an autonomous vehicle, an inference mechanism in ANN framework similar to that used in Case 1 (feedforward) or Case 2 (EA based) may be designed to determine which locations on the way to the goal destination could be used for the transformation to acquire more range data for navigation of the vehicle. ANN models may also be designed for data aggregation and inference algorithms for the SFM approach.

A potential application of the RNN learning architectures discussed in this report would be in the areas of multi-sensor imagery fusion, mining, and reasoning [104]–[108]. It is suggested that the RNN architectures developed at DRDC Suffield should be explored to apply in these research areas.

Readers may have noticed that the focus of this report has been on solutions to the fundamental problems of learning approaches, such as the problems of diversity of data sets, loss of memory storage, and complexity of computational algorithms, which have to be addressed in practical application of any learning technique. Beyond the scope of this review, there are certainly many other biologically-based learning techniques that have been proposed and implemented for various applications in autonomous systems, such as the RatSLAM [101],[102] that builds a map online for autonomous navigation using a computational model of the rodent hippocampus and a vision system.

Page 58: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

44 DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 59: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 45

References .....

[1] Connell, J. H. and Mahadevan, S. (1993), Introduction to robot learning, In J. H. Connell and S. Mahadevan (eds), Robot learning, Kluwer Academic, Boston, MA, pp. 1–7.

[2] Siegwart, R. and Nourbakhsh, I. (2004), Introduction to autonomous mobile robots, MIT Press, Cambridge, MA.

[3] Arkin, R. C. (1998), Behavior-based Robotics, MIT Press, Cambridge, MA.

[4] Barto, A. G. (1992), Reinforcement learning and adaptive critic methods. In D. A. White and D. A. Sofge (eds), Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, Van Nostrand Reinhold, New York, pp. 469–491.

[5] Kaelbling, L. P., Littman, M. L. and Moore, A. W. (1996), Reinforcement learning: a survey, Journal of Artificial Intelligence Research, 4, 237-285.

[6] Barto, A. G. (1995), Reinforcement learning, In M. A. Arbib (ed.), Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA, pp. 804–809.

[7] Schalkoff, R.J. (1997), Artificial Neural Networks, McGraw-Hill, New York.

[8] Haykin, S. (1994), Neural Networks, Macmillan, New York.

[9] Arbib, M.A., ed. (1995), The hand book of brain theory and neural networks, MIT Press, Cambridge, MA.

[10] Holland, J. (1975), Adaptation in natural and artificial systems, The University of Michigan Press, Ann Arbor, MI.

[11] Goldberg, D. (1989), Genetic algorithms in search, optimization, and machine learning, Addison Wesley, Reading, MA.

[12] Mitchell, M. (1996), An introduction to genetic algorithms, MIT Press, Cambridge, MA.

[13] Kaufman, H., Bar-Kana, I. and Sobel, K. (1994), Direct Adaptive Control Algorithms: Theory and Applications, Springer-Verlag, New York.

[14] Landau, I.D., Lozano, R. and M'saad. M. (1998), Adaptive Control, Springer, New York.

[15] Astrom, K.J., and Wittenmark, B. (1989), Adaptive Control, Addison-Wesley, Reading, MA.

[16] Bekey, G. A. (2005), Autonomous Robots: From Biology Inspiration to Implementation and Control, MIT Press, Cambridge, MA.

Page 60: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

46 DRDC Suffield TM 2013-059

[17] Chiarella, M., Fay, D., Ivey, R., Bomberger, N. and Waxman, A. (2004), Multisensor Image Fusion, Mining, and Reasoning - Rule Sets for Higher-Level AFE in a COTS Environment, In Proceedings of the 7th International Conference on Information Fusion, 983–990.

[18] Fagg, A. H., Lewis, M. A. and Montgomery, J. F. (1993), The USC Autonomous flying vehicle: An experiment in real-time behavior-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2, 1173–1187, Yokahama, Japan.

[19] Fagg, A. H., D. Lotspeich, and G. A. Bekey. (1994), A reinforcement learning approach to reactive control policy design for autonomous robots. In IEEE International Conference on Robotics and Automation, 39–44, San Diego, CA.

[20] Omidvar, O. and Van der Smagt, P., eds. (1997), Neural Systems for Robotics, Academic Press, San Diego, CA.

[21] Watkins, C., and Dayan, P. (1992), Q-learning, Machine Learning, 8, 279–292.

[22] Lin, L. (1992), Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning, 8, 293-321.

[23] Peng, J., and Bhanu, B. (1999), Learning to perceive objects for autonomous navigation, Autonomous Robots, 6(2), 187-201.

[24] Bagnell, J., Schneider, J. (2001), Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods, In IEEE Proceedings of the International Conference on Robotics and Automation, 2, 1615 – 1620.

[25] Abbeel, P., Ganapathi, V., & Ng, A. (2006). Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, 18, 1.

[26] Fay, D. A., Waxman, A. M., Aguilar, M., Ireland, D. B., Racamato, J. P., Ross, W. D., and Braun, M. I. (2000). Fusion of multi-sensor imagery for night vision: color visualization, target learning and search. In Proceedings of the Third International Conference on Information Fusion, vol. 1, pp. TUD3-3. IEEE.

[27] Fay D.A., Waxman A. M., Aguilar M., Ireland D. B., Racamato J. P., Ross W.D., Streilein W.W., and Braun, M.I. (2000) , Fusion of 2-/3-/4-Sensory Imagery for Visualization, Target Learning and Search, In Proceedings of SPIE, Enhanced and Synthetic Vision, vol. 4023.

[28] Ross, W. D., Waxman, A. M., Streilein W. W., Aguilar, M., Verly, J., Liu, F., Braun, M. I., Harmon, P., and Rak, S. (2000), Multi-Sensor 3D Image Fusion and Interactive Search, In Proceedings of the 3rd International Conference on Information Fusion, vol. 1.

[29] Streilein, W. W., Waxman, A., Ross, W. D., Liu, F., Braun, M., Fay, D., Harmon, P. and Read, C.H. (2000), Fused Multi-Sensor Image Mining for Feature Foundation Data, In Proceedings of the 3rd International Conference on Information Fusion, vol. 1.

[30] Omidvar, O. and P. Van der Smagt, eds. (1997), Neural Systems for Robotics, Academic Press, San Diego, CA.

Page 61: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 47

[31] Cheng, D. and Patel, R.V. (2003), Neural network based tracking control of a flexible macro-micro manipulator system, Neural Networks, 16(2), 272–286.

[32] Dayhoff, J. (1990), Neural Network Architectures: An Introduction, Van Nostrand Reinhold, New York.

[33] Hertz, J. A., Krogh, and R. Palmer (1991), Introduction to the Theory of Neural Computation, Addison Wesley, Redwood City, CA.

[34] Levine, D. (1991), Introduction to Neural and Cognitive Modeling, Lawrence Earlbaum Associates, Hillsdale, NJ.

[35] Gallant, S. (1993), Neural Network Learning and Expert Systems, MIT Press, Cambridge, MA.

[36] Kung, S. Y. (1994), Digital Neural Networks, Prentice Hall, New York.

[37] Haykin, S. (1999), Neural networks: a comprehensive foundation, 2nd ed., Prentice-Hall, Upper Saddle River, NJ.

[38] Ripley, B. (1996), Pattern Recognition and Neural Networks, Cambridge University Press, New York.

[39] O’Reilly, R. and Y. Munakata (2000), Computational Explorations in Cognitive Neuroscience, MIT Press, Cambridge, MA.

[40] Batavia, P., Pomerleau, D. and Thorpe, C. (1996), Applying Advanced Learning Algorithms to ALVINN, (Tech. Report CMU-RI-TR-96-31) Robotics Institute, Carnegie Mellon University.

[41] Baluja, S. (1996), Evolution of an artificial neural network based autonomous land vehicle controller, IEEE Transactions on Systems, Man and Cybernetics, Part B, 26(3), 450–463.

[42] Jochem, T., Pomerleau, D. and Thorpe, C. (1995), Vision Guided Lane Transition, In IEEE Symposium on Intelligent Vehicles, 30–35.

[43] Hancock, J. and Thorpe, C. (1995), ELVIS: Eigenvectors for Land Vehicle Image System, In Proceedings of the International Conference on Intelligent Robots and Systems (IROS '95), 35–40.

[44] Jochem, T., Pomerleau, D. and Thorpe, C. (1995), Vision-Based Neural Network Road and Intersection Detection and Traversal, In IEEE Conference on Intelligent Robots and Systems (IROS '95), 344–349.

[45] Pomerleau, D. (1995), Neural Network Vision for Robot Driving, In M. Arbib (ed.), The Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA.

[46] Hancock, J. and C. Thorpe (1994), ELVIS: Eigenvectors for Land Vehicle Image System, (Tech. Report CMU-RI-TR-94-43) Robotics Institute, Carnegie Mellon University.

Page 62: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

48 DRDC Suffield TM 2013-059

[47] Pomerleau, D. (1994), Defense and Civilian Applications of the ALVINN Robot Driving System, In 1994 Government Microcircuit Applications Conference, 358–362.

[48] Sukthankar, R., Pomerleau, D. and Thorpe, C. (1993), Panacea: An Active Sensor Controller for the ALVINN Autonomous Driving System, (Tech. Report CMU-RI-TR-93-09) Robotics Institute, Carnegie Mellon University.

[49] Jochem, T., Pomerleau, D. and Thorpe, C. (1993), MANIAC: A Next Generation Neurally Based Autonomous Road Follower, In Proceedings of the International Conference on Intelligent Autonomous Systems, 15-18, Pittsburgh, PA.

[50] Pomerleau, D. (1991), Efficient Training of Artificial Neural Networks for Autonomous Navigation, Neural Computation, 3(1), 88–97.

[51] Pomerleau, D. (1992), Progress in Neural Network-based Vision for Autonomous Robot Driving, In Proceedings of the 1992 Intelligent Vehicles Conference, 391–396.

[52] Pomerleau, D. (1993), Neural network perception for mobile robot guidance, Kluwer Academic Publishers, Norwell, MA.

[53] Pomerleau, D. and Touretzky, D.S. (1993), Analysis of Feature Detectors Learned by a Neural Network Autonomous Driving System, In International Conference on Intelligent Autonomous Systems (IAS-3), 572–581.

[54] Pomerleau, D., Thorpe, C. , Langer, D. , Rosenblatt, J. and Sukthankar, R. (1994), AVCS Research at Carnegie Mellon University, In Proceedings of Intelligent Vehicle Highway Systems, 257-261.

[55] Rosenblum, M. (2000), Neural that Know How to Drive, In Proc. of the IEEE Intelligent Vehicles Symposium, 556–562, IEEE.

[56] Rosenblum, M. and Davis, L.S. (1996), An Improved Radial Basis Function Network for Visual Autonomous Road Following, IEEE Transactions on Neural Networks, 7(5), 1111-1120.

[57] Rasmussen, C. (2002), Combining laser range, color and texture cues for autonomous road following, In Proc. IEEE Inter. Conf. on Robotics and Automation, 4320-4325, IEEE.

[58] Michalewicz, Z. (1992), Genetic Algorithms + Data Structures= Evolution Programs, Springer-Verlag, New York.

[59] Man, K.F., Tang, K.S. and Kwong, S. (1999), Genetic Algorithms: Concepts and designs. Springer, New York.

[60] Juedes, D. and Balakrishnan, K. (1996), Generalized Neural Networks, Computational Differentiation, and Evolution. In Berz, M., Bischof, C., Corliss, G. and Griewank, A. (eds), Computational Differentiation, Applications Techniques and Tools, pp. 273–286, SIAM Press, Philadelphia, PA.

Page 63: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 49

[61] Beer, R. D. and Gallagher, J. C. (1992), Evolving dynamical neural networks for adaptive behavior, Adaptive Behavior, 1(1), 92–122.

[62] Belew, R. (1993), Interposing an ontogenic model between genetic algorithms and neural networks, In Hanson, S. J., Cowan, J. D. and Giles, C. L. (eds), Advances in Neural Information Processing Systems 5, pp. 99–106, Morgan Kaufmann , San Mateo, CA.

[63] Floreano, D. and Mondada, F. (1995), Autonomous and self-sufficient: Emergent homing behaviors in a mobile robot, IEEE Trans. Systems, Man, and Cybernetics - Part B, 26(3), 396–407.

[64] Maniezzo, V. (1994), Genetic evolution of the topology and weight distribution of neural networks, IEEE Trans. Neural Networks, 5(1), 39–53.

[65] Salomon, R. (1991), Improved convergence rate of back propagation with dynamic adaptation of the learning rate, In Proceedings of the First International Conference on Parallel Problem Solving from Nature, 269–273.

[66] Chalmers, David J. (1990), The Evolution of Learning: An Experiment in Genetic Connectionism. In Proceedings of the 1990 Connectionist Models Summer School. 81–90.

[67] Fonseca, C. and Fleming, P. (1995), An Overview of Evolutionary Algorithms in Multi-Objective Optimization, Evolutionary Computation, 3(1), 1–16.

[68] Horn, J. and Nafpliotis, N. (1993), Multiobjetive optimization using the niched pareto genetic algorithm. (IlliGAL Technical Report 93005) University of Illinois, Urbana-Champaign, IL.

[69] Nolfi, S. and Floreano, D. (2000), Evolutionary robotics: The biology, intelligence and technology of self-organizing machines, MIT Press, Cambridge, Mass.

[70] Hopgood, A. A. (2000), Intelligent systems for engineers and scientists, CRC Press, Boca Raton, FL.

[71] Gomo, T., ed. (2000), Evolutionary robotics: From intelligent robots to artificial life, AAI Books, Ottawa, Ontario.

[72] Goldberg, D. E. (2002), The design of innovation: Lessons from and for competent generic algorithms, Generic Algorithms and Evolutionary Computation 7, Kluwer Academic, Boston, MA.

[73] Hamner, B., Singh, S. and Scherer, S. (2007), Learning obstacle avoidance parameters from operator behavior, Journal of Field Robotics, 23, 1037–1058.

[74] Heaney, K., Gawarkiewicz, G. , Duda, T. and Lermusiaux, P. (2007), Nonlinear optimization of autonomous undersea vehicle sampling strategies for oceanographic data-assimilation., Journal of Field Robotics, 24, 437–448.

Page 64: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

50 DRDC Suffield TM 2013-059

[75] Committee on Army Unmanned Ground Vehicle Technology (2003), Technology Development of Army Unmanned Ground Vehicles, The National Academies Press, Washington, DC.

[76] Pomerleau, D. (1994), Defense and Civilian Applications of the ALVINN Robot Driving System, In 1994 Government Microcircuit Applications Conference, 358–362.

[77] Barton, S. A. (2006), Associative Memory in a Recurrent Neural Network, (DRDC Suffield TM 2001–053) Defence R&D Canada – Suffield.

[78] Barton, S.A. (1996), Structure and Convergence Properties of a Recurrent Neural Network, (DRDC Suffield SM-1489) Defence R&D Canada – Suffield.

[79] Barton, S.A. (1998), Techniques for Pattern Classification Using a Convergent Recurrent Neural Network, (DRDC Suffield SR-709) Defence R&D Canada – Suffield.

[80] Batavia, P., Pomerleau, D. and Thorpe C. (1996), Applying Advanced Learning Algorithms to ALVINN, (Tech. Report CMU-RI-TR-96-31) Robotics Institute, Carnegie Mellon University.

[81] Baluja, S. (1996), Evolution of an artificial neural network based autonomous land vehicle controller, IEEE Transactions on Systems, Man and Cybernetics, Part B, 26(3), 450 – 463.

[82] Jochem, T., Pomerleau, D. and C. Thorpe (1995), Vision Guided Lane Transition, In IEEE Symposium on Intelligent Vehicles, September, 30–35, IEEE.

[83] Hancock, J. and C. Thorpe (1995), ELVIS: Eigenvectors for Land Vehicle Image System, In Proceedings of the International Conference on Intelligent Robots and Systems. 'Human Robot Interaction and Cooperative Robots' (IROS '95), vol. 1, 35–40, IEEE.

[84] Jochem, T., Pomerleau, D. and C. Thorpe (1995), Vision-Based Neural Network Road and Intersection Detection and Traversal, IEEE Conference on Intelligent Robots and Systems (IROS '95), vol. 3, 344–349, IEEE.

[85] Pomerleau, D. (1995), Neural Network Vision for Robot Driving, In Arbib, M. (ed.), The Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA.

[86] Hancock, J. and Thorpe, C. (1994), ELVIS: Eigenvectors for Land Vehicle Image System, (Tech. report CMU-RI-TR-94-43) Robotics Institute, Carnegie Mellon University.

[87] Sukthankar, R., Pomerleau, D. and Thorpe, C. (1993), Panacea: An Active Sensor Controller for the ALVINN Autonomous Driving System, (Tech. Report CMU-RI-TR-93-09) Robotics Institute, Carnegie Mellon University.

[88] Jochem, T., Pomerleau, D. and C. Thorpe (1993), MANIAC: A Next Generation Neurally Based Autonomous Road Follower, In Proceedings of the International Conference on Intelligent Autonomous, 15-18, Pittsburgh, PA.

Page 65: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 51

[89] Pomerleau, D. (1991), Efficient Training of Artificial Neural Networks for Autonomous Navigation, Neural Computation, 3(1), 88–97.

[90] Pomerleau, D. (1992), Progress in Neural Network-based Vision for Autonomous Robot Driving, In Proceedings of the 1992 Intelligent Vehicles Conference, 391–396.

[91] Pomerleau, D. (1993), Neural network perception for mobile robot guidance, Kluwer Academic Publishers, Norwell, MA.

[92] Pomerleau, D. and Touretzky, D.S. (1993), Analysis of Feature Detectors Learned by a Neural Network Autonomous Driving System, In International Conference on Intelligent Autonomous Systems (IAS-3), 572–581.

[93] Pomerleau, D., Thorpe, C., Langer, C., D., Rosenblatt, J. and Sukthankar, R. (1994), AVCS Research at Carnegie Mellon University, In Proceedings of Intelligent Vehicle Highway Systems, 257-261.

[94] Touretzky, D., The ALVINN Demos in MATLAB, Department of Computer Science, Carnegie Mellon University.

[95] Haykin, S. (1999), Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice-Hall, Upper Saddle River, NJ.

[96] B. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press, New York.

[97] O’Reilly, R. and Y. Munakata (2000), Computational Explorations in Cognitive Neuroscience, MIT Press, Cambridge MA.

[98] Maniezzo, V. (1994), Genetic evolution of the topology and weight distribution of neural networks, IEEE Trans. Neural Networks, 5(1), 39–53.

[99] Baluja, S. (1996), Evolution of an artificial neural network based autonomous land vehicle controller, IEEE Transactions on Systems, Man and Cybernetics, Part B, 26(3), 450–463.

[100] Barton, S.A. (2000), Recognition and Identification of Objects in IR Feature Images using a Recurrent Neural Network, (CAN contribution to TTCP W7 KTA 7-2 Final Report) Defence R&D Canada – Suffield.

[101] Milford, M. J., Wyeth, G., and Prasser, D. (2004), RatSLAM: A Hippocampal Model for Simultaneous Localization and Mapping, In International Conference on Robotics and Automation, New Orleans, United States.

[102] Milford, M. J. and Wyeth, G. (2008), Mapping a Suburb with a Single Camera using a Biologically Inspired SLAM System, IEEE Transactions on Robotics Special Issue on Visual SLAM, 24(5), 1038–1053.

Page 66: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

52 DRDC Suffield TM 2013-059

[103] Dellaert, F., Seitz, S., Thorpe, C., and Thrun, S. (2000), Structure from Motion without Correspondence, In IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[104] Fay, D.A., Waxman, A. M., Aguilar, M., Ireland, D. B., Racamato, J. P., Ross, W.D., Streilein, W.W., and Braun, M.I. (2000), Fusion of Multi-Sensor Imagery for Night Vision: Color Visualization, Target Learning and Search, In Proceedings of the 3rd International Conference on Information Fusion, Vol. 1.

[105] Fay D.A., Waxman A. M., Aguilar M., Ireland D. B., Racamato J. P., Ross W.D., Streilein W.W., and Braun, M.I. (2000), Fusion of 2-/3-/4-Sensory Imagery for Visualization, Target Learning and Search, In Proceedings of SPIE , Vol. 4023 Enhanced and Synthetic Vision.

[106] Ross, W. D., Waxman, A. M., Streilein W. W., Aguilar, M., Verly, J., Liu, F., Braun, M. I., Harmon, P., and Rak, S. (2000), Multi-Sensor 3D Image Fusion and Interactive Search, In Proceedings of the 3rd International Conference on Information Fusion, Vol. 1.

[107] Streilein, W. W., Waxman, A., Ross, W. D., Liu, F., Braun, M., Fay, D., Harmon, P. and Read, C.H. (2000), Fused Multi-Sensor Image Mining for Feature Foundation Data, In Proceedings of the 3rd International Conference on Information Fusion, Vol 1.

[108] Chiarella, M., Fay, D., Ivey, R., Bomberger, N. and Waxman, A. (2004), Multisensor Image Fusion, Mining, and Reasoning - Rule Sets for Higher-Level AFE in a COTS Environment, In Proceedings of the 7th International Conference on Information Fusion, 983-990.

Page 67: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DRDC Suffield TM 2013-059 53

List of symbols/abbreviations/acronyms/initialisms

ALVINN Autonomous Land Vehicle In a Neural Network

ANN Artificial Neural Network

BP Backpropagation

EA Evolutionary Algorithm

GA Genetic Algorithm

GPPE Gaussian Peak Position Error

PBIL Population-Based Incremental learning

RNN Recurrent Neural Network

SFM Structure from Motion

SSE Sum Squared Error

UGV Unmanned Ground Vehicle

DND Department of National Defence

DRDC Defence Research & Development Canada

DRDKIM Director Research and Development Knowledge and Information Management

R&D Research & Development

Page 68: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

54 DRDC Suffield TM 2013-059

This page intentionally left blank.

Page 69: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

DOCUMENT CONTROL DATA (Security classification of title, body of abstract and indexing annotation must be entered when the overall document is classified)

1. ORIGINATOR (The name and address of the organization preparing the document.Organizations for whom the document was prepared, e.g. Centre sponsoring a contractor's report, or tasking agency, are entered in section 8.)

Defence R&D Canada – SuffieldP.O. Box 4000, Station MainMedicine Hat, Alberta T1A 8K6

2. SECURITY CLASSIFICATION (Overall security classification of the document including special warning terms if applicable.)

UNCLASSIFIED(NON-CONTROLLED GOODS)DMC AREVIEW: GCEC June 2010

3. TITLE (The complete document title as indicated on the title page. Its classification should be indicated by the appropriate abbreviation (S, C or U) in parentheses after the title.)

Case Studies on Learning and Control Architectures for Autonomous Systems

4. AUTHORS (last name, followed by initials – ranks, titles, etc. not to be used)

Cheng, D.X.P.

5. DATE OF PUBLICATION (Month and year of publication of document.)

January 2013

6a. NO. OF PAGES (Total containing information, including Annexes, Appendices, etc.)

70

6b. NO. OF REFS (Total cited in document.)

108

7. DESCRIPTIVE NOTES (The category of the document, e.g. technical report, technical note or memorandum. If appropriate, enter the type of report,e.g. interim, progress, summary, annual or final. Give the inclusive dates when a specific reporting period is covered.)

Technical Memorandum

8. SPONSORING ACTIVITY (The name of the department project office or laboratory sponsoring the research and development – include address.)

Defence R&D Canada – SuffieldP.O. Box 4000, Station MainMedicine Hat, Alberta T1A 8K6

9a. PROJECT OR GRANT NO. (If appropriate, the applicable research and development project or grant number under which the document was written. Please specify whether project or grant.)

9b. CONTRACT NO. (If appropriate, the applicable number under which the document was written.)

10a. ORIGINATOR'S DOCUMENT NUMBER (The official document number by which the document is identified by the originating activity. This number must be unique to this document.)

DRDC Suffield TM 2013-059

10b. OTHER DOCUMENT NO(s). (Any other numbers which may be assigned this document either by the originator or by the sponsor.)

11. DOCUMENT AVAILABILITY (Any limitations on further dissemination of the document, other than those imposed by security classification.)

Unlimited

12. DOCUMENT ANNOUNCEMENT (Any limitation to the bibliographic announcement of this document. This will normally correspond to theDocument Availability (11). However, where further distribution (beyond the audience specified in (11) is possible, a wider announcement audience may be selected.))

Unlimited

Page 70: Case Studies on Learning and Control Architectures for ...cradpdf.drdc-rddc.gc.ca/PDFS/unc261/p805068_A1b.pdf · on examine les principaux progrès accomplis et les principaux problèmes

13. ABSTRACT (A brief and factual summary of the document. It may also appear elsewhere in the body of the document itself. It is highly desirablethat the abstract of classified documents be unclassified. Each paragraph of the abstract shall begin with an indication of the security classification of the information in the paragraph (unless the document itself is unclassified) represented as (S), (C), (R), or (U). It is not necessary to include here abstracts in both official languages unless the text is bilingual.)

This report reviews some of the established learning and control architectures that have been applied or have potential to apply to autonomous systems, with an emphasis on their potential for military applications. In particular, techniques of reinforcement learning, neural network based learning, and genetic algorithms are reviewed with respect to the key progress made and main problems to be addressed in each of the research fields. To illustrate implementation of the learning approaches for autonomous systems, three cases are studied: Autonomous Land Vehicle in Neural Networks (ALVINN), evolutionary approaches for training ALVINN, and pattern recognition with recurrent neural networks for autonomous systems. Strengths, limitations, and potential of the learning techniques are reviewed and discussed for future development from the perspective of autonomous systems application.

14. KEYWORDS, DESCRIPTORS or IDENTIFIERS (Technically meaningful terms or short phrases that characterize a document and could behelpful in cataloguing the document. They should be selected so that no security classification is required. Identifiers, such as equipment model designation, trade name, military project code name, geographic location may also be included. If possible keywords should be selected from a published thesaurus, e.g. Thesaurus of Engineering and Scientific Terms (TEST) and that thesaurus identified. If it is not possible to select indexing terms which are Unclassified, the classification of each should be indicated as with the title.)

Autonomous Systems, Unmanned Vehicles, Reinforcement Learning, Neural Networks, Genetic Algorithms

Dans le présent rapport, on examine quelques-unes des architectures d’apprentissage et de contrôle établies qui ont été appliquées, ou qui pourraient être appliquées, à des systèmes autonomes, l’accent étant mis sur le potentiel en matière d’applications militaires. En particulier, on examine les principaux progrès accomplis et les principaux problèmes subsistants dans chacun des champs de recherche suivants : techniques d’apprentissage par renforcement, apprentissage par réseau neuronal, et algorithmes génétiques. Pour illustrer la mise en œuvre des méthodes d’apprentissage pour les systèmes autonomes, on a étudié trois cas : le véhicule terrestre autonome dans les réseaux neuronaux (ALVINN), les méthodes évolutionnaires pour entraîner l’ALVINN, et la reconnaissance des formes avec les réseaux neuronaux récurrents pour les systèmes autonomes. On considère les forces, les limites et le potentiel de ces techniques d’apprentissage et on en parle dans le contexte de leur futur développement pour l’application aux systèmes autonomes.