autonomous quadrotor control with reinforcement learning.pdf

Autonomous Quadrotor Control with Reinforcement Learning

Michael C. Koval

[email protected]

Christopher R. Mansley

[email protected]

Michael L. Littman

[email protected]

Abstract

Based on the same principles as a single-rotorhelicopter, a quadrotor is a flying vehicle that ispropelled by four horizontal blades surroundinga central chassis. Because of this vehicles sym-metry and propulsion mechanism, a quadrotor iscapable of simultaneously moving and steeringby simple modulation of motor speeds [1]. Thisstability and relative simplicity makes quadro-tors ideal for research in the application of con-trol theory and artificial intelligence to aerialrobotics [3].Most prior work using quadrotors has applied

low-level, manually-tuned control algorithms tocomplete specific tasks. This paper proposesan alternate approach for controlling a quadro-tor through the application of continuous state-action space reinforcement learning algorithmsby making use of the Parrot AR.Drones richsuite of on-board sensors and the localizationaccuracy of the Vicon motion tracking system.With such high quality state information a re-inforcement learning algorithm should be capa-ble of quickly learning a policy that maps thequadrotors physical state to the low level veloc-ity parameters that are used to control a quadro-torss four motors. Once learning is complete,this policy will encode the information neces-sary to repeatably and accurately perform thedesired high-level action without ever requiringa programmer to manually split the action intosmaller components.

1 Introduction

Programming a complex robotic system has typ-ically required a large amount of time from an in-terdisciplinary team of computer scientists, elec-

trical engineers, and control theorists. Thisbroad set of skills is required largely because asingle high-level action (such as catch the ball)requires thousands of decisions at different pointsin time based on the robots sensor input and po-sition. Reinforcement learning promises to sim-plify the otherwise time-consuming process ofprogramming a robots desired behavior by al-lowing the programmer to specify what actionthe robot should perform without ever detailinghow it should perform the action [7]. This ab-straction is especially beneficial when controllinga robot with multiple actuators such as aerialrobot moving through a dynamic environment.

Quadrotor helicopters are one form of aerialvehicles that offer an excellent balance betweenmaneuverability and stability. Similar in designto a traditional helicopter, a quadrotor is sup-ported by four horizontally-oriented blades sur-rounding a central chassis that houses the vehi-cles electronics and, potentially, its payload. Byusing four horizontal blades instead of one hori-zontal blade and one vertical blade, a quadrotoris capable of moving, strafing, and turning inthree-dimensional space by altering the relativespeeds of its rotors [1]. The specific methods ofmodulating motor outputs to achieve this typeof motion is discussed in Section 4.

Even with its relatively simple design and rela-tive stability, applying reinforcement learning toquadrotor control is a non-trivial problem. Un-like the discrete problems considered introduc-tory reinforcement learning texts, a quadrotorsstate is a function of its position, velocity, andacceleration: continuous variables that do notlend themselves to quantization. Similarly, therobots actions are formed from a continuum ofpossible motor outputs. Moving from a discretestate space to a continuous state space greatly

1

joy ardrone_joy ardrone_driver

Vicon Driver

Learning

Kalman Filter

Figure 1: Architecture of the ROS nodes used in this project. Broken lines indicate unimplementedfeatures.

increases the difficulty in learning an optimalpolicy and requires that one either quantize thestate space or use a reinforcement learning al-gorithm designed to learn in continuous state-action spaces [6].

Regardless of the specific reinforcement learn-ing algorithm selected, learning a mapping be-tween states and actions is only as effective as therobots state information. Because the quadro-tor operates at high speeds and some tasks re-quire fast response times, it is important tohave a high-fidelity and high-sample rate for thequadrotors state. This paper proposes that suf-ficiently high-quality data can be obtained byfusing data measured from onboard inertial sen-sors with the high-accuracy localization providedby a Vicon Motion Capture System.

2 Reinforcement Learning

Reinforcement learning is a subfield of machinelearning in which an agent must learn an opti-mal behavior by interacting and receiving feed-back from a stochastic environment. Unlike su-pervised learning where the agent is given knowl-edge as for how it should perform a task, rein-forcement learning uses reward and punishmentto teach the agent what the task is [2]. Thisapproach is particularly useful when the agent isinteracting with a large or highly dynamic en-vironment that cannot be completely explored,such as when a robot is operating in a previouslyunknown environment [7].

In the standard model of reinforcement learn-ing the environment is modeled as a Markov

Decision Process (MDP) with a discrete set ofstates, actions, and a transition model. Usingthis model, the discrete set of states is connectedby a collection of actions. In each state theagent performs one action, stochastically tran-sitions to a new state, and receives a rewardas dictated by the environment. Defined usingthis framework, the goal of reinforcement learn-ing can be formally expressed as solving for apolicy pi : S A(s) that, when followed, max-imizes the agents expected cumulative reward[8].

Traditionally, reinforcement learning algo-rithms have been categorized by whether theylearn the transition model of the environment asan intermediate step towards learning the opti-mal policy. Model-based methods use the agentsinteraction with the environment to first learn amodel of the environment, then use that modelto derive an optimal policy. Conversely, model-free algorithms directly learn the value of eachstate-action pair without ever learning an ex-plicit model of the environment [2].

3 Robot Operating System

Willow Garages Robot Operating System, ROS,is a framework for finding, building, and seam-lessly running robot control code across multiplecomputers. In its essence, ROS is a common APIfor publishing and subscribing to data in a stan-dard format [5]. Standardizing the format fordata transmission and the interaction betweennodes has allowed ROS to accumulate a largerepository of drivers, libraries, and robotics al-

2

gorithms. In addition to granting access to al-ready written code, this framework is a usefulway of formally defining the interactions betweenthe various tasks discussed in this paper (see Fig-ure 1).

Interfacing directly with AR.Drone over802.11 wifi, a modified version of Brown Uni-versitys ardrone driver is responsible for con-verting ROS messages to and from Parrots cus-tom packet format. Most importantly, this isthe node that allows ROS to control the droneby specifying its desired linear and angular ve-locities. In addition to control, this node con-verts the drones raw sensor readings into stan-dard ROS messages that can easily be used andinterpreted by the other ROS nodes involved inthe project. See Section 4.2 for a detailed discus-sion of integration of Parrots official SDK into aROS node for communication with the quadro-tor.

Because these sensors return measurementsrelative to the quadrotors coordinate frame,they are unable to accurately localize thequadrotor relative to a global coordinate frame.This information is crucial for building an accu-rate representation of the robots state and is aprerequisite for effective learning to occur. Thislimitation demands that the quadrotors positionbe measured by a suite of external sensors, eachof which has a known position relative to theworld. Because of its mature software frame-work, high accuracy, and unparalleled samplerate of 125 Hz, the Vicon Motion Tracking Sys-tem was selected for this task.

While the Vicon system provides an extremelyaccurate measurement of the drones spatial co-ordinates, it can be further refined using the in-ertial measurements collected by the drones on-board sensors (see Section 4.1). Effective fusionof these two estimates of the drones position willbe achieved through direct application of an en-hanced Kalman filter [9]. This filtered outputcan then be directly used as state information fora reinforcement learning algorithm as previouslydiscussed in Section 2.

Figure 2: Parrot AR.Drone quadrotor in flight.

4 Parrot AR.Drone

The AR.Drone is an inexpensive, mass-producedquadrotor helicopter designed and manufacturedfor consumer use by Parrot Corporation. Asdiscussed in Section 1, a quadrotor featuresfour horizontally-oriented rotors evenly spacedaround a central chassis and is much more sta-ble than traditional flying vehicles. Modulatingthe relative speeds of a quadrotors four motorsadjusts the net force and torque applied to therobot and allows it to move through space.

To remain in a stationary hover the net forceand torque on the quadrotor must both be zero.This is achieved by driving the pairs of oppo-site motors in the same direction, or equivalently,driving adjacent motors with equal speeds in op-posite directions. Any deviation from this stableconfiguration will cause translation or rotationto occur in the plane of the quadrotor.

Slowing two adjacent rotors by equal amounts,as one would expect, causes the net force on therobot to increase and the torque to remain zero(Figure 3a). This configuration allows the droneto make translational motion with no rotationabout the drones center. Conversely, decreas-ing the speed of two opposing motors causes netforce on the drone to remain zero while perturb-ing the net torque. This, as expected, causesthe drone to rotate about its center with zerotranslational velocity (Figure 3b). More com-plex modulations of motor speed allow one toinduce simultaneous translation and rotation bycombining these two primitive actions. [1].

3

(a) Translational Motion (b) In-Place Rotation

Figure 3: Movement through motor speed mod-ulation. Smaller arrows indicate slower motorspeeds.

4.1 Sensing Capabilities

Despite the quadrotors conceptual simplicity,modulating four motor speeds to control the ve-hicles motion is too difficult for direct humanteleoperation. To assist the driver in managingthis complexity, the AR.Drone features an arrayof sensors that are used as the input to on-boardcontrol algorithms. These algorithms translatecommands in the form of rotational and trans-lation velocities into modulated motors speeds.To assist the driver in this manner, the followingsensors are present on the AR.Drone: [4]

1. Two Ultrasonic Altitude Sensors

2. Forward-Facing Camera1

3. Downward-Facing Camera2

4. Six DoF Inertial Measurement Unit (IMU),

Reading the IMU yields a gyroscope measure-ment of angular velocity, an accelerometer mea-surement of linear acceleration, and a compos-ite estimate of the units pose. Similarly, di-rectly reading from the drones downward-facingultrasonic range sensors yields the drones alti-tude and applying optical flow techniques to thedownward-facing camera produces an accuratemeasurement of translational velocity. Note, asdiscussed in Section 3, all of these measurementsare relative to the robots coordinate frame and

1640480 pixels, 93 wide-angle lens, 15 Hz frequency2167144 pixels, 64 wide-angle lens, 60 Hz frequency

provide no information about the global refer-ence frame.

4.2 ROS Integration

While there is not a complete ROS node avail-able that publishes all of the AR.Drones internalsensor readings as ROS topics, a research groupat Brown University developed a thin wrapperfor Parrots official software development kit thatallows for basic teleoperation of the AR.Dronethrough standard ROS topic and services.

This node was extended to publish the fullbreadth of sensor data discussed in Section 4.1as standard ROS messages such a Pose, Twist,and Imu3. Using multiple standard messages in-stead of a single custom message improves there-usability of the ardrone driver node at thecost of a slightly increased overhead. Using thisconvention, the modified ROS node publishes thefollowing topics:

battery: battery life as a percentage

imu4: orientation, linear acceleration, andangular velocity

pose: orientation and altitude

twist: linear and angular velocity

image raw: forward-facing camera

All of these messages are designed to be con-formant to REP-103, the ROS EnhancementProposal (REP) that defines the unit and co-ordinate frame standards that the built-in ROSnodes follow. Specifically, this REP states thatdistance is expressed in meters, velocity in m/s,acceleration in m/s2, and Euclidean rotationsas quaternions. All of these measurements areexpressed relative to a right-handed coordinateframe fixed to the origin of the robot where thepositive x-axis is forward, the positive y-axis isto the left, and the positive z-axis is up when inits resting orientation.

3Message in the standard sensor msgs node.4Topic used to publish Imu messages.

4

4.3 Control Algorithms

Using its full repertoire of on-board sensors, theAR.Drone uses a suite of control algorithms tokeep itself in a stable hover. These control algo-rithms use the optical flow velocities calculatedfrom the downward-facing camera to minimizethe drones translational velocity and, theoreti-cally, cause it to hover over a fixed point relativeto the ground. Empirically these control algo-rithms are fairly effective at keeping the droneat a constant altitude, but less effective at stop-ping translational drift.

This is quantitatively confirmed with the opti-cal flow velocity data plotted in Figure 4. In thisplot the take-off and landing occur, respectively,at 3 and 9 seconds and the time in-between rep-resents an unaltered hover. During the time inflight, the controls were not touched by a hu-man operator and the drones baseline drift hasbeen established. The same drift qualitativelyobserved above is visible in the plot by a largenegative bias visible in the drones forward veloc-ity. Because this drift is visible by both an out-side observer and in the optical flow velocities,a reinforcement-learning algorithm or improvedcontrol algorithm should be capable for correct-ing for this drift when controlling the quadrotor.

5 Conclusion

Quadrotor helicopters, as discussed in Section1, are a much more stable platform for artificialintelligence research than traditional helicoptersand planes. This stability has been exploited bycontrol theorists to obtain impressive results inspecific situations [3], but this research does notsimplify the process of programming the quadro-tor to respond in novel situations. This learningprocess closely matches the goal of reinforcementlearning described in Section 2: to train an agentby telling it what the desired goal is instead ofhow to reach it [2].

Thanks to a large library of pre-existing codeavailable and the common message-passing inter-face specified by ROS, the experimental setupfor reinforcement learning requires a minimalamount of custom software. In particular, Brown

Universitys ardrone driver node provide aROS interface to Parrots official SDK and allowsfor common ROS messages to control the drone.With the addition of custom modifications (dis-cussed in Section 4.2), this node also publishesthe data collected from the Drones cameras, dis-tance sensors, and the IMU. Finally, the inertialdata returned by this node will be combined withthe position data returned by the Vicon MotionTracking system using an enhanced Kalman fil-ter and used to obtain an extremely accurate es-timate of the drones state [9].

Using this information as the drones stateand its motor outputs as actions, the problemof training the quadrotor to perform a high-levelaction is in the same form as the MDPs discussedin Section 2. Future work will focus on usinga continuous state-action reinforcement learn-ing algorithm to learn complex behavior morequickly than would be possible through manualprogramming.

6 Future Work

Because reinforcement learning algorithms areonly as good as the accuracy and timelinesstheir knowledge of the world, it is first nec-essary to obtain accurate measurements of thedrones position, velocity, and acceleration inthree-dimensional space. While the customizedROS node discussed in Section 4.2 provides in-formation about the drones velocity and acceler-ation, it is unable to provide information aboutthe drones spatial coordinates relative to a fixedreference frame. Using the ROS node for theVicon Motion Tracking System developed by re-searchers at the University of California, Berke-ley, the next step towards having a fully opera-tional experimental setup is to set up and cali-brate the Vicon motion tracking system to trackthe drones position and orientation.

This measured pose will then be combinedwith the drones on-board sensors to producea single estimate of the quadrotors position,orientation, velocity and acceleration with anenhanced Kalman filter [9]. This vector ofdata (or, perhaps, a subset of it) will then

5

-0.6-0.4-0.2

00.20.40.6

0 2 4 6 8 10 12

Velocity

(m/s)

Time (s)

Forward Velocity

-0.6-0.4-0.2

00.20.40.6

0 2 4 6 8 10 12

Velocity

(m/s)

Time (s)

Strafing Velocity

Figure 4: Translational velocity of the AR.Drone while in a stable hover as measured by opticalflow.

be used as state information for a reinforce-ment learning algorithm. At each instant intime, the quadrotors action is a vector of de-sired linear and angular velocities which are in-terpreted with the AR.Drones built-in controlsoftware. Given the appropriate reward func-tion and an ample amount of data, a reinforce-ment learning algorithm designed for continuousstate-action spaces should be capable of rela-tively quickly learning high-level actions similarto those achieved by pure control theory.

One such application of this learning would beto train a quadrotor to perform adaptive high-performance maneuvers similar to those done bycontrol theorists at the Swiss Federal Institute ofTechnology Zurich. Examples of these tricksinclude performing flips, catching a ball in a cup,or jugging a light ball using a modified chassis[3]. With the appropriate equipment and re-search investment, reinforcement learning couldreplace or reduce the amount of extensive man-ual tuning required to learn new maneuvers.

References

[1] G.M. Hoffmann, H. Huang, S.L. Waslander, andC.J. Tomlin. Quadrotor helicopter flight dynam-ics and control: Theory and experiment. In Pro-ceedings of the AIAA Guidance, Navigation, andControl Conference, pages 20076461. Citeseer,2007.

[2] L.P. Kaelbling, M.L. Littman, and A.W. Moore.Reinforcement learning: A survey. Arxiv preprint

cs/9605103, 1996.

[3] S. Lupashin, A. Schollig, M. Sherback, andR. DAndrea. A simple learning strategy for high-speed quadrocopter multi-flips. In Robotics andAutomation (ICRA), 2010 IEEE InternationalConference on, pages 16421648. IEEE, 2010.

[4] Stephane Piskorski. AR.Drone Developers GuideSDK 1.5. Parrot, 2010.

[5] M. Quigley, B. Gerkey, K. Conley, J. Faust,T. Foote, J. Leibs, E. Berger, R. Wheeler, andA. Ng. ROS: an open-source Robot OperatingSystem. In International Conference on Roboticsand Automation, 2009.

[6] J.C. Santamara, R.S. Sutton, and A. Ram. Ex-periments with reinforcement learning in prob-lems with continuous state and action spaces.Adaptive behavior, 6(2):163, 1997.

[7] W.D. Smart and L.P. Kaelbling. Reinforcementlearning for robot control. Mobile Robots XVI(Proc. SPIE 4573), 2001.

[8] Richard S. Sutton and Andrew G. Barto. Rein-forcement Learning: An Introduction. The MITPress, 1998.

[9] Sebastian Thrun, Wolfram Burgard, and DieterFox. Probabilistic Robotics: Intelligent Roboticsand Autonomous Agents. The MIT Press, 2001.

6

autonomous quadrotor control with reinforcement learning.pdf

Documents