motor learning at intermediate reynolds number

Motor Learning at Intermediate ReynoldsNumber: Experiments with Policy Gradient on aHeaving Plate

John W. Roberts, Jun Zhang and Russ Tedrake

Abstract This work describes the development of a model-free reinforcementlearning-based control methodology for the heaving plate, a laboratory experimentalfluid system that serves as a model of flapping flight. Through an optimized policygradient algorithm, we were able to demonstrate rapid convergence (requiring lessthan 10 minutes of experiments) to a stroke form which maximized the propulsiveefficiency of this very complicated fluid-dynamical system. This success was duein part to an improved sampling distribution and carefully selected policy parame-terization, both motivated by a formal analysis of the signal-to-noise ratio of policygradient algorithms. The resulting optimal policy provides insight into the behaviorof the fluid system, and the effectiveness of the learning strategy suggests a numberof exciting opportunities for machine learning control of fluid dynamics.

1 Introduction

The possible applications of robots that swim and fly are myriad, and include ag-ile UAVs, high-maneuverability AUVs and biomimetic craft such as ornithoptersand robotic fish. However, controlling robots whose behavior is heavily dependentupon their interaction with a fluid can be particularly challenging. Whereas in manyregimes robots are able to make use of either accurate dynamical models or easy-to-stabilize dynamics, in the case of flapping and swimming robots neither of theseconditions apply. The models available to swimming and flying robots tend to bevery limited in their region of validity (such as quasi-steady models of fixed-wingaircraft), very expensive to evaluate (such as direct numerical simulation of the gov-erning equations) or almost completely unavailable (as is the case with interactionsbetween a complex flow and a compliant body).

Here we investigate the problem of designing a controller for a specific experi-mental system: the heaving plate (see Figure 1). This setup is a model of forwardflapping flight developed by the Applied Math Lab (AML) of the Courant Institute

1

2 John W. Roberts, Jun Zhang and Russ Tedrake

Fig. 1: (A) Schematic of experimental flapping system. The white arrow shows thedriven vertical motion determined by the controller, while the black arrow shows thepassive rotational motion resulting from the fluid forces. Figure from (Vandenbergheet al., 2006), with slight modifications. (B) Original experimental flapping system.The wing shown is an earlier design used to study the fluid system, while the wingused for this work was a simple rigid rectangle.

of Mathematical Sciences at New York University (Vandenberghe et al., 2004; Van-denberghe et al., 2006). This system consists of a symmetric rigid horizontal platethat is driven up and down along its vertical axis and is free to rotate in the horizontalplane. Previous work demonstrated that driving the vertical motion with a sinusoidalwaveform caused the system to begin rotating in a stable “forward flight” (Vanden-berghe et al., 2004). Here we will investigate a richer class of waveforms in anattempt to optimize the efficiency of that forward flight. This system is an excellentcandidate for control experiments as it is perhaps the simplest experimental modelof flapping flight, is experimentally convenient for learning experiments1, and cap-tures the essential qualities of the rich fluid dynamics in fluid-body interactions atintermediate Reynolds numbers. While accurate Navier-Stokes simulations of thissystem do exist for the case of a rigid wing (Alben & Shelley, 2005), they requirea dramatic amount of computation2. As such, model-based design of an effectivecontroller for this system is a daunting task.

Model-free reinforcement learning algorithms, however, offer an avenue bywhich controllers can be designed for this system despite the paucity of the avail-

1 Particularly when compared to a flapping machine that falls out of the sky repeatedly duringcontrol design experiments.2 At the time, this simulation required approximately 36 hours to simulate 30 flaps on specializedhardware (Shelley, 2007).

Motor Learning at Intermediate Reynolds Number 3

able models by evaluating the effectiveness of a controller directly on the physicalsystem (Peters et al., 2003; Tedrake et al., 2004). In fact, learning through exper-imentally collected data can be far more efficient than learning via simulation inthese systems, as high-fidelity simulations can take orders of magnitude longer thanan experiment to obtain the same data. One of the limitations of this experimen-tal approach, however, is the potential difficulty in directly measuring the state ofthe fluid, which, naively, is infinite-dimensional (a continuum model). Therefore,the problem is best formulated as a partially-observable Markov decision process(POMDP) (Kaelbling et al., 1998). In the experiment described here, we solve thePOMDP with a pure policy gradient approach, choosing not to attempt to approxi-mate a value function due to our poor understanding of even the dimensionality ofthe fluid state.

The difficulty in applying policy gradient techniques to physical systems stemsfrom the fact that model-free algorithms often suffer from high variance and rel-atively slow convergence rates (Greensmith et al., 2004), resulting in the need formany evaluations. As the same systems on which one wishes to use these algorithmstend to have a high cost of policy evaluation, much work has been done on maximiz-ing the policy improvement from any individual evaluation (Meuleau et al., 2000;Williams et al., 2006). Techniques such as Natural Gradient (Amari, 1998; Peterset al., 2003) and GPOMDP (Baxter & Bartlett, 2001) have become popular throughtheir ability to converge on locally optimal policies using fewer policy evaluations.

During our experiments with policy gradient algorithms on this system, we de-veloped a number of optimizations to the vanilla policy gradient algorithm whichprovided significant benefits to learning performance. The most significant of theseis the use of non-Gaussian sampling distributions on the parameters, a techniquewhich is appropriate for systems with parameter-independent additive noise (a com-mon model for sensor-driven observation noise). We make these observations con-crete by formulating a signal-to-noise ratio (SNR) analysis of the policy gradientalgorithms, and demonstrate that our modified sampling distributions improve theSNR.

With careful experiments and improvements to the policy gradient algorithms,we were able to reliably optimize the stroke form (i.e., the periodic vertical tra-jectory of the plate through time) of the heaving plate for propulsive efficiency tothe same optima from different initial conditions (in policy space) in less than tenminutes of trial-and-error experiments on the system. The remainder of this chapterdetails these experiments and the theory behind them.

2 Experimental Setup

The heaving plate is an ideal physical system to use as a testbed for developingcontrol methodologies for the broader class of problems involving fluid-robot in-teractions. The system consists of a symmetric rigid titanium plate pinned in thehorizontal plane. It is free to rotate about the point at which it is pinned with very


little non-fluid related friction. The control input is the vertical position of the platez (note that the input is the kinematic position, not the force on the plate). This ver-tical driven flapping motion is coupled through the fluid with the passive angularrotation such that once the system begins to flap, fluid forces cause it to spin. Toensure that the rotational dynamics are dominated by fluid forces, and not frictiondue to the slip ring and bearings used to mount the wing, the decay in angular speedwas measured for both a submerged and non-submerged wing. Figure 2 shows theresult of this experiment, demonstrating that the fluid forces are far more significantthan the bearing friction in the regime of operation.

Fig. 2: Comparison of fluid and bearing friction for large rigid wing. The foil (notundergoing a heaving motion) was spun by hand in both air and water, with theangular velocity measured as a function of time. Five curves for both cases weretruncated to begin at the same speed then averaged and filtered with a zero-phaselow-pass filter to generate these plots. The quick deceleration of the wing in wateras compared to air indicates that fluid viscous losses are much greater than bear-ing losses. At the speeds achieved at the convergence of the learning, the frictionalforces in water were over six times that seen in air.

The only measurements taken were the vertical position of the plate (z) and angu-lar position of the plate (x) by optical encoders and the axial force applied to the plate(Fz), obtained by an analog tension-compression load cell. While these measure-ments were sufficient for formulating interesting optimization problems, note thatthe fluid was never measured directly. Indeed, the hidden state of the fluid, includinghighly transient dynamic events such as the vortex pairs created on every flappingcycle, provide the primary mechanism of thrust generation in this regime (Vanden-berghe et al., 2004). Sensing the state of the flow is possible using local flow sensorsor even real-time far-field optical flow measurement (Bennis et al., 2008), but ex-perimentally more complex. However, this work demonstrates such sensing is notnecessary for the purpose of learning to move effectively in the fluid, despite thecritical role the fluid state plays in the relevant dynamics.


The setup was originally used to study the fluid dynamics of forward flappingflight, and possess a Reynolds number of approximately 16,000, putting it in thesame regime as dragonflies and other small biological flapping fliers3. The plate isunalloyed titanium (chosen for its corrosion resistance), 72 cm long, 5 cm wide and.3175 cm thick. The vertical motion was produced by a scotch yoke, which con-verted the rotational motion of the motor into the desired linear driven motion ofthe plate. Due to the high gear ratio between the motor and the scotch yoke, thesystem was not back-drivable, and thus the plate was controlled be specifying a de-sired kinematic trajectory which was followed closely using tuned high-gain linearfeedback. While the trajectories were not followed perfectly (e.g., there is some lagin tracking the more violent motions), the errors between specified trajectory andfollowed trajectory were small (on the order of 5% of the waveform’s amplitude).

3 Optimal Control Formulation

We formulate the goal of control as maximizing the propulsive efficiency of forwardflight by altering the plate’s stroke form. We attempt to maximize this efficiencywithin the class of strokeforms that can be described by a given parameterization. Inthis section we discuss both the reward function used to measure performance, andthe parameterization used to represent policies (i.e., stroke forms).

3.1 Reward Function

We desire our reward function to capture the efficiency of the forward motion pro-duced by the stroke form. To this end, we define the (mechanical) cost-of-transportover one period T as:

cmt =∫

T |Fz(t)z(t)|dtmg∫

T x(t)dt. (1)

where x is the angular position, z is the vertical position, Fz is the vertical force, m isthe mass of the body and g is gravitational acceleration. The numerator of this quan-tity is the energy used, while the denominator is the weight times distance traveled.It was computed on our system experimentally by filtering and integrating the forcemeasured by the load cell and dividing by the measured angular displacement, allover one period. This expression is the standard means of measuring transport costfor walking and running creatures (Collins et al., 2005), and thus seems a sensibleplace to start when measuring the performance of a swimming system.

This cost has the advantage of being dimensionless, and thus invariant to theunits used. The non-dimensionalization is achieved by dividing by a simple scalar

3 This Reynolds number is determined using the forward flapping motion of the wing, rather thanthe vertical heaving motion. The vertical heaving motion possesses a Re of approximately 3,000


(in this case mg), and thus does not change as the policy changes. While alternativesto the mass such as using the difference in mass between the plate and the displacedfluid were debated (to take into account the importance of the fluid to the motionof the plate), changes such as these would affect the magnitude of the cost, butnot the optima and learning behavior, as these are invariant to a scaling. Therefore,implementing this change would not effect the found optimal policy.

Another possibility would be non-dimensionalizing by dividing by expected en-ergy to travel through the fluid (as opposed to weight times distance), but this woulddepend upon the speed of travel as drag is velocity dependent, and thus would havea more complicated form. While this could obviously still be implemented, andnew behavior and optima may be found, rewarding very fast flapping gaits strongly(as this would tend to do) was undesirable simply because the experimental setupstruggled mechanically with the violent motions found when attempting to max-imize speed. The cost function selected often produced relatively gentle motions,and as such put less strain on the setup.

Finally, note that our learning algorithm attempts to maximize the cost of trans-port’s inverse (turning it from a cost into a reward), which is equivalent to minimiz-ing the energy cost of traveling a given distance. This was done for purely practicalconsiderations, as occasionally when very poor policies were tried the system wouldnot move significantly despite flapping, which resulted in an infinite or near-infinitecost of transport. The inverse, however, remained well-behaved.

3.2 Policy Parameterization

The parameterization chosen took the following form: the vertical heaving motionof the wing was represented by a 13-point periodic cubic spline with fixed amplitudeand frequency, giving height z as a function of time t (see Figure 3). There were fiveindependent parameters, as the half-strokes up and down were constrained to besymmetric about the t axis (i.e., the first, seventh and last points were fixed at zero,while points 2 and 8, 3 and 9 etc. were set to such that they were equal in absolutevalue but opposite in sign which were determined by the control parameters).

This parameterization represented an interesting class of waveforms, had a rela-tively small number of parameters, and was both smooth and periodic. We will alsosee that many of the properties are desirable when viewed through the SNR (theseadvantages are discussed in greater detail in Section 6.1).

4 The Learning Algorithm

In light of the challenges faced by learning on a system with such dynamic com-plexity and partial observability, we made use of the weight-perturbation (WP) al-gorithm: a model-free policy gradient method that has been shown empirically to be


Fig. 3: A schematic of the parameterization of policies used in this work. Note thesymmetry of the up and down strokes, and the fact that five independent parametersare used to encode the shape of the waveform.

well-suited to these sorts of problems due to its robustness to noise and insensitivityto the complexity of the system dynamics.

4.1 The weight perturbation update

Consider minimizing a scalar function J(w) with respect to the parameters w (notethat it is possible that J(w) is a long-term cost and results from running a systemwith the parameters w until conclusion). The weight perturbation algorithm (Jabri& Flower, 1992) performs this minimization with the update:

∆w =−η (J(w+ z)− J(w))z, (2)

where the components of the “perturbation”, z, are drawn independently from amean-zero distribution, and η is a positive scalar controlling the magnitude of theupdate (the “learning rate”). Performing a first-order Taylor expansion of J(w + z)yields:

∆w =−η

(J(w)+∑

i

∂J∂w i

zi− J(w)

)z =−η ∑

i

∂J∂w i

zi · z. (3)

In expectation, this becomes the gradient times a (diagonal) covariance matrix, andreduces to

E[∆w] =−ησ2 ∂J

∂w, (4)

an unbiased estimate of the gradient, scaled by the learning rate and σ2, the varianceof the perturbation. However, this unbiasedness comes with a very high variance, asthe direction of an update is uniformly distributed. It is only the fact that updates


near the direction of the true gradient have a larger magnitude than do those nearlyperpendicular to the gradient that allows for the true gradient to be achieved inexpectation. Note also that all samples parallel to the gradient are equally useful,whether they be in the same or opposite direction, as the sign of the change in costdoes not affect the resulting update.

The WP algorithm is one of the simplest examples of a policy gradient rein-forcement learning algorithm, and in the special case when z is drawn from aGaussian distribution, weight perturbation can be interpreted as a REINFORCE up-date (Williams, 1992).

4.2 The Shell Distribution

Rather than the Gaussian noise which is most commonly used for sampling, in ourwork we used a distribution in which z (the perturbation) is uniformly distributed indirection, but always has a fixed magnitude. We call this the shell distribution. Thisstyle of sampling was originally motivated by the intuitive realization that when aGaussian distribution produced a small noise magnitude, the inherent noise in thesystem effectively swamped out the change in cost due to the policy perturbation,preventing any useful update from taking place. When the SNR was studied in thisdomain (noisy policy evaluations and possibly poor baselines), it was found to sup-port these conclusions (as discussed in Section 5.4). For these reasons, the shelldistribution was used throughout this work, and as Section 5.5 demonstrates, tangi-ble benefits were obtained.

5 Signal-to-Noise Ratio Analysis

Our experiments with sampling distributions quickly revealed that significant per-formance benefits could be realized through a better understanding of the effectof sampling distributions and measurement noise on learning performance. In thissection we formulate a signal-to-noise ration (SNR) analysis of the policy gradientalgorithms. This analysis formalized a number of our empirical observations aboutWP’s performance, and gave insight in several improvements that offered real ben-efits to the speed of convergence.

5.1 Definition of the Signal-to-Noise Ratio

The SNR is the expected power of the signal (update in the direction of the truegradient) divided by the expected power of the noise (update perpendicular to thetrue gradient). Taking care to ensure that the magnitude of the true gradient does not


effect the SNR, we have:

SNR =E[∆wT‖ ∆w‖

]E[∆wT⊥∆w⊥

] , (5)

∆w‖ =(

∆wT Jw

‖Jw‖

)Jw

‖Jw‖, ∆w⊥ = ∆w−w‖, (6)

and using Jw(w0) = ∂J(w)∂w

∣∣∣(w=w0)

for convenience.

Intuitively, this expression measures how large a proportion of the update is “use-ful”. If the update is purely in the direction of the gradient the SNR would be infinite,while if the update moved perpendicular to the true gradient, it would be zero. Assuch, all else being equal, a higher SNR should generally perform as well or betterthan a lower SNR, and result in less violent swings in cost and policy for the sameimprovement in performance. For a more in depth study of the SNR’s relationshipto learning performance, see (Roberts & Tedrake, 2009).

5.2 Weight perturbation with Gaussian distributions

Evaluating the SNR for the WP update in Equation 2 with a deterministic J(w) andz drawn from a Gaussian distribution yields a surprisingly simple result. If one firstconsiders the numerator:

E[∆wT‖ ∆w‖

]= E

[η2

‖Jw‖4

(∑i, j

JwiJw j ziz j

)Jw

T ·

(∑k,p

Jwk Jwpzkzp

)Jw

]

= E

[η2

‖Jw‖2 ∑i, j,k,p

JwiJw j Jwk Jwpziz jzkzp

]= Q, (7)

where we have named this term Q for convenience as it occurs several times in theexpansion of the SNR. We now expand the denominator as follows:

E[∆wT⊥∆w⊥

]= E

[∆wT

∆w−2∆wT‖ (∆w‖+∆w⊥)+∆wT

‖ ∆w‖]= E

[∆wT

∆w]−2Q+Q

(8)Substituting Equation (2) into Equation (8) and simplifying results in:

E[∆wT⊥∆w⊥

]=

η2

‖Jw‖2 E

[∑i, j,k

JwiJw j ziz jz2k

]−Q. (9)

We now assume that each component zi is drawn from a Gaussian distribution withvariance σ2. Taking the expected value, it may be further simplified to:


Q =η2

‖Jw‖4

(3σ

4∑

iJwi

4 +3σ4∑

iJwi

2∑j 6=i

Jw j2

)=

3σ4

‖Jw‖4 ∑i, j

Jwi2Jw j

2 = 3σ4,

(10)

E[∆wT⊥∆w⊥

]=

η2σ4

‖Jw‖2

(2∑

iJwi

2 +∑i, j

Jwi2

)−Q = σ

4(2+N)−3σ4 = σ

4(N−1),

(11)where N is the number of parameters. Canceling σ results in:

SNR =3

N−1. (12)

Thus, for small noises and constant σ the SNR and the parameter number have asimple inverse relationship. This is a particularly concise model for performancescaling in PG algorithms.

5.3 SNR with parameter-independent additive noise

In many real world systems, the evaluation of the cost J(w) is not deterministic,a property which can significantly affect learning performance. In this section weinvestigate how additive noise in the function evaluation affects the analytical ex-pression for the SNR. We demonstrate that for very high noise WP begins to behavelike a random walk, and we find in the SNR the motivation for the shell distribution;an improvement in the WP algorithm that will be examined in Section 5.4.

Consider modifying the update seen in Equation (2) to allow for a parameter-independent additive noise term v and a more general baseline b(w), and again per-form the Taylor expansion. Writing the update with these terms gives:

∆w =−η

(J(w)+∑

iJwizi−b(w)+ v

)z =−η

(∑

iJwizi +ξ (w)

)z. (13)

where we have combined the terms J(w), b(w) and v into a single random variableξ (w). The new variable ξ (w) has two important properties: its mean can be con-trolled through the value of b(w), and its distribution is independent of parametersw, thus ξ (w) is independent of all the zi.

We now essentially repeat the calculation seen in Section 5.2, with the smallmodification of including the noise term. When we again assume independent zi,each drawn from identical Gaussian distributions with standard deviation σ , weobtain the expression:

SNR =φ +3

(N−1)(φ +1), φ =

(J(w)−b(w))2 +σ2v

σ2‖Jw‖2 (14)


where σv is the standard deviation of the noise v and we have termed the errorcomponent φ . This expression depends upon the fact that the noise v is mean-zeroand independent of the parameters, although the assumption that v is mean-zero isnot limiting. It is clear that in the limit of small φ the expression reduces to that seenin Equation (12), while in the limit of very large φ it becomes the expression for theSNR of a random walk (see (Roberts & Tedrake, 2009)). This expression makes itclear that minimizing φ is desirable, a result that suggests two things: (1) the optimalbaseline (from the perspective of the SNR) is the value function (i.e., b∗(w) = J(w))and (2) higher values of σ are desirable as they reduce φ by increasing the size ofits denominator. However, there is clearly a limit on the size of σ due to higher-order terms in the Taylor expansion; very large σ will result in samples which donot represent the local gradient. Thus, in the case of noisy measurements, there issome optimal sampling distance that is as large as possible without resulting in poorsampling of the local gradient. This is explored in the next section.

Fig. 4: SNR vs. update magnitude for a 2D quadratic cost function. Mean-zero mea-surement noise is included with variances from 0 to .65 (the value function’s Hes-sian was diagonal with all entries equal to 2). As the noise is increased, the samplingmagnitude producing the maximum SNR is larger and the SNR achieved is lower.Note that the highest SNR achieved is for the smallest sampling magnitude withno noise where it approaches the theoretical value (for 2D) of 3. Also note that forsmall sampling magnitudes and large noises the SNR approaches the random walkvalue of 1/N−1 (see (Roberts & Tedrake, 2009)).

5.4 Non-Gaussian Distributions

The analysis in Section 5.3 suggests that for a function with noisy measurementsthere is an optimal sampling distance which depends upon the local noise andgradient as well as the strength of higher-order terms in that region. For a two-


dimensional cost function that takes the form of a quadratic bowl in parameter space,Figure 4 shows the SNR’s dependence upon the radius of the shell distribution (i.e.,the magnitude of the sampling). For various levels of additive mean-zero noise theSNR was computed for a distribution uniform in angle and fixed in its distance fromthe mean (this distance is the “sampling magnitude”). The fact that there is a uniquemaximum for each case suggests the possibility of sampling only at that maximalmagnitude, rather than over all magnitudes as is done with a Gaussian, and thusimproving SNR and performance. While determining the exact magnitude of maxi-mum SNR may be impractical, by choosing a distribution with uniformly distributeddirection and a constant magnitude close to this optimal value, performance can beimproved.

5.5 Experimental Evaluation of Shell Distributions

To provide compelling evidence that the shell distribution could improve conver-gence in problems of interest, the shell distribution was implemented directly on theheaving plate, and the resulting learning curves were compared to those obtainedusing Gaussian sampling. For the purposes of this comparison, policy evaluationswere run for long enough to reach steady state (to eliminate any issues relating tothe coupling between consecutive evaluations). As can be seen in Figure 5, the shelldistribution provided a real advantage in convergence rate on this system of inter-est, when dealing with the full dynamical complexity of a laboratory experimentalsystem.

(a)

Fig. 5: Five averaged runs on the heaving plate using Gaussian or Shell distributionsfor sampling. The error bars represent one standard deviation in the performance ofdifferent runs at that trial.


5.6 Implications for Learning at Intermediate Reynolds Numbers

The SNR demonstrates many of the properties of WP that make it so well suitedto learning on partially observable and dynamically complicated system, such asfluid systems in the intermediate Reynolds number regime. The SNR shows that thesystem’s dynamical complexity does not (locally) effect the difficulty of learning,as the dynamics appear nowhere in the expression. Instead, learning performance islocally effected by the number of parameters in the policy (N), the level of stochas-ticity in policy evaluations (σv), the quality of the baseline and the steepness of localgradients.

The SNR does not take into account the effects of higher-order behavior such asthe roughness of the value function in policy space, which is in general a function ofthe system, the choice of parameterization and the choice of the cost function. Theseproperties can be extremely important to the performance of WP, affecting bothnumber and depth of local minima and the rate of learning, but are not analyticallytractable in general.

6 Learning Results

6.1 Policy Representation Viewed Through SNR

Due to the importance of the policy parameterization to the performance of thelearning, it is important to pick the policy class carefully. Now armed with knowl-edge of the SNR, some basic guidelines for choosing the parameterization, previ-ously justified heuristically, become more precise. Finding a rich representation witha small number of parameters can greatly improve convergence rates. Furthermore,certain parameterizations can be found with fewer local minima and smoother gradi-ents than others, although determining these properties a priori is often impractical.A parameterization in which all parameters are reasonably well-coupled to the costfunction is beneficial, as this will result in less anisotropy in the magnitude of thegradients, and thus larger sampling magnitudes and greater robustness to noise canbe achieved. Prior to the periodic cubic spline, several parameterizations were tried,with all taking the form of encoding the z height of the wing as a function of timet over one period T , with the ends of the waveform constrained to be periodic (i.e.,the beginning and end have identical values and first derivatives).

Parameterizing the policy in the seemingly natural fashion of a finite Fourierseries was ruled out due to the difficulty in representing many intuitively usefulwaveforms (e.g., square waves) with a reasonable number of parameters. A param-eterization using the sum of base waveforms (i.e., a smoothed square wave, a sinewave and a smoothed triangle wave) was used and shown to learn well, but wasdeemed a too restrictive class which predetermined many features of the waveform.Learning the base period T and the amplitude A of the waveform was also tried, and


shown to perform well without significant difficulty. However, it was discovered thatperiods as long as possible and amplitudes as small as possible were selected, andthus these extra parameters were determined to not be of interest to the learning (thisresult was useful in determining how the system’s dynamics related to the rewardfunction, and is discussed in detail in Section 7).

6.2 Reward Function Viewed Through SNR

The SNR also gives some greater insight into the formulation of the reward function.The form of the reward was discussed in Section 3.1, and justified by its connectionto previous work in locomotive efficiency. We may now, however, understand moreof what makes it advantageous from the perspective of learning performance andSNR. Its integral form performs smoothing on the collected data, which effectivelyreduces the noise level for a given trial, and intuitively it should differentiate mean-ingfully between different policies (e.g., a reward function that has no gradient withrespect to the parameters in some region of policy space will find it very difficult tolearn in that region).

6.3 Implementation of Online Learning

The SNR analysis presented here is concerned with episodic trials (i.e., a series ofdistinct runs), but this does not preclude online operation, in which trials follow oneanother immediately and the system runs continuously. While at first we ran longerpolicy evaluations (on the order of 40 or more flaps, averaged together) to reducenoise, and gave the system plenty of time (20 or more flaps) between trials to reachsteady state and avoid inter-trial coupling, as we became more proficient at dealingwith the high noise levels of the system we began to attempt to learn more aggres-sively. Through techniques such as sampling from the shell distribution, we wereable to reduce trial length to a single flap (requiring just 1 second), and by reducingnoise levels (ultimately choosing a shell distribution with a radius of approximately4% of the parameter value) were able to eliminate the inter-trial settling time. Inter-trial correlation was explored, and while present, we found it did not significantlyhamper learning performance, and that including eligibility traces did not greatlyimprove the rate of convergence.

6.4 Performance of Learning

In its final form, the SNR-optimized WP update was able to learn extremely effi-ciently. Using the policy parameterization described above, along with shell sam-


pling and one-flap trials, the optimal policy was found within 7 minutes (around400 flaps) even when starting far away in state space (see Figure 6). This quick con-vergence in the face of inter-trial coupling and high variance in policy evaluations(resulting from running them for such a short period of time) demonstrates the WPalgorithm’s robustness to these complex but very common difficulties.

Fig. 6: The average of five learning curves using online learning (an update everysecond, after each full flapping cycle), with markers for +/- one standard deviation.The high variance is the result of large inter-trial variance in the cost, rather thanlarge differences between different learning curves.

7 Interpretation of Optimal Solution

Once the learning had been successfully implemented, and repeatable convergenceto the same optimum was achieved, it is interesting to investigate what the form ofthe solution suggests about the physical system. Figure 7 shows an example of aninitial, intermediate and final waveform from a learning trial, starting at a smoothedout square wave and ending at the triangle wave which was found to be optimal.

The result is actually quite satisfying from a fluid dynamics point of view, asit is consistent with our theoretical understanding of the system, and indeed wasthe predicted solution given our reward function by experts in the field of flappingflight. If one considers the reward function used in this work (see 3.1), the basis ofthis behavior’s optimality becomes clear.

Consider the cost of transport cmt . As drag force is approximately quadratic withspeed in this regime, the numerator behaves approximately as:∫

T|Fz(t)z(t)|dt ∼ 1

2ρCd〈V 2〉T, (15)


Fig. 7: A series of waveforms (initial, intermediate and final) seen during learningon the rigid plate.

Fig. 8: Linear growth of forward speed with flapping frequency. The axes of thiscurve have been non-dimensionalized as shown, and the data was taken for a sinewave. Figure from (Vandenberghe et al., 2004).

where Cd is the coefficient of drag and 〈V 2〉 is the mean squared heaving speed.However, forward rotational speed was found to grow linearly with flapping fre-quency (see (Vandenberghe et al., 2004) and Figure 8), thus the denominator can bewritten approximately as:

mg∫

Tx(t)dt ∼C f 〈V 〉T, (16)

where C f is a constant relating vertical speed to forward speed, and 〈V 〉 is the meanvertical speed. Therefore, higher speeds result in quadratic growth of the numeratorof the cost and linear growth of the cost’s denominator. This can be seen as thereward r (the inverse of cmt ) having the approximate form:


r ∼C〈V 〉〈V 2〉

, (17)

with C a constant. This results in lower speeds being more efficient, causing lowerfrequencies and amplitudes to be preferred. If period and amplitude are fixed, how-ever, the average speed is fixed (assuming no new extrema of the stroke form areproduced during learning, a valid assumption in practice). A triangle wave, then,is the means of achieving this average speed with the minimum average squaredspeed.

The utility of this control development method now becomes more clear. For sys-tems that, despite having complicated dynamics, can be reasonably described withlumped-parameter or quasi-steady models, learning the controller directly servesdual purposes: it suggests lines of reasoning that inform the creation of relativelysimple models, and it gives confidence that the modeling captures the aspects of thesystem relevant to the optimization. Furthermore, on systems for which tractablemodels are unavailable, the learning methodology can be applied just as easily whilemodel-centric techniques will begin to fail.

8 Conclusion

This work has presented a case study in how to produce efficient, online learning ona complicated fluid system. The techniques used here were shown to be effective,with convergence being achieved on the heaving plate in approximately seven min-utes. The algorithmic improvements presented have additional applications to manyother systems, and by succeeding on a problem possessing great dynamic complex-ity, a reasonably large dimensionality, partial observability and noisy evaluations,they have been shown to be robust and useful. We believe the complexity of flowcontrol systems is an excellent match for the capabilities of learning control, andexpect to see many more applications for this domain in the future.

Bibliography

Alben, S., & Shelley, M. (2005). Coherent locomotion as an attracting state for afree flapping body. Proceedings of the National Academy of Science, 102, 11163–11166.

Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computa-tion, 10, 251–276.

Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journalof Artificial Intelligence Research, 15, 319–350.

Bennis, A., Leeser, M., Tadmor, G., & Tedrake, R. (2008). Implementation of ahighly parameterized digital PIV system on reconfigurable hardware. Proceed-ings of the Twelfth Annual Workshop on High Performance Embedded Computing(HPEC). Lexington, MA.

Collins, S. H., Ruina, A., Tedrake, R., & Wisse, M. (2005). Efficient bipedal robotsbased on passive-dynamic walkers. Science, 307, 1082–1085.

Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniquesfor gradient estimates in reinforcement learning. Journal of Machine LearningResearch, 5, 1471–1530.

Howard, M., Klanke, S., Gienger, M., Goerick, C., & Vijayakumar, S. (2009). Meth-ods for learning control policies from variable-constraint demonstrations. In Frommotor to interaction learning in robots. Springer.

Jabri, M., & Flower, B. (1992). Weight perturbation: An optimal architecture andlearning technique for analog VLSI feedforward and recurrent multilayer net-works. IEEE Trans. Neural Netw., 3, 154–157.

Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting inpartially observable stochastic domains. Artificial Intelligence, 101.

Kober, J., Mohler, B., & Peters, J. (2009). Imitation and reinforcement learning formotor primitives with perceptual coupling. In From motor to interaction learningin robots. Springer.

Meuleau, N., Peshkin, L., Kaelbling, L. P., & Kim, K.-E. (2000). Off-policy policysearch. NIPS.

Peters, J., Vijayakumar, S., & Schaal, S. (2003). Policy gradient methods for robotcontrol (Technical Report CS-03-787). University of Southern California.

19


Roberts, J. W., & Tedrake, R. (2009). Signal-to-noise ratio analysis of policy gra-dient algorithms. Advances of Neural Information Processing Systems (NIPS) 21(p. 8).

Shelley, M. (2007). Personal Communication.Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient rein-

forcement learning on a simple 3D biped. Proceedings of the IEEE InternationalConference on Intelligent Robots and Systems (IROS) (pp. 2849–2854). Sendai,Japan.

Vandenberghe, N., Childress, S., & Zhang, J. (2006). On unidirectional flight of afree flapping wing. Physics of Fluids, 18.

Vandenberghe, N., Zhang, J., & Childress, S. (2004). Symmetry breaking leads toforward flapping flight. Journal of Fluid Mechanics, 506, 147–155.

Williams, J. L., III, J. W. F., & Willsky, A. S. (2006). Importance sampling actor-critic algorithms. Proceedings of the 2006 American Control Conference.

Williams, R. (1992). Simple statistical gradient-following algorithms for connec-tionist reinforcement learning. Machine Learning, 8, 229–256.

motor learning at intermediate reynolds number

Documents