journal of la musemorphose: full-song and fine-grained

15
JOURNAL OF L A T E X CLASS FILES, VOL. #, NO. #, AUGUST 2021 1 MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with One Transformer VAE Shih-Lun Wu, and Yi-Hsuan Yang, Senior Member, IEEE Abstract—Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. The task is split into two steps. First, we equip Transformer decoders with the ability to accept segment-level, time-varying conditions during sequence generation. Subsequently, we combine the developed and tested in-attention decoder with a Transformer encoder, and train the resulting MuseMorphose model with the VAE objective to achieve style transfer of long musical pieces, in which users can specify musical attributes including rhythmic intensity and polyphony (i.e., harmonic fullness) they desire, down to the bar level. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based baselines on numerous widely-used metrics for style transfer tasks. Index Terms—Transformer, variational autoencoder (VAE), deep learning, controllable music generation, music style transfer 1 I NTRODUCTION Automatic music composition, i.e., the generation of musical content in symbolic formats such as MIDI, 1 has been an active and captivating research topic with endeavors dating back to more than half a century ago [1]. Due to the renaissance of neural networks, we have seen in recent years a proliferation of deep learning-based methods for music composition [2], [3], [4]. Different network architectures have been employed depending on the target applications, but the state-of-the-art models are usually based on one of the following two architectures [5], [6], [7]—the Transformer [8] and the variational autoencoder (VAE) [9]. Transformers [8] are a family of neural sequence models that are generally thought of as the potent successors of recur- rent neural betworks (RNN) [10], [11]. Thanks to the design of self-attention to aggregate information from the hidden states of all previous tokens in a sequence, Transformers are able to compose coherent music of a few minutes long, as demonstrated by the Music Transformer [5] and MuseNet [12]. Variants of Transformers have been employed to model multi-track music [6], pop piano performances [13], jazz lead sheets [14], and even guitar tabs [15] that are readable by guitarists. Objective and subjective evaluations in [5], [6] and [16] have all proven Transformers to be the state-of-the-art model for long sequence generation. VAEs [9] are a type of deep latent variable model com- prising an encoder, a decoder, and a Kullback-Leibler (KL) divergence-regularized latent space in between. The main strength of VAEs is that they grant human users access to the generative process through operations on the learned latent space [7], which stores the compact semantics of a musical Both SL and YH were with the Taiwan AI Labs, 15F., No. 70, Sec. 1, Chengde Rd., Datong Dist., Taipei City 103622 , Taiwan. Besides, SL was also with the Department of CSIE, National Taiwan University; and YH was also with the Research Center for IT Innovation, Academia Sinica. E-mail: [email protected]; [email protected] Manuscript received ***, 2021 1. Check https://midi.org/specifications for detailed specifications. Fig. 1: Visualizations of an original music excerpt (top) and one of its style-transferred generations by MuseMorphose (bottom). Digits in the upper and lower rows represent our target attributes to be controlled, namely, bar-wise rhythmic intensity and polyphony classes, respectively. excerpt, making them a great choice for tasks like controllable music generation and music style (or, attribute) transfer. For example, MusicVAE [7] and Music FaderNets [17] showed respectively that low-level (e.g., note density) and high-level (e.g., arousal level) musical features of machine compositions can be altered via latent vector arithmetic. Music FaderNets, GLSR-VAE [18], and MIDI-VAE [19] imposed auxiliary losses arXiv:2105.04090v2 [cs.SD] 26 Jul 2021

Upload: others

Post on 23-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 1

MuseMorphose: Full-Song and Fine-GrainedMusic Style Transfer with One Transformer VAE

Shih-Lun Wu, and Yi-Hsuan Yang, Senior Member, IEEE

Abstract—Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain musicgeneration. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert controlover different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct asingle model that exhibits both strengths. The task is split into two steps. First, we equip Transformer decoders with the ability to acceptsegment-level, time-varying conditions during sequence generation. Subsequently, we combine the developed and tested in-attentiondecoder with a Transformer encoder, and train the resulting MuseMorphose model with the VAE objective to achieve style transfer of longmusical pieces, in which users can specify musical attributes including rhythmic intensity and polyphony (i.e., harmonic fullness) theydesire, down to the bar level. Experiments show that MuseMorphose outperforms recurrent neural network (RNN) based baselines onnumerous widely-used metrics for style transfer tasks.

Index Terms—Transformer, variational autoencoder (VAE), deep learning, controllable music generation, music style transfer

F

1 INTRODUCTION

Automatic music composition, i.e., the generation ofmusical content in symbolic formats such as MIDI,1 hasbeen an active and captivating research topic with endeavorsdating back to more than half a century ago [1]. Due to therenaissance of neural networks, we have seen in recent yearsa proliferation of deep learning-based methods for musiccomposition [2], [3], [4]. Different network architectures havebeen employed depending on the target applications, butthe state-of-the-art models are usually based on one of thefollowing two architectures [5], [6], [7]—the Transformer [8]and the variational autoencoder (VAE) [9].

Transformers [8] are a family of neural sequence modelsthat are generally thought of as the potent successors of recur-rent neural betworks (RNN) [10], [11]. Thanks to the designof self-attention to aggregate information from the hiddenstates of all previous tokens in a sequence, Transformersare able to compose coherent music of a few minutes long,as demonstrated by the Music Transformer [5] and MuseNet[12]. Variants of Transformers have been employed to modelmulti-track music [6], pop piano performances [13], jazz leadsheets [14], and even guitar tabs [15] that are readable byguitarists. Objective and subjective evaluations in [5], [6] and[16] have all proven Transformers to be the state-of-the-artmodel for long sequence generation.

VAEs [9] are a type of deep latent variable model com-prising an encoder, a decoder, and a Kullback-Leibler (KL)divergence-regularized latent space in between. The mainstrength of VAEs is that they grant human users access to thegenerative process through operations on the learned latentspace [7], which stores the compact semantics of a musical

• Both SL and YH were with the Taiwan AI Labs, 15F., No. 70, Sec. 1,Chengde Rd., Datong Dist., Taipei City 103622 , Taiwan. Besides, SL wasalso with the Department of CSIE, National Taiwan University; and YHwas also with the Research Center for IT Innovation, Academia Sinica.E-mail: [email protected]; [email protected]

Manuscript received ***, 20211. Check https://midi.org/specifications for detailed specifications.

Fig. 1: Visualizations of an original music excerpt (top) andone of its style-transferred generations by MuseMorphose(bottom). Digits in the upper and lower rows represent ourtarget attributes to be controlled, namely, bar-wise rhythmicintensity and polyphony classes, respectively.

excerpt, making them a great choice for tasks like controllablemusic generation and music style (or, attribute) transfer. Forexample, MusicVAE [7] and Music FaderNets [17] showedrespectively that low-level (e.g., note density) and high-level(e.g., arousal level) musical features of machine compositionscan be altered via latent vector arithmetic. Music FaderNets,GLSR-VAE [18], and MIDI-VAE [19] imposed auxiliary losses

arX

iv:2

105.

0409

0v2

[cs

.SD

] 2

6 Ju

l 202

1

Page 2: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 2

on the latent space to force some latent dimensions to tobe discriminative of the musical style or attributes to becontrolled. In contrast, Kawai et al. [20] used adversariallearning [21] to prevent the latent space from encodingattribute information, and delegated attribute control tolearned embeddings fed to the decoder.

The aforementioned VAE models are, however, all basedon RNNs, whose capabilities in modeling long sequencesare known to be limited. On the other hand, conditionalgeneration with Transformers are usually done by providingextra tokens. For instance, MuseNet [12] used composer andinstrumentation tokens, prepended to the event sequence,to affect the composition’s style and restrict the instrumentsused. Sometimes, Transformers are given a composed melody(also as a token sequence) and are asked to generate an ac-companiment. Such tasks have been addressed with encoder-decoder architectures [22], [23], or by training a decoder withinterleaved subsequences of the melody and the full piece[24]. The two scenarios above can been seen as two extremesin terms of the restrictiveness of given conditions. Users mayeither only be able to decide the broad categories (e.g., genre,composer, etc.) of the generation, or have to be capable ofcoming up with a melody themselves. The middle ground ofconstraining the generation with high-level flow of musicalideas in the form of latent vectors, with which users mayengage easily and extensively in the machine creative processas achieved by RNN-based VAEs, has not yet been studiedfor Transformers to the best of our knowledge.

The goal of this paper is hence to construct such a modelthat attains all the aforementioned strengths of Transformersand VAEs. In achieving the goal, we split our work into twosteps. First, we devise mechanisms to condition Transformerdecoders [25] with segment-level (i.e., bar-level in our case),time-varying conditioning vectors during long sequencegeneration. Three methods, i.e., pre-attention, in-attention, andpost-attention, which inject conditions into Transformers atdifferent stages in the architecture, are proposed. We conductan objective study to show that in-attention most effectivelyexerts control over the model.

Next, we combine an in-attention-featured Transformerdecoder with a Transformer encoder [26], which operatesrather on the segment level, to build our ultimate Muse-Morphose model for fine-grained music style transfer. Thisencoder-decoder network is trained with the VAE objective[9], and enables music style transfer through attribute embed-dings [27]. Experiments demonstrate that MuseMorphoseexcels in generating style-transferred versions of pop pianoperformances (i.e., music with expressive timing and dynam-ics, instead of plain sheet music) of 32 bars (or measures)long, in which users can freely control two ordinal musicalattributes, i.e., rhythmic intensity and polyphony, of each bar.

Parameterizing VAEs with Transformers is a less exploreddirection. An exemplary work is Optimus [28]. Optimussuccessfully linked together a pair of pre-trained Transformerencoder (BERT) [26] and decoder (GPT-2) [25] with a VAElatent space, and devised mechanisms for the latent variableto exert control over the Transformer decoder. LinkingTransformers and VAE for story completion [29] has alsobeen attempted. However, both works above were limited togenerating short texts (only one sentence), and conditionedthe Transformer decoder at the global level. Our work

extends the approaches above to the case of long sequences,maintains VAEs’ fine-grained controllability, and verifies thatsuch control can be achieved without a discriminator andany auxiliary losses.

Our key contributions can be summarized as:• We devise the in-attention method to firmly harness Trans-

formers’ generative process with segment-level conditions.• We pair a Transformer decoder featuring in-attention with

a bar-level Transformer encoder to form MuseMorphose,our Transformer-based VAE, which marks an advancementover Optimus [28] in terms of conditioning mechanism,granularity of control, and accepted sequence length.

• We employ MuseMorphose for the style transfer of 32-bar-long pop piano performances, on which it surpasses state-of-the-art RNN-based VAEs [19], [20] on various metricswithout having to use style-directed loss functions.

Fig. 1 displays one of MuseMorphose’s compositions,2 inwhich the model responds precisely to the increasing rhythmintensity and polyphony settings, while keeping the contourof the original excerpt well. We encourage readers to visitour companion website3 to listen to more generations byMuseMorphose. Moreover, we have also open-sourced ourimplementation of MuseMorphose.4

The remainder of this paper is structured as follows.Section 2 provides a comprehensive walk-through of thetechnical background. Sections 3 and 4, which are the mainbody of our work, focus respectively on the segment-levelconditioning for Transformer decoders, and the MuseMor-phose Transformer-based VAE model for fine-grained musicstyle transfer. In each of these two sections, we start fromproblem formulation; then, we elaborate on our method(s),followed by evaluation procedure and metrics; finally, wepresent the results and offer some discussion. Section 5concludes the paper and provides potential directions topursue in the future.

2 TECHNICAL BACKGROUND

2.1 Event-based Representation for MusicTo facilitate the modeling of music with neural sequencemodels, an important pre-processing step is to tokenize amusical piece X into a sequence of events, i.e.,

X = {x1, x2, . . . , xT }, (1)

where T is the length of the music in the resulting event-based representation. Such tokenization process can be donestraighforward with music in the symbolic format of MIDI. AMIDI file stores a musical piece’s tempo (in beats per minute,or bpm), time signature (e.g., 3/4, 4/4, 6/8, etc.), as well aseach note’s onset and release timestamps and velocity (i.e.,loudness). Multiple tracks can also be included for pieceswith more than one playing instruments.

There exist several ways to represent symbolic music astoken sequences [7], [30]. The representations adopted in ourwork are based on Revamped MIDI-derived events (REMI)[13]. It incorporated the intrinsic units of time in music,i.e., bars and beats, the former of which naturally define the

2. Each tone in the chromatic scale (C, C#, . . . , B) is colored differently.3. Companion website: slseanwu.github.io/site-musemorphose4. Codebase: github.com/YatingMusic/MuseMorphose

Page 3: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 3

segments to impose conditions. In REMI, a musical piece isrepresented as a sequence with the following types of tokens:• BAR and SUB-BEAT (1∼16, denoting the time within a bar,

in quarter beat, i.e., 16th note, increments) events denotethe progression of time.

• TEMPO events explicitly set the pace (in beats per minute,or bpm) the music should be played at.

• PITCH, VELOCITY, and DURATION events, which alwaysco-occur at each note’s onset, marks the pitch, loudness (in32 levels), and duration (in 16th note increments) of a note.

The authors of REMI also suggested employing rule-basedchord recognition algorithms to add CHORD events, e.g.,CMAJ, G7, etc., to inform the sequence model of localharmonic characteristics of the music. To better suit ouruse cases, we introduce some modifications to REMI, whichwill be explained in detail in Sections 3 and 4.

2.2 TransformersIn general, a Transformer operates on an input sequenceX = {x1, . . . , xT }.5 A token xt in the input sequence firstbecomes a token embedding xt ∈ Rd, where d is the model’shidden state dimension. Then, xt goes through the coreof the Transformer—a series of L (self-)attention modulessharing the same architecture. An attention module6 canbe further divided into a multi-head attention (MHA) [8] sub-module, and a position-wise feedforward net (FFN) sub-module.Suppose the input embedding xt has passed l − 1 suchattention modules to produce the hidden state hl−1t ∈ Rd,the operations that take place in the next (i.e., l th) attentionmodule can be summarized as follows:

slt = LayerNorm(hl−1t + MHA(hl−1t | H l−1)) (2)

hlt = LayerNorm(slt + FFN(slt)), (3)

where H l−1 = [hl−11 , . . . ,hl−1T ]> ∈ RT×d, and slt,hlt ∈ Rd.

The presence of residual connections [31] and layer nor-malization (LayerNorm) [32] ensures that gradients back-propagate smoothly into the network. The MHA sub-moduleis the key component for each timestep to gather informationfrom the entire sequence. In MHA, first, the query, key, andvalue matrices (Ql,Kl, V l ∈ RT×d) are produced:

Ql = H l−1W lQ Kl = H l−1W l

K V l = H l−1W lV , (4)

where W lQ,W

lK ,W

lV ∈ Rd×d are learnable weights. Ql is

further split into M heads [Ql,1, . . . , Ql,M ], where Ql,m ∈RT× d

M ,∀m ∈ {1, . . . ,M}, and so are Kl and V l. Then, thedot-product attention is performed in a per-head fashion:

Attl,m = softmax(Ql,mKl,m>√

d/M)V l,m (5)

where Attl,m ∈ RT× dM . Causal (lower-triangular) masking is

applied before softmax in autoregressive sequence modelingtasks. The attention outputs (i.e., Attl,1, . . . ,Attl,M ) are thenconcatenated and linearly transformed by learnable weightsW lO ∈ Rd×d, thereby concluding the the MHA sub-module.

5. We note that combinations of Transformers operating on multiplesequences are also often seen.

6. It is also often called a (self-)attention layer. However, we use theterm module for it actually involves multiple layers of operations.

However, the mechanism above is permutation-invariant,which is clearly not a desirable property for sequence models.This issue can be addressed by adding a positional encodingto the input token embeddings, i.e., xt = xt + pos(t), wherepos(t) can either be fixed sinusoidal waves [8], or learnableembeddings [26]. More sophisticated methods which informthe model of the relative positions between all pairs of tokenshave also been invented [33], [34], [35], [36].

Due to attention’s quadratic memory complexity it isoften impossible for Transformers to take sequences withover 2∼3k tokens, leading to the context fragmentation issue.To resolve this, Transformer-XL [16] leveraged segment-levelrecurrence and hidden states cache to allow tokens in thecurrent segment to refer to those in the preceding segment.Algorithms to estimate attention under linear complexityhave also been proposed recently [37], [38].

2.3 Variational Autoencoders (VAE)

VAEs [9] are based on the hypothesis that generative pro-cesses follow p(x) = p(x|z)p(z), where p(z) is the priorof some intermediate-level latent variable z. It has been beshown that generative modeling can be achieved throughmaximizing:

LELBO = Ez∼qφ(z|x) log pθ(x|z)−DKL(qφ(z|x)‖p(z)), (6)

where DKL denotes KL divergence, and qφ(z|x) is theestimated posterior distribution emitted by the encoder (pa-rameterized by φ), while pθ(x|z) is the likelihood modeledby the decoder (parameterized by θ). It is often called theevidence lower bound (ELBO) objective since it can be provedthat: −LELBO ≤ log p(x). For simplicity, we typically restrictqφ(z|x) to isotropic gaussians N (µ,diag(σ2)), and have theencoder emit the mean and standard deviation of z, i.e.,µ,σ ∈ Rdim(z). The prior distribution p(z) is often set to beisotropic standard gaussian N (0, I).

For practical considerations, β-VAE [39] proposed that wemay adjust the weight on the KL term of the VAE objectivewith an additional hyperparameter β. Though using β 6= 1deviates from optimizing ELBO, it grants practitioners thefreedom to trade off reconstruction accuracy for a smootherlatent space and better feature disentanglement (by opting forβ > 1). It is worth mentioning that, in musical applications,β is often set to 0.1∼0.2 [7], [19], [20].

In the literature, it has been repeatedly mentioned thatVAEs suffer from the posterior collapse problem [40], [41], [42],in which case the estimated posterior qφ(z|x) fully collapsesonto the prior p(z), leading to an information-less latentspace. The problem is especially severe when the decoderis powerful, or when the decoding is autoregressive, whereprevious tokens in the sequence reveal strong information.Numerous methods have been proposed to tackle thisproblem, either by tweaking the training objective [43], [44]or slightly modifying the model architecture [42]. Kingma etal. [43] introduced a hyperparameter λ to the VAE objectiveto ensure that each latent dimension may store λ nats7 ofinformation without being penalized by the KL term. CyclicalKL annealing [44] periodically adjusts β (i.e., weight on theKL term) during training. Empirical evidence have shown

7. 1 nat ≈ 1.44 bits

Page 4: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 4

that this simple technique benefited downstream tasks suchas language modeling and text classification. Skip-VAE [42],on the other hand, addressed posterior collapse with slightarchitectural changes to the model. In Skip-VAEs, the latentcondition z is fed to all layers and timesteps instead of justthe initial ones. Theoretically and empirically, it was shownthat Skip-VAEs increases the mutual information betweeninput x and the estimated latent condition z. In our work,we take advantage of all of the techniques above to facilitatethe training of our Transformer-based VAE.

3 CONDITIONING TRANSFORMER DECODERS ATTHE SEGMENT LEVEL

This section sets aside latent variable models first, and isconcerned with conditioning an autoregressive Transformerdecoder with a series of pre-given, time-varying conditionsc1, . . . , cK , which are continuous vectors each belonging to apre-defined, non-overlapping segment of the target sequence.To approach this problem, three mechanisms, namely, pre-attention, in-attention, and post-attention conditioning, areproposed. Experiments demonstrate that, in terms of offeringtight control, in-attention surpasses the other two, as wellas a mechanism slightly tweaked from Optimus [28], whichserves as our baseline.

3.1 Problem Formulation

Under typical settings, Transformer decoders [8] are em-ployed for autoregressive generative modeling of sequences:

p(xt|x<t) , (7)

where xt is the element of a sequence to predict at timestept, and x<t is all the (given) preceding elements of thesequence. Once a model is trained, the model can generatenew sequences autoregressively, i.e., one element at a timebased on the previously generated elements, in chronologicalorder of the timesteps [25].

The unconditional type of generation associated withEq. (7), however, does not offer control mechanisms to guidethe generation process of the Transformer. One alternative isto consider a conditional scenario where the model is informedof a global condition c for the entire sequence to generate, asused by, for example, the CTRL model for text [45] or theMuseNet model for music [12]:

p(xt|x<t, c) . (8)

When the target sequence length is long, it may be beneficialto extend Eq. (8) by using instead a sequence of conditionsc1, c2, . . . , cK , each belonging to different parts of the se-quence, to offer fine-grained control through:

p(xt|x<t; c1, c2, . . . , cK) . (9)

This can be implemented, for example, by treating repre-sentations of c1, c2, . . . , cK as additional memory for theTransformer decoder to attend to, which can be achievedwith minimal modifications to [28].

In this paper, we particularly address the case where thetarget sequences can be, by nature, divided into multiplemeaningful, non-overlapping segments, such as sentences in

a text article, or bars (measures) in a musical piece. That is tosay, for sequences with K segments, each timestep index t ∈[1, T ] belongs to one of the K sets of indices I1, I2, . . . , IK ,where Ik ∩ Ik′ = ∅ for k 6= k′ and

⋃Kk=1 Ik = [1, T ]. Thus,

we can provide the generative model with a condition ckfor each segment during the corresponding time interval Ik,leading to the modeling of:

p(xt|x<t; ck) , for t ∈ Ik . (10)

This conditional generation task is different from the one out-lined in Eq. (9) in that we know specifically which condition(among the K conditions) to attend to for each timestep. Thesegment-level condition ck may manifest itself as a sentenceembedding in text, or a bar-level embedding in music. In ourcase, we consider the conditions to be bar-level embeddingsof a musical piece, i.e., we represent each ck in the vectorform ck ∈ Rdc , where dc is the dimensionality of the bar-level embedding. Collectively, the time-varying conditionsc1, c2, . . . , cK offer a blueprint, or high-level planning, of thesequence to model. This is potentially helpful for generatinglong sequences, especially for generating music, as ancestralsampling from unconditional autoregressive models, whichare trained only with long sequences of scattered tokens, mayfail to exhibit the high-level flow and twists that are centralto a catchy piece of music.

The segment-level conditional generation scenario associatedwith Eq. (10), though seems intuitive for long sequencemodeling, has not been much studied in the literature. Hence,it is our goal to design methods that specifically tackle thistask, and perform comprehensive evaluation to find out themost effective way.

3.2 Method

Here, we elaborate the proposed three mechanisms: pre-attention, in-attention, and post-attention (see Fig. 2 for aschematic comparison), to condition Transformer decodersat segment level. For simplicity, it is assumed here that thebar-level conditions, i.e., ck’s, can be obtained separatelybeforehand; their extraction method are also to be introduced.Finally, we explain the data representation and datasetadopted for this task.

Pre-attention Conditioning. Under pre-attention, thesegment-level conditions enter the Transformer decoderonly once before all the self-attention layers. The segmentembedding ck is first transformed to a hidden conditionstate ek by a matrix Wpre ∈ Rdc×de . Then, it is concatenatedwith the token embedding of each timestep in the kth bar(i.e., Ik) before the first self-attention layer, i.e.,

h0t = concat([xt; ek]) , xt, ek ∈ Rde , (11)

where de is the dimensionality for token embeddings. Notethat under this mechanism, the hidden dimension, d, forattention modules is implicitly required to be 2de.

The pre-attention conditioning is methodology-wise simi-lar to the use of segment embeddings in BERT [26], exceptthat BERT’s segment embedding is summed directly with thetoken embedding, while we use concatenation to strengthenthe presence of the conditions.

Page 5: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 5

(a) Pre-attention conditioning (b) In-attention conditioning (c) Post-attention conditioning

Fig. 2: The architecture of segment-level conditional Transformer decoders using different conditioning mechanisms.

In-attention Conditioning. The proposed in-attentionmechanism more frequently reminds the Transformer de-coder of the segment-level conditions throughout the self-attention layers. It projects the segment embedding ck to thesame space as the self-attention hidden states via:

e>k = c>kWin; Win ∈ Rdc×d, (12)

and then sums the obtained hidden condition state ek withthe hidden states of all the self-attention layers but the lastone, to form the input to the subsequent layer, i.e.,

hlt = hlt + ek , ∀l ∈ {0, . . . , L− 1} ;

hl+1t = Transformer_self-attention_layer(hlt) .

(13)

We use summation rather than concatenation here to keepthe residual connections intact. We anticipate that by copy-pasting the segment-level condition everywhere, its influenceon the Transformer can be further aggrandized.

Post-attention Conditioning. Unlike the last two mech-anisms, in post-attention, the segment embeddings do notinteract with the self-attention layers at all. Instead, theyare imposed afterwards on the final attention outputs (i.e.,hLt ) via two single-hidden-layer, parametric ReLU-activatedfeed-forward networks: the conditioning net B, and the gatingnet G. Specifically, the segment embedding ck is transformedby the conditioning net B to become ek ∈ Rd. Then, a gatingmechanism, dictated by the gating net G, determines “howmuch” of ek should be blended with the self-attention outputhLt at every timestep t, i.e.,

ek = B(ck) ;αt = sigmoid(G(concat([hLt ; ck]))) ;

hLt = (1− αt) · hLt + αt · ek .(14)

Finally, hLt proceeds to the output projections to form theprobability distribution of xt+1.

TABLE 1: Model attributes shared by the implementedTransformer decoders.

Attr. Description Value

Ttgt target sequence length 1,024Tmem Transformer-XL memory length 1,024L # self-attention layers 12nhead # self-attention heads 10de token embedding dimension 320d hidden state dimension 640dff feed-forward dimension 2,048dc condition embedding dimension 512# params 58.7 to 62.6 mil.

This mechanism is adapted from the contextualized vo-cabulary bias proposed in the Insertion Transformer [46]. It isdesigned such that the segment-level conditions can directlybias the output event distributions, and that the model canfreely decide how much it refers to the conditions for help atdifferent timesteps within each bar.

Implementation and Training Details. We adoptTransformer-XL [16] as the backbone sequence model behindall of our conditioning mechanisms. To demonstrate thesuperiority of using Eq. (10), we involve two baselines inour study, dubbed unconditional and memory hereafter. Theformer takes no ck’s, thereby modeling Eq. (7) exactly. Thelatter models Eq. (9) with a conditioning mechanism largelyresembling the memory scheme introduced in Optimus [28],differing only in that we have multiple conditions instead ofone. In the memory baseline, the conditions {ck}Kk=1 are eachtransformed by matrix Wmem ∈ Rdc×Ld to become hiddenstates {elk ∈ Rd}Ll=1 unique to each layer l. These states arethen fed to the decoder before the training sequence, allowingthe tokens (i.e., xt’s) to attend to them. Some commonattributes shared among the five implemented models arelisted in Table 1.

Page 6: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 6

Fig. 3: Illustration of the method for extracting bar-levelembeddings to be used by our conditional Transformerdecoders.

All these models are trained with the Adam optimizer[47] and teacher forcing, i.e., always feeding in correct inputsrather than those sampled from previous-timestep outputsfrom the model itself, to minimize negative log-likelihood:

−T∑t=1

log p(xt |x<t), (15)

(NLL) of the training sequences. Due to limited computa-tional resources, we truncate each song to the first 32 bars.We warm-up the learning rate linearly from 0 to 2e−4 inthe first 500 training steps, and use cosine learning ratedecay (600,000 steps) afterwards. Trainable parameters arerandomly initialized from the gaussian N (0, 0.012). Eachmodel is trained on one NVIDIA Tesla V100 GPU (with 32GBmemory) with a batch size of 4 for around 20 epochs, whichrequires roughly 2 full days.

Extraction of Bar-level Embeddings. We extract theconditions ck from the sequences themselves through aseparate encoder network E , which is actually a typicalTransformer decoder with 12 self-attention layers and 8attention heads. The encoder network E takes a sequence ofevent tokens corresponding to a musical bar and outputs adc-dimensional embedding vector representing the bar. Weuse a GPT-2 [25] like setup and train E for autoregressivenext-token prediction, and average-pool across all timestepson the hidden states of a middle layer l of the Transformerto obtain the segment-level condition embedding, namely,

ck = avgpool([hl1, . . . ,hlTk]) , (16)

where hlt ∈ RdE (= Rdc) is the hidden state at timestep t afterl self-attention layers, and Tk is the number of tokens in thebar. According to [48], the hidden states in middle layers arethe best contextualized representation of the input, so we setl = 6.

The model E has 39.6 million trainable parameters. Weuse all the bars associated with our training data (i.e., notlimiting to the first 32 bars of the songs) for training E ,with teacher forcing and causal self-attention masking toalso minimize the NLL, i.e., −

∑t log p(xt|x<t), as we do

with our segment-level conditional Transformers. Fig. 3 is anillustration of the method introduced above.

Dataset and Data Representation. In our experiments,we consider modeling long sequences of symbolic musicwith up to 32 bars per sequence. Specifically, our data come

TABLE 2: The vocabulary used to represent songs in LPD-17-cleansed dataset.

Event type Description # tokens

BAR beginning of a new bar 1

SUB-BEAT position in a bar, in 32nd note steps ( ˇ “* ) 32TEMPO 32∼224 bpm, in steps of 3 bpm 64PITCH∗ MIDI note numbers (pitch) 0∼127 1,757VELOCITY∗ MIDI velocities 3∼127 544

DURATION∗ multiples (1∼64 times) of ˇ “* 1,042

All events — 3,440∗: unique for each of the 17 tracks

from the LPD-17-cleansed dataset [49], a pop music MIDIdataset containing >20K songs with at most 17 instrumentaltracks (e.g., piano, strings, and drums) per song. We take thesubset of 10,626 songs in which the piano is playing at leasthalf of the time, and truncate each song to the beginning 32bars, resulting in a dataset with 650 hours of music. Thesesongs are all with time signature 4/4 (i.e., four beats per bar).We reserve 4% of the songs (i.e., 425 songs) as the validationset for objective evaluation.

Following REMI [13], we represent music in the formof event tokens, encompassing note-related tokens such asPITCH-[TRK], DURATION-[TRK], and VELOCITY-[TRK], whichrepresent a note played by a specific track (i.e., ‘[TRK]’); andmetric-related tokens such as BAR, SUB-BEAT, and TEMPO.The SUB-BEAT tokens divide a bar into 32 possible locationsfor the onset (i.e., starting time) of notes, laying an explicittime grid for the Transformer to process the progressionof time. The BAR token marks the beginning of a new bar,making it easy to associate different bar-level conditions ck tosubsequences belonging to different bars (i.e., different Ik’s).Brief descriptions for all types of tokens in our representationis provided in Table 2.

3.3 Evaluation Metrics

Besides examining the training and validation losses, forfurther evaluation, we let our segment-level conditionalTransformers generate multi-track MIDI music of 32 barslong, with the bar-level conditions (i.e., {ck}32k=1) extractedfrom songs in the validation set . Therefore, in essence, themodels generate re-creations, or covers, of existing music. Inthe following paragraphs, we define and elaborate on theobjective metrics that assess re-creations of music.

Evaluation Metrics for Re-creation Fidelity. Three bar-level metrics and a sequence-level metric are defined toquantitatively evaluate whether the re-creations of ourmodels are similar to the original songs from which thesegment-level conditions are extracted.• Chroma similarity, or simchr, measures the closeness of

two bars in tone via:

simchr(ra, rb) = 100

〈ra, rb〉||ra|| ||rb||

, (17)

where 〈·, ·〉 denotes dot-product, and r ∈ Z12 is the chromavector [50] representing the number of onsets for each ofthe 12 pitch classes (i.e., C, C#, . . . , B) within a bar, countedacross octaves and tracks, with the drums track ignored.

Page 7: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 7

One of the bars in the pair (a, b) is from the original song,while the other is from the re-creation.

• Grooving similarity, or simgrv, examines the rhythmicresemblance between two bars with the same formulationas the chroma similarity, measured instead on the groovingvectors g ∈ Z32 recording the number of note onsetsoccurring at each of the 32 sub-beats in a bar, countedacross tracks [51].

• Instrumentation similarity, or sim ins, quantifies whethertwo bars use similar instruments via:

sim ins(ba, bb) = 100 (1− 1

17

17∑i=1

XOR(bai , b

bi )), (18)

where b ∈ {0, 1}17 is a binary vector indicating thepresence of a track in a bar, i.e., at least one PITCH-[TRK]for that [TRK] occurs within the bar.

• Self-similarity matrix distance, or distSSM, measureswhether two 32-bar sequences have similar overall struc-ture by comparing the mean absolute distance between theself-similarity matrices (SSM) [52] of the two sequences, i.e.,

100 ||Sa − Sb||1, (19)

where S ∈ RN×N is the SSM. In doing so, we firstlysynthesize each 32-bar sequence into audio with a syn-thesizer. Then, we use constant-Q transform to extractacoustic chroma vectors indicating the relative strengthof each pitch class for each half beat in an audio piece,and then calculate the cosine similarity among such beat-synchronized features; hence, sij ∈ [0, 1] for half beats iand j within the same music piece. Since each bar in 4/4signature has 8 half-beats, we have N = 256 for our data.An SSM depicts the self-repetition within a piece. Unlikethe previous metrics, the value of distSSM is the closer tozero the better. We note that the values of all metrics above,i.e., simchr, simgrv, sim ins, and distSSM, are all in [0, 100].

Evaluation Metrics for Re-creation Quality. It is worthmentioning that objective evaluation on the quality ofgenerated music remains an open issue [14]. Nevertheless, weadopt here a recent idea that measures the quality of musicby comparing the distributions of the generated content andthe training data in feature spaces, using KL divergence [53].

While we employ simchr, simgrv, sim ins to measure thesimilarity between a generated bar and the correspondingreference bar in Sec. 3.3, here we instead use them to calculateeither the (self-)similarity among the bars in a generated32-bar sequence, or that among the bars in a real trainingsequence. For each sequence, we have 32 · (32− 1)/2 suchpairs, leading to a distribution. Since the values of simchr andsimgrv are continuous, we discretize their distributions into50 bins with evenly-spaced boundaries [0, 2, . . . , 100]. TheKL divergence is calculated on the probability mass functionspreal, pgen ∈ R50, or R18 for the distribution in terms of sim ins,and is defined as KL (preal || pgen). We assume that the valuesare the closer to zero the better.

3.4 Results and Discussion

We begin with presenting the training dynamics of our model(see Fig. 4). From Fig. 4a, we can see that the model whoseloss drops the fastest, in-attention, achieves 0.41 training NLL

10 15 20# epochs trained

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Trai

ning

NLL

unconditionalmemorypre-attentionin-attentionpost-attention

(a) # epochs vs. training loss

0.40.50.60.70.8Training NLL

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Valid

atio

n NL

L

unconditionalmemorypre-attentionin-attentionpost-attention

(b) Training loss vs. val. loss

Fig. 4: Training dynamics of segment-level conditional Trans-formers on LPD-17-cleansed dataset (view in color).

after 16 epochs, while the unconditional and memory [28]models still have a loss of around 0.65 and 0.60 respectively.We also take validation loss into consideration. Fig. 4b showsthat our models, especially pre- and in-attention, outperformboth unconditional and memory at comparable training NLL.

Table 3 displays the objective evaluation results usingmetrics defined in Sec. 3.3. The scores are calculated on425 re-creations, each using the bar-level conditions from aunique song in the validation set. For comparison, we let theunconditional model randomly generate 400 32-bar piecesfrom scratch, and further include a random baseline, whosescores are computed over 400 random pairs of pieces drawnfrom the training set.

Focusing on the fidelity metrics first, all of our conditionalmodels score quite high on bar-level simchr, simgrv, andsim ins, with in-attention significantly outperforming the rest.In-attention also attains the lowest distSSM, demonstratingeffective and tight control. Moreover, it is worth noting thatthe memory model sits right in between ours and the randombaseline in all metrics, offering limited conditioning ability.

Next, we shift attention to the quality metrics, whereKLchr, KLgrv, and KLins represent the KL divergence calcu-lated on the corresponding distributions. The conditionalmodels clearly outperform the unconditional baseline, sug-gesting that our conditions do help Transformers in modelingmusic. Furthermore, once again, in-attention comes out ontop, followed by pre-, post-attention, and then memory, exceptfor metric KLins, in which memory performs the best.

From the results presented above, we surmise that thegenerally worse performance of memory baseline comparedto ours could be due to the advantages of using Eq. (10)over Eq. (9). Equation (10) informs the Transformer decoderof when exactly to exploit each of the conditions, andeliminates the bias caused by the positional encoding fedto Transformers, i.e., later tokens in the sequence are mademore dissimilar to the conditions provided in the beginningof the sequence, thereby undermining the conditions’ effectin the attention process. Therefore, memory works well onlyfor conditions that remain relatively unchanged throughouta sequence, such as the instrumentation of a song. Amongthe three proposed conditioning methods, we conjecture that

Page 8: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 8

TABLE 3: Objective evaluation results of segment-level conditional Transformers, on re-creating multi-track music (↓/ ↑: thelower/higher the score the better).

Model Fidelity Qualitysimchr ↑ simgrv ↑ sim ins ↑ distSSM ↓ KLchr ↓ KLgrv ↓ KLins ↓

random 34.6 40.5 80.4 13.5 — — —unconditional — — — — .047 .228 .219memory [28] 60.7 62.7 90.7 10.3 .021 .061 .006pre-attention 94.5 92.9 93.0 7.10 .006 .009 .034in-attention 96.3∗∗∗ 95.7∗∗∗ 97.0∗∗∗ 5.83∗∗∗ .002 .005 .021post-attention 92.4 85.7 95.0 8.27 .016 .054 .056∗∗∗: leads all other models with p < .001

post-attention works the worst because the conditions do notparticipate in the attention process. Therefore, the model hasless chance to integrate information from them. Moreover, theadvantage of in-attention over pre-attention is reasonably dueto that, in in-attention, the conditions are fed to all attentionlayers instead of just once before the first attention layer.With the comprehensive evaluation conducted, we havesufficient evidence indicating that in-attention works the bestin exerting control over Transformer decoders with segment-level, time-varying conditions.

4 MUSEMORPHOSE: GENERATING MUSIC WITHTIME-VARYING USER CONTROLS

We have developed in Section 3 the in-attention technique,which can exert firm control over a Transformer decoder’sgeneration with time-varying, predetermined conditioningvectors. However, that technique alone only enables Trans-formers to compose re-creations of music at random; nofreedom has been given to users to interact with such asystem to affect the music it generates as one wishes, therebylimiting its practical value. This section aims at alleviatingsuch a limitation. We bridge the in-attention Transformerdecoder and a jointly learned bar-wise Transformer encoder,tasked with learning the bar-level latent conditions, withthe variational autoencoder (VAE) training objective; and,introduce attribute embeddings [20], [27], also working at thebar level, to the decoder to discourage the latent conditionsfrom storing attribute-related information, and realize fine-grained user controls over sequence generation. Experimentsshow that our resulting model, MuseMorphose, successfullyallows users to harness two easily perceptible musicalattributes, rhythmic intensity and polyphony, while maintainingfluency and high fidelity to the input reference song. Suchcombination of abilities is shown unattainable by existingmethods [19], [20].

4.1 Problem FormulationFor clarity, we model our user-controllable conditionalgeneration task as a style transfer problem without paireddata [27]. This problem, in its most basic form, deals with aninput instance X , such as a sentence or a musical excerpt,and an attribute a intrinsic toX . The attribute a ∈ {1, . . . , C}can be either a nominal (e.g., composer style), or an ordinal(e.g., note density) variable with C categories. As users, wewould like to have a model that takes a tuple:

(X, a), (20)

where a is the user-specified attribute value that may bedifferent from the original a, and outputs a style-transferredinstance X . A desirable outcome is that the generated Xbears the specified attribute value a (i.e., achieving effectivecontrol), and preserves the content other than the attribute(i.e., exhibiting high fidelity). Being able to generate diverseversions of X’s given the same a or to ensure the fluencyof generations, though not central to this problem, are alsopreferred characteristics of such models. The aforementionedframework has been widely adopted by past works in bothtext [27], [54], [55] and music [19] domains.

A natural extension to the problem above is to augmentthe model such that it can process multiple attributes at atime, i.e., taking inputs in the form:

(X, a1, a2, . . . , aJ), (21)

to produce X , where aj , for j ∈ {1, J}, are the J categoricalattributes being considered. In this case, users have thefreedom to alter single or multiple attribute values. Hence,preferably, changing one attribute to aj should not affect anunaltered attribute aj

′. This multi-attribute variant has also

been studied in [20], [56].In some scenarios, the input sequence X is long and can

be partitioned into meaningful continuous segments, e.g.,sentences in text or bars in music, through:

X = {X1, . . . , XK},whereXk = {xt | t ∈ Ik}, Ik ⊂ N for 1 ≤ k ≤ K;

K⋃k=1

Ik = [1, T ] and Ik ∩ Ik′ = ∅ for k 6= k′ ,

(22)

where T is the length of X , and I1, . . . , IK constitute apartition of X (cf. the paragraph concerning the formulationof Eq. (10) in Sec. 3.1). For such cases, it could be a privilegeif we can control each and every segment individually, sincethis allows, for example, users to make a musical sectiona lot more intense, and another much calmer, to achievedramatic contrast. Therefore, we formulate here the styletransfer problem with multiple, time-varying attributes. Amodel tackling this problem accepts:

(X, {a1k, a2k, . . . , aJk}Kk=1), (23)

and then generate a long, style-transferred X , wherea1k, a

2k, . . . , a

Jk are concerned specifically with the segment

Xk. Though Transformers [8], [25] are impressive in sequencegeneration, no prior work, to our knowledge, has addressedthe problem above with them, reasonably because there

Page 9: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 9

has not been a well-known segment-level conditioningmechanism for Transformers. Thus, given our developedin-attention method, we are indeed in a privileged positionto approach this task.

4.2 MethodIn what follows, we describe how we pair an in-attentionTransformer decoder with a bar-wise Transformer encoder,as well as attribute embeddings, to achieve fine-grainedcontrollable music generation through optimizing the VAEobjective. The musical attributes considered here are: rhythmicintensity (arhym) and polyphony (apoly; i.e., harmonic fullness),whose calculation methods will also be detailed. After that,we present the implementation details and the pop pianodataset, REMI-pop-1.7k, used for this task.

Model Architecture. Figure 5 illustrates our MuseMor-phose model, which consists of a Transformer encoder operat-ing at bar level, a Transformer decoder that accepts segment-level conditions through the in-attention mechanism, and aKL divergence-regularized latent space for the representationof musical bars between them. In MuseMorphose, the inputmusic is partitioned as X = {X1, . . . , XK}, where Xk

houses the kth bar of the music. The encoder works on themusical bars, X1, . . . , XK , in parallel, while the decodersees the entire piece X at once. The bar-level attributes,a

rhymk and apoly

k , each consisting of 8 ordinal classes, are firsttransformed into embedding vectors, arhym

k ,apolyk ∈ Rda ,

before entering the decoder’s attention layers through in-attention. On the whole, the operations taking place insideMuseMorphose can be summarized as follows:

zk = enc(Xk) for 1 ≤ k ≤ K;

ck = concat([zk;arhymk ;a

polyk ]);

yt = dec(x<t; ck), t ∈ Ik for 1 ≤ k ≤ K,(24)

where zk ∈ Rdz is the latent condition for the kth bar;a

rhymk ,a

polyk are the bar’s attribute embeddings, Ik stores

the timestep indices (see Eq. (22)) of the bar; and, yt isthe predicted probability distribution for xt. Note thatthough the input is split into bars on the encoder side,the decoder still deals with the entire sequence, thanks toboth in-attention conditioning and

⋃Kk=1 Ik = [1, T ]. This

asymmetric architecture is likely advantageous due to that itenables fine-grained conditions to be given at the bar level,and also promotes the coherence of long generations.

Since we adopt the VAE framework here, we nowelaborate more on how we construct the latent space forzk’s. First, we treat the encoder’s attention output at the firsttimestep (corresponding to the BAR token), i.e., hLenc

k,1 , as thecontextualized representation of the bar.8 Then, following theconventional VAE setting [9], it is projected by two separatelearnable weights Wµ,Wσ , to the mean and std vectors:

µk = hLenck,1

>Wµ σk = hLenc

k,1

>Wσ , (25)

where Wµ,Wσ ∈ Rd×dz . Afterwards, we may sample thelatent condition to be fed to the decoder from the isotropicgaussian defined by µk and σk:

zk ∼ N (µk,diag(σ2k)). (26)

8. Since after Lenc attention layers, hLenck,1 should have gathered

information from the entire bar.

For simplicity, the prior of latent conditions, i.e., p(zk), is setto the standard gaussian N (0, Idz ), which has been typicallyused for VAEs.

It is worthwhile to mention that the subsequent copy-paste of the zk’s, done by the in-attention decoder, to everyattention layer and every timestep in Ik coincides withthe design of Skip-VAE [42], which provably mitigates theposterior collapse problem, i.e., the case where the decodercompletely ignores zk’s and degenerates into autoregressivesequence modeling. Also, our work makes a notable advanceover Optimus [28] in that we equip Transformer-based VAEswith the capability to model long sequences (with length inthe order of ≥103), under fine-grained changing conditionsfrom the latent vector (i.e., zk) and user controls (i.e., arhym

k

and apolyk ).

Training Objective. MuseMorphose is trained with theβ-VAE objective [39] with free bits [43]. It minimizes the loss,LMuseMorphose, which is written as:

LMuseMorphose = LNLL + βLKL , whereLNLL = − log pdec(X | c1, . . . , cK) ;

LKL =1

K

K∑k=1

dz∑i=1

max(λ,DKL(qenc(zk,i|Xk) ‖ p(zk,i))) .

(27)

The first term,LNLL, is the conditional negative log-likelihood(NLL) for the decoder to generate input X given theconditions c1, . . . , cK ,9 hence referred to as reconstructionNLL hereafter. The second term, LKL, is the KL divergencebetween the posterior distributions of zk’s estimated by theencoder (i.e., qenc(zk|Xk)) and the prior p(zk). β [39] andλ [43] are hyperparameters to be tuned (see Section 2.3 fordefinitions). We refer readers to Fig. 5 for a big picture ofwhere the two loss terms act on the model.

Previous works that employ autoencoders for style trans-fer tasks [20], [55], [57] often suggested adding adversariallosses [21] on the latent space, so as to discourage itfrom storing style-related, or attribute-related, information.However, a potential downside of this practice is that itintroduces additional complications to the training of ourTransformer-based network, which is already very complexitself. We instead demonstrate that by using suitable β andλ to control the size of latent information bottleneck, bothstrong style transfer and good content preservation of inputX can be accomplished without auxiliary training losses.

Calculation of Musical Attributes. The attributes chosenfor our task, rhythmic intensity (arhym) and polyphony (apoly),are able to be perceived easily by people without intensivetraining in music. Meanwhile, they are also importantdetermining factors of musical emotion. To obtain the ordinalclasses arhym and apoly of each bar, we first compute the rawscores srhym and spoly.• Rhythmic intensity score, or srhym, simply measures the

percentange of sub-beats with at least one note onset, i.e.:

srhym =1

B

B∑b=1

1(nonset,b ≥ 1) , (28)

9. They are concatenations of latent condition, zk , and attributeembeddings, arhym

k and apolyk .

Page 10: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 10

Fig. 5: Architecture of MuseMorphose, a Transformer-based VAE model for controllable conditional music generation.

TABLE 4: Main characteristics of our MuseMorphose model.

Attr. Description Value

T target sequence length 1,280L # self-attention layers 24Lenc # encoder self-attention layers 12Ldec # decoder self-attention layers 12nhead # self-attention heads 8de token embedding dimension 512d hidden state dimension 512dff feed-forward dimension 2,048dz latent condition dimension 128da attribute embedding dimension (each) 64# params — 79.4 M

where B is the number of sub-beats in a bar and 1(·) is theindicator function.

• Polyphony score, or spoly, is a bit more implicit, and isdefined as the average number of notes being hit (onset)or held (not yet released) in a sub-beat, i.e.,

spoly =1

B

B∑b=1

(nonset,b + nhold,b) . (29)

This makes sense since if there are more notes pressedsimultaneously, the music would feel harmonically fuller.

After we collect all bar-wise raw scores from the dataset, wedivide them into 8 bins with roughly equally many samples,resulting in the 8 classes of arhym and apoly. For example, inour implementation, the cut-off srhym’s between the classesof arhym are: [.20, .25, .32, .38, .44, .50, .63], while the cut-offsfor apoly are: [2.63, 3.06, 3.50, 4.00, 4.63, 5.44, 6.44].

Implementation and Training Details. Table 4 lists theimportant architectural details of our MuseMorphose model.For better training outcomes, we introduce both cyclical KLannealing [44] and free bits [43] (both have been introduced inSec. 2.3) to our training objective (see Eq. (27)). After sometrial and error, we set βmax = 1, and anneal β, i.e., the weight

on the LKL term, in cycles of 5,000 training steps. The freebit for each dimension in zk is set to λ = 0.25 . This setupis referred to as preferred settings hereafter, for they strike agood balance between content preservation (i.e., fidelity) anddiversity.10 We also train MuseMorphose with the pure AEobjective (i.e., β = 0, constant), and VAE objective (i.e., β = 1,constant; λ = 0), to examine how the model behaves on thetwo extremes. However, across all three settings, we do notadd LKL in the first 10,000 steps, to make it easier for thedecoder to exploit the latent space in the beginning.

Furthermore, we set K = 16 during training, whichmeans that we feed to the model a random crop of 16-barexcerpt, truncated to maximum length T = 1,280, from eachmusical piece in every epoch. This is done mainly to savememory and speed up training. During inference, the modelcan still generate music of arbitrary length with a slidingwindow mechanism. Data augmentation on pitches is alsoemployed. The key of each sample is transposed randomlyin the range of ±6 half notes in every epoch.

The models are trained with Adam optimizer [47] andteacher forcing. We use linear warm-up to increase thelearning rate from 0 to 1e−4 in the first 200 steps, followed bya 200,000-step cosine decay down to 5e−6. Model parametersare initialized from the gaussian N (0, 0.012). Using a batchsize of 4, we can fit our MuseMorphose into a single NVIDIATesla V100 GPU with 32GB memory. For all three lossvariants (preferred settings, AE objective, VAE objective) alike,the training converges in less than 2 full days. At inferencetime, we perform nucleus sampling [58] to sample fromthe output distribution at each timestep, using a softmaxtemperature τ = 1.2 and truncating the distribution atcumulative probability p = 0.9.

Due to the affinities in training objective and applica-tion, we implement MIDI-VAE [19] and Attr-Aware VAE

10. Note that this is however judged by our ears.

Page 11: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 11

TABLE 5: The vocabulary used to represent songs in REMI-pop-1.7K dataset.

Event type Description # tokens

BAR beginning of a new bar 1SUB-BEAT position in a bar, in 16th note steps ( ˇ “) ) 16TEMPO 32∼224 bpm, in steps of 3 or 6 bpm 54PITCH MIDI note numbers (pitch) 22∼107 86VELOCITY MIDI velocities 40∼86 24DURATION multiples (1∼16 times) of ˇ “) 16CHORD chord markings (root & quality) 133

All events — 330

[20] as baselines.11 Their RNN-based decoders, however,operates only on single bars (i.e., Xk’s) of music. Therefore,during inference, each bar is generated independently andthen concatenated to form the full piece. We deem thisacceptable since the bar-level latent conditions zk’s shouldstore sufficient information to link the piece together. Wefollow their specifications closely, add all auxiliary losses asrequired, and increase their number of trainable parametersby enlarging the RNN hidden state for fair comparisonwith MuseMorphose. The resulting number of parametersin our implementations of MIDI-VAE and Attr-Aware VAE,respectively, are 58.2 million and 60.0 million. For faircomparison, in our implementation the two baseline modelsare trained under the preferred settings for MuseMorphose,with teacher forcing as well.

Dataset and Data Representation. For this task, weconsider generating expressive pop piano performances. Thepop piano MIDI dataset used is REMI-pop-1.7K, which wasreleased in [24]. According to [24], the piano performancesin the dataset are originally collected from the Internet inthe MP3 (audio) format. They further employed Onsets andFrames piano transcription [59], madmom beat tracking tool[60], and chorder rule-based chord detection12 to transcribethe audio into MIDI format with tempo, beat, and chordinformation. REMI-pop-1.7K encompasses 1,747 songs, or 108hours of music, in total. We hold out 5% of the data (i.e., 87songs) each for the validation and test sets. We utilize thevalidation set to monitor the training process, and the testset to generate style-transferred music for further evaluation.

The data representation adopted here is largely identicalto REMI [13] and the one used in Sec. 3. Differences include:(1) an extended set of CHORD tokens are used to mark theharmonic settings of each beat; (2) only a single piano trackis present; and, (3) each bar contains 16 SUB-BEAT’s, ratherthan 32. Table 5 presents descriptions of the event tokensused to represent pop piano songs in REMI-pop-1.7K.

4.3 Evaluation MetricsTo evaluate the trained models, we ask them to generatestyle-transferred musical pieces, i.e., X’s, based on 32-bar-long excerpts, i.e., X ’s, drawn from the test set. Following theconvention in text style transfer tasks [55], [57], the generatedX ’s are evaluated according to: (1) their fidelity w.r.t. input X ;

11. The Optimus model is not included here since it was not originallyemployed for style transfer, and that its conditioning mechanism hasbeen evaluated in Sec. 3.

12. https://github.com/joshuachang2311/chorder

(2) the strength of attribute control given specified attributesa

rhymk and apoly

k ; and, (3) their fluency. In addition, we includea (4) diversity criterion, which is measured across samplesgenerated under the same inputs X , arhym

k and apolyk .

The experiments are, therefore, conducted under thefollowing two settings:• Setting #1: We randomly draw 20 excerpts from the test

set, each being 32 bars long, and randomly assign to them5 sets of different attribute inputs, i.e., {arhym

k , apolyk }32k=1, for

a model to generate X . This would result in 20 · 5 = 100samples on which we may compute the fidelity, control, andfluency metrics.

• Setting #2: We draw 20 excerpts as in setting #1; however,here, we assign only 1 set of attribute inputs to them each,and have the model compose 5 different versions of X withexactly the same inputs. By doing so, we would obtain20 ·

(52

)= 200 pairs of generations on which we may

compute the metrics on diversity.Concerning the metrics, for fidelity and diversity, we

directly employ the bar-wise chroma similarity and groovingsimilarity (i.e., simchr and simgrv) defined in Sec. 3.3. Weprefer them to be higher in the case of fidelity, but lower inthe case of diversity. On the other hand, the metrics for controland fluency are defined in the following paragraphs.

Evaluation Metrics for Attribute Control. Since ourattributes are ordinal in nature, following [20], we maydirectly compute the Spearman’s rank correlation, ρ, betweena specified input attribute class a and the attribute raw scores (see the definitions in Sec. 4.2, Calculation of MusicalAttributes) computed from the resulting generation, to see ifthey are strongly and positively correlated. Taking rhythmicintensity as an example, we define:

ρrhym = Spearman_corr(arhym, srhym), (30)

where arhym’s are user-specified inputs, and srhym’s arecomputed from model generations given the arhym’s; thedefinition of ρpoly is similar.

Additionally, in the multi-attribute scenario, we wouldlike to avoid side effects when tuning one attribute but notthe others. More specifically, the model should be able totransfer an attribute awithout affecting other attributes a′. Toevaluate this, we define another set of correlations, ρpoly|rhymand ρrhym|poly, as:

ρpoly|rhym = Spearman_corr(arhym, spoly), (31)

and similarly for ρrhym|poly. We prefer these correlations to beclose to zero (i.e., |ρa′|a| as small as possible), which suggestmore independent controls of each attribute.

Evaluation Metric for Fluency. In line with text styletransfer research [55], [57], we evaluate the fluency of thegenerations by examining their perplexity (PPL), i.e., theexponentiation of entropy, which measures the degree ofuncertainty involved during their generation. A PPL valuerepresents, on average, from how many possibilities we arechoosing uniformly to produce each token in a sequence.Therefore, a low PPL means such a sequence is rather likelyto be seen in a language. In our case here, the language ispop piano music. However, it is impossible to know thetrue perplexity of a language, so our best effort is to train alanguage model (LM) to check instead the perplexity for that

Page 12: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 12

TABLE 6: Training and validation losses of models trained on REMI-pop-17K dataset. The best checkpoint is chosen to be theone with the lowest validation reconstruction NLL.

Model Best ckpt.at step

Recons. NLL KL divergenceTrain Val. Train Val.

MIDI-VAE [19] 65K 0.676 0.894 0.697 0.682Attr-Aware VAE [20] 190K 0.470 0.688 0.710 0.697MuseMorphose (Ours), AE objective 90K 0.130 0.139 6.927 6.928MuseMorphose (Ours), VAE objective 75K 0.697 0.765 0.485 0.484MuseMorphose (Ours), preferred settings 140K 0.457 0.584 0.636 0.636bold: leads the baselines under the same latent space constraints

TABLE 7: Evaluation results of various models, on the style transfer of pop piano performances (↓/ ↑: the lower/higher thescore the better).

Model Fidelity Control Fluency Diversitysimchr ↑ simgrv ↑ ρrhym ↑ ρpoly ↑ PPL↓ simchr ↓ simgrv ↓

MIDI-VAE [19] 75.6 85.4 .719 .261 8.84 74.8 86.5Attr-Aware VAE [20] 85.0 76.8 .997 .781 10.7 86.5 84.7Ours, AE objective 98.5 95.7 .181 .154 6.10 97.9 95.4Ours, VAE objective 78.6 80.7 .931 .884 7.89 73.2 84.9Ours, preferred settings 91.2 84.5 .950 .885 7.39 87.1 87.6

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MIDI VAErhym = 0.719, | poly|rhym| = 0.134

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

Attr Aware VAErhym = 0.997, | poly|rhym| = 0.239

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, AE objectiverhym = 0.181, | poly|rhym| = 0.058

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, VAE objectiverhym = 0.931, | poly|rhym| = 0.038

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, preferred settingsrhym = 0.95, | poly|rhym| = 0.023

(a) On controlling rhythmic intensity (x-axis: user-specified arhym; y-axis: srhym (left), spoly (right) computed from the resultinggenerations; error bands indicate ±1 std from the mean).

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MIDI VAEpoly = 0.261, | rhym|poly| = 0.056

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

Attr Aware VAEpoly = 0.781, | rhym|poly| = 0.040

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, AE objectivepoly = 0.154, | rhym|poly| = 0.072

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, VAE objectivepoly = 0.884, | rhym|poly| = 0.003

0 1 2 3 4 5 6 7

.2

.3

.4

.5

.6

.7 Rhythmic Int.Polyphony

3

4

5

6

7

MuseMorphose, preferred settingspoly = 0.885, | rhym|poly| = 0.016

(b) On controlling polyphony (x-axis: user-specified apoly; y-axis: srhym (left), spoly (right) computed from the resulting generations;error bands indicate ±1 std from the mean).

Fig. 6: Comparison of the models on attribute controllability. We desire both a high correlation ρa and a low |ρa′|a|, where ais the attribute in question, while a′ is not.

LM to generate the sequence. This estimated perplexity wasproven to upper-bound the true perplexity [61], and hencecan be considered a valid metric.

To this end, we train a stand-alone 24-layer Transformerdecoder LM on the training set of REMI-pop-1.7k, andcompute the PPL of each (32-bar-long) generation X with:

PPLX = exp(−T∑t=1

log pLM(xt | x<t)), (32)

where T is the length of X , and pLM(·) is the probabilitygiven by the stand-alone LM. Note that PPL is the only piece-

level metric used for our evaluation. All other metrics aremeasured at the bar level.

4.4 Results and Discussion

We commence with examining the models’ losses, which aredisplayed in Table 6. Comparing MuseMorphose (preferredsettings) and the RNN-based baselines, which are all trainedunder the same latent space KL constraints (i.e., free bitsand cyclical scheduling on β), MuseMorphose attains lowervalidation reconstruction NLL and KL divergence at thesame time. This could be due to both the advantage oflooking at the entire sequence, a trait that RNNs do not have;

Page 13: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 13

and, that a Transformer decoder can more efficiently exploitthe information from latent space when paired with our in-attention conditioning. Next, turning attention to the threesettings of MuseMorphose, we can observe clearly an inverserelationship between the amount of latent space constraintand reconstruction performance. When the latent space iscompletely free (i.e., the AE objective variant), the NLL can bereduced easily to a very low level. Yet, even when the strictVAE objective is used, MuseMorphose can still reach an NLLaround those of the baselines.

Table 7 shows the evaluation results on our style transfertask. Fidelity-wise, our model mostly surpasses the baselinesunder the preferred settings, except in simgrv, where it scoresa little lower than MIDI-VAE. Moreover, our AE objectivevariant sticks almost perfectly to the inputs, while VAEobjective gets a much lower simchr. Perceptually, we findthat simchr < 80 is a level where a generation often doesnot feel like the same piece as the input anymore. Attributecontrol is the aspect where AE objective fails without question;meanwhile, MuseMorphose under preferred settings andVAE objective exhibit equally strong attribute control withρrhym ≈ .95 and ρpoly ≈ .90 . This suggests that, in theabsence of style control-targeted losses, having a narrowenough latent space bottleneck is a key to successful stylecontrol. Comparing with the baselines, our model is the onlyone capable of tightly controlling both attributes, despitethe closer competitor, Attr-Aware VAE, achieving a reallyhigh ρrhym. We surmise that rhythmic intensity is an easierattribute for the models to capture, since it only involvescounting the number of SUB-BEAT’s appearing in a bar. Onfluency, MuseMorphose significantly (p < .001, with 100samples) beats the baselines under all settings. Diversity-wise, though our model does not seem to have an edgeover the baselines, we note that there exists an inherenttrade-off between diversity and fidelity. Nevertheless, if wecompare our preferred settings with Attr-Aware VAE, thecloser competitor, we can notice that the margins by whichMuseMorphose wins over Attr-Aware VAE on fidelity (6.2for simchr; 7.7 for simgrv) is somewhat larger than those bywhich it loses on diversity (0.6 for simchr; 2.9 for simgrv).

It is worth mentioning that manipulating the fidelity-diversity trade-off is possible with a single model, by eitherforcibly enlarging/shrinking the std vectors (i.e., σk’s) fromthe VAE encoder, or tuning the hyperparameters used duringsampling from decoder outputs, e.g., temperature or thetruncation point on probability distribution. Their respectiveeffects, however, are beyond the scope of this study and leftfor further investigation.

The attribute controllability of the models can be morethoroughly analyzed using Figure 6. Focusing on rhythmicintensity first (Fig. 6a), although the RNN-based methodshave satisfactory control of this attribute, the resultingpolyphony scores are more or less undesirably affected(|ρpoly|rhym| ≈ .15 or .25). This is reasonable since the|ρpoly|rhym|, and |ρrhym|poly| alike, computed on the trainingset of REMI-pop-1.7K are around .3 , meaning that thetwo musical attributes are moderately correlated by nature.MuseMorphose (VAE and preferred settings), on the otherhand, is able to maintain |ρpoly|rhym| ≤ .05, revealing ourmodel’s yet another strength in independent attribute control.However, we may also notice that MuseMorphose is not that

great at differentiating classes 0 and 1. We suspect that thisis due to the higher diversity in terms of rhythm intensityscore of the training data in class 0. Moving on to polyphony(Fig. 6b), the more challenging attribute, it is clear that onlyMuseMorphose manages to have a firm grip on it, whileonce again maintaining good independence.

To summarize, our MuseMorphose model, underpinnedby Transformers and the in-attention conditioning mecha-nism, outperforms both baselines and ticks all the boxesin our controllable music generation task—in particular, itaccomplishes high fidelity, strong and independent attributecontrol, good sequence-level fluency, and acceptable diversity,all at once. Nevertheless, we also emphasize that using acombination of VAE training techniques, i.e., cyclical KLannealing and free bits, and picking suitable values for thehyperparameters β and λ, are indispensable to the successof MuseMorphose.

5 CONCLUSION AND FUTURE DIRECTIONS

In this paper, we have developed MuseMorphose, a sequencevariational autoencoder with Transformers as its backbone,which delivered a stellar performance on the controllableconditional generation of pop piano performances. In achiev-ing so, we set off from defining a novel segment-levelconditioning problem for generative Transformers, anddevised three mechanisms, namely, pre-attention, in-attention,and post-attention conditioning, to approach it. Conductedobjective and subjective evaluations demonstrated that in-attention came out on top in terms of offering firm controlwith time-varying conditioning vectors. Subsequently, weleveraged in-attention to bring together a Transformer en-coder and Transformer decoder, forming the foundation ofMuseMorphose model. Experiments have shown that, whentrained with a carefully tuned VAE objective, MuseMorphoseemerged as a solid all-rounder on the music style transfertask with long inputs, where we considered controlling twomusical attributes, i.e., rhythmic intensity and polyphony, at thebar level. Our model outperformed two previous methods[19], [20] on commonly accepted style transfer evaluationmetrics, without using any auxiliary objectives tailored forstyle control.

Our research can be extended in the following directions:• Applications in Natural Language Processing (NLP): The

framework of MuseMorphose is easily generalizable totexts by, for example, treating sentences as segments (i.e.,Xk’s) and an article as the full sequence (i.e., X). As forattributes, possible options include sentiments [27], [57]and text difficulty [62]. Success in controlling the latter, forexample, is of high practical value since it would enableus to rewrite articles or literary works to cater to the needsof pupils across all stages of education.

• Music generation with long-range structures: Being in-spired by the two astounding works of high-resolutionimage generation [63], [64], in which Transformers are usedto generate the high-level latent semantics of an image (i.e.,zk’s, the sequence of latent conditions), we conjecture thatlong range musical structures, e.g., motivic development,recurrence of themes, contrasting sections, etc., may bebetter generated in a similar fashion. To this end, it is likelythat we need to learn a vector-quantized latent space [65]

Page 14: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 14

instead, so that the latent conditions can be represented intoken form for another Transformer to model.

To conclude, our work has not only bridged Transformersand VAEs for controllable sequence modeling, where poten-tials for further applications exist, but has also laid the solidfoundation for a pathway to long sequence generation neverexplored before.

REFERENCES

[1] L. A. Hiller and L. M. Isaacson, Experimental Music; Compositionwith an Electronic Computer, 1959.

[2] G. Hadjeres, F. Pachet, and F. Nielsen, “DeepBach: a steerablemodel for Bach chorales generation,” in Proc. ICML, 2017.

[3] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “MuseGAN:Multi-track sequential generative adversarial networks for sym-bolic music generation and accompaniment,” in Proc. AAAI, 2018.

[4] S. Ji, J. Luo, and X. Yang, “A comprehensive survey on deep musicgeneration: Multi-level representations, algorithms, evaluations,and future directions,” arXiv preprint arXiv:2011.06801, 2020.

[5] C. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne,N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck,“Music Transformer: Generating music with long-term structure,”in Proc. ICLR, 2019.

[6] C. Donahue, H. H. Mao, Y. E. Li, G. W. Cottrell, and J. McAuley,“LakhNES: Improving multi-instrumental music generation withcross-domain pre-training,” in Proc. ISMIR, 2019.

[7] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “Ahierarchical latent vector model for learning long-term structure inmusic,” in Proc. ICML, 2018.

[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”in Proc. NeurIPS, 2017.

[9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”in Proc. ICLR, 2014.

[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[11] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “Onthe properties of neural machine translation: Encoder–decoderapproaches,” in Proc. Eighth Workshop on Syntax, Semantics andStructure in Statistical Translation, 2014.

[12] C. M. Payne, “MuseNet,” OpenAI Blog, 2019.[13] Y.-S. Huang and Y.-H. Yang, “Pop Music Transformer: Generating

music with rhythm and harmony,” in Proc. ACM Multimedia, 2020.[14] S.-L. Wu and Y.-H. Yang, “The Jazz Transformer on the front

line: Exploring the shortcomings of AI-composed music throughquantitative measures,” in Proc. ISMIR, 2020.

[15] Y.-H. Chen, Y.-H. Huang, W.-Y. Hsiao, and Y.-H. Yang, “Automaticcomposition of guitar tabs by Transformers and groove modeling,”in Proc. ISMIR, 2020.

[16] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov,“Transformer-XL: Attentive language models beyond a fixed-lengthcontext,” in Proc. ACL, 2019.

[17] H. H. Tan and D. Herremans, “Music FaderNets: Controllablemusic generation based on high-level features via low-level featuremodelling,” in Proc. ISMIR, 2020.

[18] G. Hadjeres, F. Nielsen, and F. Pachet, “GLSR-VAE: Geodesic latentspace regularization for variational autoencoder architectures,” inProc. IEEE Symposium Series on Computational Intelligence, 2017.

[19] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer, “MIDI-VAE: Modeling dynamics and instrumentation of music withapplications to style transfer,” in Proc. ISMIR, 2018.

[20] L. Kawai, P. Esling, and T. Harada, “Attributes-aware deep musictransformation,” in Proc. ISMIR, 2020.

[21] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnetworks,” in Proc. NeurIPS, 2014.

[22] K. Choi, C. Hawthorne, I. Simon, M. Dinculescu, and J. Engel,“Encoding musical style with Transformer autoencoders,” in Proc.ICML, 2020.

[23] Y. Ren, J. He, X. Tan, T. Qin, Z. Zhao, and T.-Y. Liu, “PopMAG: Popmusic accompaniment generation,” in Proc. ACM Multimedia, 2020.

[24] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound WordTransformer: Learning to compose full-song music over dynamicdirected hypergraphs,” in Proc. AAAI, 2021.

[25] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,“Language models are unsupervised multitask learners,” OpenAIBlog, 2019.

[26] J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “BERT:Pre-training of deep bidirectional Transformers for languageunderstanding,” in Proc. NAACL-HLT, 2018.

[27] Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan, “Style transfer in text:Exploration and evaluation,” in Proc. AAAI, 2018.

[28] C. Li, X. Gao, Y. Li, B. Peng, X. Li, Y. Zhang, and J. Gao, “Optimus:Organizing sentences via pre-trained modeling of a latent space,”in Proc. EMNLP, 2020.

[29] T. Wang and X. Wan, “T-CVAE: Transformer-based conditionedvariational autoencoder for story completion.” in Proc. IJCAI, 2019.

[30] S. Oore, I. Simon, S. Dieleman, D. Eck, and K. Simonyan, “Thistime with feeling: Learning expressive musical performance,” arXivpreprint arXiv:1808.03715, 2018.

[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. CVPR, 2016.

[32] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450, 2016.

[33] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relativeposition representations,” in Proc. NAACL, 2018.

[34] G. Ke, D. He, and T.-Y. Liu, “Rethinking the positional encoding inlanguage pre-training,” in Proc. ICLR, 2021.

[35] B. Wang, L. Shang, C. Lioma, X. Jiang, H. Yang, Q. Liu, and J. G.Simonsen, “On position embeddings in bert,” in Proc. ICLR, 2021.

[36] A. Liutkus, O. Cífka, S.-L. Wu, U. Simsekli, Y.-H. Yang, andG. Richard, “Relative positional encoding for Transformers withlinear complexity,” in Proc. ICML, 2021.

[37] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Trans-formers are RNNs: Fast autoregressive Transformers with linearattention,” in Proc. ICML, 2020.

[38] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane,T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al.,“Rethinking attention with performers,” in Proc. ICLR, 2021.

[39] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick,S. Mohamed, and A. Lerchner, “beta-VAE: Learning basic visualconcepts with a constrained variational framework,” in Proc. ICLR,2017.

[40] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther,“Ladder variational autoencoders,” in Proc. NeurIPS, 2016.

[41] S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, andS. Bengio, “Generating sentences from a continuous space,” inProc. SIGNLL, 2016.

[42] A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei, “Avoiding latentvariable collapse with generative skip models,” in Proc. AISTATS,2019.

[43] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever,and M. Welling, “Improving variational inference with inverseautoregressive flow,” arXiv preprint arXiv:1606.04934, 2016.

[44] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclicalannealing schedule: A simple approach to mitigating KL vanishing,”in Proc. NAACL-HLT, 2019.

[45] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher,“CTRL: A conditional Transformer language model for controllablegeneration,” arXiv preprint arXiv:1909.05858, 2019.

[46] M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion Transformer:Flexible sequence generation via insertion operations,” in Proc.ICML, 2019.

[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

[48] M. Chen, A. Radford, J. Wu, H. Jun, P. Dhariwal, D. Luan, andI. Sutskever, “Generative pretraining from pixels,” in Proc. ICML,2020.

[49] H.-W. Dong and Y.-H. Yang, “Convolutional generative adversarialnetworks with binary neurons for polyphonic music generation,”in Proc. ISMIR, 2018.

[50] T. Fujishima, “Realtime chord recognition of musical sound: Asystem using common Lisp,” in Proc. International Computer MusicConf. (ICMC), 1999.

[51] S. Dixon, F. Gouyon, and G. Widmer, “Towards characterisation ofmusic via rhythmic patterns,” in Proc. ISMIR, 2004.

[52] J. Foote, “Visualizing music and audio using self-similarity,” inProc. ACM Multimedia, 1999.

[53] L.-C. Yang and A. Lerch, “On the evaluation of generative modelsin music,” Neural Computing and Applications, vol. 32, no. 9, pp.4773–4784, 2020.

Page 15: JOURNAL OF LA MuseMorphose: Full-Song and Fine-Grained

JOURNAL OF LATEX CLASS FILES, VOL. #, NO. #, AUGUST 2021 15

[54] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick,“Unsupervised text style transfer using language models as discrim-inators,” in Proc. NeurIPS, 2018.

[55] V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangledrepresentation learning for non-parallel text style transfer,” in Proc.ACL, 2019.

[56] G. Lample, S. Subramanian, E. Smith, L. Denoyer, M. Ranzato, andY.-L. Boureau, “Multiple-attribute text rewriting,” in Proc. ICLR,2019.

[57] N. Dai, J. Liang, X. Qiu, and X.-J. Huang, “Style Transformer:Unpaired text style transfer without disentangled latent representa-tion,” in Proc. ACL, 2019.

[58] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curiouscase of neural text degeneration,” in Proc. ICLR, 2019.

[59] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel,J. Engel, S. Oore, and D. Eck, “Onsets and Frames: Dual-objectivepiano transcription,” in Proc. ISMIR, 2018.

[60] S. Böck, F. Korzeniowski, J. Schlüter, F. Krebs, and G. Widmer,“Madmom: A new Python audio and music signal processinglibrary,” in Proc. ACM Multimedia, 2016.

[61] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. C. Lai, and R. L.Mercer, “An estimate of an upper bound for the entropy of english,”Computational Linguistics, vol. 18, no. 1, pp. 31–40, 1992.

[62] S. Surya, A. Mishra, A. Laha, P. Jain, and K. Sankaranarayanan,“Unsupervised neural text simplification,” in Proc. ACL, 2019.

[63] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,”arXiv preprint arXiv:2102.12092, 2021.

[64] P. Esser, R. Rombach, and B. Ommer, “Taming Transformers forhigh-resolution image synthesis,” in Proc. CVPR, 2021.

[65] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discreterepresentation learning,” in Proc. NeurIPS, 2017.