A write-up of what I have done

I will talk about how I’ve gone about this assignment. If there is anything here that is vague/confusing, please drop a comment so I can fix it up.

Preprocessing

In terms of loading in the audio, I used SciPy’s io.wavfile.read method,
which returns a NumPy array of integers (and a sample rate). In my case, the
sample rate was 16,000, so the length of the array was approximately 16,000*60*60*3 (i.e., 3 hours of audio). The only pre-processing I did was scale the array so that its values lied between 0 and 1, using \frac{(x - mean(x))}{(max(x) - min(x)}.** I then split the data 80-10-10 (80% for training, 10% for validation, and 10% for testing), and reshaped the train/valid/test arrays such that they were (3D) tensors of the form \texttt{(batch\_size, seq\_length, num\_features)} and that \texttt{num\_features = 16000} and \texttt{seq\_length = 50} (this is how the data is expected for the library I am using). That means each training/valid/test example is a second of audio, and that a sequence is (arbitrarily) 50 seconds of audio. I admit that the latter choice was probably a bit arbitrary. The script that achieves this is located here.

Formulation

Formally, we have the entire audio sequence x = x^{(1)}, x^{(2)}, \dots, x^{(n)} (where n \approx 3 \times 60 \times 60 and x^{(i)} \in R^{16,000} represents a second of audio***). Given x^{(i)}, produce the next second in the sequence x^{(i+1)}. This means that we can formulate our loss as the average of squared distances between \hat x^{(2)} and x^{(2)}, \hat x^{(3)} and x^{(3)}, \dots, and \hat x^{(m)} and x^{(m)}. (You can see this loss formulated in the main training script here.) I have drawn a diagram to illustrate this more clearly (where y^{(i)} = \hat x^{(i+1)}):

rnn

 

The network architecture I used was an LSTM consisting of 3 layers of 600 units (the architecture definition file is here). I have a gut feeling that this architecture might be overkill (it has 55 million parameters!), and it’s probably possible to reproduce my result with a much more lightweight network.

Software

For this project I used Lasagne, which is a lightweight neural network library for Theano. I can’t compare this with any other framework (such as Keras or Blocks), but Lasagne is really easy to use and it tries not to abstract Theano away from the user, which is comforting for me. The documentation is really nice too! There is even some example RNN code here to get you started (it certainly helped me). An example of an LSTM in Lasagne for text generation can be found here as well, and hopefully opens your mind up to the possibility of doing all sorts of cool things with LSTMs. 🙂

Results

I have plotted some learning curves for this model (the raw output is here):

download (1)

You can see that at around 1000 epochs is when the network starts to overfit the data (since the training curve dives down), so that would have been an appropriate time to terminate model training. It is good to see that from 0 to 1000 epochs the training and validation curves are in close proximity — if the training curve was significantly below the validation curve then I would be concerned.

Each epoch takes about ~3.7 seconds, and overall the model took 125 minutes to train on a Tesla K40 GPU.

My next step is to train another LSTM but with less units in each layer — I’d really like to know if my architecture is a bit overkill.

Notes

** Originally this was \frac{(x - min(x))}{(max(x) - min(x)} but Thomas George noticed that my generated audio was not centered at 0 like the original audio, so I changed it. My old generated audio not being centered at 0 did not seem to produce a different sounding audio, but I decided to stay with this new choice of scaling instead.

*** Yoshua made an interesting point in class, in that by letting x^{(i)} \in R^{16000} I am making an assumption that each of those 16,000 samples are independent. Obviously, they aren’t independent, and letting e.g. x^{(i)} \in R^{200} would seem more appropriate.

Other thoughts:

* Admittedly, I realised that the whole time I was training these LSTMs that the nonlinearity I was using on the output was tanh — I imagine there was probably a lot of saturation going on during backprop and thus it was slowing down learning. In a later experiment I changed these to ReLUs but found I was getting exploding gradient issues (even when experimenting with gradient clipping), so I had to modify my LSTMs so that they only backpropped through a certain number of time steps in the past. Nonetheless, it seems like the audio I am generating right now sounds really good!

* It is interesting how I am not able to generate audio past 50 seconds. It seems to be that there is some kind of “input drift” (since I am feeding the LSTM’s generated output into its input repeatedly), and when the input has “drifted far enough”, the LSTM doesn’t know how to recover from it and generates nonsense/white noise. This sounds like a teacher forcing / generalisation issue: the inputs that the RNN gets at training time are the actual inputs, so there is a lot of hand-holding going on. In test time (i.e., when we generate new sequences), the outputs are being fed back as inputs, and if the network has not generalised well enough then we run into issues. It seems like the most obvious way to combat this is to, during training, let some of these actual inputs x^{(1)}, x^{(2)}, \dots be the generated sequences \hat x^{(1)}, \hat x^{(2)}, \dots, etc. Another way, which seems easier from a coding point of view, is to simply apply some input noising and/or dropout.

A write-up of what I have done

New audio, part deux

**NOTE**: There was a “bug” in the way I was generating the audio – this is not actually audio that my LSTM generated! Sorry for the mislead, I am trying to rectify this ASAP!

I will try and update this blog post shortly, I just wanted to get another piece of audio out there!

New audio, part deux

Finally, some generated audio!

I managed to finally generate something!

https://github.com/christopher-beckham/ift6266h16/blob/master/vocal_synthesis/notebooks/23feb_entry.ipynb

In short, I used a 3-layer LSTM with 600 hidden units, trained with an initial learning rate \alpha = 0.01 (with momentum m = 0.9) for 2000 epochs using RMSProp. The learning curves and a plot of the waveform can be found in the notebook.

You can have a listen to it here:

Finally, some generated audio!

Update

I hope that soon I will be able to actually generate some music, but for now I’d like to shed some light on something I did which was a bit silly. In the case of the .wav file that’s provided, you get 16,000 samples per second, and since the clip is approximately 3 hours, that’s 16,000 * 60 * 60 * 3 = 172,800,000 samples in total. Initially, for my RNN, each x^{(i)} corresponded to a sample, and I was training on sequence lengths of 100. That meant that I was trying to get my RNN to generate the next sample based on 0.00625 of a second of audio! That sounds pretty silly. Another way to think about it is this: I am training my RNN on some sequence x^{(1)}, x^{(2)}, \dots, x^{(100)}. Suppose we have an RNN with one hidden layer, with the states h^{(1)}, h^{(2)}, \dots, h^{(100)}. h^{(1)} is going to be influenced by x^{(1)}, and h^{(2)} is going to be influenced by h^{(1)} and x^{(2)} and so forth. The last hidden state, h^{(100)}, is influenced by the previous memory h^{(1)}, \dots, h^{(99)}, but this “past memory” only accounts for \frac{100}{16000} \times 99 = 0.0061875 of a second! As far as I’m concerned, 0.0061875 of a second of audio sounds like nothing!

Anyway, we can think about redefining what x^{(i)} is exactly. For now, I have defined it to be a second of audio, so x^{(i)} \in R^{16,000} (since 16,000 samples correspond to a second). That means that if I’m training on a sequence x^{(1)}, x^{(2)}, \dots, x^{(100)}, I’m training on 100 seconds of audio. Last night I ran some experiments doing this on some pretty big LSTMs, so hopefully I get something good this time!

I’d also like to point your attention to this paper, “Algorithmic Music Generation using Recurrent Neural Networks”:

Click to access NayebiAran.pdf

Cheers,
Chris

Update

Training generative models

I am fortunate to have had some experience in the past using Theano, and more specifically, Lasagne, which is a lightweight neural network library for Theano. I can’t compare it with anything else as of now (e.g. Keras, Blocks), but what I like about Lasagne is that it doesn’t try to abstract Theano away from you, so you don’t feel like you’re missing out on flexibility.

I am choosing to do the vocal synthesis dataset as I’ve done quite a few convolutional nets and feel reasonably comfortable with them — as for RNNs, not so much! Also, generative modelling is interesting since it isn’t classification.

At this stage, my code isn’t well documented, but if you would like to see some RNNs in action, take a look at my configurations folder here. The main Python script for training an RNN is here.

To keep things simple computationally, I am training an RNN on (approximately) a 1-minute time slice of the 3 hour audio clip. So far I haven’t been able to generate any audio that isn’t silent, when I generate audio with my model and graph the values I get something that looks like this:

figure_1

Notice how the RNN generates something that looks reasonable at the early start but afterwards it dies off and generates the same value. I wonder what this means, and I keep thinking back to what Yoshua was saying in lecture about attractors. Maybe my RNNs are not big enough yet and I need to give them more capacity, who knows!

Edit: A fellow classmate that is also doing the vocal synthesis project!

 

Training generative models