@@ -204,7 +204,7 @@ In order to find the best batch size, sizes between 32 samples per batch and 102

There is a connection between the batch size and the learning rate. Increasing the batch size can have a similar effect as reducing the learning rate over time (learning rate decay) @smith_dont_2018. Since we use a comparatively big batch size for our model training, we experimented with smaller learning rate values. During preliminary testing on the validation set, different initial values from 0.01 to 0.00001 were tested. We fixed the initial learning rate to 0.0001, as this provided the best performance. We had implemented starting with a higher learning rate and then using learning rate decay but found out during preliminary testing on the validation set, that this approach did not improve the performance, in our case. We also found out that starting with higher learning rates ($lr > 0.01$) lead to numerical instability in the recurrent networks, producing NaN values for gradients and thus parameters. This means the training became unstable for the networks containing LSTM layers, hence the learning rate had to be reduced for these networks anyways.

##### Loss function

As loss function, we use the cross-entropy loss, weighted by the classes' frequencies ($\mathcal{L}_{weighted}$). This means that the loss function corrects for imbalanced classes, and we do not have to rely on sub sampling or repetition in order to balance the class frequencies in the data set. The weighted cross entropy loss is defined as shown in equation \ref{eqn:cross_entropy_loss}. The true labels $\mathbf{y}$ and the models' output are one-hot encoded vectors. We first apply the "softmax" function to the models' output $\mathbf{x}$ (see equations \ref{eqn:softmax} and \ref{eqn:apply_softmax}). Then, the loss is calculated by applying the weighted cross-entropy loss function, with the weight of each class being the inverse of its relative frequency in the training set data. This way, the predictions for all classes have the same potential influence on the parameter updates, despite the classes not being perfectly balanced.

As loss function, we use the cross-entropy loss, weighted by the classes' frequencies ($\mathcal{L}_{weighted}$). This means that the loss function corrects for imbalanced classes, and we do not have to rely on sub sampling or repetition in order to balance the class frequencies in the data set. The weighted cross entropy loss is defined as shown in equation \ref{eqn:cross_entropy_loss}. The true labels $\mathbf{y}$ and the models' output $\mathbf{x}$ are one-hot encoded vectors. We first apply the "softmax" function to the models' output (see equations \ref{eqn:softmax} and \ref{eqn:apply_softmax}). Then, the loss is calculated by applying the weighted cross-entropy loss function, with the weight of each class being the inverse of its relative frequency in the training set data. This way, the predictions for all classes have the same potential influence on the parameter updates, despite the classes not being perfectly balanced.