@@ -32,7 +32,7 @@ With label smoothing, we can reach an increased performance with all of the mode

The models running on normalized data also profit from the label smoothing, however they still cannot reach the performance of the non normalized models.

For the special case of the models initially trained on problem 3 which were then binarized and run on problem 1, we only report some results in this section. The full results can be found in the appendix, in table A\ref{tbl:washing_binarized} and fig. A\ref{fig:washing_binarized}. Surprisingly, the models trained on problem 3 reach similar F1 scores on the test data of problem 1 as the models trained on problem 1. DeepConvLSTM achieves an F1 score of $0.857$, DeepConvLSTM-A achieves $0.847$. The F1 score of DeepConvLSTM is even higher than the highest F1 score of the models trained for problem 1 by $0.004$. However, for the S score metric, the models trained for problem 3 can only reach up to $0.704$ (CNN) or $0.671$ (DeepConvLSTM-A), which is lower by $0.052$ than the best performing model trained for problem 1.

For the special case of the models initially trained on problem 3 which were then binarized and run on problem 1 (without smoothing), we only report some results in this section. The full results can be found in the appendix. Surprisingly, the models trained on problem 3 reach similar F1 scores on the test data of problem 1 as the models trained on problem 1. DeepConvLSTM achieves an F1 score of $0.857$, DeepConvLSTM-A achieves $0.847$. The F1 score of DeepConvLSTM is even higher than the highest F1 score of the models trained for problem 1 by $0.004$. However, for the S score metric, the models trained for problem 3 can only reach up to $0.704$ (CNN) or $0.671$ (DeepConvLSTM-A), which is lower by $0.052$ than the best performing model trained for problem 1.

\FloatBarrier

...

...

@@ -59,19 +59,19 @@ Also, like in task 2 two without smoothing, normalization brings about small to

\FloatBarrier

### Classifying hand washing and compulsive hand washing separately and distinguishing from other activities

### Classifying hand washing and compulsive hand washing separately and distinguishing it from other activities

The three class problem of classifying hand washing, compulsive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the data trained on the non normalized data of the same network class is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.

The three class problem of classifying hand washing, compulsive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the architecture-identical model trained on the non normalized data is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.

![Confusion matrices for all neural network based classifiers with and without normalization of the sensor data](img/confusion.pdf){#fig:confusion width=98%}

The confusion matrices of the non normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM))). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ to $0.79$ on HW and $0.82$ to $0.85$ on HW-C).

The confusion matrices of the non normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM)). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ vs $0.79$ on HW and $0.82$ vs $0.85$ on HW-C).

As for problem 1 and for problem 2, we obtain the result, that normalization seems to decrease the performance of all the neural network based classifiers. For this problem, the FC network also has a decreased performance when normalized input data is used. todo some more

As for problem 1 and for problem 2, we obtain the result, that normalization seems to decrease the performance of all the neural network based classifiers. For this problem, the FC network also has a decreased performance when normalized input data is used.

![Confusion matrices for all baseline classifiers with and without normalization of the sensor data](img/confusion_baselines.pdf){#fig:confusion_baselines width=98%}

The respective confusion matrices for the baseline classifiers, i.e. RFC, SVM, majority classifier and random classifier are displayed in @fig:confusion_baselines. For both SVM and RFC, and both normalized and non normalized versions thereof, the confusion matrices show, that the Null class was predicted most. In the non normalized version, $94\,\%$ of samples belonging to the Null class are predicted correctly by both of these methods. However, they also predict most of the samples belonging to the other classes as Null. The HW class is detected to be Null in $71\,\%$ (SVM) and $67\,\%$ of its samples, with only $15\,\%$ (SVM) and $22\,\%$ (RFC) being identified correctly. The accuracy is better for the HW-C class, where a ratio of correct predictions of $0.42$ (SVM) and $0.43$ (RFC) is reached, although there is again a high amount of misclassifications into the Null class (SVM: $0.56$, RFC: $0.54$).

The respective confusion matrices for the baseline classifiers, i.e. RFC, SVM, majority classifier and random classifier are displayed in @fig:confusion_baselines. For both SVM and RFC, and both the normalized and the non normalized versions thereof, the confusion matrices show, that the Null class was predicted most. In the non normalized version, $94\,\%$ of samples belonging to the Null class are predicted correctly by both of these methods. However, they also predict most of the samples belonging to the other classes as Null. The HW class is detected to be Null in $71\,\%$ (SVM) and $67\,\%$ of its samples, with only $15\,\%$ (SVM) and $22\,\%$ (RFC) being identified correctly. The accuracy is better for the HW-C class, where a ratio of correct predictions of $0.42$ (SVM) and $0.43$ (RFC) is reached, although there is again a high amount of misclassifications into the Null class (SVM: $0.56$, RFC: $0.54$).

The majority classifier classifies all the samples into the Null class, which leads to its accuracy on the samples belonging to the Null class being $1.0$ and $0.0$ for all other samples belonging to the HW and HW-C classes.

...

...

@@ -79,6 +79,8 @@ The random classifier also does not perform well, and reaches values around $0.3

![Confusion matrix of the chained best classifiers for problem 1 (DeepConvLSTM-A) and problem 2 (DeepConvLSTM), applied to problem 3](img/chained_confusion.pdf){#fig:chained_confusion width=60%}

#### Chained model

The best classifier in terms of the S score for problem 1 was DeepConvLSTM-A. Likewise, for problem 2 it was the DeepConvLSTM. Thus, these two classes were selected to run in a chained model as described in section \ref{chained_model}. The results of this chained run of 2 models on the testing data for problem 3 can be seen in @fig:chained_confusion. The results were achieved on non normalized data. The main difference of the results of the chained models compared with the best performing original models for problem 3 is that the ratio of correct predictions for the Null class by the chained model is significantly higher ($0.69$ instead of around $0.5$). In return, its ratio of correct predictions for the HW class is only $0.67$ instead of around $0.79$ for the DeepConvLSTM based models, and $0.8$ instead of around $0.85$ for the HW-C class.