Commit 86a5e6be authored by burcharr's avatar burcharr 💬
Browse files

Update docs/thesis/md/results.md

parent 9cffaf57
......@@ -13,24 +13,24 @@ The values for the metrics specificity and sensitivity will be reported in the t
### Distinguishing hand washing from all other activities
For the first task of classifying hand washing in contrast to non hand washing activities, we report the results with and without the application of label smoothing. The results without label smoothing are shown in table \ref{tbl:washing}. In @fig:p1_metrics, the resulting scores for problem 1 with and without smoothing are shown.
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become really small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
\input{tables/washing.tex}
With label smoothing, we can reach an increased performance with all of the model classes, including the traditional machine learning methods RFC and SVM. The results with a 20 prediction wide average filter smoothing can be seen in table \ref{tbl:washing_rm} and @fig:p1_metrics. The top performing neural network architectures do not change with the smoothing. However, the performance measures increase. DeepConvLSTM has the best F1 score ($0.892$), followed by LSTM-A ($0.891$), DeepConvLSTM-A ($0.890$) and CNN ($0.888$). These results are higher by about $0.03$ to $0.05$ compared to utilizing the raw predictions, without smoothing. In the S score metric, DeepConvLSTM-A performs best ($0.819$), followed by DeepConvLSTM-A ($0.814$) and CNN ($0.808$). For the S score, the advantage of the label smoothing is bigger in general, between $0.05$ to $0.06$ for all model classes except the LSTM, which only improves by $0.015$. RFC and SVM do not improve with the label smoothing, their scores decrease by about $0.04$ for both of the metrics.
With label smoothing, we can reach an increased performance with all of the model classes, including the traditional machine learning methods RFC and SVM. The results with a 20-prediction wide average filter smoothing can be seen in table \ref{tbl:washing_rm} and @fig:p1_metrics. The top performing neural network architectures do not change with the smoothing. However, the performance measures increase. DeepConvLSTM has the best F1 score ($0.892$), followed by LSTM-A ($0.891$), DeepConvLSTM-A ($0.890$) and CNN ($0.888$). These results are higher by about $0.03$ to $0.05$ compared to utilizing the raw predictions, without smoothing. In the S score metric, DeepConvLSTM-A performs best ($0.819$), followed by DeepConvLSTM-A ($0.814$) and CNN ($0.808$). For the S score, the advantage of the label smoothing is bigger in general, between $0.05$ to $0.06$ for all model classes except the LSTM, which only improves by $0.015$. RFC and SVM do not improve with the label smoothing, their scores decrease by about $0.04$ for both metrics.
![F1 score and S score for problem 1](img/washing_all.pdf){#fig:p1_metrics width=105%}
\input{tables/washing_rm.tex}
The models running on normalized data also profit from the label smoothing, however they still cannot reach the performance of the non normalized models.
The models running on normalized data also profit from the label smoothing, however they still cannot reach the performance of the non-normalized models.
For the special case of the models initially trained on problem 3 which were then binarized and run on problem 1 (without smoothing), we only report some results in this section. The full results can be found in the appendix. Surprisingly, the models trained on problem 3 reach similar F1 scores on the test data of problem 1 as the models trained on problem 1. DeepConvLSTM achieves an F1 score of $0.857$, DeepConvLSTM-A achieves $0.847$. The F1 score of DeepConvLSTM is even higher than the highest F1 score of the models trained for problem 1 by $0.004$. However, for the S score metric, the models trained for problem 3 can only reach up to $0.704$ (CNN) or $0.671$ (DeepConvLSTM-A), which is lower by $0.052$ than the best performing model trained for problem 1.
\FloatBarrier
### Distinguishing compulsive hand washing from non-compulsive hand washing
The results without smoothing of predictions for the second task, distinguishing compulsive hand washing from non compulsive hand washing can be seen in table \ref{tbl:only_conv_hw}. In @fig:p2_metrics, the results with and without smoothing are shown. In terms of the F1 score metric, the LSTM model performs best ($0.926$). It is closely followed by DeepConvLSTM-A ($0.922$) and DeepConvLSTM ($0.918$). However, the RFC also performs surprisingly well, with an F1 score of $0.891$, even beating the CNN ($0.883$) and FC networks ($0.886$). Due to the imbalance of classes in the test set ($70.6\,\%$ of samples correspond to the positive class), the majority classifier reaches an F1 score of $0.828$. The S score is best for DeepConvLSTM ($0.869$) and LSTM ($0.862$), followed by LSTM-A ($0.848$) and DeepConvLSTM-A ($0.846$). The baseline methods RFC ($0.734$) and SVM ($0.701$) fail to reach similar S scores as the neural network based methods.
The results without smoothing of predictions for the second task, distinguishing compulsive hand washing from non-compulsive hand washing can be seen in table \ref{tbl:only_conv_hw}. In @fig:p2_metrics, the results with and without smoothing are shown. In terms of the F1 score metric, the LSTM model performs best ($0.926$). It is closely followed by DeepConvLSTM-A ($0.922$) and DeepConvLSTM ($0.918$). However, the RFC also performs surprisingly well, with an F1 score of $0.891$, even beating the CNN ($0.883$) and FC networks ($0.886$). Due to the imbalance of classes in the test set ($70.6\,\%$ of samples correspond to the positive class), the majority classifier reaches an F1 score of $0.828$. The S score is best for DeepConvLSTM ($0.869$) and LSTM ($0.862$), followed by LSTM-A ($0.848$) and DeepConvLSTM-A ($0.846$). The baseline methods RFC ($0.734$) and SVM ($0.701$) fail to reach similar S scores as the neural network based methods.
![F1 score and S score for problem 2](img/only_conv_hw_all.pdf){#fig:p2_metrics width=105%}
......@@ -45,34 +45,34 @@ The S scores of the neural network based models are also high, with the highest
\input{tables/only_conv_hw_rm.tex}
Also, like in task 2 two without smoothing, normalization brings about small to big performance decreases for all the neural network based models except for the FC network. The FC F1 score rises by $0.01$ when normalization is applied and its S score rises by $0.014$.
Also, like in task 2 two without smoothing, normalization brings about small to big performance decreases for all the neural network based models except for the FC network. The FC F1 score rises by $0.01$ when normalization is applied, and its S score rises by $0.014$.
\FloatBarrier
### Classifying hand washing and compulsive hand washing separately and distinguishing it from other activities
The three class problem of classifying hand washing, compulsive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the architecture-identical model trained on the non normalized data is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.
The three class problem of classifying hand washing, compulsive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the architecture-identical model trained on the non-normalized data is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.
![Confusion matrices for all neural network based classifiers with and without normalization of the sensor data](img/confusion.pdf){#fig:confusion width=98%}
The confusion matrices of the non normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) being only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM)). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ vs $0.79$ on HW and $0.82$ vs $0.85$ on HW-C).
The confusion matrices of the non-normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) being only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM-based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM)). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ vs $0.79$ on HW and $0.82$ vs $0.85$ on HW-C).
As for problem 1 and for problem 2, we obtain the result, that normalization seems to decrease the performance of all the neural network based classifiers. For this problem, the FC network also has a decreased performance when normalized input data is used.
![Confusion matrices for all baseline classifiers with and without normalization of the sensor data](img/confusion_baselines.pdf){#fig:confusion_baselines width=98%}
The respective confusion matrices for the baseline classifiers, i.e. RFC, SVM, majority classifier and random classifier are displayed in @fig:confusion_baselines. For both SVM and RFC, and both the normalized and the non normalized versions thereof, the confusion matrices show, that the Null class was predicted most. In the non normalized version, $94\,\%$ of samples belonging to the Null class are predicted correctly by both of these methods. However, they also predict most of the samples belonging to the other classes as Null. The HW class is detected to be Null in $71\,\%$ (SVM) and $67\,\%$ of its samples, with only $15\,\%$ (SVM) and $22\,\%$ (RFC) being identified correctly. The accuracy is better for the HW-C class, where a ratio of correct predictions of $0.42$ (SVM) and $0.43$ (RFC) is reached, although there is again a high amount of misclassifications into the Null class (SVM: $0.56$, RFC: $0.54$).
The respective confusion matrices for the baseline classifiers, i.e. RFC, SVM, majority classifier and random classifier are displayed in @fig:confusion_baselines. For both SVM and RFC, and both the normalized and the non-normalized versions thereof, the confusion matrices show, that the Null class was predicted most. In the non-normalized version, $94\,\%$ of samples belonging to the Null class are predicted correctly by both of these methods. However, they also predict most of the samples belonging to the other classes as Null. The HW class is detected to be Null in $71\,\%$ (SVM) and $67\,\%$ of its samples, with only $15\,\%$ (SVM) and $22\,\%$ (RFC) being identified correctly. The accuracy is better for the HW-C class, where a ratio of correct predictions of $0.42$ (SVM) and $0.43$ (RFC) is reached, although there is again a high amount of misclassifications into the Null class (SVM: $0.56$, RFC: $0.54$).
The majority classifier classifies all the samples into the Null class, which leads to its accuracy on the samples belonging to the Null class being $1.0$ and $0.0$ for all other samples belonging to the HW and HW-C classes.
The random classifier also does not perform well, and reaches values around $0.33$ for each of the fields of the confusion matrix.
The random classifier also does not perform well and reaches values around $0.33$ for each of the fields of the confusion matrix.
![Confusion matrix of the chained best classifiers for problem 1 (DeepConvLSTM-A) and problem 2 (DeepConvLSTM), applied to problem 3](img/chained_confusion.pdf){#fig:chained_confusion width=60%}
#### Chained model
The best classifier in terms of the S score for problem 1 was DeepConvLSTM-A. Likewise, for problem 2 it was the DeepConvLSTM. Thus, these two classes were selected to run in a chained model as described in section \ref{chained_model}. The results of this chained run of 2 models on the testing data for problem 3 can be seen in @fig:chained_confusion. The results were achieved on non normalized data. The main difference of the results of the chained models compared with the best performing original models for problem 3 is that the ratio of correct predictions for the Null class by the chained model is significantly higher ($0.69$ instead of around $0.5$). In return, its ratio of correct predictions for the HW class is only $0.67$ instead of around $0.79$ for the DeepConvLSTM based models, and $0.8$ instead of around $0.85$ for the HW-C class.
The best classifier in terms of the S score for problem 1 was DeepConvLSTM-A. Likewise, for problem 2 it was the DeepConvLSTM. Thus, these two classes were selected to run in a chained model as described in section \ref{chained_model}. The results of this chained run of 2 models on the testing data for problem 3 can be seen in @fig:chained_confusion. The results were achieved on non-normalized data. The main difference of the results of the chained models compared with the best performing original models for problem 3 is that the ratio of correct predictions for the Null class by the chained model is significantly higher ($0.69$ instead of around $0.5$). In return, its ratio of correct predictions for the HW class is only $0.67$ instead of around $0.79$ for the DeepConvLSTM-based models, and $0.8$ instead of around $0.85$ for the HW-C class.
\input{tables/separate.tex}
......@@ -84,7 +84,7 @@ Aside from showing and discussing the confusion matrices, we also report the mul
The multiclass S score shows a similar ordering of the model classes. The chained model achieves the highest score ($0.783$), with the second highest value being $0.769$, reached by DeepConvLSTM-A. The third highest score is then $0.751$ for DeepConvLSTM, which is followed by LSTM-A ($0.736$), FC ($0.733$), CNN ($0.706$) and LSTM ($0.703$).
In the multiclass S score, The RFC ($0.509$) and SVM ($0.467$) show a significantly lower performance than the neural network based methods.
The mean diagonal value of the confusion matrix upholds almost the same ordering as the F1 score and S score measures. The chained classifier reaches the highest value of $0.718$. DeepConvLSTM-A achieves a close second place with a value of $0.712$, while DeepConvLSTM gets to $0.701$. The DeepConvLSTM based models are followed by LSTM-A ($0.679$), LSTM ($0.665$), FC ($0.664$) and CNN ($0.624$).
The mean diagonal value of the confusion matrix upholds almost the same ordering as the F1 score and S score measures. The chained classifier reaches the highest value of $0.718$. DeepConvLSTM-A achieves a close second place with a value of $0.712$, while DeepConvLSTM gets to $0.701$. The DeepConvLSTM-based models are followed by LSTM-A ($0.679$), LSTM ($0.665$), FC ($0.664$) and CNN ($0.624$).
\FloatBarrier
......@@ -95,7 +95,7 @@ The mean diagonal value of the confusion matrix upholds almost the same ordering
In the first scenario, the 5 subjects reported an average of $4.75$ hand washing procedures on the day on which they evaluated the system.
Per subject, there were $4.75$ ($\pm\,3.3$) hand washing procedures. Out of those, $1.75$ ($\pm\,2.06\,\%$) were correctly identified. The accuracy per subject was $28.33\,\%$ ($\pm\,37.9\,\%$). The highest accuracy for a subject was $80\,\%$ out of 5 hand washes, the lowest was $0\,\%$ out of 4 hand washes. Of all hand washing procedures conducted over the day by the subjects, $35.8\,\%$ were detected correctly.
Some subjects wore the smart watch on the right wrist instead of the left wrist, and reported worse results for that. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $50\,\%$.
Some subjects wore the smart watch on the right wrist instead of the left wrist and reported worse results for that. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $50\,\%$.
The duration and intensity of the hand washing process also played a role.
The correlation of duration of the hand washing with the detection rate is $-0.039$. However, the raw data does only contain 2 "longer" hand washes over 30 seconds, the rest being in the range of 10 to 25 seconds.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment