Commit 3390a094 authored by burcharr's avatar burcharr 💬
Browse files

automatic writing commit ...

parent da1ba091
\section*{Appendix}
\subsection*{User manual and questionnaire for hand washing detection evaluation (see next pages:)}
\includepdf[pages=-]{img/HW_EVAL_draft.pdf}
\ No newline at end of file
......@@ -19,7 +19,9 @@ Normalization was shown to be ineffective for our approach, worsening the perfor
For the reasons explained in section \ref{s_score}, we weigh the results of the S score higher than the ones of the F1 score. Thus, the best network for problem 1 is DeepConvLSTM-A, although only by a slight margin. The overall achieved S score of $0.819$ is based on reaching a specificity of $0.751$ and a sensitivity of $0.90$, which means that $90\,\%$ of windows containing hand washing were classified as hand washing correctly. However, $75.1\,\%$ of windows classified as Null really contained no hand washing, which leaves some room for improvement, because this means that the model still has a false positive rate of $24.9\,\%$.
Compared to results obtained by Mondol et al. in HAWAD @sayeed_mondol_hawad_2020, with F1 scores over $90\,\%$, it may look like our approach is weaker. Their detection of samples, that are out of distribution sounds like a good idea in theory. However, we must argue that their results and our results are not entirely comparable, because we did not train or evaluate on the same data. From what they report in their paper, they did not split the data by subjects, but rather by data windows, with random sampling. This means that, during training, their model saw data from subjects that they later tested on. Although this is not technically a leak, our version of splitting by subjects can be expected to deliver a better estimate of generalization performance, because our models (over-)adaptation to the specific subjects styles or patterns of hand washing cannot yield a performance boost on unseen subjects. Nevertheless, the detection of out of distribution samples could possibly increase the performance of our models. Still, one has to keep in mind, that a sample being out of distribution does not always mean, that it cannot be hand washing, especially if we test on unseen subjects that might arguably employ different patterns of motion. For these reasons the comparability of the results seems rather low, with the performance of HAWAD likely being overestimated in comparison to our scenario.
The binarized versions of the models trained on problem 3 achieve a notable success in terms of their similar F1 scores to the models trained for problem 1. However their performance in terms of the S score metric is worse by about $0.052$ for the best and by more for the other models. Therefore, and especially because of the higher importance of the S score, the models trained on problem 3 are not as good at classifying hand washing and separating it from other activities, as the models specifically trained for this problem. This lower performance can be explained with the higher difficulty of the 3 class problem learned by the classifiers trained for model 3. Thus the loss in performance was to be expected.
Compared to results obtained by Mondol et al. in HAWAD @sayeed_mondol_hawad_2020, with F1 scores over $90\,\%$, it may look like our approach provides weaker results. Their detection of samples, that are out of distribution sounds like a good idea in theory. However, we must argue that their results and our results are not entirely comparable, because we did not train or evaluate on the same data. Added to that, from what they report in their paper, they did not split the data by subjects, but rather by data windows, with random sampling. This means that, during training, their model saw data from all subjects, including the subjects whose data they later tested on. Although this is not technically a leak from train to test set, our approach of splitting by subjects can be expected to deliver a better estimate of the generalization performance, because our models' (over-)adaptation to the specific subjects styles or patterns of hand washing cannot yield a performance boost on unseen subjects. Nevertheless, the detection of out of distribution samples could possibly increase the performance of our models. Still, one has to keep in mind, that a sample being out of distribution does not always mean, that it cannot be hand washing, especially if we test on unseen subjects who might arguably employ different patterns of motion during hand washing. For these reasons the comparability of the results seems rather low, with the performance of HAWAD likely being overestimated in comparison to our scenario.
##### Problem 2
......@@ -58,6 +60,14 @@ The general performance of our models on problem 2 was high. However, one limita
## Future work
The detection of hand washing could be incorporated into many devices, mainly wrist worn ones, like smart watches. In order to further improve the detection capabilities and accuracy, one would need to invest even more time into carefully designing and training better models. This works architecture search could be expanded, and more parameter combinations could be tried out. For example, different types of layers, that have not been included in the architecture yet could be tried. Instead of normalizing data on the data set level, batch normalization could be used try to make the networks faster and more stable.
Added to that, all the other hyperparameters could be optimized better. Instead of manual hyperparameter optimization (HPO), automated versions of HPO could be employed, e.g. bayesian optimization. This could lead to better choices for the batch size, learning rate and other parameters
The current state of the system, especially for the classification of hand washing versus compulsive hand washing class looks promising for future work in this area. The collection of real obsessive-compulsive hand washing data would likely lead to the possible training of models capable of reliably classifying compulsive hand washing. Such models could then be tested on real world subjects, and also evaluated with them. If they perform well enough, they could aid psychologists and their patients with the treatment of compulsive hand washing. Like explained in the introduction, exposure and response prevention (ERP) is a viable treatment method, and interventions from a smart watch could possibly be used for response prevention. The exact design of the interventions and their actual usability forms another exciting problem field and is yet to be researched.
todo:hpo, architecture (batch norm), data (more+ real ocd),
hawad
......
......@@ -104,7 +104,11 @@ The usages for the classes are shown in Table \ref{table:classes}. As mentioned
\label{table:classes}
\end{table}
The data used for training and testing the models for the different problems differs due to the tasks requirements. Namely, the data for problem 2 only contains hand washing data and compulsive hand washing data. The Null class data is not contained in training and testing data for this problem. However, we still made sure that even across the different sets, the different subject recordings were only ever assigned to the same set out of the split into train set and test set. This means that we can execute the different classifiers trained for one of the problems on the test sets of the other problems without the possibility of accidentally testing on data previously seen by the classifier. By testing on the train set or parts thereof, the results would be invalidated, thus this property of our splits was desirable.
![Sample distribution for the 3 problems by classes. The number in round brackets is the amount of windows in the data set used for each problem.](img/dataset_dist.pdf){width=98% #fig:sample_dist}
The sample distribution to the classes can be seen in @fig:sample_dist.
The data used for training and testing the models for the different problems differs due to the tasks requirements. Namely, the data for problem 2 only contains hand washing data and compulsive hand washing data. The Null class data is not contained in training and testing data for this problem. However, we still made sure that even across the different sets, the different subject recordings were only ever assigned to the same set out of the split into train set and test set. This means that we can execute the different classifiers trained for one of the problems on the test sets of the other problems without the possibility of accidentally testing on data previously seen by the classifier. By testing on the train set or parts thereof, the results would be invalidated, thus this property of our splits was desirable.
## Baselines
Baselines can be used to show that our approach outperforms classic and simple approaches to solving the problem.
......@@ -301,7 +305,7 @@ Furthermore, we report an adapted S score for multiclass problem, defined in a s
S\ score\ multi = \frac{1}{3}\cdot \sum_{i=0}^2 S\ score(\mathbf{C}_i)
\end{align}
We also report the metrics used for problems 1 on a binarized version of the third problem. To binarize the problem, we define "hand washing" as the positive class, and the remainder as negative class. Note that "hand washing" includes "compulsive hand washing". With this binarization, we can compare the models trained on the multiclass problem to the models trained on the initial binary problem. TODO: report these results!
We also report the metrics used for problems 1 on a binarized version of the third problem. To binarize the problem, we define "hand washing" as the positive class, and the remainder as negative class. Note that "hand washing" includes "compulsive hand washing". With this binarization, we can compare the models trained on the multiclass problem to the models trained on the initial binary problem. However, as problem 1 is a special case of problem 3, we expect the performance of the models trained for problem 3 to be lower than the ones trained for problem 1.
\label{chained_model}
Added to that, we also report the performance of the best two models for problem 1 and problem 2 chained and then tested on problem 3. This means we execute the best model for hand washing detection first, and then, for all sample windows that were detected as hand washing, we run the best model for the classification of compulsive hand washing vs non compulsive hand washing. From this chain, we can derive three-class predictions by counting all samples that were not detected by the first model as negatives (Null) and the ones predicted to be hand washing, but not predicted to be compulsive by the second model as hand washing (HW). The remaining samples are then classified to be compulsive hand washing (HW-C) by the chained model. This chained model could possibly perform better, as in theory they are two different models, which thus, in combination, have had more training time. However, the method of chaining the models would also take up more space and computation time in the memory of a device, and thus be less efficient.
......
......@@ -16,21 +16,19 @@ Gesture recognition, in general, uses similar methods as the more difficult huma
\label{section:har}
Recognizing more than one gesture or body movement in combination in a temporal context and deriving the current activity of the user is called human activity recognition (HAR). In this task, we want to detect a more general activity, compared to a shorter and simpler gesture. An activity can include many distinguishable gestures. However, the same activity will not always include all of the same gestures and the gestures included could be in a different order for every repetition. Activities are less repetitive than gestures, and harder to detect in general @zhu_wearable_2011. However, Zhu et al. have shown that the combined detection of multiple different gestures can be used in HAR tasks too @zhu_wearable_2011, which makes sense, because a human activity can consist of many gestures. Nevertheless, most methods used for HAR consist of more direct applications of machine learning to the data, without the detour of detecting specific gestures contained in the execution of an activity.
Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. These methods include but are not limited to Random Forests, Hidden Markov Models (HMM), Support Vector Machines (SVM) $k$-nearest neighbors algorithm, and more. The features can frequency-domain based and time-domain based, but usually both are used at the same time to train these conventional models @liu_overview_2021.
Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. These methods include but are not limited to Random Forests, Hidden Markov Models (HMM), Support Vector Machines (SVM), the $k$-nearest neighbors algorithm and more. The features can frequency-domain based and time-domain based, but usually both are used at the same time to train these conventional models @liu_overview_2021.
#### Deep neural networks
Recently, deep neural networks have taken over the role of the state of the art machine learning method in the area of human activity recognition @bock_improving_2021, @liu_overview_2021. Deep neural networks are universal function approximators @bishop_pattern_2006, and are known for being easy to use on "raw" data. They are "artificial neural networks" consisting of multiple layers, where each layer contains a set amount of nodes that are connected to the nodes of the following layer. Simple neural networks where all nodes of a layer are connected to all nodes in the following layer are often called "fully connected neural networks (FC-NN or FC)".
The connections' parameters are optimized using forward passes followed by execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to "good" values. However, there is no guarantee of conversion whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. However, empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods.
The connections' parameters are optimized using forward passes followed by execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to values that lead to a lower value of the loss function for the training data. However, there is no guarantee of conversion whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. However, empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods.
Recurrent neural networks (RNNs) are similar to feed forward neural networks, with the difference being that they have access to information from a previous time step. The simplest version of an RNN is a single node that takes the input $\mathbf{x}_t$ and its own output $\mathbf{h}_{t-1}$ from the last time step as inputs. RNNs can be trained on time series data and are able to interprete temporal connections and dependencies in the data to some extent. Recurrent neural networks are trained using "back propagation through time" @mozer_focused_1995. This means that we have to run a forwards pass of multiple time steps through the network first, followed by a back propagation that sums up over all the different time steps and their gradients. For "long" runs, i.e. if the network is supposed to take into account many time steps, there is the "vanishing gradient problem" @hochreiter_vanishing_1998. With an increasing amount of time steps, the gradients become smaller and smaller, making it harder or impossible to properly train the recurrent neural network.
Long short-term memory (LSTM) can be used to combat the vanishing gradient problem in recurrent neural networks @hochreiter_long_1997, @hochreiter_vanishing_1998. It can be used in various applications, such as time series prediction, speech recognition and translation tasks (including generative tasks) @smagulova_survey_2019, but also for human activity recognition.
\label{sec:LSTM}
LSTMs consists of a "cell" of which one or more can be contained in a neural network.
LSTMs consist of a "cell" of which one or more can be contained in a neural network.
The LSTM cell is shown in @fig:lstm_cell and consists of two inputs, four gates and two outputs. The values gathered from the outputs are also part of the input in the next time step of the network's execution, introducing a special case of recurrency.
The inputs to the cell are the external inputs $\mathbf{x}_t$ (from the previous network layer and the current time step), as well as the "memory cell" $\mathbf{c}_{t-1}$ from the previous time step and the "hidden state" $\mathbf{h}_{t-1}$ which is the LSTM's output vector from the previous time step. The calculations describing one time step of the LSTM forward pass are listed in equations \ref{eqn:lstm1} to \ref{eqn:lstm2}.
......
......@@ -20,7 +20,7 @@ For the first task of classifying hand washing in contrast to non hand washing a
![F1 score and S score for problem 1](img/washing.pdf){#fig:p1_metrics width=98%}
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become really small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become really small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
......@@ -32,7 +32,7 @@ With label smoothing, we can reach an increased performance with all of the mode
The models running on normalized data also profit from the label smoothing, however they still cannot reach the performance of the non normalized models.
For the special case of the models initially trained on problem 3 which were then binarized and run on problem 1, we only report some results in this section. The full results can be found in the appendix, in table \ref{tbl:washing_binarized} and fig. \ref{fig:washing_binarized}. Surprisingly, the models trained on problem 3 reach similar F1 scores on the test data of problem 1 as the models trained on problem 1. DeepConvLSTM achieves an F1 score of $0.857$, DeepConvLSTM-A achieves $0.847$. The F1 score of DeepConvLSTM is even higher than the highest F1 score of the models trained for problem 1 by $0.004$. However, for the S score metric, the models trained for problem 3 can only reach up to $0.704$ (CNN) or $0.671$ (DeepConvLSTM-A), which is lower by $0.052$ than the best performing model trained for problem 1.
\FloatBarrier
......
\begin{table}
\centering
\caption{Problem 2: metrics of the different classes with smooting}
\caption{Problem 2: metrics of the different classes with smoothing}
\label{tbl:only_conv_hw_rm}
\begin{tabular}{|l|l|c|c|c|c|}
\toprule
......
\begin{table}
\centering
\caption{Problem 1: metrics of the different classes binarized from problem 3}
\label{tbl:washing_binarized}
\begin{tabular}{|l|l|c|c|c|c|}
\toprule
& & \textbf{specificity} & \textbf{sensitivity} & \textbf{F1 score} & \textbf{S score} \\
\textbf{normalize} & \textbf{modelclass} & & & & \\
\midrule
\multirow{10}{*}{\textbf{False}} & \textbf{CNN} & 0.641 & 0.780 & 0.798 & \textbf{0.704} \\
& \textbf{DeepConvLSTM} & 0.463 & \textbf{0.945} & \textbf{0.857} & 0.622 \\
& \textbf{DeepConvLSTM-A} & 0.534 & 0.902 & 0.847 & 0.671 \\
& \textbf{FC} & 0.515 & 0.890 & 0.837 & 0.652 \\
& \textbf{LSTM} & 0.334 & 0.943 & 0.831 & 0.493 \\
& \textbf{LSTM-A} & 0.470 & 0.919 & 0.844 & 0.622 \\
& \textbf{Majority Classifier} & \textbf{1.000} & 0.000 & 0.000 & 0.000 \\
& \textbf{RFC} & 0.938 & 0.422 & 0.581 & 0.582 \\
& \textbf{Random Classifier} & 0.344 & 0.660 & 0.667 & 0.452 \\
& \textbf{SVM} & 0.939 & 0.396 & 0.555 & 0.557 \\
\cline{1-6}
\multirow{10}{*}{\textbf{True }} & \textbf{CNN} & 0.510 & 0.730 & 0.742 & 0.600 \\
& \textbf{DeepConvLSTM} & 0.596 & 0.688 & 0.730 & 0.639 \\
& \textbf{DeepConvLSTM-A} & 0.493 & 0.752 & 0.752 & 0.596 \\
& \textbf{FC} & 0.411 & 0.900 & 0.823 & 0.565 \\
& \textbf{LSTM} & 0.384 & 0.675 & 0.683 & 0.489 \\
& \textbf{LSTM-A} & 0.355 & 0.762 & 0.734 & 0.484 \\
& \textbf{Majority Classifier} & \textbf{1.000} & 0.000 & 0.000 & 0.000 \\
& \textbf{RFC} & 0.946 & 0.205 & 0.332 & 0.336 \\
& \textbf{Random Classifier} & 0.344 & 0.660 & 0.667 & 0.452 \\
& \textbf{SVM} & 0.951 & 0.266 & 0.412 & 0.415 \\
\bottomrule
\end{tabular}
\end{table}
\begin{table}
\centering
\caption{Problem 1: metrics of the different classes with smooting}
\caption{Problem 1: metrics of the different classes with smoothing}
\label{tbl:washing_rm}
\begin{tabular}{|l|l|c|c|c|c|}
\toprule
......
No preview for this file type
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment