Commit 0213765f authored by burcharr's avatar burcharr 💬
Browse files

automatic writing commit ...

parent f8657062
......@@ -50,7 +50,7 @@ To conclude the results of problem 3, the overall performance of this more diffi
### Practical applicability
The data from the real world evaluation with our test subjects shows, that not all real world hand washing procedures are detected by our smart watch system. Overall, the system's sensitivity was only $28.33\,\%$ in the evaluation of a "normal day", which is much lower compared to the theoretical results. However, this was to be expected to some degree, since real hand washing knows many forms and patterns, that are unlikely to all be captured during the explicit recording of training data. Added to that, the hand washing detection depended on the side of the body, on which the watch was worn, at least for some subjects.
The data from the real world evaluation with our test subjects shows, that not all real world hand washing procedures are detected by our smart watch system. Overall, the system's sensitivity was only $28.33\,\%$ in the evaluation of a "normal day", which is much lower compared to the theoretical results. However, this was to be expected to some degree, since real hand washing knows many forms and patterns, that are unlikely to all be captured during the explicit recording of training data. Added to that, the hand washing detection depended on the side of the body, on which the watch was worn, at least for some subjects. The performance was significantly worse if the watch was worn on the right wrist. This is likely due to the hand washing data used for training being collected almost exclusively with smart watches worn on the left wrist. If the data from subjects wearing the watch on the right wrist is left out, the overall detection sensitivity rises to $50\,\%$.
For some subjects, the smart watch application did not work properly, i.e. not start to run in the background as desired, which is why their results could not be included in the reported results. However, it could be possible, that other users' smart watch applications also were inactive for some of the time, possibly missing some hand washing procedures during this time.
......@@ -60,11 +60,11 @@ Our theoretical results could therefore not be reached in the real life scenario
We also expected, that a higher intensity or a longer duration of the hand washing have a positive influence on the detection probability by the model on the smart watch. This seems logical for the longer duration due to the smoothing, but also for the intensity. It can be assumed, that the system can reach higher certainties with high intensity compared to low intensity washing, as it is likely more separable from less intense activities. However, the results showed a significantly positive correlation value only for intensity and detection rate, whereas the detection rate and hand washing duration seemed to be mostly uncorrelated. However, this may again be due to the relatively small sample size. Especially for the longer washing tasks of 30s and 35s, there were only 2 examples, out of which one was not detected. This may have had a big influence on the absence of a positive correlation value in the evaluation results.
Added to that, the system did detect an average of 4 false positives per subject per day. These false positives could lead to annoyances and ultimately to the users losing trust in the detection capabilities of the system. However, the amount found here in the everyday task also varied a lot from subject to subject. Mainly, washing activities lead to false positives, which was to be expected, because similar movements like in hand washing are executed. Other activities also lead to false positives, which also confirmed the theoretical results' high, but not very high specificity.
Added to that, the system did detect an average of 4 false positives per subject per day. These false positives could lead to annoyances and ultimately to the users losing trust in the detection capabilities of the system. However, the amount found here in the everyday task also varied a lot from subject to subject. Mainly, washing activities lead to false positives, which was to be expected, because similar movements like in hand washing are executed. Other activities also lead to false positives, which also confirmed that the theoretical results' high, but not very high specificity does not lead to the total avoidance of false positives.
The test of scenario 2, the task of intensively washing for at least 30 seconds, yielded a lot higher accuracy. Per subject the washing was on average detected in $76\,\%$ of washing repetitions. Compared to the sensitivity of $90\,\%$ reached for problem 1 with smoothing, this is only lower by $14$ percentage points. The discrepancy here is much lower than in the every day scenario. This could be due to the fact that the training data for hand washing procedures was also collected in a more controlled environment, and more similar patterns were achieved. The results of the evaluation for scenario 2 are thus better than the results for scenario 1.
The test of scenario 2, the task of intensively washing for at least 30 seconds, yielded a lot higher accuracy. Per subject the washing was on average detected in $76\,\%$ of washing repetitions. Compared to the sensitivity of $90\,\%$ reached for problem 1 with smoothing, this is only lower by $14$ percentage points. The discrepancy here is much lower than in the every day scenario. This could be due to the fact that the training data for hand washing procedures was also collected in a more controlled environment, and more similar patterns were achieved. The results of the evaluation for scenario 2 are thus better than the results for scenario 1.
In total, the practical evaluation showed some weaknesses and some strengths of the system. As the sample size is small, and system instabilities occurred, the results have to be interpreted carefully. The evaluation is valid, especially for the false positives and the activities provoking them. However, the low sensitivity found in the every day task does not match the much higher sensitivity found in the intensive hand washing task, and the differences between subjects were huge for scenario 1.
In total, the practical evaluation showed some weaknesses and some strengths of the system. As the sample size is small, and system instabilities occurred, the results have to be interpreted carefully. The evaluation is valid, especially for the false positives and the activities provoking them. However, the low sensitivity found in the every day task does not match the much higher sensitivity found in the intensive hand washing task, and the differences between subjects were huge for scenario 1. Part of the reason for this is the difference in performance on the left and right wrists respectively.
## Comparison of goals to results
#### Detection of hand washing in real time from inertial motion sensors
......@@ -110,4 +110,4 @@ In the second test of the practical evaluation, subjects performed intensive and
Hence, the evaluation results suggest that the developed system is able to properly detect hand washing in many cases. The specificity and sensitivity of the system is high, but leaves some room for improvement.
In conclusion, the application of wrist worn sensor data to the detection of hand washing and compulsive hand washing remains an interesting and open field of research, with many possible areas of application. Especially the detection of obsessive hand washing would be a world's first, and seems promising for future usage in the treatment of OCD patients. Due to the possibility of directly running neural network models on wrist worn smart watches, interventions could be generated in real time and with low latency.
In conclusion, the application of wrist worn sensor data to the detection of hand washing and compulsive hand washing remains an interesting and open field of research, with many possible areas of application. Especially the detection of compulsive hand washing would be a world's first, and seems promising for future usage in the treatment of OCD patients. Due to the possibility of directly running neural network models on wrist worn smart watches, interventions could be generated in real time and with low latency.
......@@ -23,7 +23,7 @@ The separation of compulsive hand washing from ordinary hand washing is an even
One method of treatment for clinical cases of OCD is exposure and response prevention (ERP) therapy @meyer_modification_1966 @whittal_treatment_2005. Using this method, patients that suffer from OCD are exposed to situations in which their obsessions are stimulated and they are helped at preventing compulsive reactions to the stimulation. The patients can then "get used" to the situation in a sense, and thus the reaction to the stimulation will be weakened over time. This means that their quality of life is improved, as the severity of their OCD declines.
A successful, i.e. reliable and accurate system for obsessive hand washing detection could be used to intervene, whenever the compulsive hand washing is detected. It could therefore help psychologists and their patients in the treatment of the symptoms. It could help the user to stop the compulsive behavior by issuing a warning. Such a warning could be a vibration of the device, or a sound that is played upon the detection of compulsive behavior. However, the hypothesis of usefulness is yet to be tested, as no such systems exists as of now. Therefore we want to develop a system that can not only detect hand washing with low latency and in real time, but also discriminate between usual hand washing and obsessive-compulsive hand washing at the same time. The system could then, as described, be used in ERP therapy sessions, but also in every day life, to prevent compulsive hand washing.
A successful, i.e. reliable and accurate system for compulsive hand washing detection could be used to intervene, whenever the compulsive hand washing is detected. It could therefore help psychologists and their patients in the treatment of the symptoms. It could help the user to stop the compulsive behavior by issuing a warning. Such a warning could be a vibration of the device, or a sound that is played upon the detection of compulsive behavior. However, the hypothesis of usefulness is yet to be tested, as no such systems exists as of now. Therefore we want to develop a system that can not only detect hand washing with low latency and in real time, but also discriminate between usual hand washing and obsessive-compulsive hand washing at the same time. The system could then, as described, be used in ERP therapy sessions, but also in every day life, to prevent compulsive hand washing.
### Wrist worn sensors
Different types of sensors can be used to detect activities such as hand washing. It is possible to detect hand washing from RGB camera data to some extent. However, in order for this to work, we would need to place a camera at every place and room a subject could want to wash their hands at. This is unfeasible for most applications of hand washing detection, and could be very expensive. Added to that it might be problematic to place cameras inside wash or bath rooms for privacy reasons. Thus, a better alternative could be body worn, camera-less devices.
......
# Methods
In this chapter, we describe our approaches towards the development and application of a hand washing detection system, as well as a system to separate "normal" hand washing from compulsive hand washing.
In this chapter, we describe our approaches towards the development and application of a hand washing detection system, as well as a system to separate ordinary hand washing from compulsive hand washing.
First, we explain the collection and development of a combined data set from multiple sources, which can be used to train arbitrary machine learning models to detect hand washing in inertial wrist motion data.
......@@ -9,7 +9,7 @@ Added to that, we further explain the development and testing of different neura
Then we explain meaningful methods of evaluating the developed models and methods on both unseen pre-recorded data and with real world subjects.
## Data set
In order to be able to train any machine learning algorithm, we need enough data that can be used to correctly train the used model. In our case of wrist motion data, we used acceleration and gyroscope time series data from multiple sources which will be explained below. The needed inertial data of each sensor is given as $\mathbf{s}_i \in \mathbb{R}^{d_i \times t}$, where $d_i$ is the dimensionality of the sensor (e.g. $d_{accelerometer} = 3$) and $t$ is the amount of samples in a time series. We use accelerometer and gyroscope data witch both have 3 dimensions. We combine these two sensors into one data series of dimensionality $\mathbb{R}^{6 \times t}$. An example for the sensor data used in our experiments is shown in Figure \ref{fig:sensor_data}
In order to be able to train any machine learning algorithm, we need enough data that can be used to correctly train the used model. In our case of wrist motion data, we used acceleration and gyroscope time series data from multiple sources which will be explained below. The needed inertial data of each sensor is given as $\mathbf{s}_i \in \mathbb{R}^{d_i \times t}$, where $d_i$ is the dimensionality of the sensor (e.g. $d_{accelerometer} = 3$) and $t$ is the amount of samples in a time series. We use accelerometer and gyroscope data witch both have 3 dimensions. We combine these two sensors into one data series of dimensionality $\mathbb{R}^{6 \times t}$. An example for the sensor data used in our experiments is shown in fig. \ref{fig:sensor_data}
\begin{figure}[hp]
\centering
......@@ -23,16 +23,16 @@ In order to be able to train any machine learning algorithm, we need enough data
We also need to pay attention to the sampling rate of the given data, i.e. at how many data points per second the data was recorded. The sampling should not be different for the distinct data-streams contained in the final data set. Differing timescales between data parts tested on models trained on one timescale will lead to a significant decrease in performance. In order to successfully run the machine learning training algorithms, the data must therefore have a jointly fixed sampling rate.
### Data set requirements for (obsessive) hand washing detection
### Data set requirements for (compulsive) hand washing detection
Our task of separating hand washing from non-hand washing activities is a difficult task for which pre-recorded training data is necessary. We split this in the activity classes of data needed:
1. hand washing
2. obsessive hand washing
2. compulsive hand washing
3. other activities
In order to correctly detect the hand washing in real time in a real world scenario, the model must have access to real hand washing sensor data during train time. Thus, said data was recorded and labeled and is included in the data set. In order to add a "negative" class into the training set, we used data from everyday scenarios ("long term", i.e. all-day recordings), as well as labeled data from previously existing studies concerning gesture or activity recognition. Activities or gestures which are unknown to the system are harder to classify, and will more likely be wrongly detected as hand washing. It is therefore desirable that as many non-hand washing activities as possible are included in the training set in order to better separate hand washing from all the other activities. We wish to avoid false positive hand washing detections, which could annoy the user of the system and possibly lead to a less successful therapy.
In order to correctly detect the hand washing in real time in a real world scenario, the model must have access to real hand washing sensor data during train time. Thus, said data was recorded and labeled and is included in the data set. In order to add a "negative" class into the training set, we used data from everyday scenarios ("long term", i.e. all-day recordings), as well as labeled data from previously existing studies concerning gesture or activity recognition. Activities or gestures which are unknown to the system are harder to classify, and will more likely be wrongly detected as hand washing. It is therefore desirable that as many non-hand washing activities as possible are included in the training set in order to better separate hand washing from all the other activities. We wish to avoid false positive hand washing detections, which could annoy the user of the system and possibly lead to a decreased trust in the reliability of the system, lowering their response to all detections.
In order to also separate non-obsessive hand washing from obsessive hand washing, data of obsessive hand washing must be included. In order to record this data, real patients can be asked to wear a sensor during their daily life.
In order to also separate non-compulsive hand washing from compulsive hand washing, data of compulsive hand washing must be included. In order to record this data, real patients can be asked to wear a sensor during their daily life, but especially during hand washing.
### Data used in our data set
We used hand washing data and "compulsive" hand washing data recorded at the University of Basel and University of Freiburg as our "positive" class data. This data was recorded at several occasions and using different paradigms. We mainly used data recorded at $50\,$Hz, using a smart watch application. In 2019 and in 2020, data was recorded. The data from 2019 includes hand washing data and, added to that, also includes simulated "compulsive" hand washing. For the simulated compulsive hand washing, subjects were asked to "dirty" their hands with different substances, like finger paint or Nivea \textregistered\ creme, in order to serve as a motivation for intensive hand washing. Afterwards, they had to follow certain scripts of intensive hand washing steps. Each script contained several steps of washing, like interlacing the fingers, washing the fingers individually, washing the palms and more.
......@@ -51,8 +51,8 @@ Added to that, multiple data sets from other studies were used. In our selection
\begin{tabular}{|l|l|l|l|}
\hline
No & Dataset name & Contained activities (excerpt) & Recording frequency \\ \hline
1 & 2019 & hand washing and compulsive hand washing & 50 Hz \\ \hline
2 & 2020 & different hand washing activities & 50 Hz \\ \hline
1 & 2019 & Hand washing and compulsive hand washing & 50 Hz \\ \hline
2 & 2020 & Different hand washing activities & 50 Hz \\ \hline
3 & 2020 long-term & All day recordings of activities of daily living & 50 Hz \\ \hline
4 & WISDM & Movement (walking, jogging, stairs, sitting, ...) & 20 Hz \\ \hline
5 & RealWorld & Movement (walking, jogging, stairs, sitting, ...) & 50 Hz \\ \hline
......@@ -87,9 +87,9 @@ In total, we want to solve three slightly different classification problems:
3. Classifying hand washing and compulsive hand washing separately and distinguishing both from other activities at the same time
From this point on, we will refer to these three problems as "problem" or "task" 1, 2 or 3.
The first and the second task both are binary classification problems, while the third task is a three task multiclass classification problem. In the first task, both hand washing and obsessive hand washing count as hand washing activities. For the second task, we want to classify only data from hand washing activities into the two classes hand washing and obsessive hand washing. In case we obtain two good models for 1. and 2., we can then combine the two models to also compete with the multi class models trained for 3.. We also want to look into the direct detection of compulsive hand washing which is a sub-problem of 3. Thus, we will also report results for this special case, as it could possibly later be used for the treatment of patients.
The first and the second task both are binary classification problems, while the third task is a three task multiclass classification problem. In the first task, both hand washing and compulsive hand washing count as hand washing activities. For the second task, we want to classify only data from hand washing activities into the two classes hand washing and compulsive hand washing. In case we obtain two good models for 1. and 2., we can then combine the two models to also compete with the multi class models trained for 3.. We also want to look into the direct detection of compulsive hand washing which is a sub-problem of 3. Thus, we will also report results for this special case, as it could possibly later be used for the treatment of patients.
In total, 3 classes of data that contain data from different activities can be distinguished:
In total, 3 classes of data that contain samples from different activities can be distinguished:
- Hand washing (HW): Contains hand washing activity not containing obsessive-compulsive washing.
- Obsessive-compulsive hand washing (HW-C): Hand washing activity containing obsessive-compulsive washing.
......@@ -132,20 +132,20 @@ For each of the problems, we train the baselines with the same windows. For SVM
The implementations of SVM and RFC in scikit-learn @pedregosa_scikit-learn_nodate are used. SVM and RFC are trained with the standard parameters in scikit-learn.
To incorporate the "chance level" we use majority prediction and uniform random prediction.
To incorporate the "chance level" we use majority prediction and uniform random prediction. TODO maybe more?
## Neural network based detection of hand washing
As explained in Section \ref{section:har}, neural networks are the state-of-the-art when it comes to human activity recognition. For hand washing detection, this can also be applied and thus, our classification algorithms are all entirely based on neural networks.
### Preprocessing and train-test-split
The data sets were normalized separately to each be mean free and have a standard deviation of $1$. We also tried training all the models without normalization and realized, that the performance on the validation set was better without normalization, which is why we included both a normalized and a non normalized version of the data in our experiment, in order to compare the performance on the test set.
The sensor values were parted in windows using a sliding window approach. After some testing and trying, a window length of $3s$ ($150$ samples) was fixed. We used an overlap of $50\,\%$ ($75$ samples). The external data sets were only used for training. The prerecorded data sets containing hand washing and faked obsessive hand washing were used for training and testing.
We used a train-test-split of $85\,\%$ to $15\,\%$ but split the data on the recording level. This means that training and testing is executed on distinct subsets of subjects, which makes sure that the performance on the test set will give a good estimate of the generalization performance. As every person will wash their hands in a slightly different way, this generalization is needed in the real world in order to also detect unseen but similar patterns of hand washing or obsessive hand washing. The sliding windows were only calculated after splitting for the train test split, to avoid leakage from the test set into the training set.
The sensor values were parted in windows using a sliding window approach. After some testing and trying, a window length of $3s$ ($150$ samples) was fixed. We used an overlap of $50\,\%$ ($75$ samples). The external data sets were only used for training. The prerecorded data sets containing hand washing and simulated compulsive hand washing were used for training and testing.
We used a train-test-split of $85\,\%$ to $15\,\%$ but split the data on the recording level. This means that training and testing is executed on distinct subsets of subjects, which makes sure that the performance on the test set will give a good estimate of the generalization performance. As every person will wash their hands in a slightly different way, this generalization is needed in the real world in order to also detect unseen but similar patterns of hand washing or compulsive hand washing. The sliding windows were only calculated after splitting for the train test split, to avoid leakage from the test set into the training set.
For the training, a validation set is split off from the training data, its size also being $15\,\%$ of the training data. The validation set is part of the training set and can be used to evaluate a model's performance during the development and in order to oversee the success of the training process.
### Architectures
The architecture of a neural network decides its capacity as well as how well it can be trained and perform at test time. We tried out multiple different promising architectures which are listed below. We implemented all these neural networks in python using PyTorch @paszke_pytorch_2019. The architectures will be explained in this section. An overview of each architectures' graphical representation is shown in Figure \ref{fig:network_architectures}
The architecture of a neural network decides its capacity as well as how well it can be trained and perform at test time. We tried out multiple different promising architectures which are listed below. We implemented all these neural networks in python using PyTorch @paszke_pytorch_2019. The architectures will be explained in this section. An overview of each architecture's graphical representation is shown in fig. \ref{fig:network_architectures}
\begin{figure}[hp]
\centering
......@@ -188,7 +188,7 @@ As explained in Section \ref{sec:LSTM}, the recurrent networks, especially those
Added to the simple LSTM model, we also implemented the LSTM with attention mechanism (LSTM-A) described in Section \ref{sec:LSTMA}, which was proposed by Zeng et al. @zeng_understanding_2018. The attention mechanism allows the network to dynamically focus to certain parts of the input, by weighing the sum over a time series' LSTM hidden states. In the LSTM-A model, we directly apply one LSTM layer on the inputs, and then use a linear layer to calculate the weights of each states' hidden step for the weighted sum. Afterwards, there are two linear layers that are used to classify the resulting representation, the first one having $64$ units and the second one being the output layer with size $2$ or $3$. We obtain a network with around 94.000 learnable parameters.
#### DeepConvLSTM
The DeepConvLSTM and its modifications are considered state of the art in Human Activity Recognition tasks. We are applying our implementation to the hand washing classification problem. DeepConvLSTM combines the advantages of convolutional layers and LSTMs. We implement it using the original design with four convolutional layers followed by two LSTM layers and a classification layer. Similar to the convolutional neural network, we use $64$ filters in each of the layers, a kernel size of $9$ and a stride of $1$. During preliminary testing, leaving out one LSTM layer like proposed by Bock et al. @bock_improving_2021 did not yield a significantly different performance. Thus we use two layers like it was done in the original study. The LSTM layers each have a hidden size of $128$. The classification layer has output size $2$ or $3$. This results in a network with around 346.000 learnable parameters.
The DeepConvLSTM and its modifications are considered state of the art in Human Activity Recognition tasks. We are applying our implementation to the hand washing classification problem. DeepConvLSTM combines the advantages of convolutional layers and LSTMs. We implement it using the original design with four convolutional layers followed by two LSTM layers and a fully connected classification layer. Similar to the convolutional neural network, we use $64$ filters in each of the layers, a kernel size of $9$ and a stride of $1$. During preliminary testing, leaving out one LSTM layer like proposed by Bock et al. @bock_improving_2021 did not yield a significantly different performance. Thus we use two layers like it was done in the original study. The LSTM layers each have a hidden size of $128$. The classification layer has output size $2$ or $3$. This results in a network with around 346.000 learnable parameters.
#### DeepConvLSTM with attention mechanism (DeepConvLSTM-A)
To our knowledge, no previous work exists, that couples DeepConvLSTM with the exact attention mechanism used in LSTM-A. Only after starting the work on this thesis, we found out that a pretty similar approach had been tried by Singh et al. @singh_deep_2021. We instead tried to combine the two methods DeepConvLSTM and LSTM-A together, ending up with DeepConvLSTM-A. The attention mechanism is implemented exactly the same way as with the LSTM with attention mechanism by Zeng et al. @zeng_understanding_2018, and is therefore different from the one used by Singh et al.. Due to to us finding out about the work of Singh et al. late, we did not add their version to the list of architectures we tried. The resulting architecture for our version of a DeepConvLSTM with attention mechanism is still different from theirs. The data is first passed through the four convolutional layers with the same configuration as for DeepConvLSTM. Then it is passed through the LSTM, which only has one layer here, and then the hidden states generated over the series of time steps are combined with the weighted sum as in LSTM-A. Afterwards, these results are passed through the fully connected classification layer. The DeepConvLSTM-A model has around 230.000 parameters that need to be trained.
......@@ -196,7 +196,7 @@ To our knowledge, no previous work exists, that couples DeepConvLSTM with the ex
### Training routines and hyper parameter search
We trained all the models using PyTorch @paszke_pytorch_2019. The data was loaded from the matroska container format using a modified version of the PyAV library. It was then processed in NumPy @harris_array_2020, and converted to PyTorch tensors. The training of the neural networks took place on a single GTX 1070 graphics card by NVIDIA.
The train data was split in $150\,s$ long windows which were then shuffled before being used to train the models on the different paradigms.
The train data was split in $150\,s$ long windows which were then shuffled before being used to train the models on the different paradigms. The shuffling is done in order for the mini-batch method to receive random batches, rather than a group of windows from the same temporal context, as well as to avoid overfitting the network to the order of the windows in the training data.
#### Hyper parameter search
##### Batch size
......@@ -206,7 +206,7 @@ In order to find the best batch size, sizes between 32 samples per batch and 102
There is a connection between the batch size and the learning rate (lr). Increasing the batch size can have a similar effect as reducing the learning rate over time (learning rate decay) @smith_dont_2018. Since we use a comparatively big batch size for our model training, we experimented with smaller learning rate values. During preliminary testing on the validation set, different initial values from 0.01 to 0.00001 were tested. We fixed the initial learning rate to 0.0001, as this provided the best performance. We had implemented starting with a higher learning rate and then using learning rate decay but found out during preliminary testing on the validation set, that this approach did not improve the performance, in our case. We also found out that starting with higher learning rates ($lr > 0.01$) lead to numerical instability in the recurrent networks, producing NaN values for gradients and thus parameters. This means the training became unstable for the networks containing LSTM layers, hence the learning rate had to be reduced for these networks anyways.
##### Loss function
As loss function, we use the cross-entropy loss, weighted by the classes frequencies ($\mathcal{L}_{weighted}$). This means that the loss function corrects for imbalanced classes and we do not have to rely on sub sampling or repetition in order to balance the classes. The weighted cross entropy loss is defined as shown in equation \ref{eqn:cross_entropy_loss}. We first apply the "softmax" function to the models' output $\mathbf{x}$ (see equations \ref{eqn:softmax} and \ref{eqn:apply_softmax}). Then, the loss is calculated by applying the weighted cross-entropy loss function, with the weight of each class being the inverse of its relative frequency in the train set data. This way, the predictions for all classes have the same potential influence on the parameter updates, despite the classes not being perfectly balanced.
As loss function, we use the cross-entropy loss, weighted by the classes' frequencies ($\mathcal{L}_{weighted}$). This means that the loss function corrects for imbalanced classes and we do not have to rely on sub sampling or repetition in order to balance the classes. The weighted cross entropy loss is defined as shown in equation \ref{eqn:cross_entropy_loss}. We first apply the "softmax" function to the models' output $\mathbf{x}$ (see equations \ref{eqn:softmax} and \ref{eqn:apply_softmax}). Then, the loss is calculated by applying the weighted cross-entropy loss function, with the weight of each class being the inverse of its relative frequency in the train set data. This way, the predictions for all classes have the same potential influence on the parameter updates, despite the classes not being perfectly balanced.
\begin{figure}
\begin{align}
......@@ -231,7 +231,7 @@ Dropout is a method that helps to prevent neural networks from overfitting to th
We applied dropout to all model classes in preliminary testing. On the validation data, dropout with $p=0.25$ was tested for all models. Out of all the models, only the fully connected network had an increased validation performance, whilst the other models did not. For this reason, dropout was only applied in the fully connected model.
#### Early stopping
We used early stopping, based on the split off validation set. Early stopping is a regularization technique @prechelt_early_1998, which is frequently employed during the training of neural networks. It helps to prevent over-fitting to the training set, by stopping the training process early. In order to decide at which point in the process, i.e. after which epoch, the training should be stopped, we monitor the loss function on the validation set. The model is trained utilizing the train set, but as soon as the validation loss starts to rise, we can stop the training. This makes sense because we can assume that the increase of the validation loss is reflecting the unknown trend of the loss on the test set, which we cannot look at during train time. An example of this process can be seen in Figure \ref{fig:learning_curves}, which shows the comparison between training and validation losses over the progress of the training.
We used early stopping, based on the split off validation set. Early stopping is a regularization technique @prechelt_early_1998, which is frequently employed during the training of neural networks. It helps to prevent overfitting to the training set, by stopping the training process early. In order to decide at which point in the process, i.e. after which epoch, the training should be stopped, we monitor the loss function on the validation set. The model is trained utilizing the train set, but as soon as the validation loss starts to rise, we can stop the training. This makes sense because we can assume that the increase of the validation loss is reflecting the unknown trend of the loss on the test set, which we cannot look at during train time. An example of this process can be seen in fig. \ref{fig:learning_curves}, which shows the comparison between training and validation losses over the progress of the training.
\begin{figure}[!h]
\centering
......@@ -240,13 +240,13 @@ We used early stopping, based on the split off validation set. Early stopping is
\label{fig:learning_curves}
\end{figure}
We can see that the train loss will still decrease while the validation loss is already rising again. Therefore we employ early stopping, to be able to select the model parameters which lead to an empirically minimal validation loss. The losses achieved by parameter updates using mini batches are not decreasing monotonically. Due to the visible "zig-zagging" of the losses, it makes sense to continue running the training for a fixed amount of epochs, even if the validation loss is already rising. This is due to the possibility, that the validation loss could potentially decrease again, below the current minimal validation loss, in a future epoch. As training a model can take a lot of time, we need to select a value in a trade-off between continuing to run the training in order to get to a potential decrease of the validation loss again, and stopping the training in order to not waste training time. We fixed the amount of epochs to keep running to 50, and the maximum amount of epochs a training could run to 300. As can be seen partially in Figure \ref{fig:learning_curves}, the training was usually stopped much earlier than at the 300 epoch mark. The stopping positions heavily depend on the model classes and how fast each class can be trained, and reached from after around 20 epochs to after more than 100 epochs.
We can see that the train loss will still decrease while the validation loss is already rising again. Therefore we employ early stopping, to be able to select the model parameters which lead to an empirically minimal validation loss. The losses achieved by parameter updates using mini batches are not decreasing monotonically. Due to the visible "zig-zagging" of the losses, it makes sense to continue running the training for a fixed amount of epochs, even if the validation loss is already rising. This is due to the possibility, that the validation loss could potentially decrease again, below the current minimal validation loss, in a future epoch. As training a model can take a lot of time, we need to select a value in a trade-off between continuing to run the training in order to get to a potential decrease of the validation loss again, and stopping the training in order to not waste training time. We fixed the amount of epochs to keep running to 50, and the maximum amount of epochs a training could run to 300. As can be seen partially in fig. \ref{fig:learning_curves}, the training was usually stopped much earlier than at the 300 epoch mark. The stopping positions heavily depend on the model classes and how fast each class can be trained, and reached from after around 20 epochs to after more than 100 epochs.
In Figure \ref{fig:learning_curves}, the stop position of the early stopping is marked with a bold \textbf{x}, and it marks the epoch with the lowest validation loss. After the training is stopped due to the early stopping, we reset the models parameters to the parameters with the lowest validation loss. As can be seen in the figure, the training loss still decreases after this point, but the validation loss does not.
In fig. \ref{fig:learning_curves}, the stop position of the early stopping is marked with a bold \textbf{x}, and it marks the epoch with the lowest validation loss. After the training is stopped due to the early stopping, we reset the models parameters to the parameters with the lowest validation loss. As can be seen in the figure, the training loss still decreases after this point, but the validation loss does not.
## Online detection of hand washing and obsessive hand washing
Like for recording our data set, we use Android based smart watches running a custom application for online classification of the users current activity.
## Online detection of hand washing and compulsive hand washing
Like for recording our data set, we use Android based smart watches running a custom application for the online classification of the users current activity.
The application programming was done by Alexander Henkel and is not part of this work. We only designed the outline for the deep learning model deployment part of the app, by providing the needed pre-trained models in the appropriate formats, so that they can be executed on mobile devices.
......@@ -254,22 +254,22 @@ In order to run a pre-trained neural network based model smart watches, we used
![Flow diagram of the smart watch classification loop](img/wear_data_flow.pdf){#fig:watch_flow width=98%}
The course of action on the smart watch is shown in Figure \ref{fig:watch_flow} The watch continuously records the data from the integrated IMU, to fill a buffer. To filter out the most basic idle case of "no movement", the neural network will only be run to classify the current activity, if at least one sensor value is higher than a certain threshold, $v_{idle}$ that is fixed inside the application. If there is enough movement to reach the threshold, a forward pass of the neural network model is done with the data from the last few seconds. It is possible to set the interval classified in each network pass to a value from $1$ second to $10$ seconds, but the model has to be trained for the specific interval length, as mentioned above. The forward pass will then output class probabilities for each of the windows considered. Our windows had a length of 3 seconds (150 samples of the 6 sensor axes).
The course of action on the smart watch is shown in fig. \ref{fig:watch_flow}. The watch continuously records the data from the integrated IMU, to fill a buffer. To filter out the most basic idle case of "no movement", the neural network will only be run to classify the current activity, if at least one sensor value is higher than a certain threshold, $v_{idle}$ that is fixed inside the application. If there is enough movement to reach the threshold, a forward pass of the neural network model is done with the data from the last few seconds. It is possible to set the interval classified in each network pass to a value from $1$ second to $10$ seconds, but the model has to be trained for the specific interval length, as mentioned above. Our windows had a length of 3 seconds (150 samples of the 6 sensor axes). The forward pass will then output class probabilities for each of the windows considered.
In order to avoid false positives and outliers, smoothing can be applied on the network outputs. The smoothing acts as a low-pass filter on the predictions, filtering out rapid changes in the output. In order to do this, we employ a threshold on the running mean of the last $n$ predictions, again over a fixed interval, e.g. 15 seconds. If the running mean reaches the threshold, the final prediction of the window will be "hand wash". We tried different interval sizes and thresholds on the validation sets for each of the models for each of the problems and report the projected performance results in Section \ref{sec:results}.
In order to avoid false positives and outliers, smoothing can be applied on the network outputs. The smoothing acts as a low-pass filter on the predictions, filtering out rapid changes in the output. In order to do this, we employ a threshold on the running mean of the last $n$ predictions, again over a fixed interval, e.g. 15 seconds. If the running mean reaches the threshold, the final prediction of the window will be "hand wash". We tried different interval sizes and thresholds on the validation sets for each of the models for each of the problems and report the projected performance results for the best thresholds found on the validation set in Section \ref{sec:results}.
In the end, if the final prediction of the prediction pipeline is "hand wash", a notification can be sent to the user. The running mean filter is especially important to avoid sending out too many notifications for false positives. Depending on the state of the study, the notification can be used for further data collection. The application can ask the user if the detection was correct, so that we can validate and annotate new data points.
On the other hand in the future and with the correctly trained model, the notification can also be used as an intervention in the treatment of obsessive hand washing, if obsessive hand washing was detected. As explained in the introduction, it could help by presenting the user with an intervention, in order to prevent the users response to the urge of washing their hands in a compulsive way. It could also inform the medical personnel who are treating the patient.
On the other hand in the future and with the correctly trained model, the notification can also be used as an intervention in the treatment of compulsive hand washing, if compulsive hand washing was detected. As explained in the introduction, our application could help by presenting the user with an intervention, in order to prevent the users response to the urge of washing their hands in a compulsive way. It could also inform the medical personnel who are treating the patient, who could then use this information for the therapy.
## Evaluation
In order to evaluate the developed systems, different methods of evaluation are taken into account. We are able to run theoretical performance measures on the available pre-recorded hand washing data. We have to define fitting metrics or scores in order to get a meaningful estimate of the expected real world performance.
In order to evaluate the developed systems, different methods of evaluation are taken into account. We are able to run theoretical performance measures on the available pre-recorded hand washing data. We have to define fitting metrics and scores in order to get a meaningful estimate of the expected real world performance.
On the other hand, a practical evaluation can be run with live users on the smart watch system using the application described above. It can be tested in a controlled environment or "real" every day use. The results of both evaluations will be reported in Section \ref{sec:results}
On the other hand, a practical evaluation can be run with actual test users on the smart watch system using the application described above. We test the detection ability of our model in a controlled environment and for "real" every day use. The results of both evaluations will be reported in Section \ref{sec:results}
### Theoretical evaluation
We use the previously unseen sliding window sensor data $\mathbf{X}_{test}$ and the respective labels $\mathbf{y}_{test}$ as input to each of the models that we previously trained. From their respective outputs, we obtain the predictions $\mathbf{p}$. In order to evaluate how well each models performs on the unseen data we have to compare the predictions to the ground truth labels in a meaningful way, utilizing metrics to score each model's performance.
We use the previously unseen sliding window sensor data $\mathbf{X}_{test}$ and the respective labels $\mathbf{y}_{test}$ as input to each of the models that we previously trained. From their respective outputs, we obtain the predictions $\mathbf{p}$. In order to evaluate how well each model performs on the unseen data we have to compare the predictions to the ground truth labels in a meaningful way, utilizing metrics to score each model's performance.
Different metrics can be used for a binary classification system. As the classes are usually highly unbalanced in our scenario of hand washing, the simple accuracy score is not sufficient. As most activities in the daily life of a user will be "no hand washing" an accuracy of far over $90\,\%$ could be reached by always predicting "no hand washing" without actually solving the problem of hand washing detection. The accuracy score does not take into account this disparity in class sizes, and is thus not a meaningful metric for our problem formulation.
......@@ -280,7 +280,18 @@ We therefore use and report the following, more sophisticated, metrics:
- F1 score
- S score
Let $P$ be the number of positive examples in the data, $N$ be the number of negative examples. Further, let $TP$ and $TN$ be the number of true positives in the prediction, i.e. $TP = \#\{\mathbf{pred}_i | 1 = \mathbf{y}_{test, i} = \mathbf{pred}_i, i\in 0,...,n\}$, and $TN = \#\{\mathbf{pred}_i | 0 = \mathbf{y}_{test, i} = \mathbf{pred}_i, i\in 0,...,n\}$ respectively, where $n = P + N$ is the total amount of samples in the test . Added to that, let $FP$ and $FN$ be the number of false positves and false negatives in the set: $FP = \#\{\mathbf{pred}_i | 1 = \mathbf{y}_{test, i} \neq \mathbf{pred}_i, i\in 0,...,n\}$ and $FN = \#\{\mathbf{pred}_i | 0 = \mathbf{y}_{test, i} \neq \mathbf{pred}_i, i\in 0,...,n\}$
Let $P$ be the number of positive examples in the data, $N$ be the number of negative examples. Further, let $TP$ and $TN$ be the number of true positives in the prediction, and let $n = P + N$ be the total amount of samples in the test set. Added to that, let $FP$ and $FN$ be the number of false positives and false negatives in the set. The definitions are shown in equations \ref{eqn:true_neg_pos} to \ref{eqn:true_neg_pos_end}.
\begin{figure}
\begin{align}
\label{eqn:true_neg_pos}
TP = \#\{\mathbf{pred}_i | 1 &= \mathbf{pred}_i = \mathbf{y}_{test, i}, i\in 1,...,n\} \\
TN = \#\{\mathbf{pred}_i | 0 &= \mathbf{pred}_i = \mathbf{y}_{test, i}, i\in 1,...,n\} \\
FP = \#\{\mathbf{pred}_i | 1 &= \mathbf{pred}_i \neq \mathbf{y}_{test, i}, i\in 1,...,n\}\\
FN = \#\{\mathbf{pred}_i | 0 &= \mathbf{pred}_i \neq \mathbf{y}_{test, i}, i\in 1,...,n\}
\label{eqn:true_neg_pos_end}
\end{align}
\end{figure}
We can then define the metrics as shown in equations \ref{eqn:metrics_first} to \ref{eqn:metrics_last}:
......@@ -297,9 +308,9 @@ S\ score &= 2 \cdot \frac{Sensitivity \cdot Specificity}{Sensitivity + Specifici
\end {figure}
\label{s_score}
The sensitivity is the rate of positive samples that get correctly recognized, the specificity the rate of negatives that get correctly recognized. If both these measures are close to 1, the model performs well. The precision is the ratio of true positives contained in all positive predictions. It is similar to the sensitivity but also punishes false positives to some extent. The recall is the same as the sensitivity. The harmonic mean of recall and precision is called F1 score and is also commonly used to evaluate binary prediction tasks. Since we especially need to balance specificity and sensitivity for our task, we also report the S score, which we define as the harmonic mean of specificity and sensitivity. One of the reasons for reporting the S score is the lack of false positive punishment in the F1 score formula. The F1 score does not punish false positives as much as needed in the task of compulsive hand washing detection. While it is partly included in the precision measure, if there are many positives in the ground truth, then the precision wont weigh false positives enough. Including the specificity in the measure therefore makes sure we do not lose track of the false positives, which would be annoying to the user, especially if we send out smart watch notifications with vibration or sound alerts.
The sensitivity is the rate of positive samples that get correctly recognized. The specificity is the rate of negatives that get correctly recognized. If both these measures are close to 1, the model performs well. The precision is the ratio of true positives contained in all positive predictions. It is similar to the sensitivity but also punishes false positives to some extent. The recall is the same as the sensitivity. The harmonic mean of recall and precision is called F1 score and is also commonly used to evaluate binary prediction tasks. Since we especially need to balance specificity and sensitivity for our task, we also report the S score, which we define as the harmonic mean of specificity and sensitivity. One of the reasons for reporting the S score is the lack of false positive punishment in the F1 score formula. The F1 score does not punish false positives as much as needed in the task of compulsive hand washing detection. While it is partly included in the precision measure, if there are many positives in the ground truth, then the precision wont weigh false positives enough. Including the specificity in the measure therefore makes sure we do not lose track of the false positives, which would be annoying to the user, especially if we send out smart watch notifications with vibration or sound alerts.
For the multiclass problem of distinguishing obsessive hand washing from normal hand washing from other activities, the binary metrics are not applicable. Here, we report normalized confusion matrices, and their mean diagonal values as one performance measure. The confusion matrix shows, which amount of samples belonging to a certain class (true label, rows of the matrix) are predicted to belong to which other class (predicted label, columns of the matrix). The normalized version of the confusion matrix replaces the total values by ratios in proportion to the amount of true labels for each class. This means that for each true label row in the matrix, the values sum to 1.
For the multiclass problem of distinguishing compulsive hand washing from normal hand washing from other activities, the binary metrics are not applicable. Here, we report normalized confusion matrices, and their mean diagonal values as one performance measure. The confusion matrix shows, which amount of samples belonging to a certain class (true labels, rows of the matrix) are predicted to belong to which other class (predicted labels, columns of the matrix). The normalized version of the confusion matrix replaces the total values by ratios in proportion to the amount of true labels for each class. This means that for each true label row in the matrix, the values sum to 1.
The mean diagonal value of this matrix can be seen as a mean class accuracy score, as the diagonal values of the normalized confusion matrix are the accuracy values for each possible class.
Added to that, we report an adapted F1 score, identical to the one used by Zeng et al. @zeng_understanding_2018. The adapted multiclass F1 score is calculated by taking the mean over all classes $\mathbf{C}$, of the F1 scores if we treat the class $\mathbf{C}_i, i \in [0,1,2]$ as the positive class, and the remaining classes as the negative class:
......@@ -316,22 +327,21 @@ S\ score\ multi = \frac{1}{3}\cdot \sum_{i=0}^2 S\ score(\mathbf{C}_i)
We also report the metrics used for problems 1 on a binarized version of the third problem. To binarize the problem, we define "hand washing" as the positive class, and the remainder as negative class. Note that "hand washing" includes "compulsive hand washing". With this binarization, we can compare the models trained on the multiclass problem to the models trained on the initial binary problem. However, as problem 1 is a special case of problem 3, we expect the performance of the models trained for problem 3 to be lower than the ones trained for problem 1.
\label{chained_model}
Added to that, we also report the performance of the best two models for problem 1 and problem 2 chained and then tested on problem 3. This means we execute the best model for hand washing detection first, and then, for all sample windows that were detected as hand washing, we run the best model for the classification of compulsive hand washing vs non compulsive hand washing. From this chain, we can derive three-class predictions by counting all samples that were not detected by the first model as negatives (Null) and the ones predicted to be hand washing, but not predicted to be compulsive by the second model as hand washing (HW). The remaining samples are then classified to be compulsive hand washing (HW-C) by the chained model. This chained model could possibly perform better, as in theory they are two different models, which thus, in combination, have had more training time. However, the method of chaining the models would also take up more space and computation time in the memory of a device, and thus be less efficient.
Added to that, we also report the performance of the best two models for problem 1 and problem 2 chained and then tested on problem 3. This means we execute the best model for hand washing detection first, and then, for all sample windows that were detected as hand washing, we run the best model for the classification of compulsive hand washing vs non compulsive hand washing. From this chain, we can derive three-class predictions by counting all samples that were not detected by the first model as negatives (Null) and the ones predicted to be hand washing, but not predicted to be compulsive by the second model as hand washing (HW). The remaining samples are then classified to be compulsive hand washing (HW-C) by the chained model. This chained model could possibly perform better, as in theory they are two different models, which thus, in combination, have had more training time and possibly a higher capacity. However, the method of chaining two models would also take up more space in the memory and more computation time on the device, and thus be less efficient.
### Practical evaluation
For the practical evaluation, we asked TODO XY subjects to test the system in practice. We defined two different paradigms, one for real world performance evaluation and one for explicit evaluation of the model running on the smart watch. In order to do this, the model with the best performance on the test set of task 1., i.e. the general detection of hand washing, was exported to be executed on the watch inside the described smart watch application. We limited the testing to these scenarios because we did not have access to subjects that would actually wash their hands compulsively. The scenarios were:
For the practical evaluation, we asked 5 subjects to test the system in practice. We defined two different paradigms, one for real world performance evaluation and one for explicit evaluation of the model running on the smart watch. In order to do this, the model with the best performance on the test set of task 1., i.e. the general detection of hand washing, was exported to be executed on the watch inside the described smart watch application. We limited the testing to these scenarios because we did not have access to subjects that would actually wash their hands compulsively. The scenarios were:
1. The subjects are wearing a smart watch for one day. During this time, whenever they wash their hands, the watch will or will not detect the hand washing procedure. The subjects note down, whether or not the hand washing was recognized correctly, how long the washing procedure was and how intense it was on a scale of 1 to 5. Added to that, whenever there is a false prediction of hand washing, they note down their current activity.
2. The subjects specifically go to the bathroom to wash their hands 3 times to test the recognition. They note down whether or not the hand washing was recognized correctly. The hand washing is supposed to be done thoroughly and intensively (at least 30 seconds of washing per repetition).
Scenario 1 can be used to evaluate the real world performance of the classifier in day to day living. It is supposed to gather information about the use cases in which the system works well, but also in which cases it fails. There are many activities of daily living that one could think of, which are not included in the data set, i.e. unseen activities for the classifier. Such activities might be problematic for the classifier, as they are unlikely to perfectly resemble any Null class activities it was trained on. The test in scenario 1 is supposed to uncover some of these activities, which might be detected as false positives. The activities producing false positives in the system could be added to the set or negative training examples in the future. Added to that, by having the subjects note down whether the detection worked for every time the subject washed their hands, we also get an estimate of the sensitivity of the system, apart from what the theoretical evaluation yielded.
Scenario 1 can be used to evaluate the real world performance of the classifier in day to day living. It is supposed to gather information about the use cases in which the system works well, but also in which cases it fails. There are many activities of daily living that one could think of, which are not included in the data set, i.e. unseen activities for the classifier. Such activities might be problematic for the classifier, as they are unlikely to perfectly resemble any Null class activities it was trained on. The test in scenario 1 is supposed to uncover some of these activities, which might be detected as false positives. The activities producing false positives in the system could be added to the set of negative training examples in the future. Added to that, by having the subjects note down whether the detection worked for every time the subject washed their hands, we also get an estimate of the sensitivity of the system, apart from what the theoretical evaluation yielded.
Scenario 2 is used to check whether the system works correctly for most of the time, when we certainly know, that intensive washing is involved. It also ensures the subjects active compliance, by making the hand washing activity their main focus. In scenario 1, it would be possible that the subject forgets to take notes sometimes, which is not as likely in the controlled hand washing scenario.
Together, the two scenarios provide a basis for estimating the real world performance of the system.
The invitation to the hand washing evaluation (German language) with the exact description of the two scenarios can be found in the appendix.
\ No newline at end of file
......@@ -23,7 +23,7 @@ Recently, deep neural networks have taken over the role of the state of the art
The connections' parameters are optimized using forward passes through the network of nodes, followed by the execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data passed and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to values that lead to a lower value of the loss function for the training data. However, there is no guarantee of convergence whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. Nevertheless, the parameters for the perfect approximation cannot be easily found, and empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods. In return, with enough data, deep neural networks often outperform the classical machine learning methods.
###### Convolutional neural networks (CNNs)
are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with them. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same amount of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional, like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional, which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.
are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with them. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same amount of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional (2d), like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional (1d), which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.
###### Recurrent neural networks (RNNs)
......@@ -94,7 +94,7 @@ Another study by Singh et al. combines DeepConvLSTM with a self-attention mechan
For HAR, DeepConvLSTM and the models derived from it are the state of the art machine learning methods, as their consistently outperform other model architectures on the available benchmarks and data sets.
## Hand washing
To our knowledge, no study has ever tried to separately predict obsessive hand washing opposed to non-obsessive hand washing.
To our knowledge, no study has ever tried to separately predict compulsive hand washing opposed to non-compulsive hand washing.
Most studies that try to automatically detect hand washing are aiming for compliance improvements, i.e. trying to increase or measure the frequency of hand washes or assessing or improving the quality of hand washes.
Hand washing compliance can be measured using different tools. Jain et al. @jain_low-cost_2009 use an RFID-based system to check whether health care workers comply with hand washing frequency requirements. However, the system is merely used to make sure all workers entering an emergency care unit have washed their hands. Bakshi et al. @bakshi_feature_2021 developed a hand washing detection data set with RGB video data, and showed a valid way to extract SIFT-descriptors from it for further research. Llorca et al. showed a vision based system for automatic hand washing quality assessment @llorca_vision-based_2011 based on the detection of skin in RGB images using optical techniques such as optical flow estimation.
......
......@@ -59,9 +59,9 @@ Also, like in task 2 two without smoothing, normalization brings about small to
\FloatBarrier
### Classifying hand washing and obsessive hand washing separately and distinguishing from other activities
### Classifying hand washing and compulsive hand washing separately and distinguishing from other activities
The three class problem of classifying hand washing, obsessive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the data trained on the non normalized data of the same network class is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.
The three class problem of classifying hand washing, compulsive hand washing and other activities is harder than the other two problems, as it contains them both at once. The resulting confusion matrix for each of the neural network classifiers is shown in @fig:confusion. The version trained on the normalized data is shown on the left, while the data trained on the non normalized data of the same network class is shown on the right. Each confusion matrix shows, what percentage of the true labels of a class was classified in which of the three available classes. Optimally, the diagonal values would be all $1.0$ and the off-diagonal values all $0.0$. The matrices are color-coded with the same value ranges, so that they are interchangeably comparable.
![Confusion matrices for all neural network based classifiers with and without normalization of the sensor data](img/confusion.pdf){#fig:confusion width=98%}
......@@ -100,9 +100,9 @@ The mean diagonal value of the confusion matrix upholds almost the same ordering
### Scenario 1: One day of evaluation
In the first scenario, the 5 (TODO) subjects reported an average of $4.75$ hand washing procedures on the day on which they evaluated the system.
Per subject, there were $4.75$ ($\pm 3.3$) hand washing procedures. Out of those, $1.75$ ($\pm 2.06\,\%$) were correctly identified. The accuracy per subject was $28,33\,\%$ ($\pm 37.9\,\%$). The highest accuracy for a subject was $80\,\%$ out of 5 hand washes, the lowest was $0\,\%$ out of 4 hand washes.
Per subject, there were $4.75$ ($\pm 3.3$) hand washing procedures. Out of those, $1.75$ ($\pm 2.06\,\%$) were correctly identified. The accuracy per subject was $28,33\,\%$ ($\pm 37.9\,\%$). The highest accuracy for a subject was $80\,\%$ out of 5 hand washes, the lowest was $0\,\%$ out of 4 hand washes. Of all hand washing procedures conducted over the day by the subjects, $35,8\,\%$ were detected correctly.
Some subjects wore the smart watch on the left wrist instead of the right wrist, and reported worse results for that.
Some subjects wore the smart watch on the right wrist instead of the left wrist, and reported worse results for that. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $50\,\%$.
The duration and intensity of the hand washing process also played a role.
The correlation of duration of the hand washing with the detection rate is $-0.039$. However, the raw data does only contain 2 "longer" hand washes over 30 seconds, the rest being in the range of 10 to 25 seconds.
......@@ -125,5 +125,6 @@ Some subjects also reported difficulties with the smart watch application (not p
### Scenario 2: Controlled intensive hand washing
In scenario 2, the subjects each washed their hands at least 3 times. Some subjects voluntarily agreed to perform more repetitions, which leads to more than 3 washing detection results per subject. The detection accuracy per subject was $76\,\%$ ($\pm 25\,\%$), with the highest being, $100\,\%$ and the lowest being $50\,\%$.
The mean accuracy over all repetitions and not split by subjects was $73,7\,\%$. For scenario 2, one user moved the smart watch from the left wrist to the right wrist after two repetitions. The first two repetitions were not detected, while the two repetitions with the smart watch worn on the right wrist were detected correctly.
The mean accuracy over all repetitions and not split by subjects was $73,7\,\%$. For scenario 2, one user moved the smart watch from the right wrist to the left wrist after two repetitions. The first two repetitions were not detected, while the two repetitions with the smart watch worn on the right wrist were detected correctly. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $78.6\,\%$, and the detection accuracy per subject is $82.5\,\%$ ($\pm 23.6\,\%$).
No preview for this file type
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment