Commit 7f299467 authored by burcharr's avatar burcharr 💬
Browse files

automatic writing commit ...

parent 3390a094
......@@ -7,9 +7,11 @@ git add tables/*.tex
git add backmatter.tex
git add frontmatter.tex
git add titlepage.tex
git add references.md
git add md/metadata.yaml
git add Makefile
git add bib/zotero_cur.bib
git add git_all.sh
git commit -m "automatic writing commit ..."
git push
\ No newline at end of file
git push
......@@ -17,7 +17,7 @@ The application of smoothing improved the performance of the models even further
Normalization was shown to be ineffective for our approach, worsening the performance of almost all models. This could be due to the difference in distribution in the train and test set. The parameters for normalization were estimated from the train set and applied to the test set, which can always be inaccurate, because we assume that train and test set have the same distributions. This was not the case here, which is probably why the normalized data was harder to learn and test on, than the non normalized data.
For the reasons explained in section \ref{s_score}, we weigh the results of the S score higher than the ones of the F1 score. Thus, the best network for problem 1 is DeepConvLSTM-A, although only by a slight margin. The overall achieved S score of $0.819$ is based on reaching a specificity of $0.751$ and a sensitivity of $0.90$, which means that $90\,\%$ of windows containing hand washing were classified as hand washing correctly. However, $75.1\,\%$ of windows classified as Null really contained no hand washing, which leaves some room for improvement, because this means that the model still has a false positive rate of $24.9\,\%$.
For the reasons explained in section \ref{s_score}, we weigh the results of the S score higher than the ones of the F1 score. Thus, the best network for problem 1 is DeepConvLSTM-A, although only by a slight margin. The overall achieved S score of $0.819$ is based on reaching a specificity of $0.751$ and a sensitivity of $0.90$, which means that $90\,\%$ of windows containing hand washing were classified as hand washing correctly. However, $75.1\,\%$ of windows classified as Null really contained no hand washing, which leaves some room for improvement, because this means that the model still has a false positive rate of $24.9\,\%$, which is a lot more than desired.
The binarized versions of the models trained on problem 3 achieve a notable success in terms of their similar F1 scores to the models trained for problem 1. However their performance in terms of the S score metric is worse by about $0.052$ for the best and by more for the other models. Therefore, and especially because of the higher importance of the S score, the models trained on problem 3 are not as good at classifying hand washing and separating it from other activities, as the models specifically trained for this problem. This lower performance can be explained with the higher difficulty of the 3 class problem learned by the classifiers trained for model 3. Thus the loss in performance was to be expected.
......@@ -33,6 +33,7 @@ The results of problem 2 with the application of smoothing look even more promis
As there is no published previous work in the area of automatically detecting compulsive hand washing, the results cannot be compared to already achieved results. The strong performance levels indicate a high probability for the approach being applicable in real world testing. Sadly, as our work's real world evaluation was limited to the evaluation of the best model for problem 1, we cannot report real world results to prove this hypothesis.
##### Problem 3
The problem of classifying hand washing and compulsive hand washing separately and distinguishing both from other activities at the same time is arguably harder than the other two problems. Problem 3 can be seen as the unification of problem 1 and problem 2, namely classifying whether and activity is hand washing (problem 1) and, if yes, whether said activity is compulsive hand washing (problem 2). By being this 3 class classification problem, problem 3 is thus more difficult and has more room for errors than the other two problems. Thus, a lower level of performance must be expected.
......@@ -49,31 +50,44 @@ To conclude the results of problem 3, the overall performance of this more diffi
### Practical applicability
real world evaluation
## Limitations of our approach
The data from the real world evaluation with our test subjects shows, that most real world hand washing procedures are detected by our smart watch system. Overall, the system's sensitivity was ... in the evaluation of a "normal day", which is ... compared to the theoretical results. However, this was to be expected, since real hand washing knows many forms and patterns, that are unlikely to all be captured during the explicit recording of training data. Our theoretical results could therefore not be reached in the real life scenario. Because of the smoothing that was applied to the data, at least some consecutive windows must be classified into the positive class, which means that a real hand washing procedure needs to be longer than or around $10\,s$. In practice, it can happen that washing ones hands does take a shorter amount of time, which the system will then not detect properly. All in all, the system was able to correctly detect most hand washing procedures, and is therefore somewhat effective at this task.
##### Problem 2
The general performance of our models on problem 2 was high. However, one limitation of the results is, that we cannot measure its performance in distinguishing compulsive hand washing from other activities than non compulsive hand washing. However, our results could be employed together with other tools that give the knowledge about the user currently washing their hands.
Added to that, the system did detect an average of xy TODO false positives per subject per hour. These false positives could lead to annoyances and ultimately to the users loosing trust in the detection capabilities of the system
## Comparison of goals to results
TODO
## Future work
The detection of hand washing could be incorporated into many devices, mainly wrist worn ones, like smart watches. In order to further improve the detection capabilities and accuracy, one would need to invest even more time into carefully designing and training better models. This works architecture search could be expanded, and more parameter combinations could be tried out. For example, different types of layers, that have not been included in the architecture yet could be tried. Instead of normalizing data on the data set level, batch normalization could be used try to make the networks faster and more stable.
The general performance of our models on problem 2, distinguishing compulsive hand washing from non compulsive hand washing, was high. The downside is, that this model is only applicable if we know, when the hand washing takes place. However, our results could be employed together with other tools that give us this knowledge about the user currently washing their hands. Examples for this are in development in our group, one of them being a soap dispenser with integrated proximity sensor. Added to that, Bluetooth beacons stationed near sinks can be used to let the smart watch know that the user is near a specific sink. Conductivity sensors on the users skin could be employed to detect a change of conductivity caused by a contact with tap water. One or more of these methods combined with our model trained for problem 2 could possibly be used to achieve a higher performance for the task of compulsive hand washing detection in the future.
Added to that, all the other hyperparameters could be optimized better. Instead of manual hyperparameter optimization (HPO), automated versions of HPO could be employed, e.g. bayesian optimization. This could lead to better choices for the batch size, learning rate and other parameters
The detection of hand washing could be incorporated into many devices, mainly wrist worn ones, like smart watches. In order to further improve the detection capabilities and accuracy, one would need to invest even more time into carefully designing and training better models. This work's architecture search could be expanded, and more parameter combinations could be tried out. For example, different types of layers, that have not been included in the architecture yet could be tried. Instead of normalizing data on the data set level, batch normalization could be used try to make the networks faster and more stable.
Different attention mechanisms could be tried out on the hand washing data.
Added to that, all the other hyperparameters could be optimized better. Instead of manual hyperparameter optimization (HPO), more sophisticated versions of HPO could be employed, e.g. bayesian optimization. This could lead to better choices for the batch size, learning rate and other parameters. However, this may take a lot of time to run, as it is computationally expensive.
The current state of the system, especially for the classification of hand washing versus compulsive hand washing class looks promising for future work in this area. The collection of real obsessive-compulsive hand washing data would likely lead to the possible training of models capable of reliably classifying compulsive hand washing. Such models could then be tested on real world subjects, and also evaluated with them. If they perform well enough, they could aid psychologists and their patients with the treatment of compulsive hand washing. Like explained in the introduction, exposure and response prevention (ERP) is a viable treatment method, and interventions from a smart watch could possibly be used for response prevention. The exact design of the interventions and their actual usability forms another exciting problem field and is yet to be researched.
More data could also be incorporated for the negative class, because more different activities should be included in the data. While the standard movement activities of walking, jogging, sitting, walking up and down stairs and some fitness activities were already included for this work, more special activities have not yet been included, possibly leading to the increased false positive rate in the real world scenario.
To avoid false, positives, one could also try to do detection of out of distribution movements, similar to the HAWAD approach that we discussed. The application of this method must be carefully done, as we do not certainly know that all out of distribution samples are no hand washing. The applicability of this method needs to be tested thoroughly.
todo:hpo, architecture (batch norm), data (more+ real ocd),
hawad
The most important part of the future work in this area, especially for the detection of compulsive hand washing, will be the application to the real world with actual patients suffering from OCD with compulsive hand washing. Only on their data we will be able to properly train models, and only with them we will be able to properly evaluate the developed models, in order to gain a certain estimate of our performance. With real patients, it could also be a good idea to try and fit the model to each patient dynamically. The idea would be to start with a pre-trained model, which was trained on available data of many subjects. Afterwards, for each patient, data could be collected, and used to re-train the model. This approach of re-training pre-trained neural networks is often applied in computer vision, and has shown promising results there.
todo:real world <- depending on results
The real world evaluation results show that, more data of other activities has to be included, e.g. of people doing the dishes, as there were many false positives in this area.
TODO
# Conclusion
todo:S'isch super!
\ No newline at end of file
In this work, we described the development, training and evaluation of a powerful and accurate compulsive and non-compulsive hand washing detection system. The relevance of such a system was explained with its applications in the field of hygiene compliance enforcement (general hand washing), as well as in the field of possibly helping in the treatment of obsessive compulsive disorder with compulsive hand washing.
We theoretically evaluated different designs of neural networks on three related problems of hand washing detection, including the separation of hand washing from other activities, the separation of hand washing from compulsive hand washing and the separation of hand washing from compulsive hand washing and from other activities at the same time. For this task, we used hand washing data, data of simulated compulsive hand washing, and data of other activities which was collected from publicly available data sets. After training and evaluation, we selected the best functioning system based on several metrics, including the F1 score and the harmonic mean of sensitivity and specificity, which we called S score. The dominating models, DeepConvLSTM and DeepConvLSTM-A were both based on a deep convolutional neural network joined with an LSTM layer. For DeepConvLSTM-A, which performed slightly better than DeepConvLSTM, we added an attention mechanism, in order to allow the model to flexibly focus on more relevant sections of its input. The designed models were able to beat baselines such as a random forest classifier and a support vector machine, as well as chance level baselines by a large margin.
In a practical evaluation using x subjects (TODO), we tested DeepConvLSTM-A on the hand washing detection task in a real world and every day environment, as well as in a fixed schedule hand washing test. The system ran on a smart watch, which was used to monitor the users wrist movements in real-time and was able to correctly detect hand washing ... . Some false positives appeared for different activities, many of which were washing related.
In the second test of the practical evaluation, subjects performed intensive and long hand washing repetitions, which were more easy to detect. The systems performance here was ... (near theory?? TODO).
Hence, the evaluation results suggest that the developed system is able to properly detect hand washing in many cases. The specificity and sensitivity of the system is high, but leaves some room for improvement.
In conclusion, the application of wrist worn sensor data to the detection of hand washing and compulsive hand washing remains an interesting and open field of research, with many possible areas of application. Especially the detection of obsessive hand washing would be a world's first, and seems promising for future usage in the treatment of OCD patients
......@@ -78,7 +78,7 @@ In total, we want to solve three slightly different classification problems:
2. Classifying compulsive hand washing and distinguishing it from non-compulsive hand washing
3. Classifying hand washing and compulsive hand washing separately and distinguishing both from other activities at the same time
From this point on, we will refer to these three problems as "problem" or "task" 1, 2 or 3.
The first and the second task both are binary classification problems, while the third task is a three task multiclass classification problem. In the first task, both hand washing and obsessive hand washing count as hand washing activities. For the second task, we want to classify only data from hand washing activities into the two classes hand washing and obsessive hand washing. In case we obtain two good models for 1. and 2., we can then combine the two models to also compete with the multi class models trained for 3.. We also want to look into the direct detection of compulsive hand washing which is a sub-problem of 3. Thus, we will also report results for this special case, as it could possibly later be used for the treatment of patients.
In total, 3 classes of data that contain data from different activities can be distinguished:
......@@ -314,14 +314,14 @@ Added to that, we also report the performance of the best two models for problem
For the practical evaluation, we asked TODO XY subjects to test the system in practice. We defined two different paradigms, one for real world performance evaluation and one for explicit evaluation of the model running on the smart watch. In order to do this, the model with the best performance on the test set of task 1., i.e. the general detection of hand washing, was exported to be executed on the watch inside the described smart watch application. We limited the testing to these scenarios because we did not have access to subjects that would actually wash their hands compulsively. The scenarios were:
1. The subjects are wearing a smart watch for one day. During this time, whenever they wash their hands, the watch will or will not detect the hand washing procedure. The subjects note down, whether or not the hand washing was recognized correctly.
1. The subjects are wearing a smart watch for one day. During this time, whenever they wash their hands, the watch will or will not detect the hand washing procedure. The subjects note down, whether or not the hand washing was recognized correctly, how long the washing procedure was and how intense it was on a scale of 1 to 5. Added to that, whenever there is a false prediction of hand washing, they note down their current activity.
2. The subjects specifically go to the bathroom to wash their hands 3 times to test the recognition. They note down whether or not the hand washing was recognized correctly. The hand washing is supposed to be done thoroughly and intensively (at least 30 seconds of washing per repetition).
Scenario 1 can be used to evaluate the real world performance of the classifier in day to day living. It is supposed to gather information about the use cases in which the system works well, but also in which cases it fails. There are many activities of daily living that one could think of, which are not included in the data set, i.e. unseen activities for the classifier. Such activities might be problematic for the classifier, as they are unlikely to perfectly resemble any Null class activities it was trained on. The test in scenario 1 is supposed to uncover some of these activities, which might be detected as false positives. Added to that, by having the subjects note down whether the detection worked for every time the subject washed their hands, we also get an estimate of the sensitivity of the system, apart from what the theoretical evaluation yielded.
Scenario 1 can be used to evaluate the real world performance of the classifier in day to day living. It is supposed to gather information about the use cases in which the system works well, but also in which cases it fails. There are many activities of daily living that one could think of, which are not included in the data set, i.e. unseen activities for the classifier. Such activities might be problematic for the classifier, as they are unlikely to perfectly resemble any Null class activities it was trained on. The test in scenario 1 is supposed to uncover some of these activities, which might be detected as false positives. The activities producing false positives in the system could be added to the set or negative training examples in the future. Added to that, by having the subjects note down whether the detection worked for every time the subject washed their hands, we also get an estimate of the sensitivity of the system, apart from what the theoretical evaluation yielded.
Scenario 2 is used to check whether the system works correctly for most of the time, when we certainly know, that intensive washing is involved. It also ensures the subjects active compliance, by making the hand washing activity their main focus. In scenario 1, it would be possible that the subject forgets to take notes sometime, which is not as likely in the controlled hand washing scenario.
Scenario 2 is used to check whether the system works correctly for most of the time, when we certainly know, that intensive washing is involved. It also ensures the subjects active compliance, by making the hand washing activity their main focus. In scenario 1, it would be possible that the subject forgets to take notes sometimes, which is not as likely in the controlled hand washing scenario.
Together, the two scenarios provide a basis for estimating the real world performance of the system.
......
......@@ -96,4 +96,29 @@ The mean diagonal value of the confusion matrix upholds almost the same ordering
\FloatBarrier
\newpage
## Practical Evaluation
\ No newline at end of file
## Practical Evaluation
### Scenario 1: One day of evaluation
In the first scenario, the X (TODO) subjects reported an average of .. hand washing procedures on the day on which they evaluated the system.
Per subject, there were (+- ) hand washing procedures. Out of those, .. (+-) were correctly identified ($xy\,\%$).
The duration and intensity of the hand washing process also played a role, as can be seen in todo fig.
The highest accuracy was achieved for ....
The lowest accuracy was achieved for ....
For the reported false positives, the subjects experiences varied. The subjects reported xy (+-) false hand washing detections on this day. Assuming a 12h recording period, that means there are xy / 12 false detections per hour.
The activities leading to false positives include:
- Changing clothes (or helping others to do so)
- Washing pans / doing the dishes
- Scratching oneself
-
- TODO
The full list of reported activities can be found in the appendix. TODO
### Scenario 2: Controlled intensive hand washing
In scenario 2, the subject each washed their hands at least 3 times. Some subjects voluntarily agreed to performing more repetitions, which leads to more than 3 washing detection results per subject. The detection accuracy per subject was xy (+-) $\,%$, with the highest being, xy and the lowest being zy.TODO.
The total mean accuracy over all repetitions was xy %.
......@@ -8,10 +8,29 @@ References are automatically generated from the BibTex file (references.bib)
# References {.unnumbered}
\markboth{Literatur}{Literatur}
\markboth{Literature}{Literature}
::: {#refs}
:::
\appendix
# Appendix {.unnumbered}
\markboth{Appendix}{Appendix}
### User manual and questionnaire for hand washing detection evaluation (see next pages:) {.unnumbered}
\includepdf[pages=-]{img/HW_EVAL_draft.pdf}
### Results for the binarized version of problem 3 classifiers applied to problem 1 {.unnumbered}
\begin{figure}
\includegraphics[width=0.98\textwidth]{img/washing_binarized.pdf}
\caption{F1 score and S score for the binarized version of problem 3 classifiers applied to problem 1}
\label{fig:washing_binarized}
\end{figure}
\input{tables/washing_binarized.tex}
No preview for this file type
......@@ -8,6 +8,11 @@
\usepackage{multirow}
\usepackage[section]{placeins}
% Hurenkinder und Schusterjungen verhindern
\clubpenalty10000
\widowpenalty10000
\displaywidowpenalty=10000
%------------------------------------- Custom Title Page ---------------------------
\renewcommand{\maketitle}{
\thispagestyle{empty}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment