Commit 286543c3 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

added appendix

parent 8e0219c6
......@@ -65,4 +65,27 @@
month = jul,
eprint = {1907.12003},
doi = {10.48550/arXiv.1907.12003}
}
@article{WAHL2022105280,
title = {On the automatic detection of enacted compulsive hand washing using commercially available wearable devices},
journal = {Computers in Biology and Medicine},
volume = {143},
pages = {105280},
year = {2022},
issn = {0010-4825},
doi = {https://doi.org/10.1016/j.compbiomed.2022.105280},
url = {https://www.sciencedirect.com/science/article/pii/S0010482522000725},
author = {Karina Wahl and Philipp Marcel Scholl and Silvan Wirth and Marcel Miché and Jeannine Häni and Pia Schülin and Roselind Lieb},
keywords = {Compulsive hand washing, Machine learning, Obsessive-compulsive disorder, Wearables},
abstract = {Background
Compulsive hand washing is one of the most frequent compulsions and includes highly ritualized, repetitive hand motions. Developing an algorithm that can automatically detect compulsive washing with off-the-shelf wearable devices is a first step toward developing more sophisticated sensor-based assessments and micro-interventions that might ultimately supplement cognitive behavioral therapy for obsessive-compulsive disorder (OCD).
Objective
The main objective was to establish whether enacted compulsive hand washing can be distinguished from routine hand washing. This distinction will inform future research on the development of an algorithm that can automatically detect compulsive hand washing.
Method
Twenty-one participants were trained individually to wash their hands according to 1 of 5 scripted hand-washing procedures that were based on descriptions of pathological compulsive washes and additionally to wash their hands as they usually would, while wearing a smartwatch. Washes were video recorded to obtain validation data. To generate a baseline model, we opted to extract well-known features only (mean and variance of each sensor axis). We tested four classification models: linear support vector machine (SVM), SVM with radial basis functions, random forest (RF), and naive Bayes (NB). Leave-one-subject-out cross-validation was applied to gather F1, specificity, and sensitivity scores.
Results
The best-performing parameters were a classification window duration of 10 s, with a mean-variance feature set calculated from quaternions, rate of turn, and magnetic flux measurements. The detection performance varied with the particular enacted compulsive hand wash (F1 range: 0.65–0.87). Overall, enacted compulsive and routine hand washing could be distinguished with an F1 score of 79% (user independent), a sensitivity of 84%, and a specificity of 30%.
Conclusions
Our analysis of the sensor data demonstrates that enacted compulsive hand washing could be distinguished from routine hand washing with acceptable sensitivity. However, specificity was low. This study is a starting point for a series of follow-ups, including the application in individuals diagnosed with OCD.}
}
\ No newline at end of file
......@@ -32,7 +32,7 @@ Since adjacent windows tend to have the same activity, one indicator can cover s
\subsubsection{Utilized data sets}
For this work, I used data sets from the University of Basel and the University of Freiburg [REF]. These include hand washing data which was recorded using a smartwatch application. Additionally, they contain long-term recordings of everyday activities. The data is structured by individual participants and multiple activities per recording. During the generation of a synthetic data set, the data of a single participant is selected randomly. To cover enough data, I had to combine and merge single participants over each data set. Therefore a resulting data set for a user contains multiple participants, which I treat as one. This just affects data for \textit{null} activities. All hand wash activity data is from the same user. Since the same data sets have already been used to train the base model, I had to retrain individual base models for each participant where none of its data is contained.
For this work, I used data sets from the University of Basel and the University of Freiburg~\cite{WAHL2022105280}. These include hand washing data which was recorded using a smartwatch application. Additionally, they contain long-term recordings of everyday activities. The data is structured by individual participants and multiple activities per recording. During the generation of a synthetic data set, the data of a single participant is selected randomly. To cover enough data, I had to combine and merge single participants over each data set. Therefore a resulting data set for a user contains multiple participants, which I treat as one. This just affects data for \textit{null} activities. All hand wash activity data is from the same user. Since the same data sets have already been used to train the base model, I had to retrain individual base models for each participant where none of its data is contained.
......@@ -110,7 +110,7 @@ The score only benefits from adding a label to the set if the predicted value fo
\input{figures/approach/example_pseudo_filter_score}
\subsubsection{Deep Convolutional network}
\subsubsection{Deep convolutional network}
Convolutional networks have become a popular method for image and signal classification. I use a convolutional neural network (CNN) to predict the value of a pseudo label given the surrounding pseudo labels. It consists of two 1d-convolutional layers and a linear layer for the classification output. Both convolutional layers have a stride of 1, and padding is applied. The kernel size of the first layer is 10, and from the second layer, the size is 5. They convolve along the time axis over the \textit{null} and \textit{hw} values. As activation function, I use the Rectified Linear Unit (ReLU) after each convolutional layer. For input, I apply a sliding window of length 20 and shift 1 over the pseudo labels inside a \textit{hw} interval. It results in a 20x2 dimensional network input, generating a 1x2 output. After applying a softmax function, the output is the new pseudo-soft-label at the windows middle position.
I used the approach from \secref{sec:synDataset} to train the network and created multiple synthetic datasets. On these datasets, I predicted the pseudo labels by the base model. Additionally, I augmented the labels by adding noise and random label flips. After that I extracted the \textit{correct} intervals. It results in roughly 400 intervals with $\sim1300$ windows, which were shuffled before training. As loss function, I used cross-entropy.
......@@ -119,7 +119,7 @@ In \figref{fig:examplePseudoFilterCNN} you can see a plot of the same example in
\input{figures/approach/example_pseudo_filter_cnn}
\subsubsection{Auto encoder}
\subsubsection{Autoencoder}
I tried different implementations of autoencoders for denoising. They all take a 1x128 dimensional noisy input and output a 1x128 dimensional clean signal. The size of 128 has been chosen because it is a multiple of two and covers all \textit{positive} intervals. So I have to enlarge the \textit{hw} and \textit{manual} intervals and just use the soft label values of $\hat{y}^{hw}_i$. Therefore a whole interval is processed in one step. Since the output is also just the soft label values of the hand wash probability, I have to recompute the \textit{null} values of each label by $\hat{y}^{null}_i=1-\hat{y}^{hw}_i$.
The first approach is a fully convolutional denoising auto encoder (FCN-dAE). The encoding part consists of three 1D-convolutional layers with kernel sizes of 8, 4, 4, and strides of 2, 2, and 1. It encodes the input to a 32x64 dimensional feature map after the first layer, a 12x34 feature map after the second layer, and a 1x33 feature map after the third layer. The decoder part is inversely symmetric and consists of a 1D-deconvolutional layer that reconstructs the input back to 1x128 values. As activation function, I apply Exponential Linear Unit (ELU) after each layer, and as output, a sigmoid function is used. For training, I created similar as for the CNN, multiple synthetic datasets and their predictions with additional noise and label flips. After the extraction of the \textit{correct} intervals I extended them to 128 values. During training, a mean squared error loss function measures the variance between the denoised input and the clean ground truth values. \figref{fig:examplePseudoFilterFCNdAE} shows the example intervals where this filter was applied.
......
......@@ -69,7 +69,9 @@ The impact of missing user feedback on the training data and resulting model per
As you can see, missing \textit{false} indicators do not lead to significant performance changes. The \texttt{all\_null\_*} filter configurations include all samples as \textit{null} labels without depending on the indicator. Similar, the \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels than the sections which are covered by the \textit{false} indicators.
In contrast, missing \textit{correct} indicators lead to performance loss. However, a negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively. If we focus on \texttt{all\_null\_convlstm2} and \texttt{all\_cnn\_convlstm2\_hard} as well as on \texttt{all\_null\_convlstm3} and \texttt{all\_cnn\_convlstm3\_hard} we can see that in both cases the \texttt{all\_null\_*} filter perform better than the \texttt{all\_cnn\_*} with full feedback, but in the absence of feedback the \texttt{all\_cnn\_*} configurations dominate. Therefore, the \texttt{all\_cnn\_*} filters should be preferred when it cannot be assumed that a user responds to all hands wash actions.
In contrast, missing \textit{correct} indicators lead to performance loss. However, a significant negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model, except for \texttt{all\_cnn\_convlstm3}, outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively.\\
The \texttt{all\_null\_deepconv} filter configuration results in the highest performance loss between $100\%$ feedback and $20\%$ feedback, followed by the \texttt{all\_null\_score} configuration. So they are considered as most vulnerable to missing feedback.\\
If we focus on \texttt{all\_null\_convlstm2} and \texttt{all\_cnn\_convlstm2\_hard} as well as on \texttt{all\_null\_convlstm3} and \texttt{all\_cnn\_convlstm3\_hard} we can see that in both cases the \texttt{all\_null\_*} filter perform better than the \texttt{all\_cnn\_*} with full feedback, but in the absence of feedback the \texttt{all\_cnn\_*} configurations dominate. Therefore, the \texttt{all\_cnn\_*} filters should be preferred when it cannot be assumed that a user responds to all hands wash actions.
\input{figures/experiments/supervised_pseudo_missing_feedback}
......@@ -103,7 +105,7 @@ I compare the estimated F1 score with the ground truth evaluation in this sectio
\input{figures/experiments/table_quality_estimation_evaluation}
\section{Real world analysis}
In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers 14 participants with overall 2682 hours of sensor data and 1390 user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be beaten by personalization.
In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers 14 participants with overall 2682 hours of sensor data and 1390 user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be exceeded by personalization.
\input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_general_evaluation}
......@@ -113,7 +115,7 @@ For each training recording, the base model is used to generate predictions for
Entries with zero iterations, as for participants OCDetect\_12 and OCDetect\_13, state that no better personalization could be found.
For all other participants, the personalization process generated a model that performs better than the general model with adjusted kernel settings. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on at most three iterations, and l2-sp regularization was used in more cases. The highest increase in F1 score achieved OCDetect\_21 with a difference of $0.25$. Compared to the general model without adjusted kernel settings, the F1 score increases by $0.355$. In practice, this would lead to the same amount of incorrect hand wash notifications and $80\%$ more correct detections. The highest decrease of false predictions achieves participant OCDetect\_10 with $74\%$. All best personalizations except of one use the \texttt{all\_cnn\_convlstm3\_hard} filter configuration. But for participants OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_cnn\_convlstm2\_hard} filter configuration achieves the same score. Moreover, for participants OCDetect\_05, OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_null\_convlstm3} filter configuration also reaches same F1 scores.
So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration.
So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration. For all results see \tabref{tab:fullRealWorldResults} in appendix.
The mean F1 score over each best-personalized model increases by $0.044$ compared to the general model with adjusted kernel settings and $0.11$ to the plain general model without adjusting kernel settings. That is $9.6\%$ and $28.2\%$ respectively. So, personalization leads to a reduction of the false detections by $31\%$ and an increase of correct detections by $16\%$.
......
\chapter*{Appendix}
\section*{Full real-world experiment results}
\input{figures/appendix/table_full_real_world_results}
\ No newline at end of file
This diff is collapsed.
\begin{figure}[t]
\begin{centering}
\subfloat[Increasing $h$]
\subfloat[Increasing $n$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_null_part_spec_sen.png}}
\subfloat[Increasing $n$]
......
......@@ -87,6 +87,7 @@
\usepackage{pdfpages}
\usepackage{makecell}
\usepackage{adjustbox}
\usepackage{longtable}
%------------------------------------------------------------------------------
% (re)new commands / settings
......
......@@ -76,6 +76,7 @@
% \renewcommand{\bibname}{Literaturverzeichnis}
\addcontentsline{toc}{chapter}{Bibliography}
\newpage
\input{chapters/9-appendix}
\thispagestyle{empty}
\mbox{}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment