Commit a12bfaa9 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

final experiment

parent 32297bd2
...@@ -25,7 +25,7 @@ ...@@ -25,7 +25,7 @@
\begin{tabular}[hc]{>{\huge}l >{\huge}l} \begin{tabular}[hc]{>{\huge}l >{\huge}l}
Examiner: & \firstexaminer \\[0.3cm] Examiner: & \firstexaminer \\[0.3cm]
Advisers: & \advisers \\[1.2cm] Advisor: & \advisers \\[1.2cm]
\end{tabular} \end{tabular}
\vfill % move the following text to the bottom \vfill % move the following text to the bottom
...@@ -61,5 +61,5 @@ ...@@ -61,5 +61,5 @@
{ {
% No second examiner, ignore % No second examiner, ignore
} }
\textbf{Advisers} \smallskip{} \\ \textbf{Advisor} \smallskip{} \\
\advisers \advisers
\chapter*{Abstract} \chapter*{Abstract}
Wearable sensors like smartwatches offer an excellent opportunity for human activity recognition (HAR). They are available to a broad user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition performs better than a general one. However, additional labeled data from the user is required, which can be time-consuming and labor intensive to annotate them. While common personalization approaches reduce the necessary labeled training data, the labeling process remains dependent on some user interaction. Wearable sensors like smartwatches offer an excellent opportunity for human activity recognition (HAR). They are available to a broad user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition performs better than a general one. However, additional labeled data from the user is required, which can be time-consuming and labor intensive to annotate them. While common personalization approaches reduce the required amount of labeled training data, the labeling process remains dependent on some user interaction.
In this work, I present a personalization approach in which training data labels are derived from inexplicit user feedback obtained during the use of a HAR application. The general model predicts labels which are then refined by various denoising filters based on Convolutional Neural Networks and Autoencoders. The previously obtained user feedback assists this process. High confidence data is then used to fine-tune the recognition model via transfer learning. No changes to the model architecture are required so that personalization can be easily added to an existing application. In this work, I present a personalization approach in which training data labels are derived from inexplicit user feedback obtained during the use of a HAR application. The general model predicts labels which are then refined by various denoising filters based on Convolutional Neural Networks and Autoencoders. The previously obtained user feedback assists this process. High confidence data is then used to fine-tune the recognition model via transfer learning. No changes to the model architecture are required so that personalization can be easily added to an existing application.
......
\chapter{Introduction}\label{chap:introduction} \chapter{Introduction}\label{chap:introduction}
Detecting and monitoring people's activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches offer integration in everyday life over a wide user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}. Detecting and monitoring people's activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches enable the integration of activity recognition into everyday life for a wider user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}.
One application scenario in healthcare is observing various diseases such as Obsessive-Compulsive Disorder (OCD). For example, the detection of hand washing activities can be used to derive the frequency or excessiveness which occurs in some people with OCD. Moreover, it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected, Just-in-Time Interventions can be presented to the user, offering enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}. One application scenario in healthcare is observing various diseases such as Obsessive-Compulsive Disorder (OCD). For example, detecting hand washing activities can be used to derive the frequency or excessiveness with which people affected by OCD perform this action. Moreover, with automatic detection it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected, just-in-time interventions can be presented to the user, offering enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}.
State-of-the-art Human Activity Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since the movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}. State-of-the-art Human Activity Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since the movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore, it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}.
To personalize a model, retraining on new unseen sensor data is necessary. Obtaining the ground truth labels is crucial for most deep learning techniques. However, the annotation process is time and cost-intensive. Typically, training data is labeled in controlled environments by hand. In a real context scenario, the user would have to take over the major part. To personalize a model, retraining on new unseen sensor data is necessary. Obtaining the ground truth labels is crucial for most deep learning techniques. However, the annotation process is time and cost-intensive. Typically, training data is labeled in controlled environments by hand. In a real context scenario, the user would have to take over the major part.
Indeed this requires lots of user interaction and decent expertise, which would contradict the usability. Indeed this requires lots of user interaction and decent expertise, which would contradict the usability.
...@@ -12,18 +12,18 @@ There has been various research on preprocessing data to make it usable for trai ...@@ -12,18 +12,18 @@ There has been various research on preprocessing data to make it usable for trai
My work aims to personalize a detection model without increasing user interaction. Information for labeling is drawn from indicators that arise during the use of the application. These can be derived from user feedback to triggered actions resulting from the predictions of the underlying recognition model. Moreover, personalization should be separated so that no change in the model architecture is required. My work aims to personalize a detection model without increasing user interaction. Information for labeling is drawn from indicators that arise during the use of the application. These can be derived from user feedback to triggered actions resulting from the predictions of the underlying recognition model. Moreover, personalization should be separated so that no change in the model architecture is required.
At first, all new unseen sensor data is labeled by the same general model, which is used for activity recognition. These model predictions are corrected to a certain extent by using pre-trained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to refine the data to generate a good training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data, the previous model can be fine-tuned in an incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neuronal networks, it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. In combination, this leads to a personalized model which has improved performance in detecting specific gestures of an individual user. At first, all new unseen sensor data is labeled by the same general model, which is used for activity recognition. These model predictions are corrected to a certain extent by using pre-trained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to refine the data to generate a good training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data, the previous model can be fine-tuned in an incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neural networks, it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. These steps lead to a personalized model which has improved performance in detecting specific gestures of an individual user.
I applied the described personalization process to a hand washing detection application used to observe the behavior of OCD patients. During the observation, the user answers requested evaluations if the application detects hand washing. For miss predictions, the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the accuracy of the predictions, resulting in the desired indicators. I applied the described personalization process to a hand washing detection application used to observe the behavior of OCD patients. During the observation, if the application detects hand washing, it asks the user for evaluation. For miss predictions, the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the accuracy of the predictions, resulting in the desired indicators.
The contributions of my work are as follows: The contributions of my work are as follows:
\begin{itemize} \begin{itemize}
\item [1.] A personalization approach which can be added to an exisitng HAR application and does not require additional user interaction or changes in the model architecture. \item [1.] A personalization approach is implemented, which can be added to an existing HAR application and does not require additional user interaction or changes in the model architecture.
\item [2.] Different indicator assisted refinement methods, based on Convolutional networks and Fully Connected Autoencoders, for generated labels. \item [2.] Different indicator-assisted refinement methods, based on Convolutional networks and Fully Connected Autoencoders, are applied to generated labels.
\item [3.] Demonstration that a personalized model which results from this approach outperforms the general model and can achieve similar performance as a supervised personalization. \item [3.] It is demonstrated that a personalized model which results from this approach outperforms the general model and can achieve similar performance as a supervised personalization.
\item [4.] Comparison to a common active learning method. \item [4.] My approach is compared to a common active learning method.
\item [5.] Presentation of real-world experiments which confirms applicability to a broad user base \item [5.] A real-world experiment is presented, which confirms applicability to a broad user base.
\end{itemize} \end{itemize}
......
This diff is collapsed.
This diff is collapsed.
\chapter{Experiments }\label{chap:experiments} \chapter{Experiments }\label{chap:experiments}
In this chapter, I show my experiments. At first, I define the setup of experiments with the used datasets and metrics. Then I introduce a supervised personalization to give a baseline and use this to investigate the performance in the presence of label noise. Afterward, I evaluate different methods in pseudo label generation and their noise reduction. Next, the development of the personalization is analyzed by determining the performance over each individual iteration step. I compare my approach to an active learning implementation to evaluate the overall performance. Finally, I have conducted multiple real-world experiments. In this chapter, I show my experiments. At first, I define the setup of experiments with the used datasets and metrics. Then I introduce a supervised personalization to give a baseline and use this to investigate the performance in the presence of label noise. Afterward, I evaluate different methods in pseudo label generation and their noise reduction. Next, the development of the personalization is analyzed by determining the performance over each individual iteration step. I compare my approach to an active learning implementation to evaluate the overall performance. Finally, I have conducted multiple real-world experiments.
\section{Experiment Setup} \section{Experiment setup}
To evaluate my personalization approach, I use different metrics which rely on ground truth data. Therefore fully labeled datasets are required. I created synthetic recordings for 3 participants described in~\secref{sec:synDataset}. Additionally, I have recorded multiple daily usages of the base application and split these recordings into two participants. I focused on doing more intense hand washing than usual for the second participant. The participant recordings are split into training and test sets by the leave one out method. \tabref{tab:supervisedDatasets} shows the resulting datasets in detail. I set the test splits ratio higher than usual, since a recording could contain long periods of low motion. So a wide variety of covered activities is guaranteed. The measurement metrics are Specificity, Sensitivity, F1 score, and S score. For each evaluation, the personalized models of the participants are applied to the respective test sets, and the mean overall is computed. The models are implemented using PyTorch~\cite{Paszke2019}. To evaluate my personalization approach, I use different metrics which rely on ground truth data. Therefore fully labeled datasets are required. I created synthetic recordings for 3 participants described in~\secref{sec:synDataset}. Additionally, I have recorded multiple daily usages of the base application and split these recordings into two participants. I focused on doing more intense hand washing than usual for the second participant. The participant recordings are split into training and test sets by the leave one out method. \tabref{tab:supervisedDatasets} shows the resulting datasets in detail. I set the test splits ratio higher than usual, since a recording could contain long periods of low motion. So a wide variety of covered activities is guaranteed. The measurement metrics are Specificity, Sensitivity, F1 score, and S score. For each evaluation, the personalized models of the participants are applied to the respective test sets, and the mean overall is computed. The models are implemented using PyTorch~\cite{Paszke2019}.
\input{figures/experiments/table_supervised_datasets} \input{figures/experiments/table_supervised_datasets}
...@@ -26,7 +26,7 @@ Now we observe the scenarios where some of the labels are noisy. Therefore we lo ...@@ -26,7 +26,7 @@ Now we observe the scenarios where some of the labels are noisy. Therefore we lo
I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus. I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus.
\subsection{Hard vs. Soft labels}\label{sec:expHardVsSoft} \subsection{Hard vs. soft labels}\label{sec:expHardVsSoft}
In these experiments, I would like to show the effect of noise on soft labels compared to crisp labels. Similar to before, different values of label flips are applied to the training data. Then the labels are smoothen to a degree $s\in [0, 0.49]$. As seen before, noise on \textit{hw} labels does not significantly impact the performance. Therefore not many changes in performance due to different smoothing are expected. This is confirmed by \figref{fig:supervisedSoftNoiseHW}. At larger noise levels, a tendency can be seen as a slight increase in the S-score. I focus on noise in \textit{null} labels. \figref{fig:supervisedSoftNoiseNull} gives detailed insights of the performance impact. The specificity increases with higher smoothing for all noise values, which becomes clearer for more noise. However, the sensitivity decreases slightly, especially for higher noise rates. Overall the F1 score and S score benefit from smoothing. In the case of $0.2\%$ noise, the personalized models trained on smoothed false labels, can reach a higher S score than the base model. In these experiments, I would like to show the effect of noise on soft labels compared to crisp labels. Similar to before, different values of label flips are applied to the training data. Then the labels are smoothen to a degree $s\in [0, 0.49]$. As seen before, noise on \textit{hw} labels does not significantly impact the performance. Therefore not many changes in performance due to different smoothing are expected. This is confirmed by \figref{fig:supervisedSoftNoiseHW}. At larger noise levels, a tendency can be seen as a slight increase in the S-score. I focus on noise in \textit{null} labels. \figref{fig:supervisedSoftNoiseNull} gives detailed insights of the performance impact. The specificity increases with higher smoothing for all noise values, which becomes clearer for more noise. However, the sensitivity decreases slightly, especially for higher noise rates. Overall the F1 score and S score benefit from smoothing. In the case of $0.2\%$ noise, the personalized models trained on smoothed false labels, can reach a higher S score than the base model.
\input{figures/experiments/supervised_soft_noise_hw} \input{figures/experiments/supervised_soft_noise_hw}
...@@ -49,7 +49,7 @@ I combined both experiments to ensure that the drawbacks of incorrectly smoothed ...@@ -49,7 +49,7 @@ I combined both experiments to ensure that the drawbacks of incorrectly smoothed
\section{Evaluation of different Pseudo label generations}\label{sec:expPseudoModels} \section{Evaluation of different pseudo label generations}\label{sec:expPseudoModels}
In this section, I describe the evaluation of different pseudo-labeling approaches using the filters introduced in \secref{sec:approachFilterConfigurations}. For each filter configuration, the base model is used to predict the labels of the training sets and create pseudo labels. After that, the filter is applied to the pseudo labels. To determine the quality of the pseudo labels, they are evaluated against the ground truth values using soft versions of the metrics $Sensitivity^{soft}$, $Specificity^{soft}$, $F_1^{soft}$, $S^{soft}$. The refined pseudo labels then train the general model. All resulted models are evaluated by their test sets, and the mean over all is computed. \figref{fig:pseudoModelsEvaluation} shows a bar plot over the metrics for all filter configuration. I concentrate on the values of the S score in terms of performance. In this section, I describe the evaluation of different pseudo-labeling approaches using the filters introduced in \secref{sec:approachFilterConfigurations}. For each filter configuration, the base model is used to predict the labels of the training sets and create pseudo labels. After that, the filter is applied to the pseudo labels. To determine the quality of the pseudo labels, they are evaluated against the ground truth values using soft versions of the metrics $Sensitivity^{soft}$, $Specificity^{soft}$, $F_1^{soft}$, $S^{soft}$. The refined pseudo labels then train the general model. All resulted models are evaluated by their test sets, and the mean over all is computed. \figref{fig:pseudoModelsEvaluation} shows a bar plot over the metrics for all filter configuration. I concentrate on the values of the S score in terms of performance.
\subsubsection{Baseline configurations} \subsubsection{Baseline configurations}
...@@ -69,7 +69,7 @@ The following experiment shows the impact of missing user feedback on the traini ...@@ -69,7 +69,7 @@ The following experiment shows the impact of missing user feedback on the traini
As you can see, missing \textit{false} indicators do not lead to significant performance changes. The \texttt{all\_null\_*} filter configurations include all samples as \textit{null} labels without depending on the indicator. Similar, the \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels than the sections which are covered by the \textit{false} indicators. As you can see, missing \textit{false} indicators do not lead to significant performance changes. The \texttt{all\_null\_*} filter configurations include all samples as \textit{null} labels without depending on the indicator. Similar, the \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels than the sections which are covered by the \textit{false} indicators.
In contrast, missing \textit{correct} indicators lead to performance loss. However, a negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively. In contrast, missing \textit{correct} indicators lead to performance loss. However, a negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively. If we focus on \texttt{all\_null\_convlstm2} and \texttt{all\_cnn\_convlstm2\_hard} as well as on \texttt{all\_null\_convlstm3} and \texttt{all\_cnn\_convlstm3\_hard} we can see that in both cases the \texttt{all\_null\_*} filter perform better than the \texttt{all\_cnn\_*} with full feedback, but in the absence of feedback the \texttt{all\_cnn\_*} configurations dominate. Therefore, the \texttt{all\_cnn\_*} filters should be preferred when it cannot be assumed that a user responds to all hands wash actions.
\input{figures/experiments/supervised_pseudo_missing_feedback} \input{figures/experiments/supervised_pseudo_missing_feedback}
...@@ -88,7 +88,7 @@ In this step, I compare the evaluation of the personalized model over the differ ...@@ -88,7 +88,7 @@ In this step, I compare the evaluation of the personalized model over the differ
\section{Evaluation of personalization} \section{Evaluation of personalization}
\subsection{Compare Active learning with my approach} \subsection{Compare active learning with my approach}
To confirm the robustness of my personalization approach, I compare it with a common active learning implementation as introduced in \secref{sec:approachActiveLearning}. To find a appropriate selection of hyper parameters $B$, $s$, $h$, use weighting, and number of epochs, I use a grid search approach. \tabref{tab:activeLearningGridSearch} shows the covered values for the hyper parameters. The 10 parameter constellations which yields best S scores are shown in \tabref{tab:activeLearningEvaluation}. They achieve S scores from ${\sim}0.8444$ to ${\sim}0.8451$. From the experiment of \secref{sec:expPseudoModels}, we know that the models based on configurations \texttt{all\_cnn\_*\_hard} reach scores around ${\sim}0.86$ and \texttt{all\_null\_deepconv}, \texttt{all\_null\_fcndae} and \texttt{all\_null\_convlstm1} even ${\sim}0.867$. So my personalization approach outperforms active learning in terms of performance increase. To confirm the robustness of my personalization approach, I compare it with a common active learning implementation as introduced in \secref{sec:approachActiveLearning}. To find a appropriate selection of hyper parameters $B$, $s$, $h$, use weighting, and number of epochs, I use a grid search approach. \tabref{tab:activeLearningGridSearch} shows the covered values for the hyper parameters. The 10 parameter constellations which yields best S scores are shown in \tabref{tab:activeLearningEvaluation}. They achieve S scores from ${\sim}0.8444$ to ${\sim}0.8451$. From the experiment of \secref{sec:expPseudoModels}, we know that the models based on configurations \texttt{all\_cnn\_*\_hard} reach scores around ${\sim}0.86$ and \texttt{all\_null\_deepconv}, \texttt{all\_null\_fcndae} and \texttt{all\_null\_convlstm1} even ${\sim}0.867$. So my personalization approach outperforms active learning in terms of performance increase.
In the next step, I analyze the required user interaction. The best-performing hyper-parameter setting relies on a budget of 20. So the user has to answer 20 queries. The training data contains $3.4$ manual, $10.6$ correct, and $53.4$ false indicators on average per participant. If we equated the answer to a query to the user feedback on which the indicators have been drawn, the active learning approach would lead to less user interaction. However, as shown in the experiment of \secref{sec:expMissingFeedback}, false indicators do not significantly impact the training process. Therefore a user could ignore false hand washing detentions and just answer for correct and manual activities. It would result in $14$ user interactions, which is less than the budget of the active learning implementation. Furthermore, positive feedback is just gathered when the application has to react to the target activity. So, in this case, the user interaction is intended anyways. In the next step, I analyze the required user interaction. The best-performing hyper-parameter setting relies on a budget of 20. So the user has to answer 20 queries. The training data contains $3.4$ manual, $10.6$ correct, and $53.4$ false indicators on average per participant. If we equated the answer to a query to the user feedback on which the indicators have been drawn, the active learning approach would lead to less user interaction. However, as shown in the experiment of \secref{sec:expMissingFeedback}, false indicators do not significantly impact the training process. Therefore a user could ignore false hand washing detentions and just answer for correct and manual activities. It would result in $14$ user interactions, which is less than the budget of the active learning implementation. Furthermore, positive feedback is just gathered when the application has to react to the target activity. So, in this case, the user interaction is intended anyways.
...@@ -103,18 +103,19 @@ I compare the estimated F1 score with the ground truth evaluation in this sectio ...@@ -103,18 +103,19 @@ I compare the estimated F1 score with the ground truth evaluation in this sectio
\input{figures/experiments/table_quality_estimation_evaluation} \input{figures/experiments/table_quality_estimation_evaluation}
\section{Real world analysis} \section{Real world analysis}
In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be beaten by personalization. In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers 14 participants with overall 2682 hours of sensor data and 1390 user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be beaten by personalization.
\input{figures/experiments/table_real_world_datasets} \input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_general_evaluation} \input{figures/experiments/table_real_world_general_evaluation}
For each training recording, the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulting data set is used for training based on the previous model or the model with the best F1 score. As regularization, freezing layers or l2-sp penalty is used. Overall personalizations of a participant, the model with the highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally, the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and adjusted base model values gives the true performance increase of the retraining. For each training recording, the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulting data set is used for training based on the previous model or the model with the best F1 score. As regularization, freezing layers or l2-sp penalty is used. Overall personalizations of a participant, the model with the highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally, the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and adjusted base model values gives the true performance increase of the retraining.
Entries with zero iterations, as for participants OCDetect\_09 and OCDetect\_13, state that no better personalization could be found. Entries with zero iterations, as for participants OCDetect\_12 and OCDetect\_13, state that no better personalization could be found.
For all other participants, the personalization process generated a model that performs better than the general. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on one or two iterations, and in most cases, l2-sp regularization was used. The highest increase in F1 score achieved OCDetect\_10 with $0.10$. In comparison to the general model without adjusted kernel settings, the F1 score increases by $0.2379$. In practice, this would lead to $82\%$ less incorrect hand wash notifications and $14\%$ more correct detections. The participants OCDetect\_02, OCDetect\_07 and OCDetect\_12 achieve an increase of around $0.05$. For OCDetect\_12, the personalization would lead to $6\%$ more wrong triggers but also increase the detection of correct hand washing activities by $45\%$. All best personalizations used either the \texttt{all\_cnn\_convlstm2\_hard} or \texttt{all\_cnn\_convlstm3\_hard} filter configurations. The mean F1 score, over all best-personalized models, achieves an increase of $0.0266$ compared to the general model with adjusted kernel settings and $XX$ to the plain general model without adjusting kernel settings. It leads to a reduction of the false detections by $27\%$ and an increase of correct detections by $5\%$. For all other participants, the personalization process generated a model that performs better than the general model with adjusted kernel settings. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on at most three iterations, and l2-sp regularization was used in more cases. The highest increase in F1 score achieved OCDetect\_21 with a difference of $0.25$. Compared to the general model without adjusted kernel settings, the F1 score increases by $0.355$. In practice, this would lead to the same amount of incorrect hand wash notifications and $80\%$ more correct detections. The highest decrease of false predictions achieves participant OCDetect\_10 with $74\%$. All best personalizations except of one use the \texttt{all\_cnn\_convlstm3\_hard} filter configuration. But for participants OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_cnn\_convlstm2\_hard} filter configuration achieves the same score. Moreover, for participants OCDetect\_05, OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_null\_convlstm3} filter configuration also reaches same F1 scores.
So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration.
\todo{Fix broken values} The mean F1 score over each best-personalized model increases by $0.044$ compared to the general model with adjusted kernel settings and $0.11$ to the plain general model without adjusting kernel settings. That is $9.6\%$ and $28.2\%$ respectively. So, personalization leads to a reduction of the false detections by $31\%$ and an increase of correct detections by $16\%$.
\extend{new participants}
\input{figures/experiments/table_real_world_evaluation} \input{figures/experiments/table_real_world_evaluation}
...@@ -19,13 +19,14 @@ The previous section's experiments gave several insights into personalization in ...@@ -19,13 +19,14 @@ The previous section's experiments gave several insights into personalization in
\end{itemize} \end{itemize}
The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist. The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist.\\
\extend{When full experiments are done} The experiment also confirms that the \texttt{all\_cnn\_*} filter configurations are better suited for a broader user base than the \texttt{all\_null\_*} configurations since they are more robust against missing feedback. For all participants, the \texttt{all\_cnn\_*} filter configurations achieved at least the same F1 scores as the \texttt{all\_null\_*} configurations, and in most cases, they outperformed them.
\section{Future work} \section{Future work}
The performance of the personalization heavily depends on the quality of the pseudo-labels. Therefore the filter configurations used for denoising them have a significant impact. More work on hyper-parameter tuning can lead to further improvements. The performance of the personalization heavily depends on the quality of the pseudo-labels. Therefore the filter configurations used for denoising them have a significant impact. More work on hyper-parameter tuning can lead to further improvements. Also new denoising concepts can be tried. Since the refinement of pseudo-labels is separated from the other part it is easy to implement different approaches.
Additionally, other sources of indicators can be considered. For example, Bluetooth beacons can be placed by the sinks. The distance between the watch and the sink can be estimated if the watch is within range. A short distance states that the user is probably washing their hands. This indicator can be handled similarly to a \textit{manual} feedback. Additionally, other sources of indicators can be considered. For example, Bluetooth beacons can be placed by the sinks. The distance between the watch and the sink can be estimated if the watch is within range. A short distance states that the user is probably washing their hands. This indicator can be handled similarly to a \textit{manual} feedback.
Furthermore, this approach offers the opportunity to learn new classes. For example, during the real-world experiment, the participant was asked if this action was compulsive for each hand washing activity. So there exists additional information to each hand washing. The target task can be adapted to learn a new classification into $A=\{null,hw,compulsive\}$. The resulting model would be able to distinguish between hand washing and not hand washing and moreover between regular hand washing and compulsive hand washing.
...@@ -3,4 +3,4 @@ In this work, I have elaborated a personalization process for human activity rec ...@@ -3,4 +3,4 @@ In this work, I have elaborated a personalization process for human activity rec
I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how-soft labels can counter training errors. Based on these insights, several constellations and filter approaches for training data have been implemented to analyze the behavior of the resulting models under the different aspects. I found out that just using the predictions of the base model leads to performance decreases since they consist of too much label noise. However, even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore more sophisticated denoising approaches are implemented that generate training data that consist of various samples with as few incorrect labels as possible. This data leads to personalized models that achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training. I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how-soft labels can counter training errors. Based on these insights, several constellations and filter approaches for training data have been implemented to analyze the behavior of the resulting models under the different aspects. I found out that just using the predictions of the base model leads to performance decreases since they consist of too much label noise. However, even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore more sophisticated denoising approaches are implemented that generate training data that consist of various samples with as few incorrect labels as possible. This data leads to personalized models that achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training.
Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness. The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, personalization would reduce the false detections by $XX\%$, and increase correct detections by $XX\%$. Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness. The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, personalization would reduce the false detections by $31\%$, and increase correct detections by $16\%$.
\ No newline at end of file \ No newline at end of file
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
\subfloat[evaluation] \subfloat[evaluation]
{\includegraphics[width=0.28\textwidth]{figures/approach/base_application_screen_eval.png}} {\includegraphics[width=0.28\textwidth]{figures/approach/base_application_screen_eval.png}}
\caption[Base application screen shots]{\textbf{Base application screen shots.} (a) shows the apllication in default state, where the user has the opportunity to trigger a hand wash event manually. (b) shows the notification, which appears, when the application has detected a hand wash activity. Here the user can confirm or decline. (b) shows one of the evaluation queries which the user has to answer for the OCD observation. These are shown, if the user triggered a manual hand wash event or confirmed a detected hand washing activity.} \caption[Base application screen shots]{\textbf{Base application screen shots.} (a) shows the application in default state, where the user has the opportunity to trigger a hand wash event manually. (b) shows the notification, which appears, when the application has detected a hand wash activity. Here the user can confirm or decline. (b) shows one of the evaluation queries which the user has to answer for the OCD observation. These are shown, if the user triggered a manual hand wash event or confirmed a detected hand washing activity.}
\label{fig:baseApplicationScreen} \label{fig:baseApplicationScreen}
\end{centering} \end{centering}
\end{figure} \end{figure}
...@@ -18,8 +18,10 @@ ...@@ -18,8 +18,10 @@
OCDetect\_11 & 19 & 46397570 & 257 & 53 & 11 & 39 & 35 \\ OCDetect\_11 & 19 & 46397570 & 257 & 53 & 11 & 39 & 35 \\
OCDetect\_12 & 13 & 8299920 & 46 & 72 & 1 & 0 & 5 \\ OCDetect\_12 & 13 & 8299920 & 46 & 72 & 1 & 0 & 5 \\
OCDetect\_13 & 15 & 33018908 & 183 & 21 & 21 & 14 & 39 \\ OCDetect\_13 & 15 & 33018908 & 183 & 21 & 21 & 14 & 39 \\
OCDetect\_18 & 8 & 12937161 & 71 & 47 & 20 & 9 & 17 \\ OCDetect\_18 & 11 & 17733047 & 98 & 63 & 33 & 14 & 26 \\
OCDetect\_20 & 14 & 29443317 & 163 & 9 & 179 & 64 & 69 \\ OCDetect\_19 & 3 & 9463742 & 52 & 4 & 25 & 18 & 173 \\
OCDetect\_20 & 17 & 34115873 & 189 & 11 & 207 & 74 & 78 \\
OCDetect\_21 & 5 & 8234224 & 45 & 4 & 7 & 12 & 19 \\
\bottomrule \bottomrule
\end{tabular}}} \end{tabular}}}
...@@ -40,7 +42,9 @@ ...@@ -40,7 +42,9 @@
OCDetect\_12 & 5 & 6502526 & 36 & 76 & 0 & 0 & 1 \\ OCDetect\_12 & 5 & 6502526 & 36 & 76 & 0 & 0 & 1 \\
OCDetect\_13 & 6 & 16679159 & 92 & 11 & 19 & 15 & 37 \\ OCDetect\_13 & 6 & 16679159 & 92 & 11 & 19 & 15 & 37 \\
OCDetect\_18 & 4 & 8249562 & 45 & 40 & 30 & 12 & 22 \\ OCDetect\_18 & 4 & 8249562 & 45 & 40 & 30 & 12 & 22 \\
OCDetect\_19 & 3 & 6975685 & 38 & 9 & 30 & 21 & 68 \\
OCDetect\_20 & 4 & 7162813 & 39 & 13 & 47 & 20 & 12 \\ OCDetect\_20 & 4 & 7162813 & 39 & 13 & 47 & 20 & 12 \\
OCDetect\_21 & 2 & 6049766 & 33 & 8 & 5 & 2 & 2 \\
\bottomrule \bottomrule
\end{tabular}}} \end{tabular}}}
......
...@@ -8,16 +8,20 @@ ...@@ -8,16 +8,20 @@
\toprule \toprule
\thead{participant} & \thead{filter\\ configuration} & \thead{base\\ on\\ best} & \thead{l2-sp} & \thead{\rotatebox[origin=c]{-90}{iterations}} & \thead{false\\ diff\\ relative} & \thead{correct\\ diff\\ relative} & \thead{F1} & \thead{base\\ false\\ diff\\ relative} & \thead{base\\ correct\\ diff\\ relative} & \thead{base\\F1} & \thead{F1\\diff}\\ \thead{participant} & \thead{filter\\ configuration} & \thead{base\\ on\\ best} & \thead{l2-sp} & \thead{\rotatebox[origin=c]{-90}{iterations}} & \thead{false\\ diff\\ relative} & \thead{correct\\ diff\\ relative} & \thead{F1} & \thead{base\\ false\\ diff\\ relative} & \thead{base\\ correct\\ diff\\ relative} & \thead{base\\F1} & \thead{F1\\diff}\\
\midrule \midrule
OCDetect\_02 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.3600 & 0.1636 & 0.5246 & -0.2057 & 0.1455 & 0.4667 & 0.0579 \\ OCDetect\_02 & all\_cnn\_convlstm2\_hard & True & True & 3 & -0.2457 & 0.1455 & 0.4791 & -0.2057 & 0.1455 & 0.4667 & 0.0124 \\
OCDetect\_03 & all\_cnn\_convlstm2\_hard & True & True & 1 & -0.3880 & 0.0000 & 0.4800 & -0.2240 & 0.1212 & 0.4728 & 0.0072 \\ OCDetect\_03 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.4754 & 0.0000 & 0.5097 & -0.2240 & 0.1212 & 0.4728 & 0.0368 \\
OCDetect\_04 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.3889 & 0.1176 & 0.5507 & -0.1111 & 0.1176 & 0.5135 & 0.0372 \\ OCDetect\_04 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.0556 & 0.2941 & 0.5641 & -0.1111 & 0.1176 & 0.5135 & 0.0506 \\
OCDetect\_05 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.1336 & 0.2111 & 0.3664 & -0.1270 & 0.1556 & 0.3514 & 0.0150 \\ OCDetect\_05 & all\_cnn\_convlstm3\_hard & True & False & 1 & -0.0847 & 0.2000 & 0.3547 & -0.1270 & 0.1556 & 0.3514 & 0.0033 \\
OCDetect\_07 & all\_cnn\_convlstm3\_hard & True & False & 2 & -0.6429 & 0.0769 & 0.8000 & -0.5714 & 0.0000 & 0.7429 & 0.0571 \\ OCDetect\_07 & all\_cnn\_convlstm3\_hard & True & False & 2 & -0.6429 & 0.0769 & 0.8000 & -0.5714 & 0.0000 & 0.7429 & 0.0571 \\
OCDetect\_09 & all\_cnn\_convlstm3\_hard & True & False & 0 & -0.5273 & 0.1000 & 0.4000 & -0.5273 & 0.1000 & 0.4000 & 0.0000 \\ OCDetect\_09 & all\_cnn\_convlstm3\_hard & True & False & 2 & -0.6364 & 0.0000 & 0.3385 & -0.3117 & 0.1818 & 0.2826 & 0.0559 \\
OCDetect\_10 & all\_cnn\_convlstm2\_hard & True & True & 2 & -0.8209 & 0.1429 & 0.3265 & -0.6418 & 0.1429 & 0.2192 & 0.1074 \\ OCDetect\_10 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.7388 & 0.1429 & 0.2667 & -0.6418 & 0.1429 & 0.2192 & 0.0475 \\
OCDetect\_11 & all\_cnn\_convlstm2\_hard & True & True & 1 & -0.1333 & 0.2857 & 0.3600 & -0.4000 & 0.1429 & 0.3556 & 0.0044 \\ OCDetect\_11 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.2500 & 0.2222 & 0.3284 & -0.4167 & 0.1111 & 0.3226 & 0.0058 \\
OCDetect\_12 & all\_cnn\_convlstm3\_hard & True & False & 2 & 0.0625 & 0.4516 & 0.6618 & -0.3125 & 0.1613 & 0.5950 & 0.0667 \\ OCDetect\_12 & all\_cnn\_convlstm3\_hard & True & False & 0 & -0.3077 & 0.1562 & 0.6016 & -0.3077 & 0.1562 & 0.6016 & 0.0000 \\
OCDetect\_13 & all\_null\_convlstm3 & True & False & 0 & -0.4600 & 0.0000 & 0.4471 & -0.4600 & 0.0000 & 0.4471 & 0.0000 \\ OCDetect\_13 & all\_cnn\_convlstm3\_hard & False & False & 0 & -0.2131 & 0.0000 & 0.3509 & -0.2131 & 0.0000 & 0.3509 & 0.0000 \\
OCDetect\_18 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.1818 & 0.1905 & 0.6623 & 0.0000 & 0.1190 & 0.6184 & 0.0438 \\
OCDetect\_19 & all\_cnn\_convlstm3\_hard & True & False & 1 & 0.0000 & 0.0476 & 0.2993 & -0.1220 & -0.0476 & 0.2963 & 0.0030 \\
OCDetect\_20 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.5000 & 0.0000 & 0.7937 & -0.1250 & 0.0000 & 0.7407 & 0.0529 \\
OCDetect\_21 & all\_cnn\_convlstm3\_hard & True & True & 1 & 0.0000 & 0.8000 & 0.7500 & 0.0000 & 0.0000 & 0.5000 & 0.2500 \\
\bottomrule \bottomrule
\end{tabular}} \end{tabular}}
......
...@@ -19,7 +19,9 @@ ...@@ -19,7 +19,9 @@
OCDetect\_12 & 77 & 32 & 13 & 0.5246 \\ OCDetect\_12 & 77 & 32 & 13 & 0.5246 \\
OCDetect\_13 & 46 & 20 & 61 & 0.3150 \\ OCDetect\_13 & 46 & 20 & 61 & 0.3150 \\
OCDetect\_18 & 83 & 42 & 22 & 0.5714 \\ OCDetect\_18 & 83 & 42 & 22 & 0.5714 \\
OCDetect\_20 & 64 & 50 & 24 & 0.7246 \\ OCDetect\_19 & 43 & 21 & 82 & 0.2877 \\
OCDetect\_20 & 64 & 50 & 25 & 0.7194 \\
OCDetect\_21 & 14 & 5 & 2 & 0.4762 \\
\bottomrule \bottomrule
\end{tabular}} \end{tabular}}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment