@@ -7,4 +7,8 @@ Analysis in the context of hand wash detection demonstrates that a significant p

\chapter{Zusammenfassung}

\todo{Schreiben wenn abstract fertig}

\ No newline at end of file

Tragbare Sensoren wie Smartwatches bieten eine hervorragende Gelegenheit für die Erkennung menschlicher Aktivitäten (HAR). Sie sind für eine breite Nutzerbasis verfügbar und können im Alltag eingesetzt werden. Aufgrund der Vielfalt der Nutzer muss das Erkennungsmodell in der Lage sein, unterschiedliche Bewegungsmuster zu erkennen. Neuere Forschungen haben gezeigt, dass eine personalisierte Erkennung besser abschneidet als eine allgemeine. Allerdings sind dafür zusätzliche beschriftete Daten des Benutzers erforderlich, deren Beschriftung zeit- und arbeitsaufwändig sein kann. Während gängige Personalisierungsansätze die benötigte Menge an beschrifteten Trainingsdaten reduzieren, bleibt der Beschriftungsprozess von einer gewissen Benutzerinteraktion abhängig.

In dieser Arbeit stelle ich einen Personalisierungsansatz vor, bei dem die Beschriftungen der Trainingsdaten aus nicht explizitem Benutzerfeedback, das während der Nutzung einer HAR-Anwendung gewonnen wird, abgeleitet werden. Das allgemeine Modell sagt Beschriftungen voraus, die dann durch verschiedene Entrauschungsfilter auf der Grundlage von Convolutional Neural Networks und Autoencodern verfeinert werden. Das zuvor erhaltene Benutzerfeedback unterstützt diesen Prozess. Daten mit hoher Wahrscheinlichkeit werden dann zur Feinabstimmung des Erkennungsmodells durch Transferlernen verwendet. Es sind keine Änderungen an der Modellarchitektur erforderlich, so dass die Personalisierung leicht zu einer bestehenden Anwendung hinzugefügt werden kann.

Die Analyse im Zusammenhang mit der Erkennung von Hände waschen zeigt, dass eine erhebliche Leistungssteigerung erzielt werden kann. Außerdem vergleiche ich meinen Ansatz mit einer traditionellen Personalisierungsmethode, um die Robustheit zu bestätigen. Schließlich bewerte ich das Verfahren in einem realen Experiment, bei dem die Teilnehmer einen Monat lang täglich eine Smartwatch tragen.

@@ -27,6 +27,6 @@ The contributions of my work are as follows:

\end{itemize}

\todo{structure of this work???}

The remainder of this work is structured as follows: In \chapref{chap:relatedwork}, I introduce related work on activity recognition in general. Furthermore, I show in more detail approaches to personalization and present concepts I used in my work. In \chapref{chap:approach}, I describe my implementation for personalization as well as a common approach to active learning. Additionally, evaluation metrics based on ground truth data and an estimation based on user feedback are introduced. \chapref{chap:experiments} presents my experiments and their results. These are discussed in \chapref{chap:Discussion}, and an outlook on future work is given. Finally, I summarize my work and results in \chapref{chap:conclusion}.

@@ -18,7 +18,7 @@ In a common hybrid approach, a general user-independent model is used first unti

Amrani et al. compared deep transfer learning approaches based on CNNs with a baseline incremental learning using Learn++, to analyze the performance of deep learning personalization to traditional techniques~\cite{Amrani2021Jan}. They demonstrated that deep learning outperforms Learn++ and does adapt faster to new users.

Rokni et al. used CNNs and personalization by transfer learning where lower layers are reused, and upper ones are retrained~\cite{Rokni2018Apr}. This is argued by the assumption that learned features in the first layers can be reused in other domains, and just classification has to be adapted~\cite{Yosinski2014}. They significantly improved activity recognition accuracy with just a few labeled instances. Hoelzemann and Van Laerhoven analyzed how results differ respectively to different methods and applications if transfer learning is applied to a Deep Convolutional LSTM network~\cite{Hoelzemann2020Sep}. They suggest that convolutional layers should not be fine-tuned, as already mentioned in the work of Rokni. Furthermore, they advise reinitializing the LSTM layers to default which results in slightly better performance in some cases.

In the work of Rokni et al., they use CNNs and personalization by transfer learning where lower layers are reused, and upper ones are retrained~\cite{Rokni2018Apr}. This is argued by the assumption that learned features in the first layers can be reused in other domains, and just classification has to be adapted~\cite{Yosinski2014}. They significantly improved activity recognition accuracy with just a few labeled instances. Hoelzemann and Van Laerhoven analyzed how results differ respectively to different methods and applications if transfer learning is applied to a Deep Convolutional LSTM network~\cite{Hoelzemann2020Sep}. They suggest that convolutional layers should not be fine-tuned, as already mentioned in the work of Rokni. Furthermore, they advise reinitializing the LSTM layers to default which results in slightly better performance in some cases.

A typical problem during fine-tuning is catastrophic forgetting ~\cite{Lee2017}. Important information that has been trained before gets lost by overfitting to the new target. To overcome this problem, Xuhong et al. and Li et al. analyzed different regularization schemes applied to inductive transfer learning~\cite{xuhong2018explicit, Li2020Feb}. They state that the L2-SP penalty should be considered the standard baseline for transfer learning which also outperforms freezing the first layers of a network. The idea is that learned parameters should remain close to their initial values during fine-tuning. So the pre-trained model is a reference that defines the effective search space. A regularizer $\Omega(\omega)$ of the network parameters $\omega$, which are adapted, is added to the result of the loss function. For the L2-SP penalty, the regularizer is defined as:

\begin{align}

...

...

@@ -41,7 +41,7 @@ Zeng et al. observed two approaches for semi-supervised learning~\cite{Zeng2017D

Active learning is a supervised learning approach that attempts to achieve the best possible performance with as few labeled instances as possible. It selects samples from a pool of unlabeled instances and queries them to a so-called oracle, for example, the user, to receive the exact labels. The instances with the highest informativeness are selected for training to achieve maximum performance with only a few queries. There are several approaches to determining informativeness, such as querying the samples, where the learner is most uncertain of labeling or which would lead to the most significant expected model change ~\cite{Settles2009, rokni2020transnet}.

Saeedi et al. combine an active learning approach with a neural network consisting of convolutional and LSTM layers to predict activities by using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine-tune the model to a new domain. To select the queried instances, they introduce a measurement called information density. It weighs the informativeness of an instance by its similarity to all other instances.

The combination of an active learning approach with a neural network consisting of convolutional and LSTM layers is presented by Saeedi et al. to predict activities using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine-tune the model to a new domain. To select the queried instances, they introduce a measurement called information density. It weighs the informativeness of an instance by its similarity to all other instances.

%~\cite{Fekri2018Sep}.

...

...

@@ -54,7 +54,7 @@ Active learning can be combined with a semi-supervised approach, as shown by Has

In self-supervised learning, a deep neural network is trained to recognize predefined transformation tasks in an unsupervised manner. I.e., different transformation functions like noise, rotation, or negation are applied to an input signal which generates new distinct versions. The network predicts the probabilities that a given sequence is the transformation of the original signal. Since the transformation functions are known, a self-supervised labeled training set can be constructed. The idea is that for detecting the transformation tasks, the core characteristics of the input signal have to be learned. These high-level semantics can then be used as the feature layers for the classifier. Just a few labeled samples are required to train the classification layer.

Saeed et al. showed a self-supervised learning approach for HAR~\cite{Saeed2019Jun}. They achieve a significantly better performance than traditional unsupervised learning methods and comparable with fully-supervised methods, especially in a semi-supervised scenario where a few labeled instances are available. Tang et al. extend this by combining self-supervised learning and self-training~\cite{Tang2021Feb}. A teacher model is trained first using supervised labeled data. Then, the teacher model is used to relabel the supervised dataset and additional unseen instances. Most confident samples are augmented by transformation functions, as mentioned previously. After that, the self-supervised dataset is used to train a student model. In addition, it is fine-tuned with the initially supervised instances. By combining the unlabeled data with the limited labeled data, performance can be further enhanced.

Aa self-supervised learning approach for HAR is showed by Saeed et al.~\cite{Saeed2019Jun}. They achieve a significantly better performance than traditional unsupervised learning methods and comparable with fully-supervised methods, especially in a semi-supervised scenario where a few labeled instances are available. Tang et al. extend this by combining self-supervised learning and self-training~\cite{Tang2021Feb}. A teacher model is trained first using supervised labeled data. Then, the teacher model is used to relabel the supervised dataset and additional unseen instances. Most confident samples are augmented by transformation functions, as mentioned previously. After that, the self-supervised dataset is used to train a student model. In addition, it is fine-tuned with the initially supervised instances. By combining the unlabeled data with the limited labeled data, performance can be further enhanced.

\subsection{Partial labels}

In situations where it is impossible to detect precisely when an activity was performed, partial labels can be used to indicate which activities are included for a larger period. Multiple contiguous instances can be collected in a single bag which is labeled by the covered classes. With these partial labels or weak labels, the actual classes of the contained instances can be predicted more accurately.

...

...

@@ -67,9 +67,9 @@ In the work of Stikic et al., multi-instance learning with weak labels is used f

Pseudo labeling allows doing an unsupervised domain adaption by using predictions of the base model~\cite{lee2013pseudo}. Based on the prediction of a sample, an artificial pseudo-label is generated, which is treated as ground truth data. However, this requires that the initial-trained model predicts pseudo-labels with high accuracy, which is hard to satisfy. Training with false pseudo-labels has a negative impact on personalization. Moreover, it is possible that pseudo-labeling could overfit to incorrect pseudo-labels over multiple iterations, which is known as confirmation bias. Therefore, many approaches augment the pseudo labels to reduce the amount of false training data. Since a base model is required, which is in most cases trained by supervised data, pseudo labeling is a part of semi-supervised learning. Nevertheless, compared to other semi-supervised approaches, pseudo labeling offers a simple implementation that does not rely on domain-specific augmentations or changes to the model architecture.

Li et al. showed a naive approach for semi-supervised learning using pseudo labels~\cite{Li2019Sep}. First, a pseudo labeling model $M_p$ is trained using a small supervised labeled data set $L$. This model is then used to perform pseudo-labeling for new unlabeled data, which results in dataset $\hat{U}$. After that, a deep learning model $M_{NN}$ is pre-trained with the pseudo labeled data $\hat{U}$ and afterward fine-tuned with the supervised data $L$. This process is repeated, where the resulting model $M_{NN}$ is used as a new pseudo labeling model $M_p$ until the validation accuracy converges. Moreover, they use the fact that predictions of a classifier model are probabilistic and assume that labels with higher probability also have higher accuracy. Therefore, they use only pseudo labels with high certainty. They argue that pseudo-labeling can be seen as a kind of data augmentation. Even with the high label noise of the pseudo labels, a deep neural network should be able to improve with training. In their tests, they significantly improved accuracy by adding pseudo labels to the training. Furthermore, they showed that the model benefits especially from the first iterations. Nevertheless, it is required that the pseudo labeling model $M_P$ has a certain accuracy. Tests show that a better pseudo labeling model leads to higher accuracy of the fine-tuned model. Arazo et al. observed the performance of a naive pseudo labeling applied on images and showed that it would overfit to incorrect pseudo labels~\cite{Arazo2020Jul}. The trained model tends to have higher confidence to previously false predicted labels, which results in new incorrect predictions. They applied simple modifications to prevent confirmation bias without requiring multiple networks or any consistency regularization methods as done in other approaches like in \secref{sec:relWorkSelfSupervisedLearning}. This yielded state-of-the-art performance with mixed-up augmentation as regularization and adding a minimum number of labeled samples. Additionally, they use soft-labels instead of hard-labels for training. Here a label consists of the individual class affiliations instead of a single value for the target class. Thereby it is possible to depict uncertainty over the classes. As a mixed-up strategy, they combine random sample pairs and corresponding labels, creating data augmentation with label smoothing. It should reduce the confidence of network predictions. As they point out, this approach is more straightforward than using other regularization methods and moreover more accurate.

Li et al. introduced a naive approach for semi-supervised learning using pseudo labels~\cite{Li2019Sep}. First, a pseudo labeling model $M_p$ is trained using a small supervised labeled data set $L$. This model is then used to perform pseudo-labeling for new unlabeled data, which results in dataset $\hat{U}$. After that, a deep learning model $M_{NN}$ is pre-trained with the pseudo labeled data $\hat{U}$ and afterward fine-tuned with the supervised data $L$. This process is repeated, where the resulting model $M_{NN}$ is used as a new pseudo labeling model $M_p$ until the validation accuracy converges. Moreover, they use the fact that predictions of a classifier model are probabilistic and assume that labels with higher probability also have higher accuracy. Therefore, they use only pseudo labels with high certainty. They argue that pseudo-labeling can be seen as a kind of data augmentation. Even with the high label noise of the pseudo labels, a deep neural network should be able to improve with training. In their tests, they significantly improved accuracy by adding pseudo labels to the training. Furthermore, they showed that the model benefits especially from the first iterations. Nevertheless, it is required that the pseudo labeling model $M_P$ has a certain accuracy. Tests show that a better pseudo labeling model leads to higher accuracy of the fine-tuned model. Arazo et al. observed the performance of a naive pseudo labeling applied on images and showed that it would overfit to incorrect pseudo labels~\cite{Arazo2020Jul}. The trained model tends to have higher confidence to previously false predicted labels, which results in new incorrect predictions. They applied simple modifications to prevent confirmation bias without requiring multiple networks or any consistency regularization methods as done in other approaches like in \secref{sec:relWorkSelfSupervisedLearning}. This yielded state-of-the-art performance with mixed-up augmentation as regularization and adding a minimum number of labeled samples. Additionally, they use soft-labels instead of hard-labels for training. Here a label consists of the individual class affiliations instead of a single value for the target class. Thereby it is possible to depict uncertainty over the classes. As a mixed-up strategy, they combine random sample pairs and corresponding labels, creating data augmentation with label smoothing. It should reduce the confidence of network predictions. As they point out, this approach is more straightforward than using other regularization methods and moreover more accurate.

Rizve et al. \cite{Rizve2021Jan} tackle the problem of relatively poor performance compared to other semi-supervised methods due to the larger number of incorrectly pseudo-labeled samples. They try to select a set of pseudo-labels that are less noisy. So just high confidence predictions under the aspect of the network's uncertainty are considered for training. Therefore they result in a smaller but more accurate pseudo-label subset which yields a higher performance of the resulting model. In their experiments, they used Monte-Carlo dropout for uncertainty estimation~\cite{Gal2016Jun}, but it is also possible to use other methods which do not require extensive model modifications. In a similar approach, Zheng and Yang use pseudo label learning for segmentation adaption without ground truth data in the target domain~\cite{Zheng2021Apr}. Like Rizve, they estimate the prediction uncertainty to address the label noise. An auxiliary classifier is used to determine the prediction variance, which reflects the uncertainty of the network. This approach also reaches competitive performance. However, changes to the model architecture are required to implement an auxiliary classifier.

Since pseudo labeling achieved relatively poor performance compared to other semi-supervised methods, Rizve et al. tackle the problem due to the larger number of incorrectly pseudo-labeled samples.~\cite{Rizve2021Jan}. They try to select a set of pseudo-labels that are less noisy. So just high confidence predictions under the aspect of the network's uncertainty are considered for training. Therefore they result in a smaller but more accurate pseudo-label subset which yields a higher performance of the resulting model. In their experiments, they used Monte-Carlo dropout for uncertainty estimation~\cite{Gal2016Jun}, but it is also possible to use other methods which do not require extensive model modifications. In a similar approach, Zheng and Yang use pseudo label learning for segmentation adaption without ground truth data in the target domain~\cite{Zheng2021Apr}. Like Rizve, they estimate the prediction uncertainty to address the label noise. An auxiliary classifier is used to determine the prediction variance, which reflects the uncertainty of the network. This approach also reaches competitive performance. However, changes to the model architecture are required to implement an auxiliary classifier.

%\cite{Gonzalez2018May} compared different self-labeling methods in a time series context. In self-training, a base learner is firstly trained on a labeled set. Then, unlabeled instances are classified by the base classifier, where it is assumed, that more accurate predictions tend to be correct. After that the labeled training set is enlarged with these self-labeled instances. They achieved for the best performing methods similar performance to the supervised learning.

@@ -22,7 +22,7 @@ To personalize a human activity recognition model, it must be re-trained with ad

Most of today's wearable devices consist of acceleration and gyroscope with three dimensions each. I combine the sets $S_{acceleration}$ and $S_{gyroscope}$ into one set with $S_i\in\mathbb{R}^{d_{acceleration}+d_{gyroscope}}$. In the case of hand wash detection, I use the activity labels $A=\{null, hw\}$ where \textit{null} represents all activities where no hand washing is performed, and \textit{hw} represents all hand washing activities.

\subsection{Synthetic data sets}\label{sec:synDataset}

Several published data sets contain sensor data of wearable devices during various human activities. Since most public data sets are separated into individual parts for each activity, artificial data sets have to be created with a continuous sequence of activities. There should be a reasonable constellation between \textit{null} and \textit{hw} samples, such that they build larger parts of non, hand washing activities with short, hand washing parts in between, like in a real-world scenario.

Several published data sets contain sensor data of wearable devices during various human activities. Since most public data sets are separated into individual parts for each activity, artificial data sets have to be created with a continuous sequence of activities. There should be a reasonable constellation between \textit{null} and \textit{hw} samples, such that they build larger parts of non-hand washing activities with short, hand washing parts between them, like in a real-world daily scenario.

Furthermore, additional data for user feedback covering parts of the time series is required. We can use the general prediction model to determine hand washing parts as it would be in the base application. In our case, we apply a running mean over multiple windows to the predictions and trigger an indicator $e_i$ at the window $W_j$ if it is higher than a certain threshold. This indicator $e_i$ gets the value \textit{correct} if one of the ground truth data covered by the mean has activity label \textit{hw}, otherwise it is \textit{false}. It represents the user feedback on confirmed or declined evaluations. A manual user feedback indicator is added for hand wash sequences where no indicator has been triggered. Using ground truth data for adding indicators simulates a perfectly disciplined user who answers to each evaluation.

...

...

@@ -72,7 +72,7 @@ For further improvements, I rely on knowledge about the context of hand washing.

For each section where the running mean of model predictions reaches a certain threshold, an indicator is created which is either \textit{neutral}, \textit{false} or \textit{correct}. Moreover, there can be indicators of type \textit{manual}. These provide the following information about the respective predictions.

\begin{itemize}

\item neutral:\\ The participant has not answered this query. However, if there is another indicator immediately afterward, both will probably cover the same activity. So we can assume the same value as from the following indicator. Suppose this is also of value \textit{neutral}. In that case, we continue this assumption over all following ones until either an indicator with the value \textit{false}/\textit{correct} exists or the distance between two adjacent indicators is so large that we can no longer assume the same activity. In the second case, no precise statement exists about the activity or prediction.

\item neutral:\\ The participant has not answered this query. However, if there is another indicator immediately afterward, both will probably cover the same activity. So we can assume the same value as from the following indicator. Suppose this is also of value \textit{neutral}. In that case, we continue this assumption over all following ones until either an indicator with the value \textit{false}/\textit{correct} exists or the distance between two adjacent indicators is so large (in my case 20 seconds) that we can no longer assume the same activity. In the second case, no precise statement exists about the activity or prediction.

\item false:\\ The participant has declined the predicted activity. So the prediction is false. Since the indicator is just a single shot time stamp, we need to specify a time interval in which we assume the predicted activity might have occurred. Within this time interval, the predictions which lead to the rejected activity are likely to be wrong.

\item correct:\\ The participant has confirmed the prediction. Again we have to estimate a time interval in which the activity could occur. Within this interval, the predictions that state the same activity as the confirmed one, are probably correct. So outliers should have the value of the confirmed activity. Nevertheless, we do not precisely know the boundaries of the confirmed activity.

\item manual:\\ The participant triggered a manual activity. So the running mean has not exceeded the threshold. However, predictions with the suitable activity are probably correct. It could be that the execution was too short to get the mean above the threshold or that too many predictions were false. Since a manual indicator could be given some time after the activity, the possible time interval is significantly larger than for the detected activities. Therefore, it is recommended to specify the maximum delay within which a user should trigger manual feedback.

...

...

@@ -111,7 +111,7 @@ The score only benefits from adding a label to the set if the predicted value fo

Convolutional networks have become a popular method for image and signal classification. I use a convolutional neural network (CNN) to predict the value of a pseudo label given the surrounding pseudo labels. It consists of two 1d-convolutional layers and a linear layer for the classification output. Both convolutional layers have a stride of 1, and padding is applied. The kernel size of the first layer is 10 and from the second 5. They convolve along the time axis over the \textit{null} and \textit{hw} values. As activation function, I use the Rectified Linear Unit (ReLU) after each convolutional layer. For input, I apply a sliding window of length 20 and shift 1 over the pseudo labels inside a \textit{hw} interval. It results in a 20x2 dimensional network input, generating a 1x2 output. After applying a softmax function, the output is the new pseudo-soft-label at the windows middle position.

Convolutional networks have become a popular method for image and signal classification. I use a convolutional neural network (CNN) to predict the value of a pseudo label given the surrounding pseudo labels. It consists of two 1d-convolutional layers and a linear layer for the classification output. Both convolutional layers have a stride of 1, and padding is applied. The kernel size of the first layer is 10, and from the second layer, the size is 5. They convolve along the time axis over the \textit{null} and \textit{hw} values. As activation function, I use the Rectified Linear Unit (ReLU) after each convolutional layer. For input, I apply a sliding window of length 20 and shift 1 over the pseudo labels inside a \textit{hw} interval. It results in a 20x2 dimensional network input, generating a 1x2 output. After applying a softmax function, the output is the new pseudo-soft-label at the windows middle position.

I used the approach from \secref{sec:synDataset} to train the network and created multiple synthetic datasets. On these datasets, I predicted the pseudo labels by the base model. Additionally, I augmented the labels by adding noise and random label flips. After that I extracted the \textit{correct} intervals. It results in roughly 400 intervals with $\sim1300$ windows, which were shuffled before training. As loss function, I used cross-entropy.

...

...

@@ -131,13 +131,13 @@ In another approach, I use convLSTM denoising autoencoders in different configur

I implemented three different network architectures, which I call convLSTM1-dAE, convLSTM2-dAE, and convLSTM3-dAE. The convLSTM layers are bidirectional so that they can be used in the encoder and decoder parts. All networks use ELU activation functions and sigmoid for the output, just like the FCN-dAE architecture. Mean squared error is used to compute the loss between the predicted and clean ground truth sequences.

The architecture of convLSTM1-dAE uses two convLSTM layers for encoding and decoding each. Both with a stride of 1. The first layer takes the 96x1x32 dimensional input and applies a convolutional kernel of width 5. After that, the 96x128x32 dimensional feature map is convolved by a kernel of width 3 to 96x64x32 dimensional features. The decoding part inversely reconstructs this.\\

In convLSTM2-dAE the spatial and temporal encoder/decoder are separated. It uses two convolutional layers for spatial encoding and two deconvolutional layers for decoding. In between, there are three convLSTM layers for time encoding/decoding. The convolutional layers compress the 96x1x32 dimensional input to a 96x32x8 dimensional feature map using kernels of width three and a stride of 2. In the temporal encoder/decoder, the convLSTMs also apply a kernel of width three and transform the features to 96x128x8 dimension and back to 96x32x8. Again the deconvolutional layers are inverse symmetric to the spatial encoder.\\

In convLSTM2-dAE the spatial and temporal encoder/decoder are separated. It uses two convolutional layers for spatial encoding and two deconvolutional layers for decoding. In between, there are three convLSTM layers for time encoding/decoding. The convolutional layers compress the 96x1x32 dimensional input to a 96x32x8 dimensional feature map using kernels of width 3 and a stride of 2. In the temporal encoder/decoder, the convLSTMs also apply a kernel of width 3 and transform the features to 96x128x8 dimension and back to 96x32x8. Again the deconvolutional layers are inverse symmetric to the spatial encoder.\\

For convLSTM3-dAE, only convLSTM layers with stride of 1 are used as in convLSTM1-dAE but with an additional layer for encoder and decoder. So the network consists of three bidirectional convLSTM layers for encoding and decoding each. Kernels of widths 7, 5, and 3 are applied, which generate a hidden feature map with dimensions 96x32x32. \figref{fig:examplePseudoFilterconvLSTM} shows the predicted output of the different networks applied on the previous example intervals. You can see that all results in similar pseudo-labels.

During testing, I created multiple filter configurations consisting of different constellations of the introduced denoising approaches. These configurations and their detailed descriptions can be seen in \tabref{tab:filterConfigurations}. Some rely on ground truth data and are just used for evaluation. The configurations \texttt{all, high\_conf, scope, all\_corrected\_null, scope\_corrected\_null, all\_corrected\_null\_hwgt} and \texttt{scope\_corrected\_null\_hwgt} depict baselines to observe which impact different parts could have to the training. So \texttt{all} and \texttt{high\_conf} show simple approaches where no user feedback is considered. Configuration \texttt{scope} depicts the difference between including potentially incorrect data and just using data where additional information is available. To show the improvements by simply correcting false predictions, \texttt{all\_corrected\_null, scope\_corrected\_null} is used. This is extended by theoretical evaluations of \texttt{all\_corrected\_null\_hwgt, scope\_corrected\_null\_hwgt} which states an upper bound of a possible perfect filter approach for \textit{hw, manual} intervals. The \texttt{all\_null\_*} configurations rely on the context knowledge that the hand washing parts should be way less than all other activities. So we can assume that all labels should be of value \textit{null} and just inside \textit{hw, manual} intervals, there are some of value \textit{hw}. It depends on how reliably a user has specified all hand wash actions. Here especially, the performance of the introduced denoising approaches is focused. Again \texttt{all\_null\_hwgt} represents the theoretical upper bound if a perfect filter would exist. As a more general approach, the \texttt{all\_cnn\_*} configurations do not make such a hard contextual statement and attempt to combine the cleaning abilities of the CNN network and high confidence in the resulting labels to augment the training data with likely correct samples.

During testing, I created multiple filter configurations consisting of different constellations of the introduced denoising approaches. These configurations and their detailed descriptions can be seen in \tabref{tab:filterConfigurations}. Some rely on ground truth data and are just used for evaluation. The configurations \texttt{all, high\_conf, scope, all\_corrected\_null, scope\_corrected\_null, all\_corrected\_null\_hwgt} and \texttt{scope\_corrected\_null\_hwgt} depict baselines to observe which impact different parts could have to the training. So \texttt{all} and \texttt{high\_conf} show simple approaches where no user feedback is considered. Configuration \texttt{scope} depicts the difference between including potentially incorrect data and just using data where additional information is available. To show the improvements by simply correcting false predictions, \texttt{all\_corrected\_null, scope\_corrected\_null} is used. This is extended by theoretical evaluations of \texttt{all\_corrected\_null\_hwgt, scope\_corrected\_null\_hwgt} which states an upper bound of a possible perfect filter approach for \textit{hw, manual} intervals. The \texttt{all\_null\_*} configurations rely on the context knowledge that the hand washing parts should be way less than all other activities. So we can assume that all labels should be of value \textit{null} and just inside \textit{hw, manual} intervals, there are some of value \textit{hw}. It depends on how reliably a user has specified all hand wash actions. Here especially, the performance of the introduced denoising approaches is in focus. Again \texttt{all\_null\_hwgt} represents the theoretical upper bound if a perfect filter would exist. As a more general approach, the \texttt{all\_cnn\_*} configurations do not make such a hard contextual statement and attempt to combine the cleaning abilities of the CNN network and high confidence in the resulting labels to augment the training data with likely correct samples.

The sensitivity gives the ratio of correctly recognized hand washing samples. The specificity gives the ratio of correctly recognized not hand washing samples. Both have to be close to 1 for a good-performing model. Precision is the ratio of correctly predicted hand wash samples over all predicted hand wash samples. The F1 score is the harmonic mean between recall and precision, where recall is the same as sensitivity. So it gives a measurement for the trade-off between detecting all hand wash activities and just detecting them if the user is actually doing it. Similarly, the S score is the harmonic mean between sensitivity and specificity. Here, there is a greater focus on the false positives under the true negatives.

The sensitivity gives the rate of correctly recognized hand washing samples. The specificity provides the rate of correctly recognized not hand washing samples. Both have to be close to 1 for a good-performing model. Precision is the ratio of correctly predicted hand wash samples over all predicted hand wash samples. The F1 score is the harmonic mean between recall and precision, where recall is the same as sensitivity. So it measures the trade-off between detecting all hand wash activities and just detecting them if the user is actually doing it. Similarly, the S score is the harmonic mean between sensitivity and specificity. Here, there is a greater focus on the false positives under the true negatives.

However, these metrics only consider the hard label values. So they are unable to reflect the uncertainty of the labels. Therefore I use a slightly different definition of the previous metrics, which work with class membership values~\cite{Beleites2013Mar, Beleites2015Feb}:

@@ -21,7 +21,7 @@ In this experiment a baseline is build on how a personalized model could perform

First, we concentrate on the dashed lines of the graphs. These are the evaluations of the general model in red and the supervised, trained model in green. In all graphs, the personalized models perform better than the general model. The base model achieves a F1 score of ${\sim}0.4127$ and ${\sim}0.7869$ in S score, whereas the personalized model reaches a F1 score of ${\sim}0.6205$ and ${\sim}0.8633$ in S score. So personalization can lead to an increase of ${\sim}0.2079$ in F1 score and ${\sim}0.0765$ in S score. This builds the theoretical performance gain for perfect labeling.

Now we observe the scenarios where some of the labels are noisy. Therefore we look at (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around $40\%-50\%$ have just a small impact on specificity and sensitivity. If the noise increases further, the sensitivity tends to decrease. For specificity, there is no trend, and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just a minor influence on the training, and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher than the general model. But for larger noise, this also decreases significantly. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance, which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks a penalty for false-positive predictions. According to the F1 score, a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to more false hand wash detections.

Now I observe the scenarios where some of the labels are noisy. Therefore we look at (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around $40\%-50\%$ have just a small impact on specificity and sensitivity. If the noise increases further, the sensitivity tends to decrease. For specificity, there is no trend, and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just a minor influence on the training, and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher than the general model. But for larger noise, this also decreases significantly. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance, which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks a penalty for false-positive predictions. According to the F1 score, a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to more false hand wash detections.

I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$\textit{hw} labels and $28,670$\textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$\textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus.

...

...

@@ -33,7 +33,7 @@ In these experiments, I would like to show the effect of noise on soft labels co

\subsubsection{Negative impact of soft labels}\label{sec:expNegImpSoftLabel}

In the next step, I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but do not flip the randomly selected labels and just apply the smoothing $s$ to them. Again, no significant changes in the performance due to noise in \textit{hw} labels are expected, which can also be seen in the left graph of \figref{fig:supervisedFalseSoftNoise}. In the case of wrongly smoothed \textit{null} labels, we can see a negative trend in S score for higher smoothing values, as shown in the right graph. The smooth value strongly influences the model's performance for a greater portion of smoothed labels. But for noise values $\leq0.2\%$, all personalized models still achieve higher S scores than the general models.

I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but do not flip the randomly selected labels and just apply the smoothing $s$ to them. Again, no significant changes in the performance due to noise in \textit{hw} labels are expected, which can also be seen in the left graph of \figref{fig:supervisedFalseSoftNoise}. In the case of wrongly smoothed \textit{null} labels, we can see a negative trend in S score for higher smoothing values, as shown in the right graph. The smooth value strongly influences the model's performance for a greater portion of smoothed labels. But for noise values $\leq0.2\%$, all personalized models still achieve higher S scores than the general models.

I combined both experiments to ensure that the drawbacks of incorrectly smoothed correct labels do not outweigh the performance gains from smoothing false labels. This is oriented to what happens to the labels if one of the denoising filters is applied to a hand wash section. First a certain ratio $n$ of \textit{null} labels are flipped. It expresses when the filter would falsely classify a \textit{null} label as hand washing. The false labels are smoothed to value $s$. After that the same ratio $n$ of correct \textit{hw} labels are smoothed to value $s$. This is equal to smoothing the label boundaries of a hand wash action. The resulting performance of personalizations can be seen in \figref{fig:supervisedSoftNoiseBoth}. The performance increase of smoothing false labels and the performance decrease of smoothing correct labels seems to cancel out for smaller values of $n$. For larger values, the performance slightly increases for larger $s$. For training data refining, I use soft labels mainly within hand washing sections for activity borders. There will be a higher chance of false labeling.

\subsection{Influence of missing feedback}\label{sec:expMissingFeedback}

The following experiment shows the impact of missing user feedback on the training data and resulting model performance. As before, the base model is trained on data refined with the different filter configurations. But in this case just $f\%$ of the \textit{false} and $c\%$ of the \textit{correct} indicators exists. All others are replaced with neutral indicators. \figref{fig:supervisedPseudoMissingFeedback} shows the S scores of the personalized models, which are trained with the respective filter configuration and increasing values of $f$ in (a) and $c$ in (b). In this experiment, only filter configurations that have outperformed the general model are considered.

The impact of missing user feedback on the training data and resulting model performance is analyzed in this experiment. As before, the base model is trained on data refined with the different filter configurations. But in this case just $f\%$ of the \textit{false} and $c\%$ of the \textit{correct} indicators exists. All others are replaced with neutral indicators. \figref{fig:supervisedPseudoMissingFeedback} shows the S scores of the personalized models, which are trained with the respective filter configuration and increasing values of $f$ in (a) and $c$ in (b). Only filter configurations that have outperformed the general model are considered.

As you can see, missing \textit{false} indicators do not lead to significant performance changes. The \texttt{all\_null\_*} filter configurations include all samples as \textit{null} labels without depending on the indicator. Similar, the \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels than the sections which are covered by the \textit{false} indicators.

...

...

@@ -74,10 +74,10 @@ In contrast, missing \textit{correct} indicators lead to performance loss. Howev

\subsection{Evaluation over iteration steps}

In this section, I compare the performance of the personalized models between iteration steps. Therefore the base model is applied to one of the training data sets of a participant, which is refined by one of the filter configurations. After that, the resulted personalized model is evaluated. This step is repeated over all training sets where the new model replaces the previous base model. Additionally, I evaluate the performance of a single iteration step by constantly training and evaluating the base model on the respective training data. I repeat that experiment with different amounts of training epochs and for the two regularization approaches of \secref{sec:approachRegularization}.

I compare the performance of the personalized models between iteration steps. Therefore the base model is applied to one of the training data sets of a participant, which is refined by one of the filter configurations. After that, the resulted personalized model is evaluated. This step is repeated over all training sets where the new model replaces the previous base model. Additionally, I evaluate the performance of a single iteration step by constantly training and evaluating the base model on the respective training data. I repeat that experiment with different amounts of training epochs and for the two regularization approaches of \secref{sec:approachRegularization}.

\subsubsection{Evolution}

First, we observe how the model performance evolves over the iteration steps. \figref{fig:evolutionSingle} shows the S scores for each iteration step of the overall personalized and single-trained models. The training data is generated by the \texttt{all\_noise\_hwgt} filter configuration. In graph (a), the epochs and regularization are the same as in the previous experiments. We can see that the first iteration leads to a lower S score than the general model. But for all following iteration steps, the performance increases continuously. Although the single-step model has a lower S score in the second iteration, the iterated model still benefits from the training. It becomes clearer with fewer training epochs per step. In graph (b) with 50 epochs, the overall personalized models' performance increases with each iteration step despite oscillating values of the single models. This illustrates that personalization does not depend on the last training step but accumulates data across all iterations.

Initially, we observe how the model performance evolves over the iteration steps. \figref{fig:evolutionSingle} shows the S scores for each iteration step of the overall personalized and single-trained models. The training data is generated by the \texttt{all\_noise\_hwgt} filter configuration. In graph (a), the epochs and regularization are the same as in the previous experiments. We can see that the first iteration leads to a lower S score than the general model. But for all following iteration steps, the performance increases continuously. Although the single-step model has a lower S score in the second iteration, the iterated model still benefits from the training. It becomes clearer with fewer training epochs per step. In graph (b) with 50 epochs, the overall personalized models' performance increases with each iteration step despite oscillating values of the single models. This illustrates that personalization does not depend on the last training step but accumulates data across all iterations.

\caption[Example pseudo filter CNN]{\textbf{Example pseudo filter CNN.} Plot of two \textit{positive} intervals, where convolutional neural network filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}

\caption[Example pseudo filter CNN]{\textbf{Example pseudo filter CNN.} Plot of two \textit{positive} intervals, where convolutional neural network filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}

\caption[Example pseudo filter score]{\textbf{Example pseudo filter score.} Plot of two \textit{positive} intervals, where naive filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}

\caption[Example pseudo filter score]{\textbf{Example pseudo filter score.} Plot of two \textit{positive} intervals, where naive filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}