Commit 56b185ac authored by Alexander Henkel's avatar Alexander Henkel
Browse files

finished state to review

parent 1e3db523
\chapter*{Abstract}
Wearable sensors like smartwatches offer a good opportunity for human activity recognition (HAR). They are available to a wide user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition tends to perform better than a general one. However, additional labeled data from the user is required which can be time consuming and labor intensive. While common personalization approaches try to reduce the necessary labeled training data, the labeling process remains dependent on some user interaction.
Wearable sensors like smartwatches offer an excellent opportunity for human activity recognition (HAR). They are available to a broad user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition performs better than a general one. However, additional labeled data from the user is required, which can be time-consuming and labor intensive to annotate them. While common personalization approaches reduce the necessary labeled training data, the labeling process remains dependent on some user interaction.
In this work, I present a personalization approach in which training data labels are derived from inexplicit user feedback obtained during the usual use of a HAR application. The general model predicts labels which are then refined by various denoising filters based on Convolutional Neural Networks and Autoencoders. This process is assisted by the previously obtained user feedback. High confidence data is then used for fine tuning the recognition model via transfer learning. No changes to the model architecture are required and thus personalization can easily be added to an existing application.
In this work, I present a personalization approach in which training data labels are derived from inexplicit user feedback obtained during the use of a HAR application. The general model predicts labels which are then refined by various denoising filters based on Convolutional Neural Networks and Autoencoders. The previously obtained user feedback assists this process. High confidence data is then used to fine-tune the recognition model via transfer learning. No changes to the model architecture are required so that personalization can be easily added to an existing application.
Analysis in the context of hand wash detection demonstrates, that a significant performance increase can be achieved. More over, I compare my approach with a traditional personalization method to confirm the robustness. Finally I evaluate the process in a real world experiment where participants wear a smart watch on a daily basis for a month.
Analysis in the context of hand wash detection demonstrates that a significant performance increase can be achieved. Moreover, I compare my approach with a traditional personalization method to confirm the robustness. Finally, I evaluate the process in a real-world experiment where participants wear a smartwatch daily for a month.
\chapter{Zusammenfassung}
......
\chapter{Introduction}\label{chap:introduction}
Detecting and monitoring peoples activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches offer an integration in everyday life over a wide user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}.
Detecting and monitoring people's activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches offer integration in everyday life over a wide user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}.
One of the application scenarios in healthcare is the observation of various diseases such as Obsessive-Compulsive Disorder (OCD). For example the detection of hand washing activities can be used to derive the frequency or excessiveness which occurs in some people with OCD. More over it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected Just-in-Time Interventions can be presented to the user which offer an enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}.
One application scenario in healthcare is observing various diseases such as Obsessive-Compulsive Disorder (OCD). For example, the detection of hand washing activities can be used to derive the frequency or excessiveness which occurs in some people with OCD. Moreover, it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected, Just-in-Time Interventions can be presented to the user, offering enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}.
State of the art Human Activiy Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}.
State-of-the-art Human Activity Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since the movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}.
To personalize a model retraining on new unseen sensor data is necessary. Obtaining the ground truth labels is crucial for most deep learning techniques. However, the annotation process is time and cost-intensive. Typically, training data is labeled in controlled environments by hand. In a real context scenario the user would have to take over the main part.
Indeed this requires lots of user interaction and a decent expertise which would contradict the usability.
There has been different research in how to preprocess data to make it usable for training. It turned out that a good trade-off is semi-supervised-learning or active learning, where a general base model is used to label the data and in uncertain cases it relies on user interaction ~\cite{siirtola2019importance, Siirtola2019Nov}. Here a small part of labeled data is combined with a larger unlabeled part to improve the detection model. But still some sort of explicit user interaction is required for personalization. So there is a overhead in the usage of a HAR application.
To personalize a model, retraining on new unseen sensor data is necessary. Obtaining the ground truth labels is crucial for most deep learning techniques. However, the annotation process is time and cost-intensive. Typically, training data is labeled in controlled environments by hand. In a real context scenario, the user would have to take over the major part.
Indeed this requires lots of user interaction and decent expertise, which would contradict the usability.
There has been various research on preprocessing data to make it usable for training. It turned out that a good trade-off is semi-supervised learning or active learning, where a general base model is used to label the data, and in uncertain cases, it relies on user interaction~\cite{siirtola2019importance, Siirtola2019Nov}. Here a small part of labeled data is combined with a larger unlabeled part to improve the detection model. However, some direct user interaction is still required for personalization. So there is an overhead in the usage of a HAR application.
The goal of my work is to personalize a detection model without increasing the user interaction. Information for labeling is drawn from indicators that arise during the use of the application. These can be derived by user feedback to triggered actions resulted from the predictions of the underlying recognition model. More over the personalization should be an additional and separated part, so no change of the model architecture is required.
My work aims to personalize a detection model without increasing user interaction. Information for labeling is drawn from indicators that arise during the use of the application. These can be derived from user feedback to triggered actions resulting from the predictions of the underlying recognition model. Moreover, personalization should be separated so that no change in the model architecture is required.
At first, all new unseen sensor data is labeled by the same general model which is used for activity recognition. These model predictions are corrected to a certain extent by using pretrained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to further refine the data to generate a valid training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data the previous model can be fine tuned in a incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neuronal networks it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. In combination this leads to a personalized model which has improved performance in detecting specific gestures of an individual user.
At first, all new unseen sensor data is labeled by the same general model, which is used for activity recognition. These model predictions are corrected to a certain extent by using pre-trained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to refine the data to generate a good training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data, the previous model can be fine-tuned in an incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neuronal networks, it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. In combination, this leads to a personalized model which has improved performance in detecting specific gestures of an individual user.
I applied the described personalization process to a hand washing detection application which is used for observing the behavior of OCD patients. During the observation, the user answers requested evaluations if the application detects hand washing. For miss predictions the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the correctness of the predictions, which leads to the required indicators.
I applied the described personalization process to a hand washing detection application used to observe the behavior of OCD patients. During the observation, the user answers requested evaluations if the application detects hand washing. For miss predictions, the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the accuracy of the predictions, resulting in the desired indicators.
The contributions of my work are as follows:
\begin{itemize}
\item [1.] A personalization approach which can be added to an exisitng HAR application and does not require additional user interaction or changes in the model architecture.
\item [2.] Different indicator assisted refinement methods, based on Convolutional networks and Fully Connected Autoencoders, for generated labels.
\item [3.] Demonstration that a personalized model which results from this approach outperforms the general model and can archive similar performance as a supervised personalization.
\item [3.] Demonstration that a personalized model which results from this approach outperforms the general model and can achieve similar performance as a supervised personalization.
\item [4.] Comparison to a common active learning method.
\item [5.] Presentation of real world experiments which depict user experiences
\item [5.] Presentation of real-world experiments which confirms applicability to a broad user base
\end{itemize}
\todo{structure of this work}
\todo{structure of this work???}
\chapter{Related Work}\label{chap:relatedwork}
Human Activity Recognition (HAR) is a wide research field and is used in a variety applications like healthcare, fitness tracking, elder care, or behavior analysis. Data acquired by different types of sensors like video cameras, range sensors, wearable sensors, or other devices is used to automatically analyze and detect everyday activities. Especially the field of wearable sensors is growing, as the technical progress in smart watches makes it possible for a wide range of users to integrate these sensors into their daily lives.
Human Activity Recognition (HAR) is a broad research field used in various applications like healthcare, fitness tracking, elder care, or behavior analysis. Data acquired by different types of sensors like video cameras, range sensors, wearable sensors, or other devices are used to automatically analyze and detect everyday activities. Especially the field of wearable sensors is growing as the technical progress in smart watches makes it possible for a wide range of users to integrate these sensors into their daily lives.
In the following I give a brief overview of literature about state of the art HAR and how personalization can improve the performance. Then I focus on work that deals with the generation of training data in different approaches. Last I show work that deals with cleaning faulty labels in the training data.
In the following, I give a brief overview of literature about state-of-the-art HAR and how personalization can improve performance. Then I focus on work that deals with the generation of training data in different approaches. Last I show work how cleaning faulty labels in the training data can be done.
\section{Activity recognition}\label{sec:relWorkActivityRecognition}
Most used Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently deep neural network techniques replace traditional ones such as Support Vector Machines or Random Forests since no hand crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature decoders and an output layer which provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons which are connected to the neurons of the following layer. These connections are assigned a weight which is learned during the training process. First, in the feed-forward pass the values of the output are computed based on a batch of training data. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. This is repeated over multiple iterations~\cite{Liu2017Apr}.
Most Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently deep neural network techniques have replaced traditional ones such as Support Vector Machines or Random Forests since no hand-crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature decoders and an output layer that provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons connected to the following layer's neurons. These connections are assigned a weight that is learned during the training process. First, in the feed-forward pass, the output values are computed based on a batch of training data. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. Feed-forward pass and backpropagation are repeated over multiple iterations, called epochs~\cite{Liu2017Apr}.
The combination of Convolutional Neural Networks (CNN) and Long-short-term memory recurrent neural networks (LSTMs) tend to outperform other approaches and are considered as the current state of the art for human activity recognition~\cite{9043535}. For classification problems, cross entropy as loss function is used in most works. \extend{???}
The combination of Convolutional Neural Networks (CNN) and Long-short-term memory recurrent neural networks (LSTMs) tend to outperform other approaches. They are considered the current state of the art for human activity recognition~\cite{9043535}. For classification problems, cross-entropy as loss function is used in most works. \extend{???}
\section{Personalization}\label{sec:relWorkPersonalization}
However, it can happen that even well performing architectures yield worse results in real world scenarios. Varying users and environments create many different influences that can affect performance. These could be the position of the device, differences between the sensors or human characteristics \cite{ferrari2020personalization}.
Each user differs in their own movement pattern, so a general detection model may suffer. To overcome this problem, a general model should be trained on a wide range of users and has to cover as many different motion patterns as possible. But this would require unrealistically large data sets. Beside of the storage and processing costs, the availability of public datasets is very limited since labeling is a difficult task. The goal is to generalize the model as much as possible with respect to the final user.
Even well-performing architectures can result in worse results in real-world scenarios. Varying users and environments create many different influences that can affect performance. These could be the device's position, differences between the sensors, or human characteristics \cite{ferrari2020personalization}.
Each user's movement pattern differs so that a general detection model may suffer. To overcome this problem, a general model should be trained on a wide range of users and cover as many different motion patterns as possible. However, this would require unrealistically large data sets. Besides the storage and processing costs, the availability of public datasets is limited since labeling is a difficult task. The goal is to generalize the model as much as possible for the final user.
It has been shown, that a personalized model trained with additional user specific data (even with just a small amount of additional data) can significantly outperform the general model ~\cite{8995531, doi:10.1137/1.9781611973440.71, zhao2011cross}. In my work I concentrate on data-based approaches, which can be split in \textit{subject-independent}, \textit{subject-dependent} and \textit{hybrid} dataset configurations ~\cite{Ferrari2021Sep}. The subject-independent model represents the general model where no user specific data is used for training, whereas the user-dependent model just relies on the users data. A user-dependent model would generalize best but requires enough specific data of each user. As the combination of both, the hybrid configuration uses the data of all other users with additional data of the target user. This should result in a better detection of the final users activities than the subject-independent model but is easier to train as the subject-dependent model since less data of the final user is required. It is even possible, that a hybrid approach can achieve similar performance as the subject-dependent but with less user specific data~\cite{weiss2012impact, Chen2017Mar}.
It has been shown that a personalized model trained with additional user-specific data (even with just a small amount of additional data) can significantly outperform the general model ~\cite{8995531, doi:10.1137/1.9781611973440.71, zhao2011cross}. In my work I concentrate on data-based approaches, which can be split in \textit{subject-independent}, \textit{subject-dependent} and \textit{hybrid} dataset configurations ~\cite{Ferrari2021Sep}. The \textit{subject-independent} model represents the general model where no user-specific data is used for training, whereas the \textit{subject-dependent} model just relies on the user's data. A subject-dependent model would generalize best but requires specific data from each user. As a combination of both, the \textit{hybrid} configuration uses the data of all other users with additional data from the target user. This should result in better detection of the final user's activities than the subject-independent model but is easier to train than the subject-dependent model since fewer data of the final user is required. It is possible that a hybrid approach can achieve similar performance as the subject-dependent but with less user-specific data~\cite{weiss2012impact, Chen2017Mar}.
In a common hybrid approach, a general user-independent model is used first, until new data of an unseen user is gathered. New data is used to fine-tune the existing model. For neural networks, deep transfere learning has been shown to provide a suitable approach to adapt an existing model with additional data~\cite{Tan2018Sep}. The idea is to transfer knowledge from a previously trained model to a new model which solves a similar task. In the case of personalization, transfer learning is used for domain adaption ~\cite{AlHafizKhan2018Mar}. Given a source domain $D_S$ with learning task $T_S$ and a target domain $D_T$ with learning task $T_T$ where $D_S \neq D_T$ and $T_S=T_T$. The goal is to improve the target prediction function $f_T(\cdot)$ from $T_T$ using knowledge from $D_S$ and $T_S$~\cite{Ghafoorian2017Sep, Lebichot2019Apr}. Particularly mini batch optimization is used, where multiple new training instances are collected over time and then used for fine tuning the model by weight updates.
In a common hybrid approach, a general user-independent model is used first until new data of an unseen user is gathered. New data is used to fine-tune the existing model. For neural networks, deep transfer learning has provided a suitable approach to adapting an existing model with additional data~\cite{Tan2018Sep}. The idea is to transfer knowledge from a previously trained model to a new model which solves a similar task. In the case of personalization, transfer learning is used for domain adaption ~\cite{AlHafizKhan2018Mar}. Given a source domain $D_S$ with learning task $T_S$ and a target domain $D_T$ with learning task $T_T$ where $D_S \neq D_T$ and $T_S=T_T$. The goal is to improve the target prediction function $f_T(\cdot)$ from $T_T$ using knowledge from $D_S$ and $T_S$~\cite{Ghafoorian2017Sep, Lebichot2019Apr}. Particularly mini-batch optimization is used, where multiple new training instances are collected over time and then used for fine-tuning the model by weight updates.
To analyze the performance of deep learning personalization to traditional techniques, Amrani et al. compared deep transfer learning approaches based on CNNs with a baseline incremental learning using Learn++~\cite{Amrani2021Jan}. They demonstrated, that deep learning outperforms Learn++ and does adapt faster to new users.
Amrani et al. compared deep transfer learning approaches based on CNNs with a baseline incremental learning using Learn++, to analyze the performance of deep learning personalization to traditional techniques~\cite{Amrani2021Jan}. They demonstrated that deep learning outperforms Learn++ and does adapt faster to new users.
Rokni et al. used CNNs and personalization by transfer learning where lower layers are reused and just upper ones are retrained~\cite{Rokni2018Apr}. This is argued by the assumption that learned features in the first layers can be reused in other domains and just classification has to be adapted~\cite{Yosinski2014}. They significantly improved the accuracy of activity recognition with just few labeled instances. Hoelzemann and Van Laerhoven analyzed how results differ respectively to different methods and applications if transfer learning is applied to a Deep Convolutional LSTM network~\cite{Hoelzemann2020Sep}. They suggest, that convolutional layers should not be fine-tuned as already mentioned in the work of Rokni. Furthermore they advice to reinitialize the LSTM layers to default which results in slightly better performance in some cases.
Rokni et al. used CNNs and personalization by transfer learning where lower layers are reused, and upper ones are retrained~\cite{Rokni2018Apr}. This is argued by the assumption that learned features in the first layers can be reused in other domains, and just classification has to be adapted~\cite{Yosinski2014}. They significantly improved activity recognition accuracy with just a few labeled instances. Hoelzemann and Van Laerhoven analyzed how results differ respectively to different methods and applications if transfer learning is applied to a Deep Convolutional LSTM network~\cite{Hoelzemann2020Sep}. They suggest that convolutional layers should not be fine-tuned, as already mentioned in the work of Rokni. Furthermore, they advise reinitializing the LSTM layers to default which results in slightly better performance in some cases.
A typical problem which can occur during fine tuning is catastrophic forgetting ~\cite{Lee2017}. Important information which has been trained before gets lost by overfitting to the new target. To overcome this problem Xuhong et al. and Li et al. analyzed different regularization schemes which are applied to inductive transfer learning~\cite{xuhong2018explicit, Li2020Feb}. They state that L2-SP penalty should be considered as the standard baseline for transfer learning which also overcomes freezing the first layers of a network. The idea is that learned parameters should remain close to their initial values during fine-tuning. So the pre-trained model is a reference which defines the effective search space. To do that a regularizer $\Omega(\omega)$ of the network parameters $\omega$ which have to be adapted, is added to the result of the loss function. For L2-SP penalty, the regularizer is defined as:
A typical problem during fine-tuning is catastrophic forgetting ~\cite{Lee2017}. Important information that has been trained before gets lost by overfitting to the new target. To overcome this problem, Xuhong et al. and Li et al. analyzed different regularization schemes applied to inductive transfer learning~\cite{xuhong2018explicit, Li2020Feb}. They state that the L2-SP penalty should be considered the standard baseline for transfer learning which also outperforms freezing the first layers of a network. The idea is that learned parameters should remain close to their initial values during fine-tuning. So the pre-trained model is a reference that defines the effective search space. A regularizer $\Omega(\omega)$ of the network parameters $\omega$, which are adapted, is added to the result of the loss function. For the L2-SP penalty, the regularizer is defined as:
\begin{align}
\Omega(\omega) &= \frac{\alpha}{2}\left\Vert \omega-\omega^0 \right\Vert^2_2
\Omega(\omega) &= \frac{\alpha}{2}\left\Vert \omega-\omega^0 \right\Vert^2_2
\end{align}
where $\alpha$ is a constant parameter which adjusts the strength of the penalty and $\left\Vert \cdot \right\Vert_p$ is the p-norm. Starting point (-SP) is the parameter vector $\omega^0$ of the pre-trained model. So the penalty adds the distance between current and initial parameters to the loss. Therefore the loss gets bigger if the distance of the two models grows, which has negative impact to the optimization.
where $\alpha$ is a constant parameter that adjusts the strength of the penalty, and $\left\Vert \cdot \right\Vert_p$ is the p-norm. Starting point (-SP) is the pre-trained model's parameter vector $\omega^0$. So the penalty adds the distance between current and initial parameters to the loss. Therefore, the loss gets bigger if the distance between the two models grows, which negatively impacts the optimization.
However these researches just show the potential performance gain due to personalization and require ground truth data which is acquired in manual label processes.
Especially deep learning algorithms require a lot of training data to achieve good performance ~\cite{Perez2017Dec}. Since wearable sensors became available for the general public it is easy to obtain new sensor data for learning. Nonetheless, this data must be labeled, which is still a labor-intensive part.
While, these researches show the potential performance gain due to personalization it requires ground truth data which is acquired in manual label processes.
Especially deep-learning algorithms require many training data to achieve good performance ~\cite{Perez2017Dec}. Since wearable sensors became available to the general public, obtaining new sensor data for learning has been easy. Nonetheless, this data must be labeled, which is still labor-intensive.
New data can be used in either \textit{supervised}, \textit{semi-supervised} or \textit{unsupervised} learning. In supervised learning, all training instances have to be assigned with their correct mostly handcrafted labels, whereas in unsupervised learning, the learning process is capable to train with unlabeled instances. A combination of these is semi-supervised learning. Here a small set of labeled instances exists which is combined with a larger set of unlabeled instances ~\cite{Chapelle2009Feb}.
New data can be used in either \textit{supervised}, \textit{semi-supervised} or \textit{unsupervised} learning. In supervised learning, all training instances have to be assigned with their correct, mostly handcrafted labels, whereas in unsupervised learning, the learning process can train with unlabeled instances. A combination of these is semi-supervised learning. Here a small set of labeled instances exists, combined with a larger set of unlabeled instances ~\cite{Chapelle2009Feb}.
For personalization, a supervised approach would not be acceptable, as the user cannot be expected to have the necessary expertise or expend the effort required to label all training data~\cite{Ashari2019Jul}. But as Siirtola and Röning had shown, it is possible to gain similar performance without large amount of additional labeled data~\cite{Siirtola2019Nov}. They compared the different learning techniques and used a combination of high confidence predictions and questioning the user on uncertain labels for their semi-supervised approach. They achieved a recognition rate with is just $2.5\%$ lower, than the supervised version.
For personalization, a supervised approach would not be acceptable, as the user cannot be expected to have the necessary expertise or expend the effort required to label all training data~\cite{Ashari2019Jul}. Nevertheless, as Siirtola and Röning had shown, it is possible to gain similar performance without a large amount of additionally labeled data~\cite{Siirtola2019Nov}. They compared the different learning techniques and used a combination of high confidence predictions and questioning the user on uncertain labels for their semi-supervised approach. As a result, they achieved a recognition rate that is just $2.5\%$ lower than the supervised version.
Zeng et al. observed two approaches for semi-supervised learning~\cite{Zeng2017Dec}. A CNN-Encoder-Decoder or CNN-Ladder is trained with clean supervised datasets. After that hidden features are fine tuned in an unsupervised training which leads to significant improvements.
Zeng et al. observed two approaches for semi-supervised learning~\cite{Zeng2017Dec}. First, a CNN-Encoder-Decoder, second a CNN-Ladder. They are trained with clean, supervised datasets. After that, hidden features are fine-tuned in unsupervised training, which leads to significant improvements.
\subsection{Active learning}
Active learning is in general a supervised learning approach which tries to achieve good performance with as few labeled instances as possible. It selects samples from a pool of unlabeled instances and queries them to a so called oracle, for example the user, to receive the exact labels. To achieve maximum performance with only a few queries, the instances with the highest informativeness are selected for training. To determine the informativeness there are several approaches like querying the samples, where the learner is most uncertain of labeling or which would lead to the greatest expected model change ~\cite{Settles2009, rokni2020transnet}.
Active learning is generally a supervised learning approach that attempts to achieve the best possible performance with as few labeled instances as possible. It selects samples from a pool of unlabeled instances and queries them to a so-called oracle, for example, the user, to receive the exact labels. The instances with the highest informativeness are selected for training to achieve maximum performance with only a few queries. There are several approaches to determining informativeness. Like querying the samples, where the learner is most uncertain of labeling or which would lead to the most significant expected model change ~\cite{Settles2009, rokni2020transnet}.
Saeedi et al. combine an active learning approach with a neural network consisting of convolutional and LSTM layers to predict activities by using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine tune the model to a new domain. To select the queried instances they introduced a measurement called information density. This weights the informativeness of an instance by its similarity to all other instances.
Saeedi et al. combine an active learning approach with a neural network consisting of convolutional and LSTM layers to predict activities by using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine-tune the model to a new domain. To select the queried instances, they introduced a measurement called information density. It weights the informativeness of an instance by its similarity to all other instances.
%~\cite{Fekri2018Sep}.
Ashari and Ghasemzadeh observed the limitations of respond capabilities of an user~\cite{Ashari2019Jul}. This applies not only to the number of queries but also to the time discrepant between querying and when it is answered. They add the criteria of a users ability to remember the correct label of a sample to the selection process of queried instances.
Ashari and Ghasemzadeh observed the limitations of response capabilities of an user~\cite{Ashari2019Jul}. This applies not only to the number of queries but also to the time discrepant between querying and when it is answered. They add the criteria of a user's ability to remember the correct label of a sample to the selection process of queried instances.
Active learning can be combined with a semi-supervised approach, as shown by Hasan and Roy-Chowdhury~\cite{Hasan2015Sep}. They use active learning where samples with high tentative prediction probability are labeled by a weak learner, i.e., classification algorithm. Just samples with low certainty and high potential model change are queried to the user. So they can enlarge the training set without increasing the user interaction. They achieved competitive performance as state-of-the-art active learning methods but with a reduced amount of manually labeled instances.
Active learning can be combined with a semi supervised approach as shown by Hasan and Roy-Chowdhury~\cite{Hasan2015Sep}. They use active learning where samples with high tentative prediction probability are labeled by a weak learner, i.e. classification algorithm. Just samples with low certainty and high potential model change are queried to the user. So they can enlarge the training set without increasing the user interaction. They achieved competitive performance as stade-of-the art active learning methods but with a reduced amount of manually labeled instances.
\subsection{Self-supervised learning}\label{sec:relWorkSelfSupervisedLearning}
Here a deep neural network is introduced which learns to solve predefined transformation recognition tasks in an unsupervised manner. I.e. different transformation functions like noise, rotation or negation are applied to an input signal which generates new distinct versions. The network predicts the probabilities that a given sequence is the transformation of the original signal. Since the transformation functions are known, a self-supervised labeled training set can be constructed. Idea is, that for detecting the transformation tasks, the core characteristics of the input signal have to be learned. These high level semantics can then be used as the feature layers for the classifier. To train the classification layer just a few labeled samples are required.
Here a deep neural network is introduced, which learns to solve predefined transformation recognition tasks in an unsupervised manner. I.e., different transformation functions like noise, rotation, or negation are applied to an input signal which generates new distinct versions. The network predicts the probabilities that a given sequence is the transformation of the original signal. Since the transformation functions are known, a self-supervised labeled training set can be constructed. The idea is that for detecting the transformation tasks, the core characteristics of the input signal have to be learned. These high-level semantics can then be used as the feature layers for the classifier. Just a few labeled samples are required to train the classification layer.
Saeed et al. showed a self-supervised learning approach for HAR~\cite{Saeed2019Jun}. They achieve a significantly better performance than traditional unsupervised learning methods and comparable with fully-supervised methods, especially in a semi-supervised scenario where a few labeled instances are available. Tang et al. extend this by a combination of self-supervised learning and self-training~\cite{Tang2021Feb}. A teacher model is trained first using supervised labeled data. The teacher model is used to relabel the supervised dataset and additional unseen instances. Most confident samples are augmented by transformation functions as mentioned previously. After that the self-supervised dataset is used to train a student model. In addition it is fine tuned with the originally supervised instances. By combining the unlabeled data with the limited labeled data, performance can be further enhanced.
Saeed et al. showed a self-supervised learning approach for HAR~\cite{Saeed2019Jun}. They achieve a significantly better performance than traditional unsupervised learning methods and comparable with fully-supervised methods, especially in a semi-supervised scenario where a few labeled instances are available. Tang et al. extend this by combining self-supervised learning and self-training~\cite{Tang2021Feb}. A teacher model is trained first using supervised labeled data. Then, the teacher model is used to relabel the supervised dataset and additional unseen instances. Most confident samples are augmented by transformation functions, as mentioned previously. After that, the self-supervised dataset is used to train a student model. In addition, it is fine-tuned with the initially supervised instances. By combining the unlabeled data with the limited labeled data, performance can be further enhanced.
\subsection{Partial labels}
In situations where it is not possible to detect exactly when an activity was performed, partial labels can be used to indicate for a larger time period which activities are included. Multiple contiguous instances can be collected in a single bag which is labeled by the covered classes. With these partial labels or also called weak labels, the actual classes of the contained instances can be predicted more precisely.
In situations where it is impossible to detect precisely when an activity was performed, partial labels can be used to indicate which activities are included for a larger period. Multiple contiguous instances can be collected in a single bag which is labeled by the covered classes. With these partial labels or weak labels, the actual classes of the contained instances can be predicted more accurately.
In the work of Stikic et al., multi-instance learning with weak labels is used for HAR~\cite{Stikic2011Feb}. A set of instances is collected in a bag which is labeled by a single bag label according to the instances. Therefore a user could simply specify which activities have happened in a certain time period without explicit allocations. The labels of the single instances are then computed by a graph-based label propagation. Hussein et al. use an active learning approach with partial labels for personalized autocomplete teleoperations~\cite{Hussein2021Jul}. For a partial feedback, the user does not have to give the exact label to a query but just answer 'yes or no' questions which covers multiple possible labels. Each feedback excludes a part of classes. The partial label consists of the set of possible classes. In this case the partial feedback is gathered while the user accepts or rejects predicted motions. The adapted framework can reduce false predictions significantly.
In the work of Stikic et al., multi-instance learning with weak labels is used for HAR~\cite{Stikic2011Feb}. A set of instances is collected in a bag annotated by a single bag label according to the instances. Therefore a user could specify which activities have happened in a specific period without explicit allocations. A graph-based label propagation then computes the labels of the single instances. Hussein et al. use an active learning approach with partial labels for personalized autocomplete teleoperations~\cite{Hussein2021Jul}. For partial feedback, the user does not have to give the exact label to a query but answer 'yes or no' questions covering multiple possible labels. Each feedback excludes a subset of classes. The partial label consists of the set of possible classes. In this case, the partial feedback is gathered while the user accepts or rejects predicted motions. The adapted framework can reduce false predictions significantly.
%\cite{Pham2017Jan} uses a dynamic programming approach.
\subsection{Pseudo labeling}\label{sec:relWorkPseudoLabeling}
Pseudo labeling allows to do an unsupervised domain adaption by using predictions of the base model~\cite{lee2013pseudo}. Based on the prediction of a sample an artificial pseudo-label is generated which is treated as ground truth data. However this requires that the initial-trained model predicts pseudo-labels with high confidence which is hard to satisfy. Training with false pseudo-labels has a negative impact to the personalization. Moreover it is possible that pseudo-labeling could overfit to incorrect pseudo-labels over multiple iteration which is known as confirmation-bias. Therefore there are many approaches to augment the pseudo labels to reduce the amount of false training data. Since a base model is required, which is in most cases trained by supervised data, pseudo labeling is a part of semi-supervised learning. But compared to other semi-supervised approaches, pseudo labeling offers a simple implementation which does not rely on domain-specific augmentations or any changes to the model architecture.
Pseudo labeling allows doing an unsupervised domain adaption by using predictions of the base model~\cite{lee2013pseudo}. Based on the prediction of a sample, an artificial pseudo-label is generated, which is treated as ground truth data. However, this requires that the initial-trained model predicts pseudo-labels with high confidence, which is hard to satisfy. Training with false pseudo-labels has a negative impact on personalization. Moreover, it is possible that pseudo-labeling could overfit to incorrect pseudo-labels over multiple iterations, which is known as confirmation bias. Therefore, many approaches augment the pseudo labels to reduce the amount of false training data. Since a base model is required, which is in most cases trained by supervised data, pseudo labeling is a part of semi-supervised learning. Nevertheless, compared to other semi-supervised approaches, pseudo labeling offers a simple implementation that does not rely on domain-specific augmentations or changes to the model architecture.
Li at al. showed a naive approach for semi-supervised learning using pseudo labels~\cite{Li2019Sep}. First a pseudo labeling model $M_p$ is trained using a small supervised labeled data set $L$. This model is then used to perform pseudo-labeling for new unlabeled data which results in dataset $\hat{U}$. After that a deep learning model $M_{NN}$ is pre-trained with the pseudo labeled data $\hat{U}$ and afterwards fine-tuned with the supervised data $L$. This process is repeated, where the resulted model $M_{NN}$ is used as new pseudo labeling model $M_p$, until the validation accuracy converges. Moreover they use the fact, that predictions of a classifier model are probabilistic and assume, that labels with higher probability also have a higher accuracy. Therefore, they use only pseudo labels with a high certainty. They argue, that pseudo-labeling can be seen as a kind of data augmentation. Even with high label noise of the pseudo labels, a deep neural network should be able to improve with training. In their tests they achieved significant improvement in accuracy by adding pseudo labels to the training. Furthermore they showed that the model benefits especially from the first iterations. Nevertheless it is required, that the pseudo labeling model $M_P$ has a certain accuracy. Tests show that a better pseudo labeling model leads to a higher accuracy of the fine-tuned model. Arazo et al. observed performance of a naive pseudo labeling applied on images and showed, that it would overfit to incorrect pseudo labels~\cite{Arazo2020Jul}. The trained model tend to have higher confidence to previously false predicted labels, which results in new incorrect predictions. They applied simple modifications to prevent confirmation bias without requiring multiple networks or any consistency regularization methods as done in other approaches like in \secref{sec:relWorkSelfSupervisedLearning}. With the use of mixed up augmentation as regularization and adding a minimum number of labeled samples, they yielded state-of-the-art performance. Additionally they use soft-labels instead of hard-labels for training. Here a label consists of the individual class affiliations instead of a single value for the target class. Thereby it is possible to depict uncertainty over the classes. As mixed up strategy, they combine random sample pairs and corresponding labels, which creates a data augmentation with label smoothing. This should reduce the confidence of network predictions. As they point out, this approach is simpler than using other regularization methods and moreover more accurate.
Li et al. showed a naive approach for semi-supervised learning using pseudo labels~\cite{Li2019Sep}. First, a pseudo labeling model $M_p$ is trained using a small supervised labeled data set $L$. This model is then used to perform pseudo-labeling for new unlabeled data, which results in dataset $\hat{U}$. After that, a deep learning model $M_{NN}$ is pre-trained with the pseudo labeled data $\hat{U}$ and afterward fine-tuned with the supervised data $L$. This process is repeated, where the resulting model $M_{NN}$ is used as a new pseudo labeling model $M_p$ until the validation accuracy converges. Moreover, they use the fact that predictions of a classifier model are probabilistic and assume that labels with higher probability also have higher accuracy. Therefore, they use only pseudo labels with high certainty. They argue that pseudo-labeling can be seen as a kind of data augmentation. Even with the high label noise of the pseudo labels, a deep neural network should be able to improve with training. In their tests, they significantly improved accuracy by adding pseudo labels to the training. Furthermore, they showed that the model benefits especially from the first iterations. Nevertheless, it is required that the pseudo labeling model $M_P$ has a certain accuracy. Tests show that a better pseudo labeling model leads to higher accuracy of the fine-tuned model. Arazo et al. observed the performance of a naive pseudo labeling applied on images and showed that it would overfit to incorrect pseudo labels~\cite{Arazo2020Jul}. The trained model tends to have higher confidence to previously false predicted labels, which results in new incorrect predictions. They applied simple modifications to prevent confirmation bias without requiring multiple networks or any consistency regularization methods as done in other approaches like in \secref{sec:relWorkSelfSupervisedLearning}. This yielded state-of-the-art performance with mixed-up augmentation as regularization and adding a minimum number of labeled samples. Additionally, they use soft-labels instead of hard-labels for training. Here a label consists of the individual class affiliations instead of a single value for the target class. Thereby it is possible to depict uncertainty over the classes. As a mixed-up strategy, they combine random sample pairs and corresponding labels, creating data augmentation with label smoothing. It should reduce the confidence of network predictions. As they point out, this approach is more straightforward than using other regularization methods and moreover more accurate.
Rizve et al. \cite{Rizve2021Jan} tackles the problem of relatively poor performance in comparison to other semi-supervised methods due to the larger number of incorrectly pseudo-labeled samples. They try to select a set of pseudo-labels which are less noisy. So just high confidence predictions under the aspect of the networks uncertainty are considered for training. Therefore they result smaller but more accurate pseudo-label sub set which yields in higher performance of the resulting model. In their experiments, they used Monte-Carlo dropout for uncertainty estimation~\cite{Gal2016Jun} but it is also possible to use other methods which does not require extensive model modifications. In a similar approach Zheng and Yang use pseudo label learning for segmentation adaption, without ground truth data in the target domain~\cite{Zheng2021Apr}. Like as Rizve, they estimate the prediction uncertainty to address the label noise. An auxiliary classifier is used to determine the prediction variance which reflects the uncertainty of the network. This approach also reaches competitive performance. But to implement an auxiliary classifier changes at the model architecture are required.
Rizve et al. \cite{Rizve2021Jan} tackle the problem of relatively poor performance compared to other semi-supervised methods due to the larger number of incorrectly pseudo-labeled samples. They try to select a set of pseudo-labels that are less noisy. So just high confidence predictions under the aspect of the network's uncertainty are considered for training. Therefore they result in a smaller but more accurate pseudo-label subset which yields a higher performance of the resulting model. In their experiments, they used Monte-Carlo dropout for uncertainty estimation~\cite{Gal2016Jun}, but it is also possible to use other methods which do not require extensive model modifications. In a similar approach, Zheng and Yang use pseudo label learning for segmentation adaption without ground truth data in the target domain~\cite{Zheng2021Apr}. Like Rizve, they estimate the prediction uncertainty to address the label noise. An auxiliary classifier is used to determine the prediction variance, which reflects the uncertainty of the network. This approach also reaches competitive performance. However, changes to the model architecture are required to implement an auxiliary classifier.
%\cite{Gonzalez2018May} compared different self-labeling methods in a time series context. In self-training, a base learner is firstly trained on a labeled set. Then, unlabeled instances are classified by the base classifier, where it is assumed, that more accurate predictions tend to be correct. After that the labeled training set is enlarged with these self-labeled instances. They achieved for the best performing methods similar performance to the supervised learning.
\section{Learning with label noise}\label{sec:relWorkLabelNoise}
When learning with generated labels, it may happen that some of these labels consists of a wrong annotation, so called label noise. This has a negative impact to the training process. Depending on the type of noise the influence on the model can differ. Where \textit{uniform noise} and \textit{class-dependent noise} can still achieve good model accuracy up to a certain degree, \textit{feature-dependent noise} results in much worse performance~\cite{Algan2020Mar}. In synthetic annotation processes, the labels are determined by the data, wrong annotations leads to feature-dependent noise. So it is important to handle noisy labels during training.
Label noise does not only occur in generated labels. Additionally in supervised-learning it is possible, that there exist mislabeled instances~\cite{Frenay2013Dec}. This can happen due to small boundaries between classes which are hard or even impossible to distinguish or errors in human annotations. Therefore, there are numerous research works on training neural networks with noisy labels. One approache would be either, to try to adapt the learning process itself to get robust against noise~\cite{Patrini2017, Ghosh2017Feb} or to clean the noise in the training data set. In my work, I focus on the latter, as no changes have to be made to the model architecture or the training implementation.
When learning with generated labels, some of these labels may consist of an incorrect annotation, so called label noise. It has a negative impact on the training process. Depending on the type of noise, the influence on the model can differ. Where \textit{uniform noise} and \textit{class-dependent noise} can still achieve good model accuracy up to a certain degree, \textit{feature-dependent noise} results in much worse performance~\cite{Algan2020Mar}. In synthetic annotation processes, the labels are determined by the data. Incorrect annotations lead to feature-dependent noise. So it is important to handle noisy labels during training.
Label noise does not only occur in generated labels. Additionally, in supervised-learning, there may exist mislabeled instances~\cite{Frenay2013Dec}. It can happen due to small boundaries between classes that are hard or even impossible to distinguish or errors in human annotations. Therefore, numerous research works are on training neural networks with noisy labels. One approach would be to try to adapt the learning process to get robust against noise~\cite{Patrini2017, Ghosh2017Feb}, the other one would be to clean the noise in the training data set. In my work, I focus on the latter, as no changes have to be made to the model architecture or the training implementation.
Furthermore it is also possible for standard neural networks to learn from arbitrary noisy data and still perform well. They also benefits from larger data sets which can accommodate a wide range of noise~\cite{Rolnick2017May}. In particular, there is a lot of research in the field of image denoising, where deep neural networks have also become very promising~\cite{Xie2012, Dong2018Oct, Gondara2016Dec}. But a lot of these approaches can also be applied to time series data. Since in our case labels results from time series samples, the labels themselves can also be considered as time series data. Therefore it is likely, that two adjacent labels have the same value and an outlier would probably be a false label. This is similar to noise in a signal and denoising or smoothing would be the same as correcting the wrong labels.
Furthermore, it is also possible for standard neural networks to learn from arbitrary noisy data and still perform well. They also benefit from larger data sets that can accommodate a wide range of noise~\cite{Rolnick2017May}. In particular, there is much research in image denoising, where deep neural networks have also become very promising~\cite{Xie2012, Dong2018Oct, Gondara2016Dec}. However, many of these approaches can also be applied to time-series data. Since, in our case, labels result from time-series samples, the labels themselves can also be considered as time-series data. Therefore it is likely, that two adjacent labels have the same value, and an outlier would probably be a false label. It is similar to noise in a signal. Denoising or smoothing would be the same as correcting the wrong labels.
\subsection{Autoencoder}\label{sec:relWorkAutoencoder}
For denoising signals, autoencoders have shown to achieve good results~\cite{Xiong2016Nov, Chiang2019Apr, Xiong2016Jun}. An autoencoder is a neural network model which consists of two parts, the encoder and the decoder. In encoding, the model tries to map the input into a more compact form with low-dimensional features. During decoding this compressed data is reconstructed to the original space~\cite{Liu2019Feb}. During training the model is adjusted to minimize the reconstruction error between the predicted output and expected output. Autoencoders have become popular for unsupervised pretraining of a deep neural network. Vincent et al. introduced the use of autoencoders for denoising~\cite{vincent2010stacked}. A denoising autoencoder receives corrupted input samples and is trained with their clean values to learn the prediction of the original data. So not just a mapping from input to output is learned, but also features for denoising.
For denoising signals, autoencoders have been shown to achieve good results~\cite{Xiong2016Nov, Chiang2019Apr, Xiong2016Jun}. An autoencoder is a neural network model consisting of two parts: the encoder and the decoder. In encoding, the model tries to map the input into a more compact form with low-dimensional features. During decoding, this compressed data is reconstructed to the original space~\cite{Liu2019Feb}. While training, the model is adjusted to minimize the reconstruction error between the predicted output and expected output. Autoencoders have become popular for unsupervised pretraining of a deep neural network. Vincent et al. introduced the use of autoencoders for denoising~\cite{vincent2010stacked}. A denoising autoencoder receives corrupted input samples and is trained with their clean values to learn the prediction of the original data. So not just a mapping from input to output is learned but also features for denoising.
A traditional autoencoder consists of three fully connected (FC) layers, where the first layer is used for encoding and the last layer for decoding. The middle layer represents the hidden state of the compact data. Chiang et al. use a fully convolutional network as their architecture for denoising electrocardiogram signals~\cite{Chiang2019Apr}. The network consists of multiple convolutional layers for the encoding part and an inversely symmetric encoder using deconvolutional layers. An additional convolutional layer is used for the output. They use a stride of 2 in their convolutional layers to downsample the input signal and upsample it again to the output. So no pooling layers are required and the exact signal alignment of input and output is obtained. This results in a compression from a 1024x1 dimensional input signal to a 32x1 dimensional feature map. Since no fully connected layers are used, the number of weight parameters is reduced and the locally-spatial information is preserved. As loss function they use root mean square error which determines the variance between a predicted output and the original signal. Similar Garc\'{i}a-P\'{e}rez et al. use fully-convolutional denoising auto-encoder (FCN-dAE) architecture for Non-Intrusive Load Monitoring~\cite{Garcia-Perez2020Dec}. Both had shown, that a FCN-dAE outperforms a traditional autoencoder in terms of improving the noise to signal ratio.
A traditional autoencoder consists of three fully connected (FC) layers, where the first layer is used for encoding and the last layer for decoding. The middle layer represents the hidden state of the compact data. Chiang et al. use a fully convolutional network as their architecture for denoising electrocardiogram signals~\cite{Chiang2019Apr}. The network consists of multiple convolutional layers for the encoding part and an inversely symmetric encoder using deconvolutional layers. An additional convolutional layer is used for the output. They use a stride of 2 in their convolutional layers to downsample the input signal and upsample it again to the output. So no pooling layers are required, and the exact signal alignment of input and output is obtained. This results in compression from a 1024x1 dimensional input signal to a 32x1 dimensional feature map. Since no fully connected layers are used, the number of weight parameters is reduced, and the locally-spatial information is preserved. They use root mean square error as loss function, determining the variance between the predicted output and the original signal. Similar Garc\'{i}a-P\'{e}rez et al. use fully-convolutional denoising auto-encoder (FCN-dAE) architecture for Non-Intrusive Load Monitoring~\cite{Garcia-Perez2020Dec}. Both had shown that a FCN-dAE outperforms a traditional autoencoder to improve the noise-to-signal ratio.
\subsection{convLSTM-AE}\label{sec:relWorkconvLSTM}
As already mentioned in \secref{sec:relWorkActivityRecognition} long short-term memory (LSTM) networks are well known in the field of time series learning or sequential learning since they offer to transfer signal information across time steps. Therefore there are multiple approaches where LSTMs are also used in autoencoders~\cite{Essien2019Jul, Essien2020Jan, Kim2021Oct}. The networks are similar to FCN-AEs but the convolution layers are replaced by convolutional LSTM (convLSTM) layers. In contrast to a common LSTM (FC-LSTM), a convLSTM replaces the internal fully connected matrix multiplications with convolution operations~\cite{Shi2015}. So convLSTM layers are also able to encode spatial information with the addition to know which information is important for the current or following time step. Therefore convLSTM has been applied in many spatial-temporal dependent tasks. Furthermore, fewer weights are needed in comparison to a fully connected LSTM.
As already mentioned in \secref{sec:relWorkActivityRecognition} long short-term memory (LSTM) networks are well known in the field of time series learning or sequential learning since they offer to transfer signal information across time steps. Therefore there are multiple approaches where LSTMs are also used in autoencoders~\cite{Essien2019Jul, Essien2020Jan, Kim2021Oct}. The networks are similar to FCN-AEs, but convolutional LSTM (convLSTM) layers replace the convolution layers. In contrast to a common LSTM (FC-LSTM), a convLSTM replaces the internal fully connected matrix multiplications with convolution operations~\cite{Shi2015}. So convLSTM layers can also encode spatial information, in addition to knowing which information is important for the current or following time step. Therefore convLSTM has been applied in many spatial-temporal dependent tasks. Furthermore, fewer weights are needed in comparison to a fully connected LSTM.
In some other approaches, the spatial encoder/decoder is separated from the temporal encoder/decoder ~\cite{Chong2017May, nogas2018fall, Nayak2020Feb, Luo2017Jul}. They use several convolution and deconvolution layers for spatial encoding and decoding respectively. Between them there is the temporal encoder-decoder which consists of three convLSTM layers. The idea is to make the network computationally efficient, since the convolutional layers reduce the input dimensionality before it is used by the more complex convLSTM layers.
In some other approaches, the spatial encoder/decoder is separated from the temporal encoder/decoder ~\cite{Chong2017May, nogas2018fall, Nayak2020Feb, Luo2017Jul}. They use several convolution and deconvolution layers for spatial encoding and decoding, respectively. Between them is the temporal encoder-decoder, which consists of three convLSTM layers. The idea is to make the network computationally efficient, since the convolutional layers reduce the input dimensionality before the more complex convLSTM layers use it.
\subsection{Soft labels}\label{sec:relSoftLabels}
In most cases the label of a sample $x_i$ is a crisp label $y_i\in K$ which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore soft-labels can depict uncertainity over multiple classes~\cite{Beleites2013Mar}. Since the output of a human activity recognition model is mostly computed by a soft-max layer, it already represents a vector of partial class memberships. Converting to a crisp label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
In most cases, the label of a sample $x_i$ is a crisp label $y_i\in K$, which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore soft-labels can depict uncertainty over multiple classes~\cite{Beleites2013Mar}. Since a soft-max layer mostly computes the output of a human activity recognition model, it already represents a vector of partial class memberships. Converting to a crisp label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
Therefore soft-labels can also be used for a more robust training with noisy labels. It could happen, that the noise in a generated label just relies by a very uncertain prediction. The noise gets maximized by using crisp labels, but has less impact to the training process if used in a soft-label. It has been shown, that soft-labels can carry valuable information even when they are noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision and recall.
Soft-labels can therefore also be used for more robust training with noisy labels. It could happen that the noise in a generated label only relies on a very uncertain prediction. The noise gets maximized by using crisp labels but has less impact on the training process if used in a soft-label. It has been shown that soft-labels can carry valuable information even when noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision, and recall.
\chapter{Methods}\label{chap:approach}
In this chapter I present my approach to personalize a Human Activity Recognition model without adding required user interaction.
This chapter presents my approach to personalizing a Human Activity Recognition model without requiring additional user interaction.
At first, I describe the application which has to be personalized. After that the creation of synthetic and real world datasets which are used for evaluation tests. To do personalization on the synthetic datasets, I simulate user feedback, like it would be in the real world setting. After that I show the process of personalization and how user feedback is used to get information about labels of the training data. In extension I present various denoising techniques to refine my training data. Then I introduce common evaluation metrics and a quality estimation for a personalization, which also does not rely on ground truth data. The resulting pipeline is then summarized.
At first, I describe the application, which has to be extended. After that, the creation of synthetic and real-world datasets used for evaluation tests is shown. Then, I describe the personalization process and how user feedback is used to get information about the training data labels. In extension, I present various denoising techniques to refine my training data. Then I introduce common evaluation metrics and a quality estimation for a personalization, which does not rely on ground truth data. The resulting pipeline is then summarized.
Finally, I present an active learning implementation, which is used for performance comparison.
Finally, I present an active learning implementation used for performance comparison.
\section{Base Application}\label{sec:approachBaseApplication}
The system I build on, consist of an Android Wear OS application, which is executed on a smart watch and a HTTP-Webservice. It is used to observe obsessive behavior of a participant to treat OCD. Therefore all wrist movements are recorded and surveys to the users mental condition for each hand washing are collected. If the device detects a hand wash activity, a notification is prompted to the user, which can then confirmed or declined. A confirmation leads to the evaluation process of the user. Furthermore a user can trigger manual evaluations if a hand washing was not detected by the device. This evaluations can be used later by psychologists to analyze and treat the participants state during the day. \figref{fig:baseApplicationScreen} shows screen shots of the application.
The system I am building on consists of an Android Wear OS application running on a smartwatch and a HTTP web service. It is used to observe the hand washing behavior of a participant to treat OCD. Therefore all wrist movements are recorded, and surveys of the user's mental condition for each hand washing are collected. If the device detects a hand wash activity, a notification is prompted to the user, which can then be confirmed or declined. Confirmation leads to the evaluation process of the user. Furthermore, a user can trigger manual evaluations if the device does not detect a hand washing. Psychologists can later use these evaluations to analyze and treat the participant's state during the day. \figref{fig:baseApplicationScreen} shows screenshots of the application.
\input{figures/approach/base_application_screen}
For activity prediction the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}. The integrated IMU of the smart watch is used to record wrist movements and stores the sensor data in a buffer. After a cycle of 10 seconds the stored data is used to predict the current activity. Therefore a sliding window with length of 3 seconds and a window shift of 1.5 seconds is applied to the buffer. For each window the largest distance between the sensor values is calculated to filter out sections where there is just little movement. If there is some motion, the general recognition model is applied to the windows of this section, to predict the current activity label. To avoid detection based on outliers, a running mean is computed over the last $kw$ predictions. Just if it exceeds a certain threshold $kt$ the final detection is triggered. Additionally the buffer is saved to an overall recording in the internal storage.
While charging the smart watch, all sensor recordings and user evaluations are sent to the web server. There they are collected and assigned to the respective watches using the android id. The server also provides a web interface for accessing and managing all recording sets and participants. In addition, various statistics are generated for the gathered data.
For activity prediction, the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}. The integrated IMU of the smartwatch is used to record wrist movements, and the sensor data is stored in a buffer. After a cycle of 10 seconds, the stored data is used to predict the current activity. Therefore a sliding window with a length of 3 seconds and a window shift of 1.5 seconds is applied to the buffer. For each window, the largest distance between the sensor values is calculated to filter out sections with little motion. If there is some movement, the general recognition model is applied to the windows of this section to predict the actual activity label. To avoid detection based on outlier predictions, a running mean is computed over the last $kw$ predictions. If it exceeds a certain threshold $kt$ the final detection is triggered. Additionally, the buffer is saved to an overall recording in the internal storage.
While charging the smartwatch, all sensor recordings and user evaluations are sent to the web server. They are collected and assigned to the respective watches using the android id. The server also provides a web interface for accessing and managing all recording sets and participants. In addition, various statistics are generated for the gathered data.
\section{Datasets}
To personalize a human activity recognition model, it must be re-trained with additional sensor data from a particular user. In our case this data have to be from IMUs of wrist-worn devices during various human activities. Typically they consist of a set $S=\{S_0,\dots,S_{k-1}\}$ of $k$ time series. Each $S_i\in \mathbb{R}^{d_i}$ is a sensor measured attribute with dimensionality $d_i$ of sensor $i$. Additionally there is a set of $n$ activity labels $A=\{a_0, \dots, a_{n-1}\}$ and each $S_i$ is assigned to one of them ~\cite{Lara2012Nov}. For activity prediction I use a sliding window approach where I split the data set into $m$ time windows $W=\{W_0, \dots, W_{m-1}\}$ of equal size $l$. The windows are shifted by $v$ time series, which means that they overlap when $v < l$. Each window $W_i$ contains a sub-set of time series $W_i=\{S_{i\cdot v}, \dots, S_{i\cdot v + l}\}$ and is assigned to an activity label which builds the set of labels $Y=\{y_0, \dots, y_{m-1}\}$.
To personalize a human activity recognition model, it must be re-trained with additional sensor data from a particular user. In our case this data have to be from IMUs of wrist-worn devices during various human activities. Typically they consist of a set $S=\{S_0,\dots,S_{k-1}\}$ of $k$ time series. Each $S_i\in \mathbb{R}^{d_i}$ is a sensor measured attribute with dimensionality $d_i$ of sensor $i$. Additionally there is a set of $n$ activity labels $A=\{a_0, \dots, a_{n-1}\}$ and each $S_i$ is assigned to one of them ~\cite{Lara2012Nov}. For activity prediction I use a sliding window approach where I split the data set into $m$ time windows $W=\{W_0, \dots, W_{m-1}\}$ of equal size $l$. The windows are shifted by $v$ time series, which means that they overlap if $v < l$. Each window $W_i$ contains a sub-set of time series $W_i=\{S_{i\cdot v}, \dots, S_{i\cdot v + l}\}$ and is assigned to an activity label which builds the set of labels $Y=\{y_0, \dots, y_{m-1}\}$.
Most of today wearable devices consists of acceleration and gyroscope with three dimensions each. I combine the sets $S_{acceleration}$ and $S_{gyroscope}$ into one set with $S_i\in \mathbb{R}^{d_{acceleration}+d_{gyroscope}}$. In the case of hand wash detection I use the activity labels $A=\{null, hw\}$ where \textit{null} represents all activities where no hand washing is covered and \textit{hw} represents all hand washing activities.
Most of today's wearable devices consist of acceleration and gyroscope with three dimensions each. I combine the sets $S_{acceleration}$ and $S_{gyroscope}$ into one set with $S_i\in \mathbb{R}^{d_{acceleration}+d_{gyroscope}}$. In the case of hand wash detection, I use the activity labels $A=\{null, hw\}$ where \textit{null} represents all activities where no hand washing is performed, and \textit{hw} represents all hand washing activities.
\subsection{Synthetic data sets}\label{sec:synDataset}
There are several published data sets containing sensor data of wearable devices during various human activities. Since most public data sets are separated in individual parts for each activity, artificial data sets have to be created which consist of a continuous sequence of activities. There should be a reasonable constellation between \textit{null} and \textit{hw} samples, such they build larger parts of non hand washing activities with short hand washing parts in between, like in a real world scenario.
Several published data sets contain sensor data of wearable devices during various human activities. Since most public data sets are separated into individual parts for each activity, artificial data sets have to be created with a continuous sequence of activities. There should be a reasonable constellation between \textit{null} and \textit{hw} samples, such that they build larger parts of non, hand washing activities with short, hand washing parts in between, like in a real-world scenario.
Furthermore additional data for user feedback which covers parts of the time series is required. We can use the general prediction model to determine hand wash parts as it would be in the base application. In our case we apply a running mean over multiple windows to the predictions and trigger an indicator $e_i$ at window $W_j$ if it's higher than a certain threshold. This indicator $e_i$ gets the value \textit{correct} if one of the ground truth data covered by the mean has activity label \textit{hw}, otherwise it is \textit{false}. This represents the user feedback to confirmed or declined evaluations. For hand wash sequences where no indicator has been triggered, a manual user feedback indicator is added.
Furthermore, additional data for user feedback covering parts of the time series is required. We can use the general prediction model to determine hand washing parts as it would be in the base application. In our case, we apply a running mean over multiple windows to the predictions and trigger an indicator $e_i$ at the window $W_j$ if it is higher than a certain threshold. This indicator $e_i$ gets the value \textit{correct} if one of the ground truth data covered by the mean has activity label \textit{hw}, otherwise it is \textit{false}. It represents the user feedback on confirmed or declined evaluations. A manual user feedback indicator is added for hand wash sequences where no indicator has been triggered. Using ground truth data for adding indicators simulates a perfectly disciplined user who answers to each evaluation.
Since adjacent windows tend to have the same activity, one indicator can cover several windows. I assume that a single user feedback is assigned to an entire execution of the corresponding activity, i.e. until the labels of two ascending windows are different. Therefore, I added a \textit{neutral} value when multiple indicators cover the same activity sequence, so that only one of them represents the actual user feedback. In addition, between two indicators, there is a minimum gap of the same size as the base applications buffer. In \figref{fig:exampleSyntheticSataset} you can see the plot of a synthetic data set with 5 hand wash activities during a time span of about one hour.
Since adjacent windows tend to have the same activity, one indicator can cover several windows. I assume that a single user feedback is assigned to an entire execution of the related activity, i.e., until the labels of two ascending windows are different. Therefore, I added a \textit{neutral} value when multiple indicators cover the same activity sequence so that only one of them represents the actual user feedback. In addition, there is a minimum gap of the same size as the base applications buffer between the two indicators. In \figref{fig:exampleSyntheticSataset} you can see the plot of a synthetic data set with five hand wash activities for about one hour.
\input{figures/approach/example_dataset}
\subsubsection{Used data sets}
For this work I used data sets from Unversty of Basel and University of Freiburg [REF]. These include hand washing data which was recorded using a smart watch application. Additionally they contain long term recordings with every day activities. The data is structured by individual participants and multiple activities per recording. During the generation of a synthetic data set, the data of a single participant is selected randomly. To cover enough data, I had to combine the data sets and merge single participants over each data set. Therefore a resulting data set for a user contains multiple participants which I treat as one. This just affects data for \textit{null} activities. All hand wash activity data is from the same user. Since the same data sets have already been used to train the base model, I had to retrain individual base models for each participant where no of its own data is contained.
For this work, I used data sets from the University of Basel and the University of Freiburg [REF]. These include hand washing data which was recorded using a smartwatch application. Additionally, they contain long-term recordings of everyday activities. The data is structured by individual participants and multiple activities per recording. During the generation of a synthetic data set, the data of a single participant is selected randomly. To cover enough data, I had to combine and merge single participants over each data set. Therefore a resulting data set for a user contains multiple participants, which I treat as one. This just affects data for \textit{null} activities. All hand wash activity data is from the same user. Since the same data sets have already been used to train the base model, I had to retrain individual base models for each participant where none of its data is contained.
\subsection{Recorded data sets}
The base application records and transfers all sensor data and evaluations of daily usage to a web server. Therefore I could use the application to collect additional training sets for my thesis. However, certain analysis require ground truth data of the activity labels, which are not included in the recordings. I have added markers to depict the exact beginning and end of a hand wash action, so the ground truth data can be derived. These markers are set by hand and are not part of the final personalization process. The generated datasets have the same format as the synthetic datasets like described in \secref{sec:synDataset}. Just the indicators are not generated but used from the real user feedback.
The base application records and transfers all sensor data and evaluations of daily usage to a web server. Therefore I could use the application to collect additional training sets for my thesis. However, some analyses require ground truth data of the activity labels, which are not included in the recordings. Therefore, I have added markers to depict the exact beginning and end of a hand wash action, so the ground truth data can be derived. These markers are set by hand and are not part of the final personalization process. The generated datasets have the same format as the synthetic datasets described in \secref{sec:synDataset}.
\section{Personalization}\label{sec:approachPersonalization}
In this work, I employ a personalization approach which does not require additional user interaction and can be added to the existing application without changes in the model's architecture. So the base application can still work by itself and the user does not notice any changes in the usage.\\
In this work, I employ a personalization approach that does not require additional user interaction and can be added to the existing application without changes in the model's architecture. So the base application can still work by itself, and the user does not notice any changes in the usage.\\
Requirements for this process are:
\begin{itemize}
\item[1)]Collection of all sensor data\\ Since an application has to listen to the sensor data anyway for predictions, it should be possible to additionally save them in the internal storage. Furthermore they have to be provided to the personalization service in some way. In my case, the sensor recordings are transmitted to a central point anyways, so it is easy to access them.
\item[2)]User responses to predictions\\ These can be simple yes/no feedback. I assume that there is at least some kind of confirmation when an activity is detected, such that the application can perform its task. For example, the application in this work is used to query an evaluation from the user if a hand wash activity is performed. Therefore, I can deduce that the recognition was correct, should such an evaluation have taken place. If the user declines, I know the prediction was incorrect. In cases where there is neither a confirmation or rejection it is possible to ignore this part or treat it like a false prediction. In section \secref{sec:expMissingFeedback} I show, that especially the confirmation to predictions has an impact to the personalization performance. Therefore, the no-feedback could be neglected if it is not possible to implement or it would change the usage.
\item[1)]Collection of all sensor data\\ Since an application has to listen to the sensor data anyway for predictions, it should be possible to save them in the internal storage. Furthermore, they must be provided to the personalization service somehow. In my case, the sensor recordings are transmitted to a central point, so it is easy to access them.
\item[2)]User responses to predictions\\ These can be simple yes/no feedback. I assume that there is at least some kind of confirmation when an activity is detected, such that the application can perform its task. For example, the application in this work is used to query an evaluation from the user if a hand wash activity is performed. Therefore, I can deduce that the recognition was correct should such an evaluation have taken place. If the user declines, I know the prediction was incorrect. In cases without confirmation or rejection, it is possible to ignore this part or treat it like a false prediction. In section \secref{sec:expMissingFeedback} I show that especially the confirmation to predictions impacts personalization performance. Therefore, the no-feedback could be neglected if it is not possible to implement or it would change the usage.
\end{itemize}
The neural network model I want to personalize has been implemented and trained by Robin Burchard~\cite{robin2021}. It consists of four 1D-convolutional layers followed by a LSTM layer. It is extended by a attention mechanism, which allows to concentrate to important parts of the data. This is done by a fully connected layer, that computes weights over the time steps of the LSTM layer. These are then used for a weighted sum with the the LSTMs hidden state. Finally a fully connected linear layer is used for classification. As input, the model requires a window of 150 time series samples with dimensionality of 6. This represents a 3 seconds long sensor reading of acceleration and gyroscope at 50Hz.
I want to personalize the \textbf{DeepConvLSTM-A} neural network model, implemented and trained by Robin Burchard~\cite{robin2021}. It consists of four 1D-convolutional layers followed by a LSTM layer. It is extended by an attention mechanism, which allows concentrating on important parts of the data. It is done by a fully connected layer that computes weights over the time steps of the LSTM layer. These are then used for a weighted sum with the LSTMs hidden state. Finally, a fully connected linear layer is used for classification. As input, the model requires a window of 150 time-series samples with a dimensionality of 6. It represents a three seconds long sensor reading of acceleration and gyroscope at 50Hz.
To personalize the model I use transfer learning in a domain adaption manner as described in \secref{sec:relWorkPersonalization}. This requires additional labeled training data. Due to condition $1)$ all sensor data from a user is available, however without any labels. Therefore I generate pseudo labels based on the predictions of the general activity recognition model. Additionally these labels are refined as described in the following \secref{sec:approachLabeling}. The use of pseudo labels leads to a supervised training. This allows to let the model architecture unchanged and main parts of the original training implementation and hyper parameter settings can be reused which have been elaborated by Robin Burchard~\cite{robin2021}.
To personalize the model, I use transfer learning in a domain adaption manner as described in \secref{sec:relWorkPersonalization}. However, it requires additional labeled training data. Due to condition $1)$ all sensor data from a user is available, however, without any labels. Therefore I generate pseudo labels based on the predictions of the general activity recognition model. Additionally, these labels are refined as described in the following \secref{sec:approachLabeling}. The use of pseudo labels leads to supervised training. It allows to keep the model architecture unchanged, and the main parts of the original training implementation and hyper parameter settings can be reused, as elaborated by Robin Burchard~\cite{robin2021}.
\subsubsection{Regularization}\label{sec:approachRegularization}
To avoid over-fitting to the target during multiple iterations of personalization I try two different approaches. The first approach is to freeze the feature layers of the model. As shown by Yosinski et al. feature layers tend to be more generalizeable and can better transferred to the new domain~\cite{Yosinski2014}. Therefore the personalization is just applied to the classification layer. So less parameters have to be fine tuned, what results in less computation time and a smaller amount of training data can have significant impact to the model. In the second approach I apply L2-SP penalty to the optimization as mentioned by Xuhong et al.~\cite{xuhong2018explicit}. Here the regularization restricts the search space to the initial model parameters. Therefore information which is learned in the pre-training stays existent even over multiple fine tuning iterations. This allows to adjust all parameters which offers more flexibility in fine-tuning. To test which approach fits best, I compare them in \secref{sec:expEvolCompFilter}.
To avoid over-fitting to the target during multiple iterations of personalization, I try two different approaches. The first approach is to freeze the feature layers of the model. Yosinski et al. show that feature layers tend to be more generalizable and can be better transferred to the new domain~\cite{Yosinski2014}. Therefore the personalization is just applied to the classification layer. So fewer parameters have to be fine-tuned, which results in less computation time, and a smaller amount of training data can significantly impact the model. In the second approach, I apply the L2-SP penalty to the optimization mentioned by Xuhong et al.~\cite{xuhong2018explicit}. Here the regularization restricts the search space to the initial model parameters. Information learned in the pre-training, therefore stays existent even over multiple fine-tuning iterations. It allows adjustment of all parameters, which offers more flexibility in fine-tuning. To test which approach fits best, I compare them in \secref{sec:expEvolCompFilter}.
\section{Training data labeling}\label{sec:approachLabeling}
To retrain the base model with new sensor data, I use a semi-supervised approach using pseudo labels as described in \secref{sec:relWorkPseudoLabeling}. Since there is already a trained model $M_p$, I can make use of its supervision from pre-training to generate artificial labels $\hat{Y}=\{\hat{y}_0, \dots, \hat{y}_{m-1}\}$ based on predictions from new unlabeled sensor data $W=\{W_0, \dots, W_{m-1}\}$, where $\hat{y}_i=M_p(W_i)$. Therefore it is easy to get pseudo labels for all new sensor data. This process does not require any supervision, but allows the model to be trained in the same way as with a supervised method. So no changes at the models architecture or based training implementation are required.
To retrain the base model with new sensor data, I use a semi-supervised approach using pseudo labels as described in \secref{sec:relWorkPseudoLabeling}. Since there is already a trained model $M_p$, I can make use of its supervision from pre-training to generate artificial labels $\hat{Y}=\{\hat{y}_0, \dots, \hat{y}_{m-1}\}$ based on predictions from new unlabeled sensor data $W=\{W_0, \dots, W_{m-1}\}$, where $\hat{y}_i=M_p(W_i)$. Therefore it is easy to get these pseudo-labels for all new sensor data. This process does not require supervision but allows the model to be trained in the same way as with a supervised method. So the same model's architecture and training implementation can be used.
However, it is very likely that there are some wrong predictions that negatively affect the training, which is also called label noise. It is important to keep the amount of noise as low as possible. To observe the impact of wrong classified data I did several tests, which you can see in \secref{sec:expTransferLearningNoise}. As seen in \secref{sec:relWorkLabelNoise} there are multiple approaches to make the training more robust against label noise. I use the predicted values as soft-labels to depict uncertainty of the model. So the set of pseudo labels consists of vectors instead of crisp labels, $\hat{Y}=\{\hat{\bm{y}}_0, \dots, \hat{\bm{y}}_{m-1}\}$ where each $\hat{\bm{y}}_i = \begin{bmatrix}\hat{y}_i^{null}& \hat{y}_i^{hw}\end{bmatrix}$. The value $\hat{y}_i^{null}$ is the predicted membership for class \textit{null} and $\hat{y}_i^{hw}$ for class \textit{hw}. \figref{fig:examplePseudoSataset} shows an example plot of the predicted pseudo values of the previously seen dataset. In the following I call a pseudo label $\hat{\bm{y}}_i$ \textit{null} if $\hat{y}_i^{null} > \hat{y}_i^{hw}$ and \textit{hw} or hand wash if $\hat{y}_i^{null} < \hat{y}_i^{hw}$.
\input{figures/approach/example_pseudo_labels}
For further improvements, I rely on knowledge about the context of hand washing. Therefore I know that most labels should have value \textit{null} and some adjacent values of time intervals around 20 seconds have value \textit{hw}. Additionally, there is user feedback on certain predictions of the model from which information can also be drawn.
For further improvements, I rely on knowledge about the context of hand washing. Therefore I know that most labels should have value \textit{null} and some adjacent values of time intervals around 20 seconds have value \textit{hw}. Additionally, there is user feedback on certain model predictions from which information can also be drawn.
\subsection{User Feedback}
For each section where the running mean of model predictions reaches a certain threshold, an indicator is created which is either \textit{neutral}, \textit{false} or \textit{correct}. More over there can be indicators of type \textit{manual}. These provide following information about the respective predictions.
For each section where the running mean of model predictions reaches a certain threshold, an indicator is created which is either \textit{neutral}, \textit{false} or \textit{correct}. Moreover, there can be indicators of type \textit{manual}. These provide the following information about the respective predictions.
\begin{itemize}
\item neutral:\\ The participant has not answered this query. But if there is another indicator immediately afterwards, both will probably cover the same activity. So we can assume the same value as from the following indicator. If this is also of value \textit{neutral}, we continue this assumption over all following ones until either an indicator with the value \textit{false}/\textit{correct} exists or the distance between two adjacent indicators is so large that we can no longer assume the same activity. In the second case, no precise statement exists about the activity or prediction.
\item false:\\ The participant has declined the predicted activity. So the prediction is false. Since the indicator is just a single shot time stamp, we need to specify a time interval in which we assume that the predicted activity might have occurred. Within this time interval, the predictions which lead to the rejected activity are likely to be wrong.
\item correct:\\ The participant has confirmed the prediction. Again we have to estimate a time interval in which the activity could occurred. Within this interval the predictions which state the same activity as the confirmed one, are probably correct. Outliers should have the value of the confirmed activity. But we do not exactly know where the boundaries of the confirmed activity are.
\item manual:\\ The participant triggered a manual activity. So the running mean did not exceeded the threshold. But predictions with the right activity are probably correct. It could be, that the execution was to short to get the mean above the threshold or too much predictions have been false. Since a manual indicator could be given some time after the activity, the possible time interval is significantly larger than for the detected activities. It is recommended to specify in advance the maximum delay within which a user should trigger a manual feedback.
\item neutral:\\ The participant has not answered this query. However, if there is another indicator immediately afterward, both will probably cover the same activity. So we can assume the same value as from the following indicator. Suppose this is also of value \textit{neutral}. In that case, we continue this assumption over all following ones until either an indicator with the value \textit{false}/\textit{correct} exists or the distance between two adjacent indicators is so large that we can no longer assume the same activity. In the second case, no precise statement exists about the activity or prediction.
\item false:\\ The participant has declined the predicted activity. So the prediction is false. Since the indicator is just a single shot time stamp, we need to specify a time interval in which we assume the predicted activity might have occurred. Within this time interval, the predictions which lead to the rejected activity are likely to be wrong.
\item correct:\\ The participant has confirmed the prediction. Again we have to estimate a time interval in which the activity could occur. Within this interval, the predictions that state the same activity as the confirmed one, are probably correct. So outliers should have the value of the confirmed activity. Nevertheless, we do not precisely know the boundaries of the confirmed activity.
\item manual:\\ The participant triggered a manual activity. So the running mean has not exceeded the threshold. However, predictions with the suitable activity are probably correct. It could be that the execution was too short to get the mean above the threshold or that too many predictions were false. Since a manual indicator could be given some time after the activity, the possible time interval is significantly larger than for the detected activities. Therefore, it is recommended to specify the maximum delay within which a user should trigger manual feedback.
\end{itemize}
In the case of hand washing I set for a \textit{false} indicator $i$ at window $w$ the activity interval to $I^{false}_i=[w-35, w]$ which represents a time span from 46 seconds before until the time of the indicator. For a \textit{correct} indicator, the interval is set to $I^{correct}_i=[w-70, w+10]$ (93 seconds before to 13 seconds after), \textit{neutral} indicator interval to $I^{neutral}_i=[w-35, w+5]$, (46 seconds before to 6 seconds after) and \textit{manual} indicator interval to $I^{manual}_i=[w-130, w-8]$, (173 seconds before to 10 seconds before). During tests this values have shown a good average coverage. All intervals are elements of their corresponding set, where:
In the case of hand washing I set for a \textit{false} indicator $i$ at window $w$ the activity interval to $I^{false}_i=[w-35, w]$ which represents a time span from 46 seconds before until the time of the indicator. For a \textit{correct} indicator, the interval is set to $I^{correct}_i=[w-70, w+10]$ (93 seconds before to 13 seconds after), \textit{neutral} indicator interval to $I^{neutral}_i=[w-35, w+5]$, (46 seconds before to 6 seconds after) and \textit{manual} indicator interval to $I^{manual}_i=[w-130, w-8]$, (173 seconds before to 10 seconds before). During tests this values have shown a good average coverage. \figref{fig:exampleSyntheticIntervals} shows the example dataset with highlighted activity intervals. All intervals are elements of their corresponding set, where:
\begin{align*}
\mathds{I}^{false}&=\{I^{false}_0, \dots, I^{false}_{nf-1}\}\\
\mathds{I}^{correct}&=\{I^{correct}_0, \dots, I^{correct}_{nc-1}\}\\
\mathds{I}^{neutral}&=\{I^{neutral}_0, \dots, I^{neutral}_{nn-1}\}\\
\mathds{I}^{manual}&=\{I^{manual}_0, \dots, I^{manual}_{nm-1}\}\\
\mathds{I}^{positive} &= \mathds{I}^{correct} \cup \mathds{I}^{manual}
\mathds{I}^{false}&=\{I^{false}_0, \dots, I^{false}_{nf-1}\}\\
\mathds{I}^{correct}&=\{I^{correct}_0, \dots, I^{correct}_{nc-1}\}\\
\mathds{I}^{neutral}&=\{I^{neutral}_0, \dots, I^{neutral}_{nn-1}\}\\
\mathds{I}^{manual}&=\{I^{manual}_0, \dots, I^{manual}_{nm-1}\}\\
\mathds{I}^{positive} &= \mathds{I}^{correct} \cup \mathds{I}^{manual}
\end{align*}
The set $\mathds{I}^{positive}$ is the combination of the correct and manual sets and depict all \textit{positive} intervals which should contain a hand wash sequence.
The set $\mathds{I}^{positive}$ is the combination of the correct and manual sets and depicts all \textit{positive} intervals, which should contain a hand wash sequence.
But we can not be sure, that a user has always answered the queries or triggered a manual indicator for all unrecognized activities. In section \secref{sec:expMissingFeedback} I observe the performance impact of incomplete user feedback. Moreover I try to make the learning process robust against missing feedback. Since the indicators are created during the usage of the application I assume, that no incorrect feedback is made, because it would also lead to a worse user experience. For example, if the user rejects correctly detected hand washing activities, no evaluation can be performed, which is contrary to the purpose of the application.
However, we can not be sure that a user has always answered the queries or triggered a manual indicator for all unrecognized activities. In section \secref{sec:expMissingFeedback}, I observe the performance impact of incomplete user feedback. Moreover, I try to make the learning process robust against missing feedback. Since the indicators are created during the usage of the application, I assume that no incorrect feedback is made because it would also lead to a worse user experience. For example, if the user rejects correctly detected hand washing activities, no evaluation can be performed, which is contrary to the purpose of the application.
\input{figures/approach/example_dataset_feedback}
\subsection{Denoising}
The pseudo-labels are derived from the predictions of the general model. Therefore, we cannot be sure that they are correct. The indicators offer a source of ground truth information to the underlying activity. In this section I describe different approaches how the indicators can be used to refine the pseudo-labels.
The pseudo-labels are derived from the predictions of the general model. Therefore, we cannot be sure that they are correct. However, the indicators offer a source of ground truth information to the underlying activity. In this section, I describe different approaches to how the indicators can be used to refine the pseudo-labels.
In the first step we can use the raw information of the indicators. As said during a \textit{false} interval, we know, that the hand wash predictions are false. So we can set all predictions within the interval to \textit{null}, i.e. the soft label vectors to $\hat{\bm{y}}_i = \begin{bmatrix}1 & 0\end{bmatrix}$. For neutral intervals it is difficult to make a statement. For example it could be possible, that a user tends more to confirm all right predicted activities than decline false predictions, because a confirmation leads to the intended usage of the application. So we can assume that all correctly detected activities are confirmed, if not they are false predictions. In this case all labels in \textit{neutral} intervals can be set to \textit{null}. Otherwise, we cannot make a safe statement about the correctness of the prediction, but simply exclude these sections from training. Another approach would be to just use data which is covered by a \textit{false} or \textit{correct} indicator. With correcting the false labels and the knowledge that confirmed predicted labels are probably correct, there is a high certainty in the training data. But labels within \textit{positive} intervals can still contain false predictions. As shown in section \secref{sec:expTransferLearningNoise} as \textit{null} labeled \textit{hw} samples does not have as much negative impact to the performance as \textit{hw} labeled \textit{null} samples. Therefore it is crucial to be sure, that no pseudo label of a \textit{null} sample inside a \textit{positive} interval has a high value for hand washing. We know, that inside the interval there should be just a set of adjacent samples with label \textit{hw}. All others should be of type \textit{null}. In the following I concentrate in approaches to identify exacly these subset and correct all labels accordingly inside a \textit{positive} interval.
Initially, we can use the raw information of the indicators. As said during a \textit{false} interval, we know that the hand wash predictions are false. So we can set all predictions within the interval to \textit{null}, i.e. the soft label vectors to $\hat{\bm{y}}_i = \begin{bmatrix}1 & 0\end{bmatrix}$. For neutral intervals, it is difficult to make a statement. For example, it could be possible that a user tends more to confirm all right predicted activities than decline false predictions because a confirmation leads to the intended usage of the application. So we can assume that all correctly detected activities are confirmed. If not, they are false predictions. In this case all labels in \textit{neutral} intervals can be set to \textit{null}. Otherwise, we cannot make a safe statement about the correctness of the prediction but exclude these sections from training. Another approach would be to use only data covered by a \textit{false} or \textit{correct} indicator. With correcting the false labels and the knowledge that confirmed predicted labels are probably correct, there is a high certainty in the training data. But labels within \textit{positive} intervals can still contain false predictions. As shown in section \secref{sec:expTransferLearningNoise} as \textit{null} labeled \textit{hw} samples does not have as much negative impact to the performance like \textit{hw} labeled \textit{null} samples. Therefore it is crucial to ensure that no pseudo label of a \textit{null} sample inside a \textit{positive} interval has a high value for hand washing. We know that inside the interval, there should be just a set of adjacent samples with label \textit{hw}. All others should be of type \textit{null}. In the following, I concentrate on approaches to identify these subsets and correct all labels accordingly inside a \textit{positive} interval.
\subsubsection{Naive}
In a naive approach we search for the largest bunch of neighboring pseudo-labels which have a high value for hand washing. This is done by computing a score over all subsets of adjacent labels. The score of a subset $Sub_k=\{\hat{\bm{y}}_p, \dots, \hat{\bm{y}}_q\}$ is computed by:
In a naive approach, we search for the largest bunch of neighboring pseudo-labels with a high value for hand washing. It is done by computing a score over all subsets of adjacent labels. The score of a subset $Sub_k=\{\hat{\bm{y}}_p, \dots, \hat{\bm{y}}_q\}$ is computed by:
\begin{align}
Score_k&=\sum_{\hat{\bm{y}}_i\in Sub_k}(-\delta+(\hat{y}^{hw}_i - 0.5))
Score_k&=\sum_{\hat{\bm{y}}_i\in Sub_k}(-\delta+(\hat{y}^{hw}_i - 0.5))
\end{align}
The score just benefits from adding a label to the set if the predicted value for hand washing is greater than the \textit{null} value. Additionally there is a general penalty $\delta$ for adding a label, so the prediction has to be at least a bit certain that the sample is probably hand washing. The subset with the highest score is assumed to be the hand wash action and all containing pseudo labels are set to $\hat{\bm{y}}_i = \begin{bmatrix}0 & 1\end{bmatrix}$. All other labels are set to $\hat{\bm{y}}_j = \begin{bmatrix}1 & 0\end{bmatrix}$. This approach depends heavily on the performance of the base model. Incorrect predictions can completely shift the assumed subset to wrong samples. It may also happen, that \textit{null} predictions between correct \textit{hw} predictions split the possible subset which results in just a partial coverage of the original activity. In \figref{fig:examplePseudoFilterScore} you can see a plot of two example intervals where the pseudo labels are refined by this approach.
The score only benefits from adding a label to the set if the predicted value for hand washing is greater than the \textit{null} value. Additionally, there is a general penalty $\delta$ for adding a label, so the prediction has to be at least a bit certain that the sample is probably hand washing. The subset with the highest score is assumed to be the hand wash action and all containing pseudo labels are set to $\hat{\bm{y}}_i = \begin{bmatrix}0 & 1\end{bmatrix}$. All other labels are set to $\hat{\bm{y}}_j = \begin{bmatrix}1 & 0\end{bmatrix}$. This approach depends heavily on the performance of the base model. Incorrect predictions can completely shift the assumed subset to the wrong samples. It may also happen that \textit{null} predictions between correct \textit{hw} predictions split the possible subset, which results in just a partial coverage of the original activity. In \figref{fig:examplePseudoFilterScore} you can see a plot of two example intervals where this approach refines the pseudo labels.
\input{figures/approach/example_pseudo_filter_score}
\subsubsection{Deep Convolutional network}
Convolutional networks have become a popular method for image and signal classification. I use a convolutional neural network (CNN) to predict the value of a pseudo label given the surrounding pseudo labels. It consist of two 1d-convolutional layers and a linear layer for the classification output. Both convolutional layers have a stride of 1 and padding is applied. The kernel size of the first layer is 10 and from the second 5. They convolve along the time axis over the \textit{null} and \textit{hw} values. As activation function I use the Rectified Linear Unit (ReLU) after each convolutional layer. For input I apply a sliding window of length 20 and shift 1 over the pseudo labels inside a \textit{hw} interval. This results in a 20x2 dimensional input for the network which generates a 1x2 output. After applying a sotmax function, the output is the new pseudo soft-label at the windows middle position.
Convolutional networks have become a popular method for image and signal classification. I use a convolutional neural network (CNN) to predict the value of a pseudo label given the surrounding pseudo labels. It consists of two 1d-convolutional layers and a linear layer for the classification output. Both convolutional layers have a stride of 1, and padding is applied. The kernel size of the first layer is 10 and from the second 5. They convolve along the time axis over the \textit{null} and \textit{hw} values. As activation function, I use the Rectified Linear Unit (ReLU) after each convolutional layer. For input, I apply a sliding window of length 20 and shift 1 over the pseudo labels inside a \textit{hw} interval. It results in a 20x2 dimensional network input, generating a 1x2 output. After applying a softmax function, the output is the new pseudo-soft-label at the windows middle position.
I used the approach from \secref{sec:synDataset} to train the network and created multiple synthetic datasets. On these datasets, I predicted the pseudo labels by the base model. Additionally, I augmented the labels by adding noise and random label flips. After that I extracted the \textit{correct} intervals. It results in roughly 400 intervals with $\sim1300$ windows, which were shuffled before training. As loss function, I used cross-entropy.
To train the network I used the approach from \secref{sec:synDataset} and created multiple synthetic datasets. On these datasets I predicted the pseudo labels by the base model. Additionally I augmented the labels by adding noise and random label flips. After that I extracted the \textit{correct} intervals. This results in roughly 400 intervals with $\sim1300$ windows which were shuffled before training. As loss function I used the cross-entropy. In \figref{fig:examplePseudoFilterCNN} you can see a plot of the same example intervals as before, but where the pseudo labels are refined by the CNN approach. As you can see this approach can make use of the soft-labels and smooths out the activity boundaries. So imprecise boundaries does have less impact to the training process. But it is also possible, that a local bunch of wrongly predicted values from the base model leads to incorrect pseudo labels.
In \figref{fig:examplePseudoFilterCNN} you can see a plot of the same example intervals as before, but where the CNN approach refines the pseudo labels. As you can see, this approach can use the soft-labels and smooth out the activity boundaries. So fuzzy boundaries do have less impact on the training process. But it is also possible that a local bunch of wrongly predicted values from the base model leads to incorrect pseudo labels.
\input{figures/approach/example_pseudo_filter_cnn}
\subsubsection{Auto encoder}
I tried different implementations of autoencoders for denoising. All of them take a 1x128 dimensional noisy input and outputs 1x128 dimensional clean signal. The size of 128 has been chosen because it is a multiple of two and covers all \textit{positive} intervals. So I have to enlarge the \textit{hw} and \textit{manual} intervals and just use the soft label values of $\hat{y}^{hw}_i$. Therefore a whole interval is processed in one step. Since the output is also just the soft label values of the hand wash probability, I have to recompute the \textit{null} values of each label by $\hat{y}^{null}_i=1-\hat{y}^{hw}_i$.
I tried different implementations of autoencoders for denoising. They all take a 1x128 dimensional noisy input and output a 1x128 dimensional clean signal. The size of 128 has been chosen because it is a multiple of two and covers all \textit{positive} intervals. So I have to enlarge the \textit{hw} and \textit{manual} intervals and just use the soft label values of $\hat{y}^{hw}_i$. Therefore a whole interval is processed in one step. Since the output is also just the soft label values of the hand wash probability, I have to recompute the \textit{null} values of each label by $\hat{y}^{null}_i=1-\hat{y}^{hw}_i$.
The first approach is a fully convolutional denoising auto encoder (FCN-dAE). The encoding part consists of three 1D-convolutional layer with kernel sized of 8, 4, 4 and strides of 2, 2, 1. This encodes the input to a 32x64 dimensional feature map after the first layer, a 12x34 dimensional feature map after the second layer and a 1x33 feature map after the third layer. The decoder part is inversely symmetric and consists of 1D-deconvolutional layer which reconstruct the input back to 1x128 values. As activation function I apply Exponential Linear Unit (ELU) after each layer and as output a sigmoid function is used. For training I created similar as for the CNN, multiple synthetic datasets and their predictions with additional noise and label flips. After the extraction of the \textit{correct} intervals I extended them to 128 values. During training a mean squared error loss function measures the variance between the denoised input and the clean ground truth values. \figref{fig:examplePseudoFilterFCNdAE} shows the example intervals where this filter was applied.
The first approach is a fully convolutional denoising auto encoder (FCN-dAE). The encoding part consists of three 1D-convolutional layers with kernel sizes of 8, 4, 4, and strides of 2, 2, and 1. It encodes the input to a 32x64 dimensional feature map after the first layer, a 12x34 feature map after the second layer, and a 1x33 feature map after the third layer. The decoder part is inversely symmetric and consists of a 1D-deconvolutional layer that reconstructs the input back to 1x128 values. As activation function, I apply Exponential Linear Unit (ELU) after each layer, and as output, a sigmoid function is used. For training, I created similar as for the CNN, multiple synthetic datasets and their predictions with additional noise and label flips. After the extraction of the \textit{correct} intervals I extended them to 128 values. During training, a mean squared error loss function measures the variance between the denoised input and the clean ground truth values. \figref{fig:examplePseudoFilterFCNdAE} shows the example intervals where this filter was applied.
\input{figures/approach/example_pseudo_filter_fcndae}
In the next step I use convLSTM denoisng autoencoders in different configurations. As with the FCN-dAE I just use the values for $pred_{hw}$ and compute the missing ones afterwards. Since LSTMs are specified for sequential data, the single interval has to be converted into a time-series sequence. Therefore I apply a sliding window of width 32 and shift 1. This creates 96 consecutive sections of the interval, each with 32 values. The autoencoder does a sequence-to-sequence prediction on the 96x1x32 dimensional input, so the output is also a 96x1x32 dimensional sequence of time-series. To recreate the 128x1 dimensional interval I compute the mean of predictions for a sample over the sequence.
In another approach, I use convLSTM denoising autoencoders in different configurations. As with the FCN-dAE, I just use the values for $pred_{hw}$ and compute the missing ones afterward. Since LSTMs are specified for sequential data, the single interval must be converted into a time-series sequence. Therefore I apply a sliding window of width 32 and shift 1. It creates 96 consecutive sections of the interval, each with 32 values. The autoencoder does a sequence-to-sequence prediction on the 96x1x32 dimensional input, so the output is also a 96x1x32 dimensional sequence of time-series. To recreate the 128x1 dimensional interval, I compute the mean of predictions for a sample over the sequence.
I implemented three different network architectures which I call convLSTM1-dAE, convLSTM2-dAE and convLSTM3-dAE. The used convLSTM layers are bidirectional so they can be used in the encoder and decoder part. All networks use ELU activation functions and sigmoid for the output just like the FCN-dAE architecture. As well mean squared error is used to compute the loss between the predicted sequence and the clean ground truth sequence.
I implemented three different network architectures, which I call convLSTM1-dAE, convLSTM2-dAE, and convLSTM3-dAE. The convLSTM layers are bidirectional so that they can be used in the encoder and decoder parts. All networks use ELU activation functions and sigmoid for the output, just like the FCN-dAE architecture. Mean squared error is used to compute the loss between the predicted and clean ground truth sequences.
The architecture of convLSTM1-dAE uses two convLSTM layers for encoding and decoding each. Both with a stride of 1. The first layer takes the 96x1x32 dimensional input and applies a convolutional kernel of width 5. After that the 96x128x32 dimensional feature map is convolved by a kernel of width 3 to 96x64x32 dimensional features. This is inversely reconstructed by the decoding part. In convLSTM2-dAE the spatial and temporal encoder/decoder are separated. It uses two convolutional layer for spatial encoding and two deconvolutional layer for decoding. In between there are 3 convLSTM layer for time encoding/decoding. The convolutional layers compress the 96x1x32 dimensional input to a 96x32x8 dimensional feature map by using kernels of width 3 and a stride of 2. In the temporal encoder/decoder the convLSTMs also apply a kernel of width 3 and transform the features to 96x128x8 dimension and back to 96x32x8. Again the deconvolutional layers are inverse symmetric to the spatial encoder. For convLSTM3-dAE just convLSTM layers with stride of 1 are used as in convLSTM1-dAE but with an additional layer for encoder and decoder. So the network consists of three bidirectional convLSTM layers for encoding and decoding each. Kernels of width 7, 5 and 3 are applied, which generate a hidden feature map with dimension 96x32x32. \figref{fig:examplePseudoFilterconvLSTM} shows the predicted output of the different networks applied on the previously example intervals. You can see, that all result in similar pseudo-labels. Several tests have shown, that convLSTM3-dAE performs best.
The architecture of convLSTM1-dAE uses two convLSTM layers for encoding and decoding each. Both with a stride of 1. The first layer takes the 96x1x32 dimensional input and applies a convolutional kernel of width 5. After that, the 96x128x32 dimensional feature map is convolved by a kernel of width 3 to 96x64x32 dimensional features. The decoding part inversely reconstructs this.\\
In convLSTM2-dAE the spatial and temporal encoder/decoder are separated. It uses two convolutional layers for spatial encoding and two deconvolutional layers for decoding. In between, there are three convLSTM layers for time encoding/decoding. The convolutional layers compress the 96x1x32 dimensional input to a 96x32x8 dimensional feature map using kernels of width three and a stride of 2. In the temporal encoder/decoder, the convLSTMs also apply a kernel of width three and transform the features to 96x128x8 dimension and back to 96x32x8. Again the deconvolutional layers are inverse symmetric to the spatial encoder.\\
For convLSTM3-dAE, only convLSTM layers with stride of 1 are used as in convLSTM1-dAE but with an additional layer for encoder and decoder. So the network consists of three bidirectional convLSTM layers for encoding and decoding each. Kernels of widths 7, 5, and 3 are applied, which generate a hidden feature map with dimensions 96x32x32. \figref{fig:examplePseudoFilterconvLSTM} shows the predicted output of the different networks applied on the previous example intervals. You can see that all results in similar pseudo-labels.
\input{figures/approach/example_pseudo_filter_lstmdae}
\subsection{Filter configurations}\label{sec:approachFilterConfigurations}
During testing I have created multiple filter configurations which consists of different constellations of the introduced denoising approaches. These configurations and their detailed descriptions can be seen in \tabref{tab:filterConfigurations}. Some of them rely on ground truth data and are just used for evaluation. The configurations \texttt{all, high\_conf, scope, all\_corrected\_null, scope\_corrected\_null, all\_corrected\_null\_hwgt} and \texttt{scope\_corrected\_null\_hwgt} depict base lines to observe which impact different parts could have to the training. So \texttt{all} and \texttt{high\_conf} show simple approaches where no user feedback is considered. Configuration \texttt{scope} depict the difference between including more data which could potentially be wrong and just using data where additional information is available. To show the improvements by simply correcting false predictions, \texttt{all\_corrected\_null, scope\_corrected\_null} is used. This is extended by theoretical evaluations of \texttt{all\_corrected\_null\_hwgt, scope\_corrected\_null\_hwgt} which states an upper bound of a possible perfect filter approach for \textit{hw, manual} intervals. The \texttt{all\_null\_*} configurations rely on the context knowledge, that the hand wash parts should be way less than all other activities. So we can assume, that all labels should be of value \textit{null} and just inside \textit{hw, manual} intervals there are some of value \textit{hw}. This depends on how reliably a user has specified all hand wash actions. Here especially the performance of the introduced denoising approaches is focused. Again \texttt{all\_null\_hwgt} represents the theoretical upper bound if a perfect filter would exits. As a more general approach, the \texttt{all\_cnn\_*} configurations do not make such a hard contextual statements and attempt to combine the cleaning abilities of the CNN network and high confidence in the resulting labels, to augment the training data with likely correct samples.
During testing, I created multiple filter configurations consisting of different constellations of the introduced denoising approaches. These configurations and their detailed descriptions can be seen in \tabref{tab:filterConfigurations}. Some rely on ground truth data and are just used for evaluation. The configurations \texttt{all, high\_conf, scope, all\_corrected\_null, scope\_corrected\_null, all\_corrected\_null\_hwgt} and \texttt{scope\_corrected\_null\_hwgt} depict baselines to observe which impact different parts could have to the training. So \texttt{all} and \texttt{high\_conf} show simple approaches where no user feedback is considered. Configuration \texttt{scope} depicts the difference between including potentially incorrect data and just using data where additional information is available. To show the improvements by simply correcting false predictions, \texttt{all\_corrected\_null, scope\_corrected\_null} is used. This is extended by theoretical evaluations of \texttt{all\_corrected\_null\_hwgt, scope\_corrected\_null\_hwgt} which states an upper bound of a possible perfect filter approach for \textit{hw, manual} intervals. The \texttt{all\_null\_*} configurations rely on the context knowledge that the hand washing parts should be way less than all other activities. So we can assume that all labels should be of value \textit{null} and just inside \textit{hw, manual} intervals, there are some of value \textit{hw}. It depends on how reliably a user has specified all hand wash actions. Here especially, the performance of the introduced denoising approaches is focused. Again \texttt{all\_null\_hwgt} represents the theoretical upper bound if a perfect filter would exist. As a more general approach, the \texttt{all\_cnn\_*} configurations do not make such a hard contextual statement and attempt to combine the cleaning abilities of the CNN network and high confidence in the resulting labels to augment the training data with likely correct samples.
\input{figures/approach/table_filter_configurations}
\subsection{Evaluation metrics}\label{sec:approachMetrics}
To determine the performance a model, usually the predicted output is compared with the ground truth data. In a binary classification problem, the predictions can be represented in a 2x2 confusion matrix. It contains the amount of labels which are either: True positive (TP) if a sample is correctly labeled as positive, false positive (FP) if negative sample is incorrectly labeled as positive, true negative (TN) if a sample is correctly labeled as negative and false negative (FN) if a positive sample is incorrectly labeled as negative. In our case, labels of class \textit{null} represents the negative and \textit{hw} the positive labels. The resulting confusion matrix is shown in \tabref{tab:confusionMatrix}.
To determine the model's performance, the predicted output is usually compared with the ground truth data. In a binary classification problem, the predictions can be represented in a 2x2 confusion matrix. It contains the amount of labels which are either: True positive (TP) if a sample is correctly labeled as positive, false positive (FP) if negative sample is incorrectly labeled as positive, true negative (TN) if a sample is correctly labeled as negative and false negative (FN) if a positive sample is incorrectly labeled as negative. In our case, labels of class \textit{null} represents the negative and \textit{hw} the positive classes. The resulting confusion matrix is shown in \tabref{tab:confusionMatrix}.
\input{figures/approach/table_confusion_matrix}
A common measurement for the performance of a classifier is the accuracy which is the percentage of correctly predicted labels over total samples. But this would not take into account the highly imbalanced class distribution. Since just a few labels would be hand washing, a model which predicts only \textit{null} would still achieve a high accuracy but fails in our task. Therefore I consider other measurement metrics which are more robust against class imbalance. These are specificity, sensitivity, F1 score and S score which are defined as~\cite{Chicco2020Dec}:
A common measurement for the performance of a classifier is the accuracy which is the percentage of correctly predicted labels over total samples. But this would not take into account the highly imbalanced class distribution. Since just a few labels would be hand washing, a model which predicts only \textit{null} would still achieve high accuracy but fails in our task. Therefore I consider other measurement metrics which are more robust against class imbalance. These are specificity, sensitivity, F1 score, and S score, which are defined as~\cite{Chicco2020Dec}:
\begin{align}
Sensitivity = Recall &= \frac{TP}{TP + FN}\\
Specificity &= \frac{TN}{TN + FP}\\
Precision &= \frac{TP}{TP+FP}\\
F_1~score &= 2\cdot\frac{Precision\cdot Recall}{Precision + Recall}\\
S~score &= 2\cdot\frac{Specificity\cdot Sensitivity}{Specificity + Sensitivity}
Sensitivity = Recall &= \frac{TP}{TP + FN}\\
Specificity &= \frac{TN}{TN + FP}\\
Precision &= \frac{TP}{TP+FP}\\
F_1~score &= 2\cdot\frac{Precision\cdot Recall}{Precision + Recall}\\
S~score &= 2\cdot\frac{Specificity\cdot Sensitivity}{Specificity + Sensitivity}
\end{align}
The sensitivity gives the ratio of correctly recognized hand wash samples. The specificity gives the ratio of correctly recognized not hand wash samples. Both have to be close to 1 for a good performing model. The F1 score is the harmonic mean between recall and precision, where recall is the same as sensitivity and precision is the ratio of correctly predicted hand wash samples over all predicted hand wash samples. So it gives a measurement for the trade-off between detecting all hand wash activities and just detect them if the user is actual doing it. Similar the S score is the harmonic mean between sensitivity and specificity.
The sensitivity gives the ratio of correctly recognized hand washing samples. The specificity gives the ratio of correctly recognized not hand washing samples. Both have to be close to 1 for a good-performing model. Precision is the ratio of correctly predicted hand wash samples over all predicted hand wash samples. The F1 score is the harmonic mean between recall and precision, where recall is the same as sensitivity. So it gives a measurement for the trade-off between detecting all hand wash activities and just detecting them if the user is actually doing it. Similarly, the S score is the harmonic mean between sensitivity and specificity. Here, there is a greater focus on the false positives under the true negatives.
However these metrics just consider the crisp label values. So they are unable to reflect the uncertainty of the labels. Therefore I use a slightly different definition of the previous metrics, which work with class membership values~\cite{Beleites2013Mar, Beleites2015Feb}:
However, these metrics only consider the crisp label values. So they are unable to reflect the uncertainty of the labels. Therefore I use a slightly different definition of the previous metrics, which work with class membership values~\cite{Beleites2013Mar, Beleites2015Feb}:
\begin{align}
Sensitivity^{soft} = Recall^{soft} &= 1-\sum_n \frac{Y}{\sum_n Y}|\hat{Y}-Y|\\
......@@ -165,51 +168,55 @@ F_1^{soft}~score &= 2\cdot\frac{Precision^{soft}\cdot Recall^{soft}}{Precision^{
S^{soft}~score &= 2\cdot\frac{Specificity^{soft} \cdot Sensitivity^{soft}}{Specificity^{soft} + Sensitivity^{soft}}
\end{align}
The terms $Y$ and $\hat{Y}$ are the sets of soft-label vectors of the ground truth labels and predicted labels respectively. These computes the absolute regression error $|\Delta|$ between the prediction and ground truth and weights it by the reference memberships. Since $|\Delta|$ is the error we have to compute the complement $1-|\Delta|$ to get the performance measure. So a soft-label vector with both values near $0.5$ has another impact to the metric as a vector with one value near $1$.
The terms $Y$ and $\hat{Y}$ are the sets of soft-label vectors of the ground truth labels and predicted labels, respectively. These compute the absolute regression error $|\Delta|$ between the prediction and ground truth and weights it by the reference memberships. Since $|\Delta|$ is the error, we have to compute the complement $1-|\Delta|$ to get the performance measure. So a soft-label vector with both values near $0.5$ has a different impact to the metric as a vector with one value near $1$.
\section{Personalization quality estimation}\label{sec:approachQualityEstimation}
New models which results by the personalization should be evaluated to ensure that the new model performs better than the one currently used. To determine the quality of a model, the predicted values are compared to the actual ground truth data using different metrics like introduced in \secref{sec:approachMetrics}. However, in our case there is no ground truth data available for common evaluations. Therefore I use again the information given by the indicators. I assert that, the performance of a model is reflected by the resulting behavior of the application. In our case, the situations in which a hand washing activity is detected and a notification is promted to the user. So we can simulate the behavior of the application using the new model on an existing recording and compare the potential detected hand wash sections with the actual user feedback. To simulate the application I use the new model to predict the classes on a recording and compute the running mean over the predictions. Additionally low movement sections are detected and the predictions are set to \textit{null}. This is equal to the filter of the application where no prediction model is applied to low movement sections. At each sample where the mean for \textit{hw} predictions is higher than the given threshold I check if the label is inside of a \textit{hw} or \textit{manual} interval. If yes, then it is counted as a true positive (TP) prediction otherwise it is false positive (FP) prediction. Since the application buffers multiple samples for prediction I require a minimum distance between two detections.
To observe wrongly predicted activities I take the same assumption as for the \texttt{all\_null\_*} filter configurations in~\secref{sec:approachFilterConfigurations}. If a hand wash activity is detected on a section which is not covered by a \textit{hw} or \textit{manual} interval then it is probably a false detection. If the running mean has not exceeded the threshold within a \textit{hw} or \textit{manual} interval it would lead to a false negative (FN) prediction. All other sections where no hand washing activity is detected would be true negative (TN) predictions. But due to the minimum distance between predictions and overlapping \textit{hw, manual} intervals it is hard to estimate section boundaries. Therefore the true negative value would not be precise. Using these values it is possible to create the confusion matrix as described in \secref{sec:approachMetrics}. I compute the Sensitivity, Precision and F1 score since they do not depend on the true negative values. So it is possible to compare the performance of arbitrary models.
New models resulting from the personalization should be evaluated to ensure that the new model performs better than the one currently used. To determine the quality of a model, the predicted values are compared to the actual ground truth data using different metrics like those introduced in \secref{sec:approachMetrics}. However, in our case, no ground truth data is available for common evaluations. Therefore I use again the information given by the indicators. I assert that the application's resulting behavior reflects a model's performance. In our case, the situations in which a hand washing activity is detected and a notification is prompted to the user. So we can simulate the application's behavior using the new model on an existing recording and compare the potential detected hand wash sections with the actual user feedback.
I use the new model to simulate the application to predict the classes on a recording and compute the running mean over the predictions. Additionally, little movement sections are detected, and the predictions are set to \textit{null}. This is equal to the application's filter, where no prediction model is applied to low movement sections. At each sample where the mean for \textit{hw} predictions is higher than the given threshold, I check if the label is inside of a \textit{hw} or \textit{manual} interval. If yes, it is counted as a true positive (TP) prediction; otherwise, it is a false positive (FP) prediction. Since the application buffers multiple samples for prediction, I require a minimum distance between two detections.
To observe wrongly predicted activities I take the same assumption as for the \texttt{all\_null\_*} filter configurations in~\secref{sec:approachFilterConfigurations}. If a hand wash activity is detected on a section not covered by a \textit{hw} or \textit{manual} interval, then it is a false detection. If the running mean has not exceeded the threshold within a \textit{hw} or \textit{manual} interval, it will lead to a false negative (FN) prediction. All other sections where no hand washing activity is detected would be true negative (TN) predictions. However, it is hard to estimate section boundaries due to the minimum distance between predictions and overlapping \textit{hw, manual} intervals. Therefore the true negative value would not be precise. Using these values, it is possible to create the confusion matrix as described in \secref{sec:approachMetrics}. I compute the Sensitivity, Precision, and F1 score since they do not depend on the true negative values. So it is possible to compare the performance of arbitrary models.
\subsection{Best kernel settings}
Furthermore these mechanism can also be used to redefine the values of kernel width and threshold for the running mean. I apply a grid search over kernel sizes of $[10, 15, 20]$ and thresholds of $[0.5, 0.52, \dots, 0.99]$. For each value combination the resulting true positive and false positive values are computed. The combination which results at least the same amount of true positives and minimal false positives is set as optimal mean setting.
Furthermore, this mechanism can also be used to redefine the values of kernel width and threshold for the running mean. I apply a grid search over kernel sizes of $[10, 15, 20]$ and thresholds of $[0.5, 0.52, \dots, 0.98]$. The resulting true positive and false positive values are computed for each value combination. The combination that results in at least the same amount of true positives and minimal amount of false positives is set as an optimal mean setting.
\section{Personalization Pipeline}
In this section I want to summarize my implementation to extend the given application. The process of personalization is shown in \figref{fig:personalizationImplementation}.
In this section, I want to summarize my implementation to extend the given application. The process of personalization is shown in \figref{fig:personalizationImplementation}.
\input{figures/approach/personalization_implementation}
As described in~\secref{sec:approachBaseApplication} all recordings of a user are send to a HTTP server. So they are accessible at a central point. Moreover each recording is assigned to the corresponding user. If a personalization run is triggered, new recordings since the last personalization are identified. Then it is determined if some of the new recordings should be excluded from personalization for later performance tests. Twenty percent of the recordings are used for the test split. These recordings should at least contain one indicator for a hand washing activity and one for a rejection. So it is ensured that the test split does not contain trivial recordings. The recordings which are used for personalization should contain at least one indicator for hand washing, so they don't lead to complete class imbalance. If there are new recordings left which satisfy this requirements, the process continues.
As described in~\secref{sec:approachBaseApplication} all user recordings are sent to a HTTP server. So they are accessible at a central point. Moreover, each recording is assigned to the corresponding user. New recordings since the last personalization are identified if a personalization run is triggered. Then it is determined if some new recordings should be excluded from personalization for later performance tests. Twenty percent of the recordings are used for the test split. These recordings should at least contain one indicator for a hand washing activity and one for rejection. So it is ensured that the test split does not contain trivial recordings. The recordings used for personalization should contain at least one indicator for hand washing, so they do not lead to complete class imbalance. If there are new recordings that satisfy these requirements, the process continues.
The base model is used to predict the soft-labels for all recordings. After that the labels are refined by one of the filter configurations specified in~\secref{sec:approachFilterConfigurations}. As default the configuration \texttt{all\_cnn\_convlstm3\_hard} is used. The previous model is then trained in a supervised learning with the new samples from the recording and their corresponding pseudo labels. For the previous model either the currently best performing model or the last model which has been personalized can be chosen. As well it can be specified which regularization technique should be used (see~\secref{sec:approachPersonalization}). Default regularization is freezing the feature layers, since it leads to shorter training times. After training the resulted new model is saved.
The base model is used to predict the soft-labels for all recordings. After that, the labels are refined by one of the filter configurations specified in~\secref{sec:approachFilterConfigurations}. As default the configuration \texttt{all\_cnn\_convlstm3\_hard} is used. The previous model is then trained in supervised learning with the new samples from the recording and their corresponding pseudo labels. For the previous model, either the currently best performing model or the last model, which has been personalized, can be chosen. It can also be specified which regularization technique should be used (see~\secref{sec:approachPersonalization}). Default regularization is freezing the feature layers since it leads to shorter training times. After training, the resulting new model is saved.
Now all of the recordings which where separated for testing are used for quality estimation of the new model as described in~\secref{sec:approachQualityEstimation}. Therefore the new model is simulated with the test recordings and its predictions are evaluated by the indicators. The same is done with the base model. Both are compared with each other and the relative differences in correct and wrong predictions are calculated. The new best running mean parameters are determined and based on these settings the performance of the new model is calculated using Sensitivity, Precision and F1 score. All these values are saved to the personalization run.
Now all of the recordings separated for testing are used for quality estimation of the new model as described in~\secref{sec:approachQualityEstimation}. Therefore the new model is simulated with the test recordings, and the indicators evaluate its predictions. The same is done with the base model. Both are compared, and the relative differences in correct and wrong predictions are calculated. The new best running mean parameters are determined and based on these settings, the new model's performance is calculated using Sensitivity, Precision, and F1 score. All these values are saved to the personalization run.
Every time when the user starts or stops using the application, the server is requested for a new model. For the specific user is then determined if there are personalized models. If so, the model with highest F1 score and its running mean settings are send to the user. If not the user receives the general model.
Whenever the user starts or stops using the application, the server is requested a new model. For the specific user, it is then determined if there are personalized models. If so, the model with the highest F1 score and its running mean settings are sent to the user. If not, the user receives the general model.
Additionally plots of the simulated test recordings and filtered hand wash intervals are generated. So there is a more intuitive overview of a personalization run. \figref{fig:personalizationPipeline} shows a screen shot of the personalization interface which is embedded to the HTTP server.
Additionally, plots of the simulated test recordings and filtered hand washing intervals are generated. So there is a more intuitive overview of a personalization run. \figref{fig:personalizationPipeline} shows a screenshot of the personalization interface embedded in the HTTP server.
\input{figures/approach/personalization_pipeline}
\section{Active learning}\label{sec:approachActiveLearning}
To evaluate my personalization approach I compare it with a simple implementation of semi supervised active learning. Similar to my approach the base model is used to predict the new unseen datasets. These predictions can then be used to calculate the informativeness of the corresponding sample. Idea is to determine instances where the model is uncertain in the label. To do that I calculate the entropy $H$ of the predictions for a sample $x_i$ which is defined as
To evaluate my personalization approach, I compare it with a simple implementation of semi-supervised active learning. Similar to my approach, the base model is used to predict the new unseen datasets. These predictions can then be used to calculate the informativeness of the corresponding sample. The idea is to determine instances where the model is uncertain in the label. To do that, I calculate the entropy $H$ of the predictions for a sample $x_i$, which is defined as
\begin{align}
H(x_i) &= -\sum_{j=1}^{n}P_{ij} \log{P_{ij}}