Commit a12bfaa9 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

final experiment

parent 32297bd2
......@@ -25,7 +25,7 @@
\begin{tabular}[hc]{>{\huge}l >{\huge}l}
Examiner: & \firstexaminer \\[0.3cm]
Advisers: & \advisers \\[1.2cm]
Advisor: & \advisers \\[1.2cm]
\end{tabular}
\vfill % move the following text to the bottom
......@@ -61,5 +61,5 @@
{
% No second examiner, ignore
}
\textbf{Advisers} \smallskip{} \\
\textbf{Advisor} \smallskip{} \\
\advisers
\chapter*{Abstract}
Wearable sensors like smartwatches offer an excellent opportunity for human activity recognition (HAR). They are available to a broad user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition performs better than a general one. However, additional labeled data from the user is required, which can be time-consuming and labor intensive to annotate them. While common personalization approaches reduce the necessary labeled training data, the labeling process remains dependent on some user interaction.
Wearable sensors like smartwatches offer an excellent opportunity for human activity recognition (HAR). They are available to a broad user base and can be used in everyday life. Due to the variety of users, the detection model must be able to recognize different movement patterns. Recent research has demonstrated that a personalized recognition performs better than a general one. However, additional labeled data from the user is required, which can be time-consuming and labor intensive to annotate them. While common personalization approaches reduce the required amount of labeled training data, the labeling process remains dependent on some user interaction.
In this work, I present a personalization approach in which training data labels are derived from inexplicit user feedback obtained during the use of a HAR application. The general model predicts labels which are then refined by various denoising filters based on Convolutional Neural Networks and Autoencoders. The previously obtained user feedback assists this process. High confidence data is then used to fine-tune the recognition model via transfer learning. No changes to the model architecture are required so that personalization can be easily added to an existing application.
......
\chapter{Introduction}\label{chap:introduction}
Detecting and monitoring people's activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches offer integration in everyday life over a wide user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}.
Detecting and monitoring people's activities can be the basis for observing user behavior and well-being. Human Activity Recognition (HAR) is a growing research area in many fields like healthcare~\cite{Zhou2020Apr, Wang2019Dec}, elder care~\cite{Jalal2014Jul, Hong2008Dec}, fitness tracking~\cite{Nadeem2020Oct} or entertainment~\cite{Lara2012Nov}. Especially the technical improvements in wearable sensors like smart watches enable the integration of activity recognition into everyday life for a wider user base~\cite{Weiss2016Feb, Jobanputra2019Jan, Bulling2014Jan}.
One application scenario in healthcare is observing various diseases such as Obsessive-Compulsive Disorder (OCD). For example, the detection of hand washing activities can be used to derive the frequency or excessiveness which occurs in some people with OCD. Moreover, it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected, Just-in-Time Interventions can be presented to the user, offering enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}.
One application scenario in healthcare is observing various diseases such as Obsessive-Compulsive Disorder (OCD). For example, detecting hand washing activities can be used to derive the frequency or excessiveness with which people affected by OCD perform this action. Moreover, with automatic detection it is possible to diagnose and even treat such diseases outside a clinical setting~\cite{Ferreri2019Dec, Briffault2018May}. If excessive hand washing is detected, just-in-time interventions can be presented to the user, offering enormous potential for promoting health behavior change~\cite{10.1007/s12160-016-9830-8}.
State-of-the-art Human Activity Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since the movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}.
State-of-the-art Human Activity Recognition methods are supervised deep neural networks derived from concepts like Convolutional Layers or Long short-term memory (LSTM). These require lots of training data to achieve good performance. Since the movement patterns of each human are unique, the performance of activity detection can differ. So training data of a wide variety of humans is necessary to generalize to new users. Therefore, it has been shown that personalized models can achieve better accuracy against user-independent models ~\cite{Hossain2019Jul, Lin2020Mar}.
To personalize a model, retraining on new unseen sensor data is necessary. Obtaining the ground truth labels is crucial for most deep learning techniques. However, the annotation process is time and cost-intensive. Typically, training data is labeled in controlled environments by hand. In a real context scenario, the user would have to take over the major part.
Indeed this requires lots of user interaction and decent expertise, which would contradict the usability.
......@@ -12,18 +12,18 @@ There has been various research on preprocessing data to make it usable for trai
My work aims to personalize a detection model without increasing user interaction. Information for labeling is drawn from indicators that arise during the use of the application. These can be derived from user feedback to triggered actions resulting from the predictions of the underlying recognition model. Moreover, personalization should be separated so that no change in the model architecture is required.
At first, all new unseen sensor data is labeled by the same general model, which is used for activity recognition. These model predictions are corrected to a certain extent by using pre-trained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to refine the data to generate a good training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data, the previous model can be fine-tuned in an incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neuronal networks, it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. In combination, this leads to a personalized model which has improved performance in detecting specific gestures of an individual user.
At first, all new unseen sensor data is labeled by the same general model, which is used for activity recognition. These model predictions are corrected to a certain extent by using pre-trained filters. High confidence labels are considered for personalization. In addition, the previously obtained indicators are used to refine the data to generate a good training set. Therefore the process of manual labeling can be skipped and replaced by an automatic combination of available indications. With the newly collected and labeled training data, the previous model can be fine-tuned in an incremental learning approach ~\cite{Amrani2021Jan, Siirtola2019May, Sztyler2017Mar}. For neural networks, it has been shown that transfer learning offers high performance with decent computation time ~\cite{Chen2020Apr}. These steps lead to a personalized model which has improved performance in detecting specific gestures of an individual user.
I applied the described personalization process to a hand washing detection application used to observe the behavior of OCD patients. During the observation, the user answers requested evaluations if the application detects hand washing. For miss predictions, the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the accuracy of the predictions, resulting in the desired indicators.
I applied the described personalization process to a hand washing detection application used to observe the behavior of OCD patients. During the observation, if the application detects hand washing, it asks the user for evaluation. For miss predictions, the user has the opportunity to reject evaluations. Depending on how the user reacts to the evaluations, conclusions are drawn about the accuracy of the predictions, resulting in the desired indicators.
The contributions of my work are as follows:
\begin{itemize}
\item [1.] A personalization approach which can be added to an exisitng HAR application and does not require additional user interaction or changes in the model architecture.
\item [2.] Different indicator assisted refinement methods, based on Convolutional networks and Fully Connected Autoencoders, for generated labels.
\item [3.] Demonstration that a personalized model which results from this approach outperforms the general model and can achieve similar performance as a supervised personalization.
\item [4.] Comparison to a common active learning method.
\item [5.] Presentation of real-world experiments which confirms applicability to a broad user base
\item [1.] A personalization approach is implemented, which can be added to an existing HAR application and does not require additional user interaction or changes in the model architecture.
\item [2.] Different indicator-assisted refinement methods, based on Convolutional networks and Fully Connected Autoencoders, are applied to generated labels.
\item [3.] It is demonstrated that a personalized model which results from this approach outperforms the general model and can achieve similar performance as a supervised personalization.
\item [4.] My approach is compared to a common active learning method.
\item [5.] A real-world experiment is presented, which confirms applicability to a broad user base.
\end{itemize}
......
\chapter{Related Work}\label{chap:relatedwork}
Human Activity Recognition (HAR) is a broad research field used in various applications like healthcare, fitness tracking, elder care, or behavior analysis. Data acquired by different types of sensors like video cameras, range sensors, wearable sensors, or other devices are used to automatically analyze and detect everyday activities. Especially the field of wearable sensors is growing as the technical progress in smart watches makes it possible for a wide range of users to integrate these sensors into their daily lives.
In the following, I give a brief overview of literature about state-of-the-art HAR and how personalization can improve performance. Then I focus on work that deals with the generation of training data in different approaches. Last I show work how cleaning faulty labels in the training data can be done.
In the following, I give a brief overview of literature about state-of-the-art HAR and how personalization can improve performance. Then, I focus on work that deals with the generation of training data in different approaches. Finally, I show work how cleaning faulty labels in the training data can be done.
\section{Activity recognition}\label{sec:relWorkActivityRecognition}
Most Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently deep neural network techniques have replaced traditional ones such as Support Vector Machines or Random Forests since no hand-crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature decoders and an output layer that provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons connected to the following layer's neurons. These connections are assigned a weight that is learned during the training process. First, in the feed-forward pass, the output values are computed based on a batch of training data. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. Feed-forward pass and backpropagation are repeated over multiple iterations, called epochs~\cite{Liu2017Apr}.
Most Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently, deep neural network techniques have replaced traditional ones such as Support Vector Machines or Random Forests since no hand-crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature decoders and an output layer that provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons connected to the following layer's neurons. These connections are assigned a weight that is learned during the training process. First, in the feed-forward pass, the output values are computed based on a batch of training data. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. Feed-forward pass and backpropagation are repeated over multiple iterations, called epochs~\cite{Liu2017Apr}.
The combination of Convolutional Neural Networks (CNN) and Long-short-term memory recurrent neural networks (LSTMs) tend to outperform other approaches. They are considered the current state of the art for human activity recognition~\cite{9043535}. For classification problems, cross-entropy as loss function is used in most works. \extend{???}
\section{Personalization}\label{sec:relWorkPersonalization}
Even well-performing architectures can result in worse results in real-world scenarios. Varying users and environments create many different influences that can affect performance. These could be the device's position, differences between the sensors, or human characteristics \cite{ferrari2020personalization}.
Even well-performing architectures can result in low-quality results in real-world scenarios. Varying users and environments create many different influences that can affect performance. These could be the device's position, differences between the sensors, or human characteristics \cite{ferrari2020personalization}.
Each user's movement pattern differs so that a general detection model may suffer. To overcome this problem, a general model should be trained on a wide range of users and cover as many different motion patterns as possible. However, this would require unrealistically large data sets. Besides the storage and processing costs, the availability of public datasets is limited since labeling is a difficult task. The goal is to generalize the model as much as possible for the final user.
It has been shown that a personalized model trained with additional user-specific data (even with just a small amount of additional data) can significantly outperform the general model ~\cite{8995531, doi:10.1137/1.9781611973440.71, zhao2011cross}. In my work I concentrate on data-based approaches, which can be split in \textit{subject-independent}, \textit{subject-dependent} and \textit{hybrid} dataset configurations ~\cite{Ferrari2021Sep}. The \textit{subject-independent} model represents the general model where no user-specific data is used for training, whereas the \textit{subject-dependent} model just relies on the user's data. A subject-dependent model would generalize best but requires specific data from each user. As a combination of both, the \textit{hybrid} configuration uses the data of all other users with additional data from the target user. This should result in better detection of the final user's activities than the subject-independent model but is easier to train than the subject-dependent model since fewer data of the final user is required. It is possible that a hybrid approach can achieve similar performance as the subject-dependent but with less user-specific data~\cite{weiss2012impact, Chen2017Mar}.
......@@ -38,21 +38,21 @@ Zeng et al. observed two approaches for semi-supervised learning~\cite{Zeng2017D
\subsection{Active learning}
Active learning is generally a supervised learning approach that attempts to achieve the best possible performance with as few labeled instances as possible. It selects samples from a pool of unlabeled instances and queries them to a so-called oracle, for example, the user, to receive the exact labels. The instances with the highest informativeness are selected for training to achieve maximum performance with only a few queries. There are several approaches to determining informativeness. Like querying the samples, where the learner is most uncertain of labeling or which would lead to the most significant expected model change ~\cite{Settles2009, rokni2020transnet}.
Active learning is a supervised learning approach that attempts to achieve the best possible performance with as few labeled instances as possible. It selects samples from a pool of unlabeled instances and queries them to a so-called oracle, for example, the user, to receive the exact labels. The instances with the highest informativeness are selected for training to achieve maximum performance with only a few queries. There are several approaches to determining informativeness, such as querying the samples, where the learner is most uncertain of labeling or which would lead to the most significant expected model change ~\cite{Settles2009, rokni2020transnet}.
Saeedi et al. combine an active learning approach with a neural network consisting of convolutional and LSTM layers to predict activities by using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine-tune the model to a new domain. To select the queried instances, they introduced a measurement called information density. It weights the informativeness of an instance by its similarity to all other instances.
Saeedi et al. combine an active learning approach with a neural network consisting of convolutional and LSTM layers to predict activities by using wearable sensors~\cite{Saeedi2017Dec}. They apply active learning to fine-tune the model to a new domain. To select the queried instances, they introduce a measurement called information density. It weighs the informativeness of an instance by its similarity to all other instances.
%~\cite{Fekri2018Sep}.
Ashari and Ghasemzadeh observed the limitations of response capabilities of an user~\cite{Ashari2019Jul}. This applies not only to the number of queries but also to the time discrepant between querying and when it is answered. They add the criteria of a user's ability to remember the correct label of a sample to the selection process of queried instances.
Ashari and Ghasemzadeh observed the limitations of the response capabilities of a user ~\cite{Ashari2019Jul}. This applies not only to the number of queries but also to the time discrepant between querying and when it is answered. They add the criteria of a user's ability to remember the correct label of a sample to the selection process of queried instances.
Active learning can be combined with a semi-supervised approach, as shown by Hasan and Roy-Chowdhury~\cite{Hasan2015Sep}. They use active learning where samples with high tentative prediction probability are labeled by a weak learner, i.e., classification algorithm. Just samples with low certainty and high potential model change are queried to the user. So they can enlarge the training set without increasing the user interaction. They achieved competitive performance as state-of-the-art active learning methods but with a reduced amount of manually labeled instances.
Active learning can be combined with a semi-supervised approach, as shown by Hasan and Roy-Chowdhury~\cite{Hasan2015Sep}. They use active learning where samples with high tentative prediction probability are labeled by a weak learner, i.e., classification algorithm. Only samples with low certainty and high potential model change are queried to the user. So they can enlarge the training set without increasing the user interaction. They achieved competitive performance comparable to state-of-the-art active learning methods but with a reduced amount of manually labeled instances.
\subsection{Self-supervised learning}\label{sec:relWorkSelfSupervisedLearning}
Here a deep neural network is introduced, which learns to solve predefined transformation recognition tasks in an unsupervised manner. I.e., different transformation functions like noise, rotation, or negation are applied to an input signal which generates new distinct versions. The network predicts the probabilities that a given sequence is the transformation of the original signal. Since the transformation functions are known, a self-supervised labeled training set can be constructed. The idea is that for detecting the transformation tasks, the core characteristics of the input signal have to be learned. These high-level semantics can then be used as the feature layers for the classifier. Just a few labeled samples are required to train the classification layer.
In self-supervised learning, a deep neural network is trained to recognize predefined transformation tasks in an unsupervised manner. I.e., different transformation functions like noise, rotation, or negation are applied to an input signal which generates new distinct versions. The network predicts the probabilities that a given sequence is the transformation of the original signal. Since the transformation functions are known, a self-supervised labeled training set can be constructed. The idea is that for detecting the transformation tasks, the core characteristics of the input signal have to be learned. These high-level semantics can then be used as the feature layers for the classifier. Just a few labeled samples are required to train the classification layer.
Saeed et al. showed a self-supervised learning approach for HAR~\cite{Saeed2019Jun}. They achieve a significantly better performance than traditional unsupervised learning methods and comparable with fully-supervised methods, especially in a semi-supervised scenario where a few labeled instances are available. Tang et al. extend this by combining self-supervised learning and self-training~\cite{Tang2021Feb}. A teacher model is trained first using supervised labeled data. Then, the teacher model is used to relabel the supervised dataset and additional unseen instances. Most confident samples are augmented by transformation functions, as mentioned previously. After that, the self-supervised dataset is used to train a student model. In addition, it is fine-tuned with the initially supervised instances. By combining the unlabeled data with the limited labeled data, performance can be further enhanced.
......@@ -65,7 +65,7 @@ In the work of Stikic et al., multi-instance learning with weak labels is used f
\subsection{Pseudo labeling}\label{sec:relWorkPseudoLabeling}
Pseudo labeling allows doing an unsupervised domain adaption by using predictions of the base model~\cite{lee2013pseudo}. Based on the prediction of a sample, an artificial pseudo-label is generated, which is treated as ground truth data. However, this requires that the initial-trained model predicts pseudo-labels with high confidence, which is hard to satisfy. Training with false pseudo-labels has a negative impact on personalization. Moreover, it is possible that pseudo-labeling could overfit to incorrect pseudo-labels over multiple iterations, which is known as confirmation bias. Therefore, many approaches augment the pseudo labels to reduce the amount of false training data. Since a base model is required, which is in most cases trained by supervised data, pseudo labeling is a part of semi-supervised learning. Nevertheless, compared to other semi-supervised approaches, pseudo labeling offers a simple implementation that does not rely on domain-specific augmentations or changes to the model architecture.
Pseudo labeling allows doing an unsupervised domain adaption by using predictions of the base model~\cite{lee2013pseudo}. Based on the prediction of a sample, an artificial pseudo-label is generated, which is treated as ground truth data. However, this requires that the initial-trained model predicts pseudo-labels with high accuracy, which is hard to satisfy. Training with false pseudo-labels has a negative impact on personalization. Moreover, it is possible that pseudo-labeling could overfit to incorrect pseudo-labels over multiple iterations, which is known as confirmation bias. Therefore, many approaches augment the pseudo labels to reduce the amount of false training data. Since a base model is required, which is in most cases trained by supervised data, pseudo labeling is a part of semi-supervised learning. Nevertheless, compared to other semi-supervised approaches, pseudo labeling offers a simple implementation that does not rely on domain-specific augmentations or changes to the model architecture.
Li et al. showed a naive approach for semi-supervised learning using pseudo labels~\cite{Li2019Sep}. First, a pseudo labeling model $M_p$ is trained using a small supervised labeled data set $L$. This model is then used to perform pseudo-labeling for new unlabeled data, which results in dataset $\hat{U}$. After that, a deep learning model $M_{NN}$ is pre-trained with the pseudo labeled data $\hat{U}$ and afterward fine-tuned with the supervised data $L$. This process is repeated, where the resulting model $M_{NN}$ is used as a new pseudo labeling model $M_p$ until the validation accuracy converges. Moreover, they use the fact that predictions of a classifier model are probabilistic and assume that labels with higher probability also have higher accuracy. Therefore, they use only pseudo labels with high certainty. They argue that pseudo-labeling can be seen as a kind of data augmentation. Even with the high label noise of the pseudo labels, a deep neural network should be able to improve with training. In their tests, they significantly improved accuracy by adding pseudo labels to the training. Furthermore, they showed that the model benefits especially from the first iterations. Nevertheless, it is required that the pseudo labeling model $M_P$ has a certain accuracy. Tests show that a better pseudo labeling model leads to higher accuracy of the fine-tuned model. Arazo et al. observed the performance of a naive pseudo labeling applied on images and showed that it would overfit to incorrect pseudo labels~\cite{Arazo2020Jul}. The trained model tends to have higher confidence to previously false predicted labels, which results in new incorrect predictions. They applied simple modifications to prevent confirmation bias without requiring multiple networks or any consistency regularization methods as done in other approaches like in \secref{sec:relWorkSelfSupervisedLearning}. This yielded state-of-the-art performance with mixed-up augmentation as regularization and adding a minimum number of labeled samples. Additionally, they use soft-labels instead of hard-labels for training. Here a label consists of the individual class affiliations instead of a single value for the target class. Thereby it is possible to depict uncertainty over the classes. As a mixed-up strategy, they combine random sample pairs and corresponding labels, creating data augmentation with label smoothing. It should reduce the confidence of network predictions. As they point out, this approach is more straightforward than using other regularization methods and moreover more accurate.
......@@ -82,9 +82,9 @@ Label noise does not only occur in generated labels. Additionally, in supervised
Furthermore, it is also possible for standard neural networks to learn from arbitrary noisy data and still perform well. They also benefit from larger data sets that can accommodate a wide range of noise~\cite{Rolnick2017May}. In particular, there is much research in image denoising, where deep neural networks have also become very promising~\cite{Xie2012, Dong2018Oct, Gondara2016Dec}. However, many of these approaches can also be applied to time-series data. Since, in our case, labels result from time-series samples, the labels themselves can also be considered as time-series data. Therefore it is likely, that two adjacent labels have the same value, and an outlier would probably be a false label. It is similar to noise in a signal. Denoising or smoothing would be the same as correcting the wrong labels.
\subsection{Autoencoder}\label{sec:relWorkAutoencoder}
For denoising signals, autoencoders have been shown to achieve good results~\cite{Xiong2016Nov, Chiang2019Apr, Xiong2016Jun}. An autoencoder is a neural network model consisting of two parts: the encoder and the decoder. In encoding, the model tries to map the input into a more compact form with low-dimensional features. During decoding, this compressed data is reconstructed to the original space~\cite{Liu2019Feb}. While training, the model is adjusted to minimize the reconstruction error between the predicted output and expected output. Autoencoders have become popular for unsupervised pretraining of a deep neural network. Vincent et al. introduced the use of autoencoders for denoising~\cite{vincent2010stacked}. A denoising autoencoder receives corrupted input samples and is trained with their clean values to learn the prediction of the original data. So not just a mapping from input to output is learned but also features for denoising.
For denoising signals, autoencoders have been shown to achieve good results~\cite{Xiong2016Nov, Chiang2019Apr, Xiong2016Jun}. An autoencoder is a neural network model consisting of two parts: the encoder and the decoder. In encoding, the model tries to map the input into a more compact form with low-dimensional features. During decoding, this compressed data is reconstructed to the original space~\cite{Liu2019Feb}. While training, the model is adjusted to minimize the reconstruction error between the predicted output and expected output. Autoencoders have become popular for unsupervised pretraining of a deep neural network. Vincent et al. introduced the use of autoencoders for denoising~\cite{vincent2010stacked}. A denoising autoencoder receives corrupted input samples and is trained to reconstruct the original data. So not just a mapping from input to output is learned but also features for denoising.
A traditional autoencoder consists of three fully connected (FC) layers, where the first layer is used for encoding and the last layer for decoding. The middle layer represents the hidden state of the compact data. Chiang et al. use a fully convolutional network as their architecture for denoising electrocardiogram signals~\cite{Chiang2019Apr}. The network consists of multiple convolutional layers for the encoding part and an inversely symmetric encoder using deconvolutional layers. An additional convolutional layer is used for the output. They use a stride of 2 in their convolutional layers to downsample the input signal and upsample it again to the output. So no pooling layers are required, and the exact signal alignment of input and output is obtained. This results in compression from a 1024x1 dimensional input signal to a 32x1 dimensional feature map. Since no fully connected layers are used, the number of weight parameters is reduced, and the locally-spatial information is preserved. They use root mean square error as loss function, determining the variance between the predicted output and the original signal. Similar Garc\'{i}a-P\'{e}rez et al. use fully-convolutional denoising auto-encoder (FCN-dAE) architecture for Non-Intrusive Load Monitoring~\cite{Garcia-Perez2020Dec}. Both had shown that a FCN-dAE outperforms a traditional autoencoder to improve the noise-to-signal ratio.
A traditional autoencoder consists of three stages, where the first stage is used for input encoding and the last stage for decoding. The middle stage represents the hidden state of the compact data. Chiang et al. use a fully convolutional network as their architecture for denoising electrocardiogram signals~\cite{Chiang2019Apr}. The network consists of multiple convolutional layers for the encoding part and an inversely symmetric encoder using deconvolutional layers. An additional convolutional layer is used for the output. They use a stride of 2 in their convolutional layers to downsample the input signal and upsample it again to the output. So no pooling layers are required, and the exact signal alignment of input and output is obtained. This results in compression from a 1024x1 dimensional input signal to a 32x1 dimensional feature map. Since no fully connected layers are used, the number of weight parameters is reduced, and the locally-spatial information is preserved. They use root mean square error as loss function, determining the variance between the predicted output and the original signal. Similar Garc\'{i}a-P\'{e}rez et al. use fully-convolutional denoising auto-encoder (FCN-dAE) architecture for Non-Intrusive Load Monitoring~\cite{Garcia-Perez2020Dec}. Both had shown that a FCN-dAE outperforms a traditional autoencoder to improve the noise-to-signal ratio.
\subsection{convLSTM-AE}\label{sec:relWorkconvLSTM}
As already mentioned in \secref{sec:relWorkActivityRecognition} long short-term memory (LSTM) networks are well known in the field of time series learning or sequential learning since they offer to transfer signal information across time steps. Therefore there are multiple approaches where LSTMs are also used in autoencoders~\cite{Essien2019Jul, Essien2020Jan, Kim2021Oct}. The networks are similar to FCN-AEs, but convolutional LSTM (convLSTM) layers replace the convolution layers. In contrast to a common LSTM (FC-LSTM), a convLSTM replaces the internal fully connected matrix multiplications with convolution operations~\cite{Shi2015}. So convLSTM layers can also encode spatial information, in addition to knowing which information is important for the current or following time step. Therefore convLSTM has been applied in many spatial-temporal dependent tasks. Furthermore, fewer weights are needed in comparison to a fully connected LSTM.
......@@ -93,6 +93,6 @@ In some other approaches, the spatial encoder/decoder is separated from the temp
\subsection{Soft labels}\label{sec:relSoftLabels}
In most cases, the label of a sample $x_i$ is a crisp label $y_i\in K$, which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore soft-labels can depict uncertainty over multiple classes~\cite{Beleites2013Mar}. Since a soft-max layer mostly computes the output of a human activity recognition model, it already represents a vector of partial class memberships. Converting to a crisp label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
In most cases, the label of a sample $x_i$ is a hard label $y_i\in K$, which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels, $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore, soft-labels can depict uncertainty over multiple classes~\cite{Beleites2013Mar}. Most models in human activity recognition already computes the probabilities of partial class memberships. Converting to a hard label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
Soft-labels can therefore also be used for more robust training with noisy labels. It could happen that the noise in a generated label only relies on a very uncertain prediction. The noise gets maximized by using crisp labels but has less impact on the training process if used in a soft-label. It has been shown that soft-labels can carry valuable information even when noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision, and recall.
Soft-labels can therefore also be used for more robust training with noisy labels. It could happen that the noise in a generated label only relies on a very uncertain prediction. The noise gets maximized by using hard labels but has less impact on the training process if used in a soft-label. It has been shown that soft-labels can carry valuable information even when noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision, and recall.
\chapter{Methods}\label{chap:approach}
This chapter presents my approach to personalizing a Human Activity Recognition model without requiring additional user interaction.
This chapter presents my approach to personalizing a Human Activity Recognition model without requiring extensive user interaction.
At first, I describe the application, which has to be extended. After that, the creation of synthetic and real-world datasets used for evaluation tests is shown. Then, I describe the personalization process and how user feedback is used to get information about the training data labels. In extension, I present various denoising techniques to refine my training data. Then I introduce common evaluation metrics and a quality estimation for a personalization, which does not rely on ground truth data. The resulting pipeline is then summarized.
At first, I describe the application, which has to be extended. After that, the creation of synthetic and real-world datasets used for evaluation tests is shown. Then, I describe the personalization process and how user feedback is used to get information about the training data labels. Additionally, I present various denoising techniques to refine the training data. Then I introduce common evaluation metrics and a quality estimation for a personalization, which does not rely on ground truth data. I summarize the process in the resulting pipeline.
Finally, I present an active learning implementation used for performance comparison.
\section{Base Application}\label{sec:approachBaseApplication}
The system I am building on consists of an Android Wear OS application running on a smartwatch and a HTTP web service. It is used to observe the hand washing behavior of a participant to treat OCD. Therefore all wrist movements are recorded, and surveys of the user's mental condition for each hand washing are collected. If the device detects a hand wash activity, a notification is prompted to the user, which can then be confirmed or declined. Confirmation leads to the evaluation process of the user. Furthermore, a user can trigger manual evaluations if the device does not detect a hand washing. Psychologists can later use these evaluations to analyze and treat the participant's state during the day. \figref{fig:baseApplicationScreen} shows screenshots of the application.
\section{Base application}\label{sec:approachBaseApplication}
The system I am building on consists of an Android Wear OS application running on a smartwatch and a HTTP web service. It is used to observe the hand washing behavior of a participant to treat OCD. Therefore, all wrist movements are recorded, and surveys of the user's mental condition for each hand washing are collected. If the device detects a hand wash activity, a notification is prompted to the user, which can then be confirmed or declined. Confirmation leads to the surveys and evaluation process of the user. Furthermore, a user can trigger manual evaluations if the device does not detect a hand washing. Psychologists can later use these evaluations to analyze and treat the participant's state during the day. \figref{fig:baseApplicationScreen} shows screenshots of the application.
\input{figures/approach/base_application_screen}
......@@ -17,7 +17,7 @@ For activity prediction, the application uses a general neural network model bas
While charging the smartwatch, all sensor recordings and user evaluations are sent to the web server. They are collected and assigned to the respective watches using the android id. The server also provides a web interface for accessing and managing all recording sets and participants. In addition, various statistics are generated for the gathered data.
\section{Datasets}
To personalize a human activity recognition model, it must be re-trained with additional sensor data from a particular user. In our case this data have to be from IMUs of wrist-worn devices during various human activities. Typically they consist of a set $S=\{S_0,\dots,S_{k-1}\}$ of $k$ time series. Each $S_i\in \mathbb{R}^{d_i}$ is a sensor measured attribute with dimensionality $d_i$ of sensor $i$. Additionally there is a set of $n$ activity labels $A=\{a_0, \dots, a_{n-1}\}$ and each $S_i$ is assigned to one of them ~\cite{Lara2012Nov}. For activity prediction I use a sliding window approach where I split the data set into $m$ time windows $W=\{W_0, \dots, W_{m-1}\}$ of equal size $l$. The windows are shifted by $v$ time series, which means that they overlap if $v < l$. Each window $W_i$ contains a sub-set of time series $W_i=\{S_{i\cdot v}, \dots, S_{i\cdot v + l}\}$ and is assigned to an activity label which builds the set of labels $Y=\{y_0, \dots, y_{m-1}\}$.
To personalize a human activity recognition model, it must be re-trained with additional sensor data from a particular user. In our case, this data have to be from IMUs of wrist-worn devices during various human activities. Typically they consist of a set $S=\{S_0,\dots,S_{k-1}\}$ of $k$ time series. Each $S_i\in \mathbb{R}^{d_i}$ is a sensor measured attribute with dimensionality $d_i$ of sensor $i$. Additionally, there is a set of $n$ activity labels $A=\{a_0, \dots, a_{n-1}\}$ and one of the labels is assigned to each $S_i$~\cite{Lara2012Nov}. For activity prediction, I use a sliding window approach where I split the data set into $m$ time windows $W=\{W_0, \dots, W_{m-1}\}$ of equal size $l$. The windows are shifted by $v$ time series, which means that they overlap if $v < l$. Each window $W_i$ contains a sub-set of time series $W_i=\{S_{i\cdot v}, \dots, S_{i\cdot v + l}\}$ and is assigned to an activity label which builds the set of labels $Y=\{y_0, \dots, y_{m-1}\}$.
Most of today's wearable devices consist of acceleration and gyroscope with three dimensions each. I combine the sets $S_{acceleration}$ and $S_{gyroscope}$ into one set with $S_i\in \mathbb{R}^{d_{acceleration}+d_{gyroscope}}$. In the case of hand wash detection, I use the activity labels $A=\{null, hw\}$ where \textit{null} represents all activities where no hand washing is performed, and \textit{hw} represents all hand washing activities.
......@@ -26,12 +26,12 @@ Several published data sets contain sensor data of wearable devices during vario
Furthermore, additional data for user feedback covering parts of the time series is required. We can use the general prediction model to determine hand washing parts as it would be in the base application. In our case, we apply a running mean over multiple windows to the predictions and trigger an indicator $e_i$ at the window $W_j$ if it is higher than a certain threshold. This indicator $e_i$ gets the value \textit{correct} if one of the ground truth data covered by the mean has activity label \textit{hw}, otherwise it is \textit{false}. It represents the user feedback on confirmed or declined evaluations. A manual user feedback indicator is added for hand wash sequences where no indicator has been triggered. Using ground truth data for adding indicators simulates a perfectly disciplined user who answers to each evaluation.
Since adjacent windows tend to have the same activity, one indicator can cover several windows. I assume that a single user feedback is assigned to an entire execution of the related activity, i.e., until the labels of two ascending windows are different. Therefore, I added a \textit{neutral} value when multiple indicators cover the same activity sequence so that only one of them represents the actual user feedback. In addition, there is a minimum gap of the same size as the base applications buffer between the two indicators. In \figref{fig:exampleSyntheticSataset} you can see the plot of a synthetic data set with five hand wash activities for about one hour.
Since adjacent windows tend to have the same activity, one indicator can cover several windows. I assume that a single user feedback is assigned to an entire execution of the related activity, i.e., until the labels of two ascending windows are different. Therefore, I added a \textit{neutral} value when multiple indicators cover the same activity sequence so that only one of them represents the actual user feedback. In addition, there is a minimum gap of the same size as the base applications buffer between the two indicators. In \figref{fig:exampleSyntheticSataset}, you can see the plot of a synthetic data set with five hand wash activities for about one hour.
\input{figures/approach/example_dataset}
\subsubsection{Used data sets}
\subsubsection{Utilized data sets}
For this work, I used data sets from the University of Basel and the University of Freiburg [REF]. These include hand washing data which was recorded using a smartwatch application. Additionally, they contain long-term recordings of everyday activities. The data is structured by individual participants and multiple activities per recording. During the generation of a synthetic data set, the data of a single participant is selected randomly. To cover enough data, I had to combine and merge single participants over each data set. Therefore a resulting data set for a user contains multiple participants, which I treat as one. This just affects data for \textit{null} activities. All hand wash activity data is from the same user. Since the same data sets have already been used to train the base model, I had to retrain individual base models for each participant where none of its data is contained.
......@@ -45,14 +45,14 @@ In this work, I employ a personalization approach that does not require addition
Requirements for this process are:
\begin{itemize}
\item[1)]Collection of all sensor data\\ Since an application has to listen to the sensor data anyway for predictions, it should be possible to save them in the internal storage. Furthermore, they must be provided to the personalization service somehow. In my case, the sensor recordings are transmitted to a central point, so it is easy to access them.
\item[2)]User responses to predictions\\ These can be simple yes/no feedback. I assume that there is at least some kind of confirmation when an activity is detected, such that the application can perform its task. For example, the application in this work is used to query an evaluation from the user if a hand wash activity is performed. Therefore, I can deduce that the recognition was correct should such an evaluation have taken place. If the user declines, I know the prediction was incorrect. In cases without confirmation or rejection, it is possible to ignore this part or treat it like a false prediction. In section \secref{sec:expMissingFeedback} I show that especially the confirmation to predictions impacts personalization performance. Therefore, the no-feedback could be neglected if it is not possible to implement or it would change the usage.
\item[2)]User responses to predictions\\ These can be simple yes/no feedback. I assume that there is at least some kind of confirmation when an activity is detected, such that the application can perform its task. For example, the application in this work is used to query an evaluation from the user if a hand wash activity is performed. Therefore, I can deduce that the recognition was correct should such an evaluation have taken place. If the user declines, I know the prediction was incorrect. In cases without confirmation or rejection, it is possible to ignore this part or treat it like a false prediction. In section \secref{sec:expMissingFeedback} I show that especially the confirmation to predictions impacts personalization performance. Therefore, the no-feedback could be neglected if it is not possible to implement or it would lead to any changes in the usability.
\end{itemize}
I want to personalize the \textbf{DeepConvLSTM-A} neural network model, implemented and trained by Robin Burchard~\cite{robin2021}. It consists of four 1D-convolutional layers followed by a LSTM layer. It is extended by an attention mechanism, which allows concentrating on important parts of the data. It is done by a fully connected layer that computes weights over the time steps of the LSTM layer. These are then used for a weighted sum with the LSTMs hidden state. Finally, a fully connected linear layer is used for classification. As input, the model requires a window of 150 time-series samples with a dimensionality of 6. It represents a three seconds long sensor reading of acceleration and gyroscope at 50Hz.
To personalize the model, I use transfer learning in a domain adaption manner as described in \secref{sec:relWorkPersonalization}. However, it requires additional labeled training data. Due to condition $1)$ all sensor data from a user is available, however, without any labels. Therefore I generate pseudo labels based on the predictions of the general activity recognition model. Additionally, these labels are refined as described in the following \secref{sec:approachLabeling}. The use of pseudo labels leads to supervised training. It allows to keep the model architecture unchanged, and the main parts of the original training implementation and hyper parameter settings can be reused, as elaborated by Robin Burchard~\cite{robin2021}.
To personalize the model, I use transfer learning in a domain adaption manner as described in \secref{sec:relWorkPersonalization}. However, it requires additional labeled training data. Therefor I require the recording of all sensor data in condition $1)$, such that all data from a user is available. However, this data does not contain any labels. So I generate pseudo labels based on the predictions of the general activity recognition model as described in \secref{sec:approachLabeling}. Additionally, these labels are refined using different approaches which are introduced in \secref{sec:approachDenoising}. The use of pseudo labels leads to supervised training. It allows to keep the model architecture unchanged, and the main parts of the original training implementation and hyper parameter settings can be reused, as elaborated by Robin Burchard~\cite{robin2021}.
\subsubsection{Regularization}\label{sec:approachRegularization}
To avoid over-fitting to the target during multiple iterations of personalization, I try two different approaches. The first approach is to freeze the feature layers of the model. Yosinski et al. show that feature layers tend to be more generalizable and can be better transferred to the new domain~\cite{Yosinski2014}. Therefore the personalization is just applied to the classification layer. So fewer parameters have to be fine-tuned, which results in less computation time, and a smaller amount of training data can significantly impact the model. In the second approach, I apply the L2-SP penalty to the optimization mentioned by Xuhong et al.~\cite{xuhong2018explicit}. Here the regularization restricts the search space to the initial model parameters. Information learned in the pre-training, therefore stays existent even over multiple fine-tuning iterations. It allows adjustment of all parameters, which offers more flexibility in fine-tuning. To test which approach fits best, I compare them in \secref{sec:expEvolCompFilter}.
......@@ -61,14 +61,14 @@ To avoid over-fitting to the target during multiple iterations of personalizatio
\section{Training data labeling}\label{sec:approachLabeling}
To retrain the base model with new sensor data, I use a semi-supervised approach using pseudo labels as described in \secref{sec:relWorkPseudoLabeling}. Since there is already a trained model $M_p$, I can make use of its supervision from pre-training to generate artificial labels $\hat{Y}=\{\hat{y}_0, \dots, \hat{y}_{m-1}\}$ based on predictions from new unlabeled sensor data $W=\{W_0, \dots, W_{m-1}\}$, where $\hat{y}_i=M_p(W_i)$. Therefore it is easy to get these pseudo-labels for all new sensor data. This process does not require supervision but allows the model to be trained in the same way as with a supervised method. So the same model's architecture and training implementation can be used.
However, it is very likely that there are some wrong predictions that negatively affect the training, which is also called label noise. It is important to keep the amount of noise as low as possible. To observe the impact of wrong classified data I did several tests, which you can see in \secref{sec:expTransferLearningNoise}. As seen in \secref{sec:relWorkLabelNoise} there are multiple approaches to make the training more robust against label noise. I use the predicted values as soft-labels to depict uncertainty of the model. So the set of pseudo labels consists of vectors instead of crisp labels, $\hat{Y}=\{\hat{\bm{y}}_0, \dots, \hat{\bm{y}}_{m-1}\}$ where each $\hat{\bm{y}}_i = \begin{bmatrix}\hat{y}_i^{null}& \hat{y}_i^{hw}\end{bmatrix}$. The value $\hat{y}_i^{null}$ is the predicted membership for class \textit{null} and $\hat{y}_i^{hw}$ for class \textit{hw}. \figref{fig:examplePseudoSataset} shows an example plot of the predicted pseudo values of the previously seen dataset. In the following I call a pseudo label $\hat{\bm{y}}_i$ \textit{null} if $\hat{y}_i^{null} > \hat{y}_i^{hw}$ and \textit{hw} or hand wash if $\hat{y}_i^{null} < \hat{y}_i^{hw}$.
However, it is very likely that there are some wrong predictions that negatively affect the training, which is also called label noise. It is important to keep the amount of noise as low as possible. To observe the impact of wrong classified data I did several tests, which you can see in \secref{sec:expTransferLearningNoise}. As seen in \secref{sec:relWorkLabelNoise} there are multiple approaches to make the training more robust against label noise. I use the predicted values as soft-labels to depict uncertainty of the model. So the set of pseudo labels consists of vectors instead of hard labels, $\hat{Y}=\{\hat{\bm{y}}_0, \dots, \hat{\bm{y}}_{m-1}\}$ where each $\hat{\bm{y}}_i = \begin{bmatrix}\hat{y}_i^{null}& \hat{y}_i^{hw}\end{bmatrix}$. The value $\hat{y}_i^{null}$ is the predicted membership for class \textit{null} and $\hat{y}_i^{hw}$ for class \textit{hw}. \figref{fig:examplePseudoSataset} shows an example plot of the predicted pseudo values of the previously seen dataset. In the following, I call a pseudo label $\hat{\bm{y}}_i$ \textit{null} if $\hat{y}_i^{null} > \hat{y}_i^{hw}$ and \textit{hw} or hand wash if $\hat{y}_i^{null} < \hat{y}_i^{hw}$.
\input{figures/approach/example_pseudo_labels}
For further improvements, I rely on knowledge about the context of hand washing. Therefore I know that most labels should have value \textit{null} and some adjacent values of time intervals around 20 seconds have value \textit{hw}. Additionally, there is user feedback on certain model predictions from which information can also be drawn.
\subsection{User Feedback}
\subsection{User feedback}
For each section where the running mean of model predictions reaches a certain threshold, an indicator is created which is either \textit{neutral}, \textit{false} or \textit{correct}. Moreover, there can be indicators of type \textit{manual}. These provide the following information about the respective predictions.
\begin{itemize}
......@@ -78,7 +78,7 @@ For each section where the running mean of model predictions reaches a certain t
\item manual:\\ The participant triggered a manual activity. So the running mean has not exceeded the threshold. However, predictions with the suitable activity are probably correct. It could be that the execution was too short to get the mean above the threshold or that too many predictions were false. Since a manual indicator could be given some time after the activity, the possible time interval is significantly larger than for the detected activities. Therefore, it is recommended to specify the maximum delay within which a user should trigger manual feedback.
\end{itemize}
In the case of hand washing I set for a \textit{false} indicator $i$ at window $w$ the activity interval to $I^{false}_i=[w-35, w]$ which represents a time span from 46 seconds before until the time of the indicator. For a \textit{correct} indicator, the interval is set to $I^{correct}_i=[w-70, w+10]$ (93 seconds before to 13 seconds after), \textit{neutral} indicator interval to $I^{neutral}_i=[w-35, w+5]$, (46 seconds before to 6 seconds after) and \textit{manual} indicator interval to $I^{manual}_i=[w-130, w-8]$, (173 seconds before to 10 seconds before). During tests this values have shown a good average coverage. \figref{fig:exampleSyntheticIntervals} shows the example dataset with highlighted activity intervals. All intervals are elements of their corresponding set, where:
In the case of hand washing, I set the activity interval for a \textit{false} indicator $i$ at window $w$, to $I^{false}_i=[w-35, w]$ which represents a time span from 46 seconds before until the time of the indicator. For a \textit{correct} indicator, the interval is set to $I^{correct}_i=[w-70, w+10]$ (93 seconds before to 13 seconds after), \textit{neutral} indicator interval to $I^{neutral}_i=[w-35, w+5]$, (46 seconds before to 6 seconds after) and \textit{manual} indicator interval to $I^{manual}_i=[w-130, w-8]$, (173 seconds before to 10 seconds before). During tests this values have shown a good average coverage. \figref{fig:exampleSyntheticIntervals} shows the example dataset with highlighted activity intervals. All intervals are elements of their corresponding set, where:
\begin{align*}
\mathds{I}^{false}&=\{I^{false}_0, \dots, I^{false}_{nf-1}\}\\
......@@ -95,10 +95,10 @@ However, we can not be sure that a user has always answered the queries or trigg
\input{figures/approach/example_dataset_feedback}
\subsection{Denoising}
\subsection{Denoising}\label{sec:approachDenoising}
The pseudo-labels are derived from the predictions of the general model. Therefore, we cannot be sure that they are correct. However, the indicators offer a source of ground truth information to the underlying activity. In this section, I describe different approaches to how the indicators can be used to refine the pseudo-labels.
Initially, we can use the raw information of the indicators. As said during a \textit{false} interval, we know that the hand wash predictions are false. So we can set all predictions within the interval to \textit{null}, i.e. the soft label vectors to $\hat{\bm{y}}_i = \begin{bmatrix}1 & 0\end{bmatrix}$. For neutral intervals, it is difficult to make a statement. For example, it could be possible that a user tends more to confirm all right predicted activities than decline false predictions because a confirmation leads to the intended usage of the application. So we can assume that all correctly detected activities are confirmed. If not, they are false predictions. In this case all labels in \textit{neutral} intervals can be set to \textit{null}. Otherwise, we cannot make a safe statement about the correctness of the prediction but exclude these sections from training. Another approach would be to use only data covered by a \textit{false} or \textit{correct} indicator. With correcting the false labels and the knowledge that confirmed predicted labels are probably correct, there is a high certainty in the training data. But labels within \textit{positive} intervals can still contain false predictions. As shown in section \secref{sec:expTransferLearningNoise} as \textit{null} labeled \textit{hw} samples does not have as much negative impact to the performance like \textit{hw} labeled \textit{null} samples. Therefore it is crucial to ensure that no pseudo label of a \textit{null} sample inside a \textit{positive} interval has a high value for hand washing. We know that inside the interval, there should be just a set of adjacent samples with label \textit{hw}. All others should be of type \textit{null}. In the following, I concentrate on approaches to identify these subsets and correct all labels accordingly inside a \textit{positive} interval.
Initially, we can use the raw information of the indicators. As said, during a \textit{false} interval, we know that the hand wash predictions are false. So we can set all predictions within the interval to \textit{null}, i.e. the soft label vectors to $\hat{\bm{y}}_i = \begin{bmatrix}1 & 0\end{bmatrix}$. For neutral intervals, it is difficult to make a statement. For example, it could be possible that a user tends more to confirm all right predicted activities than decline false predictions because a confirmation leads to the intended usage of the application. So we can assume that all correctly detected activities are confirmed. If not, they are false predictions. In this case all labels in \textit{neutral} intervals can be set to \textit{null}. Otherwise, we cannot make a safe statement about the correctness of the prediction but exclude these sections from training. Another approach would be to use only data covered by a \textit{false} or \textit{correct} indicator. With correcting the false labels and the knowledge that confirmed predicted labels are probably correct, there is a high certainty in the training data. But labels within \textit{positive} intervals can still contain false predictions. As shown in section \secref{sec:expTransferLearningNoise} as \textit{null} labeled \textit{hw} samples does not have as much negative impact to the performance like \textit{hw} labeled \textit{null} samples. Therefore it is crucial to ensure that no pseudo label of a \textit{null} sample inside a \textit{positive} interval has a high value for hand washing. We know that inside the interval, there should be just a set of adjacent samples with label \textit{hw}. All others should be of type \textit{null}. In the following, I concentrate on approaches to identify these subsets and correct all labels accordingly inside a \textit{positive} interval.
\subsubsection{Naive}
In a naive approach, we search for the largest bunch of neighboring pseudo-labels with a high value for hand washing. It is done by computing a score over all subsets of adjacent labels. The score of a subset $Sub_k=\{\hat{\bm{y}}_p, \dots, \hat{\bm{y}}_q\}$ is computed by:
......@@ -158,7 +158,7 @@ S~score &= 2\cdot\frac{Specificity\cdot Sensitivity}{Specificity + Sensitivity}
The sensitivity gives the ratio of correctly recognized hand washing samples. The specificity gives the ratio of correctly recognized not hand washing samples. Both have to be close to 1 for a good-performing model. Precision is the ratio of correctly predicted hand wash samples over all predicted hand wash samples. The F1 score is the harmonic mean between recall and precision, where recall is the same as sensitivity. So it gives a measurement for the trade-off between detecting all hand wash activities and just detecting them if the user is actually doing it. Similarly, the S score is the harmonic mean between sensitivity and specificity. Here, there is a greater focus on the false positives under the true negatives.
However, these metrics only consider the crisp label values. So they are unable to reflect the uncertainty of the labels. Therefore I use a slightly different definition of the previous metrics, which work with class membership values~\cite{Beleites2013Mar, Beleites2015Feb}:
However, these metrics only consider the hard label values. So they are unable to reflect the uncertainty of the labels. Therefore I use a slightly different definition of the previous metrics, which work with class membership values~\cite{Beleites2013Mar, Beleites2015Feb}:
\begin{align}
Sensitivity^{soft} = Recall^{soft} &= 1-\sum_n \frac{Y}{\sum_n Y}|\hat{Y}-Y|\\
......@@ -182,7 +182,7 @@ To observe wrongly predicted activities I take the same assumption as for the \t
Furthermore, this mechanism can also be used to redefine the values of kernel width and threshold for the running mean. I apply a grid search over kernel sizes of $[10, 15, 20]$ and thresholds of $[0.5, 0.52, \dots, 0.98]$. The resulting true positive and false positive values are computed for each value combination. The combination that results in at least the same amount of true positives and minimal amount of false positives is set as an optimal mean setting.
\section{Personalization Pipeline}
\section{Personalization pipeline}
In this section, I want to summarize my implementation to extend the given application. The process of personalization is shown in \figref{fig:personalizationImplementation}.
\input{figures/approach/personalization_implementation}
......
\chapter{Experiments }\label{chap:experiments}
In this chapter, I show my experiments. At first, I define the setup of experiments with the used datasets and metrics. Then I introduce a supervised personalization to give a baseline and use this to investigate the performance in the presence of label noise. Afterward, I evaluate different methods in pseudo label generation and their noise reduction. Next, the development of the personalization is analyzed by determining the performance over each individual iteration step. I compare my approach to an active learning implementation to evaluate the overall performance. Finally, I have conducted multiple real-world experiments.
\section{Experiment Setup}
\section{Experiment setup}
To evaluate my personalization approach, I use different metrics which rely on ground truth data. Therefore fully labeled datasets are required. I created synthetic recordings for 3 participants described in~\secref{sec:synDataset}. Additionally, I have recorded multiple daily usages of the base application and split these recordings into two participants. I focused on doing more intense hand washing than usual for the second participant. The participant recordings are split into training and test sets by the leave one out method. \tabref{tab:supervisedDatasets} shows the resulting datasets in detail. I set the test splits ratio higher than usual, since a recording could contain long periods of low motion. So a wide variety of covered activities is guaranteed. The measurement metrics are Specificity, Sensitivity, F1 score, and S score. For each evaluation, the personalized models of the participants are applied to the respective test sets, and the mean overall is computed. The models are implemented using PyTorch~\cite{Paszke2019}.
\input{figures/experiments/table_supervised_datasets}
......@@ -26,7 +26,7 @@ Now we observe the scenarios where some of the labels are noisy. Therefore we lo
I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus.
\subsection{Hard vs. Soft labels}\label{sec:expHardVsSoft}
\subsection{Hard vs. soft labels}\label{sec:expHardVsSoft}
In these experiments, I would like to show the effect of noise on soft labels compared to crisp labels. Similar to before, different values of label flips are applied to the training data. Then the labels are smoothen to a degree $s\in [0, 0.49]$. As seen before, noise on \textit{hw} labels does not significantly impact the performance. Therefore not many changes in performance due to different smoothing are expected. This is confirmed by \figref{fig:supervisedSoftNoiseHW}. At larger noise levels, a tendency can be seen as a slight increase in the S-score. I focus on noise in \textit{null} labels. \figref{fig:supervisedSoftNoiseNull} gives detailed insights of the performance impact. The specificity increases with higher smoothing for all noise values, which becomes clearer for more noise. However, the sensitivity decreases slightly, especially for higher noise rates. Overall the F1 score and S score benefit from smoothing. In the case of $0.2\%$ noise, the personalized models trained on smoothed false labels, can reach a higher S score than the base model.
\input{figures/experiments/supervised_soft_noise_hw}
......@@ -49,7 +49,7 @@ I combined both experiments to ensure that the drawbacks of incorrectly smoothed
\section{Evaluation of different Pseudo label generations}\label{sec:expPseudoModels}
\section{Evaluation of different pseudo label generations}\label{sec:expPseudoModels}
In this section, I describe the evaluation of different pseudo-labeling approaches using the filters introduced in \secref{sec:approachFilterConfigurations}. For each filter configuration, the base model is used to predict the labels of the training sets and create pseudo labels. After that, the filter is applied to the pseudo labels. To determine the quality of the pseudo labels, they are evaluated against the ground truth values using soft versions of the metrics $Sensitivity^{soft}$, $Specificity^{soft}$, $F_1^{soft}$, $S^{soft}$. The refined pseudo labels then train the general model. All resulted models are evaluated by their test sets, and the mean over all is computed. \figref{fig:pseudoModelsEvaluation} shows a bar plot over the metrics for all filter configuration. I concentrate on the values of the S score in terms of performance.
\subsubsection{Baseline configurations}
......@@ -69,7 +69,7 @@ The following experiment shows the impact of missing user feedback on the traini
As you can see, missing \textit{false} indicators do not lead to significant performance changes. The \texttt{all\_null\_*} filter configurations include all samples as \textit{null} labels without depending on the indicator. Similar, the \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels than the sections which are covered by the \textit{false} indicators.
In contrast, missing \textit{correct} indicators lead to performance loss. However, a negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively.
In contrast, missing \textit{correct} indicators lead to performance loss. However, a negative trend in S score can be seen just for scenarios where less than $40\%$ of hand washing activities have been confirmed. Even with just $20\%$ of answered detections, the resulting personalized model outperforms the general model. So it is enough if only a few hand washing samples are in a dataset to impact the training positively. If we focus on \texttt{all\_null\_convlstm2} and \texttt{all\_cnn\_convlstm2\_hard} as well as on \texttt{all\_null\_convlstm3} and \texttt{all\_cnn\_convlstm3\_hard} we can see that in both cases the \texttt{all\_null\_*} filter perform better than the \texttt{all\_cnn\_*} with full feedback, but in the absence of feedback the \texttt{all\_cnn\_*} configurations dominate. Therefore, the \texttt{all\_cnn\_*} filters should be preferred when it cannot be assumed that a user responds to all hands wash actions.
\input{figures/experiments/supervised_pseudo_missing_feedback}
......@@ -88,7 +88,7 @@ In this step, I compare the evaluation of the personalized model over the differ
\section{Evaluation of personalization}
\subsection{Compare Active learning with my approach}
\subsection{Compare active learning with my approach}
To confirm the robustness of my personalization approach, I compare it with a common active learning implementation as introduced in \secref{sec:approachActiveLearning}. To find a appropriate selection of hyper parameters $B$, $s$, $h$, use weighting, and number of epochs, I use a grid search approach. \tabref{tab:activeLearningGridSearch} shows the covered values for the hyper parameters. The 10 parameter constellations which yields best S scores are shown in \tabref{tab:activeLearningEvaluation}. They achieve S scores from ${\sim}0.8444$ to ${\sim}0.8451$. From the experiment of \secref{sec:expPseudoModels}, we know that the models based on configurations \texttt{all\_cnn\_*\_hard} reach scores around ${\sim}0.86$ and \texttt{all\_null\_deepconv}, \texttt{all\_null\_fcndae} and \texttt{all\_null\_convlstm1} even ${\sim}0.867$. So my personalization approach outperforms active learning in terms of performance increase.
In the next step, I analyze the required user interaction. The best-performing hyper-parameter setting relies on a budget of 20. So the user has to answer 20 queries. The training data contains $3.4$ manual, $10.6$ correct, and $53.4$ false indicators on average per participant. If we equated the answer to a query to the user feedback on which the indicators have been drawn, the active learning approach would lead to less user interaction. However, as shown in the experiment of \secref{sec:expMissingFeedback}, false indicators do not significantly impact the training process. Therefore a user could ignore false hand washing detentions and just answer for correct and manual activities. It would result in $14$ user interactions, which is less than the budget of the active learning implementation. Furthermore, positive feedback is just gathered when the application has to react to the target activity. So, in this case, the user interaction is intended anyways.
......@@ -103,18 +103,19 @@ I compare the estimated F1 score with the ground truth evaluation in this sectio
\input{figures/experiments/table_quality_estimation_evaluation}
\section{Real world analysis}
In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be beaten by personalization.
In a corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smartwatch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers 14 participants with overall 2682 hours of sensor data and 1390 user feedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exists, I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application. These values build the baseline, which has to be beaten by personalization.
\input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_general_evaluation}
For each training recording, the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulting data set is used for training based on the previous model or the model with the best F1 score. As regularization, freezing layers or l2-sp penalty is used. Overall personalizations of a participant, the model with the highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally, the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and adjusted base model values gives the true performance increase of the retraining.
Entries with zero iterations, as for participants OCDetect\_09 and OCDetect\_13, state that no better personalization could be found.
Entries with zero iterations, as for participants OCDetect\_12 and OCDetect\_13, state that no better personalization could be found.
For all other participants, the personalization process generated a model that performs better than the general. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on one or two iterations, and in most cases, l2-sp regularization was used. The highest increase in F1 score achieved OCDetect\_10 with $0.10$. In comparison to the general model without adjusted kernel settings, the F1 score increases by $0.2379$. In practice, this would lead to $82\%$ less incorrect hand wash notifications and $14\%$ more correct detections. The participants OCDetect\_02, OCDetect\_07 and OCDetect\_12 achieve an increase of around $0.05$. For OCDetect\_12, the personalization would lead to $6\%$ more wrong triggers but also increase the detection of correct hand washing activities by $45\%$. All best personalizations used either the \texttt{all\_cnn\_convlstm2\_hard} or \texttt{all\_cnn\_convlstm3\_hard} filter configurations. The mean F1 score, over all best-personalized models, achieves an increase of $0.0266$ compared to the general model with adjusted kernel settings and $XX$ to the plain general model without adjusting kernel settings. It leads to a reduction of the false detections by $27\%$ and an increase of correct detections by $5\%$.
For all other participants, the personalization process generated a model that performs better than the general model with adjusted kernel settings. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on at most three iterations, and l2-sp regularization was used in more cases. The highest increase in F1 score achieved OCDetect\_21 with a difference of $0.25$. Compared to the general model without adjusted kernel settings, the F1 score increases by $0.355$. In practice, this would lead to the same amount of incorrect hand wash notifications and $80\%$ more correct detections. The highest decrease of false predictions achieves participant OCDetect\_10 with $74\%$. All best personalizations except of one use the \texttt{all\_cnn\_convlstm3\_hard} filter configuration. But for participants OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_cnn\_convlstm2\_hard} filter configuration achieves the same score. Moreover, for participants OCDetect\_05, OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_null\_convlstm3} filter configuration also reaches same F1 scores.
So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration.
\todo{Fix broken values}
\extend{new participants}
The mean F1 score over each best-personalized model increases by $0.044$ compared to the general model with adjusted kernel settings and $0.11$ to the plain general model without adjusting kernel settings. That is $9.6\%$ and $28.2\%$ respectively. So, personalization leads to a reduction of the false detections by $31\%$ and an increase of correct detections by $16\%$.
\input{figures/experiments/table_real_world_evaluation}
......@@ -19,13 +19,14 @@ The previous section's experiments gave several insights into personalization in
\end{itemize}
The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist.
\extend{When full experiments are done}
The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist.\\
The experiment also confirms that the \texttt{all\_cnn\_*} filter configurations are better suited for a broader user base than the \texttt{all\_null\_*} configurations since they are more robust against missing feedback. For all participants, the \texttt{all\_cnn\_*} filter configurations achieved at least the same F1 scores as the \texttt{all\_null\_*} configurations, and in most cases, they outperformed them.
\section{Future work}
The performance of the personalization heavily depends on the quality of the pseudo-labels. Therefore the filter configurations used for denoising them have a significant impact. More work on hyper-parameter tuning can lead to further improvements.
The performance of the personalization heavily depends on the quality of the pseudo-labels. Therefore the filter configurations used for denoising them have a significant impact. More work on hyper-parameter tuning can lead to further improvements. Also new denoising concepts can be tried. Since the refinement of pseudo-labels is separated from the other part it is easy to implement different approaches.
Additionally, other sources of indicators can be considered. For example, Bluetooth beacons can be placed by the sinks. The distance between the watch and the sink can be estimated if the watch is within range. A short distance states that the user is probably washing their hands. This indicator can be handled similarly to a \textit{manual} feedback.
Furthermore, this approach offers the opportunity to learn new classes. For example, during the real-world experiment, the participant was asked if this action was compulsive for each hand washing activity. So there exists additional information to each hand washing. The target task can be adapted to learn a new classification into $A=\{null,hw,compulsive\}$. The resulting model would be able to distinguish between hand washing and not hand washing and moreover between regular hand washing and compulsive hand washing.
......@@ -3,4 +3,4 @@ In this work, I have elaborated a personalization process for human activity rec
I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how-soft labels can counter training errors. Based on these insights, several constellations and filter approaches for training data have been implemented to analyze the behavior of the resulting models under the different aspects. I found out that just using the predictions of the base model leads to performance decreases since they consist of too much label noise. However, even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore more sophisticated denoising approaches are implemented that generate training data that consist of various samples with as few incorrect labels as possible. This data leads to personalized models that achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training.
Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness. The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, personalization would reduce the false detections by $XX\%$, and increase correct detections by $XX\%$.
\ No newline at end of file
Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness. The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, personalization would reduce the false detections by $31\%$, and increase correct detections by $16\%$.
\ No newline at end of file
......@@ -9,7 +9,7 @@
\subfloat[evaluation]
{\includegraphics[width=0.28\textwidth]{figures/approach/base_application_screen_eval.png}}
\caption[Base application screen shots]{\textbf{Base application screen shots.} (a) shows the apllication in default state, where the user has the opportunity to trigger a hand wash event manually. (b) shows the notification, which appears, when the application has detected a hand wash activity. Here the user can confirm or decline. (b) shows one of the evaluation queries which the user has to answer for the OCD observation. These are shown, if the user triggered a manual hand wash event or confirmed a detected hand washing activity.}
\caption[Base application screen shots]{\textbf{Base application screen shots.} (a) shows the application in default state, where the user has the opportunity to trigger a hand wash event manually. (b) shows the notification, which appears, when the application has detected a hand wash activity. Here the user can confirm or decline. (b) shows one of the evaluation queries which the user has to answer for the OCD observation. These are shown, if the user triggered a manual hand wash event or confirmed a detected hand washing activity.}
\label{fig:baseApplicationScreen}
\end{centering}
\end{figure}
......@@ -18,8 +18,10 @@
OCDetect\_11 & 19 & 46397570 & 257 & 53 & 11 & 39 & 35 \\
OCDetect\_12 & 13 & 8299920 & 46 & 72 & 1 & 0 & 5 \\
OCDetect\_13 & 15 & 33018908 & 183 & 21 & 21 & 14 & 39 \\
OCDetect\_18 & 8 & 12937161 & 71 & 47 & 20 & 9 & 17 \\
OCDetect\_20 & 14 & 29443317 & 163 & 9 & 179 & 64 & 69 \\
OCDetect\_18 & 11 & 17733047 & 98 & 63 & 33 & 14 & 26 \\
OCDetect\_19 & 3 & 9463742 & 52 & 4 & 25 & 18 & 173 \\
OCDetect\_20 & 17 & 34115873 & 189 & 11 & 207 & 74 & 78 \\
OCDetect\_21 & 5 & 8234224 & 45 & 4 & 7 & 12 & 19 \\
\bottomrule
\end{tabular}}}
......@@ -40,7 +42,9 @@
OCDetect\_12 & 5 & 6502526 & 36 & 76 & 0 & 0 & 1 \\
OCDetect\_13 & 6 & 16679159 & 92 & 11 & 19 & 15 & 37 \\
OCDetect\_18 & 4 & 8249562 & 45 & 40 & 30 & 12 & 22 \\
OCDetect\_19 & 3 & 6975685 & 38 & 9 & 30 & 21 & 68 \\
OCDetect\_20 & 4 & 7162813 & 39 & 13 & 47 & 20 & 12 \\
OCDetect\_21 & 2 & 6049766 & 33 & 8 & 5 & 2 & 2 \\
\bottomrule
\end{tabular}}}
......
......@@ -8,16 +8,20 @@
\toprule
\thead{participant} & \thead{filter\\ configuration} & \thead{base\\ on\\ best} & \thead{l2-sp} & \thead{\rotatebox[origin=c]{-90}{iterations}} & \thead{false\\ diff\\ relative} & \thead{correct\\ diff\\ relative} & \thead{F1} & \thead{base\\ false\\ diff\\ relative} & \thead{base\\ correct\\ diff\\ relative} & \thead{base\\F1} & \thead{F1\\diff}\\
\midrule
OCDetect\_02 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.3600 & 0.1636 & 0.5246 & -0.2057 & 0.1455 & 0.4667 & 0.0579 \\
OCDetect\_03 & all\_cnn\_convlstm2\_hard & True & True & 1 & -0.3880 & 0.0000 & 0.4800 & -0.2240 & 0.1212 & 0.4728 & 0.0072 \\
OCDetect\_04 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.3889 & 0.1176 & 0.5507 & -0.1111 & 0.1176 & 0.5135 & 0.0372 \\
OCDetect\_05 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.1336 & 0.2111 & 0.3664 & -0.1270 & 0.1556 & 0.3514 & 0.0150 \\
OCDetect\_02 & all\_cnn\_convlstm2\_hard & True & True & 3 & -0.2457 & 0.1455 & 0.4791 & -0.2057 & 0.1455 & 0.4667 & 0.0124 \\
OCDetect\_03 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.4754 & 0.0000 & 0.5097 & -0.2240 & 0.1212 & 0.4728 & 0.0368 \\
OCDetect\_04 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.0556 & 0.2941 & 0.5641 & -0.1111 & 0.1176 & 0.5135 & 0.0506 \\
OCDetect\_05 & all\_cnn\_convlstm3\_hard & True & False & 1 & -0.0847 & 0.2000 & 0.3547 & -0.1270 & 0.1556 & 0.3514 & 0.0033 \\
OCDetect\_07 & all\_cnn\_convlstm3\_hard & True & False & 2 & -0.6429 & 0.0769 & 0.8000 & -0.5714 & 0.0000 & 0.7429 & 0.0571 \\
OCDetect\_09 & all\_cnn\_convlstm3\_hard & True & False & 0 & -0.5273 & 0.1000 & 0.4000 & -0.5273 & 0.1000 & 0.4000 & 0.0000 \\
OCDetect\_10 & all\_cnn\_convlstm2\_hard & True & True & 2 & -0.8209 & 0.1429 & 0.3265 & -0.6418 & 0.1429 & 0.2192 & 0.1074 \\
OCDetect\_11 & all\_cnn\_convlstm2\_hard & True & True & 1 & -0.1333 & 0.2857 & 0.3600 & -0.4000 & 0.1429 & 0.3556 & 0.0044 \\
OCDetect\_12 & all\_cnn\_convlstm3\_hard & True & False & 2 & 0.0625 & 0.4516 & 0.6618 & -0.3125 & 0.1613 & 0.5950 & 0.0667 \\
OCDetect\_13 & all\_null\_convlstm3 & True & False & 0 & -0.4600 & 0.0000 & 0.4471 & -0.4600 & 0.0000 & 0.4471 & 0.0000 \\
OCDetect\_09 & all\_cnn\_convlstm3\_hard & True & False & 2 & -0.6364 & 0.0000 & 0.3385 & -0.3117 & 0.1818 & 0.2826 & 0.0559 \\
OCDetect\_10 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.7388 & 0.1429 & 0.2667 & -0.6418 & 0.1429 & 0.2192 & 0.0475 \\
OCDetect\_11 & all\_cnn\_convlstm3\_hard & True & True & 1 & -0.2500 & 0.2222 & 0.3284 & -0.4167 & 0.1111 & 0.3226 & 0.0058 \\
OCDetect\_12 & all\_cnn\_convlstm3\_hard & True & False & 0 & -0.3077 & 0.1562 & 0.6016 & -0.3077 & 0.1562 & 0.6016 & 0.0000 \\
OCDetect\_13 & all\_cnn\_convlstm3\_hard & False & False & 0 & -0.2131 & 0.0000 & 0.3509 & -0.2131 & 0.0000 & 0.3509 & 0.0000 \\
OCDetect\_18 & all\_cnn\_convlstm3\_hard & True & True & 2 & -0.1818 & 0.1905 & 0.6623 & 0.0000 & 0.1190 & 0.6184 & 0.0438 \\
OCDetect\_19 & all\_cnn\_convlstm3\_hard & True & False & 1 & 0.0000 & 0.0476 & 0.2993 & -0.1220 & -0.0476 & 0.2963 & 0.0030 \\
OCDetect\_20 & all\_cnn\_convlstm3\_hard & True & True & 3 & -0.5000 & 0.0000 & 0.7937 & -0.1250 & 0.0000 & 0.7407 & 0.0529 \\
OCDetect\_21 & all\_cnn\_convlstm3\_hard & True & True & 1 & 0.0000 & 0.8000 & 0.7500 & 0.0000 & 0.0000 & 0.5000 & 0.2500 \\
\bottomrule
\end{tabular}}
......
......@@ -19,7 +19,9 @@
OCDetect\_12 & 77 & 32 & 13 & 0.5246 \\
OCDetect\_13 & 46 & 20 & 61 & 0.3150 \\
OCDetect\_18 & 83 & 42 & 22 & 0.5714 \\
OCDetect\_20 & 64 & 50 & 24 & 0.7246 \\
OCDetect\_19 & 43 & 21 & 82 & 0.2877 \\
OCDetect\_20 & 64 & 50 & 25 & 0.7194 \\
OCDetect\_21 & 14 & 5 & 2 & 0.4762 \\
\bottomrule
\end{tabular}}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment