Commit f5552682 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

work on experiments

parent 5d22fdd2
......@@ -94,4 +94,4 @@ In some other approaches, the spatial encoder/decoder is separated from the temp
\subsection{Soft labels}\label{sec:relSoftLabels}
In most cases the label of a sample $x_i$ is a crisp label $y_i\in K$ which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore soft-labels can depict uncertainity over multiple classes~\cite{Beleites2013Mar}. Since the output of a human activity recognition model is mostly computed by a soft-max layer, it already represents a vector of partial class memberships. Converting to a crisp label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
Therefore soft-labels can also be used for a more robust training with noisy labels. It could happen, that the noise in a generated label just relies by a very uncertain prediction. The noise gets maximized by using crisp labels, but ha less impact to the training process if used in a soft-label. It has been shown, that soft-labels can carry valuable information even when they are noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision and recall.
Therefore soft-labels can also be used for a more robust training with noisy labels. It could happen, that the noise in a generated label just relies by a very uncertain prediction. The noise gets maximized by using crisp labels, but has less impact to the training process if used in a soft-label. It has been shown, that soft-labels can carry valuable information even when they are noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision and recall.
......@@ -8,7 +8,7 @@ Finally, I present an active learning implementation, which is used for performa
\section{Base Application}\label{sec:approachBaseApplication}
The application I wanted to personalize, detects hand washing activities and is executed on a smart watch. It is used to observe obsessive behavior of a participant to tread OCD. If the device detects a hand wash activity, a notification is prompted to the user, which can then confirmed or declined. A confirmation leads to a survey, where the user can rate its mental condition. Furthermore a user can trigger manual evaluations if a hand washing was not detected by the device. This evaluations can be used later by psychologists to analyze and treat the participants state during the day. For activity prediction the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}.
The application I wanted to personalize, detects hand washing activities and is executed on a smart watch. It is used to observe obsessive behavior of a participant to treat OCD. If the device detects a hand wash activity, a notification is prompted to the user, which can then confirmed or declined. A confirmation leads to a survey, where the user can rate its mental condition. Furthermore a user can trigger manual evaluations if a hand washing was not detected by the device. This evaluations can be used later by psychologists to analyze and treat the participants state during the day. For activity prediction the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}.
The integrated IMU of the smart watch is used to record wrist movements and stores the sensor data in a buffer. After a cycle of 10 seconds the stored data is used to predict the current activity. Therefore a sliding window with length of 3 seconds and a window shift of 1.5 seconds is applied to the buffer. For each window the largest distance between the sensor values is calculated to filter out sections where there is just little movement. If there is some motion, the general recognition model is applied to the windows of this section, to predict the current activity label. To avoid detection based on outliers, a running mean is computed over the last $kw$ predictions. Just if it exceeds a certain threshold $kt$ the final detection is triggered. Additionally the buffer is saved to an overall recording in the internal storage. While charging the smart watch, all sensor recordings and user evaluations are sent to an HTTP server.
......@@ -170,6 +170,7 @@ The terms $Y$ and $\hat{Y}$ are the sets of soft-label vectors of the ground tru
New models which results by the personalization should be evaluated to ensure that the new model performs better than the one currently used. To determine the quality of a model, the predicted values are compared to the actual ground truth data using different metrics like introduced in \secref{sec:approachMetrics}. However, in our case there is no ground truth data available for common evaluations. Therefore I use again the information given by the indicators. I assert that, the performance of a model is reflected by the resulting behavior of the application. In our case, the situations in which a hand washing activity is detected and a notification is promted to the user. So we can simulate the behavior of the application using the new model on an existing recording and compare the potential detected hand wash sections with the actual user feedback. To simulate the application I use the new model to predict the classes on a recording and compute the running mean over the predictions. Additionally low movement sections are detected and the predictions are set to \textit{null}. This is equal to the filter of the application where no prediction model is applied to low movement sections. At each sample where the mean for \textit{hw} predictions is higher than the given threshold I check if the label is inside of a \textit{hw} or \textit{manual} interval. If yes, then it is counted as a true positive (TP) prediction otherwise it is false positive (FP) prediction. Since the application buffers multiple samples for prediction I require a minimum distance between two detections.
To observe wrongly predicted activities I take the same assumption as for the \texttt{all\_null\_*} filter configurations in~\secref{sec:approachFilterConfigurations}. If a hand wash activity is detected on a section which is not covered by a \textit{hw} or \textit{manual} interval then it is probably a false detection. If the running mean has not exceeded the threshold within a \textit{hw} or \textit{manual} interval it would lead to a false negative (FN) prediction. All other sections where no hand washing activity is detected would be true negative (TN) predictions. But due to the minimum distance between predictions and overlapping \textit{hw, manual} intervals it is hard to estimate section boundaries. Therefore the true negative value would not be precise. Using these values it is possible to create the confusion matrix as described in \secref{sec:approachMetrics}. I compute the Sensitivity, Precision and F1 score since they do not depend on the true negative values. So it is possible to compare the performance of arbitrary models.
\subsection{Best kernel settings}
Furthermore these mechanism can also be used to redefine the values of kernel width and threshold for the running mean. I apply a grid search over kernel sizes of $[10, 15, 20]$ and thresholds of $[0.5, 0.52, \dots, 0.99]$. For each value combination the resulting true positive and false positive values are computed. The combination which results at least the same amount of true positives and minimal false positives is set as optimal mean setting.
......
This diff is collapsed.
\chapter{Discussion}\label{chap:Discussion}
The experiments of the previous section given several insights of the personalization in context of hand washing.
\begin{itemize}
\item[1.] \textbf{Personalization leads to performance increase.}
As the experiments of \secref{sec:expSupervisedPerso} has shown, retraining the general model with personal data has a positive impact to the detection performance. Using ground truth data can increase the F1 score by ${\sim}0.207$ and S score by ${\sim}0.076$.
\item[2.] \textbf{The influence of label noise is derived from the unbalanced data.}
Since in a daily usage of the application most activities will be non hand washing, the resulted dataset which is used for personalization will be quite imbalanced. Therefore the pseudo labels for \textit{null} and \textit{hw} have different impact to the learning. Already few incorrect as hand washing labeled \textit{null} samples lead to significant performance decreases. Whereas wrong \textit{null} labels do not stand out among the many others correct labels, so there can be several such labels without strong performance changes.
\item[3.] \textbf{The use of soft labels makes the training more resistant to label noise.}
Soft labels which are able to depict uncertainty can reduce the fitting to errors. Especially wrong hand washing labels with lower class affiliations achieve a better performing model than using their hardened values. But smoothing correct labels also can have a negative impact.
\item[4.] \textbf{Pseudo labels have to be filtered and denoised.}
Just relying on predicted labels by the general model as training data results in a worse personalized model than the general model. Even the inclusion of user feedback alone is not enough to achieve higher performance. Just the use of a high variety of samples which contain no false positive labels achieve higher performance than the general model.
\item[5.] \textbf{Pseudo labeled data can reach nearly supervised performance.}
The combination of denoising filters and user feedback, generates training data which can result in a personalized model that reaches similar F1 and S scores as for supervised training.
\item[6.] \textbf{This personalization approach outperforms active leanring.}
\end{itemize}
\section{Future Work}
Observe how the perosnalized models are noticeable by the user.
\ No newline at end of file
\chapter{Conclusion}\label{chap:conclusion}
In this work, I have elaborated a personalization process for human activity recognition in the context of hand washing observation. My approach utilizes indirect user feedback to automatically refine training data for fine tuning the general detection model. I described the generation of pseudo labels by predictions of t he general model and introduced several approaches to denoise them. For evaluation I showed common supervised metrics and defined a quality estimation which also relies just on the user feedback. An actual implementation extends the existing application and allows real world experiments.
I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how soft labels can counter training errors. Based on these insights several constellations and filter approaches for training data have been implemented to analyze the behavior of resulting models under the different aspects. I found out, that just using the predictions of the base model leads to performance decreases, since they consist of too much label noise. But even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore the training data have to consist of a variety of samples which contain as less incorrect labels as possible. The resulting denoising approaches all generates training data which leads to personalized models which achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training.
I compared my personalization approach with a active learning implementation as common personalization method. The sophisticated filters configurations achieve higher S scores which confirms the robustness.
The real world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach on a large variety of users and their feedback behaviors. It confirms, that in most cases personalized models outperforms the general model.
\ No newline at end of file
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_dataset.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_dataset.png}}
\caption[Example synthetic data set]{\textbf{Example synthetic data set} Plot over multiple windows on x axis and their activity label on y axis. Indicators mark sections where the running mean of labels exceed the threshold and would triggered a user feedback. Correct indicators are sections where the ground truth data has the activity label for hand washing and false indicators for \textit{null} activities. Neutral indicators are already covered by the following indicator. Manual indicators mark sections where the model missed a hand wash detection.}
\label{fig:exampleSyntheticSataset}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_dataset_feedback.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_dataset_feedback.png}}
\caption[Example synthetic data set indicator intervals]{\textbf{Example synthetic data set indicator intervals} Highlighted areas of correspondingly indicator intervals. Red areas are \textit{false} intervals, green for \textit{correct}/\textit{manual} and gray for \textit{neutral} intervals.}
\label{fig:exampleSyntheticIntervals}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_cnn.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_cnn.png}}
\caption[Example pseudo filter CNN]{\textbf{Example pseudo filter CNN} Plot of two \textit{positive} intervals, where convolutional neural network filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}
\label{fig:examplePseudoFilterCNN}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_fcndae.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_fcndae.png}}
\caption[Example pseudo filter FCN-dAE]{\textbf{Example pseudo filter FCN-dAE} Plot of two \textit{positive} intervals, where fully convolutional network denoising auto encoder (FCN-dAE) filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}
\label{fig:examplePseudoFilterFCNdAE}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\subfloat[convLSTM1-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE1.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE1.png}}
\subfloat[convLSTM2-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE2.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE2.png}}
\subfloat[convLSTM3-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE3.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE3.png}}
\caption[Example pseudo filter convLSTM-dAE]{\textbf{Example pseudo filter convLSTM-dAE} Plot of two \textit{positive} intervals, where the thre convolutional LSTM denoising auto encoder (convLSTM-dAE) filter approaches were applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}
\label{fig:examplePseudoFilterconvLSTM}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_score.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_score.png}}
\caption[Example pseudo filter score]{\textbf{Example pseudo filter score} Plot of two \textit{positive} intervals, where naive filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}
\label{fig:examplePseudoFilterScore}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\subfloat[Predicted \textit{null} values]
{\includegraphics[scale=0.25]{figures/approach/example_pseudo_plot_null.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_plot_null.png}}
\subfloat[Predicted \textit{hw} values]
{\includegraphics[scale=0.25]{figures/approach/example_pseudo_plot_hw.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_plot_hw.png}}
\caption[Pseudo labels of example synthetic data set]{\textbf{Pseudo labels of example synthetic data set} Plot of predicted pseudo labels in orange. (a) shows prediction of \textit{null} values and (b) shows predictions of \textit{hw} values}.
\label{fig:examplePseudoSataset}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment