Commit f5552682 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

work on experiments

parent 5d22fdd2
......@@ -94,4 +94,4 @@ In some other approaches, the spatial encoder/decoder is separated from the temp
\subsection{Soft labels}\label{sec:relSoftLabels}
In most cases the label of a sample $x_i$ is a crisp label $y_i\in K$ which denotes exactly one of the $c$ predefined classes $K\equiv\{1, \dots , c\}$ to which this sample belongs~\cite{Li2012Jul, Sun2017Oct}. Typically labels are transformed into a one-hot encoded vector $\bm{y}_i=[y^{(1)}_i, \dots, y^{(c)}_i]$ for training, since it is required by the loss function. If the sample is of class $j$, the $j$-th value in the vector would be one, whereas all other values are zero. So $\sum_{k=1}^{c}y_i^{(k)}=1$ and $y_i^{(k)}\in\{0,1\}$. For soft labels $y_i^{(k)}\in[0,1]\subset \mathbb{R}$ which allows to assign the degree to which a sample belongs to a particular class~\cite{ElGayar2006}. Therefore soft-labels can depict uncertainity over multiple classes~\cite{Beleites2013Mar}. Since the output of a human activity recognition model is mostly computed by a soft-max layer, it already represents a vector of partial class memberships. Converting to a crisp label would lead to information loss. So if these predictions are used as soft labeled training data, the values give how certain the sample belongs to the respective classes.
Therefore soft-labels can also be used for a more robust training with noisy labels. It could happen, that the noise in a generated label just relies by a very uncertain prediction. The noise gets maximized by using crisp labels, but ha less impact to the training process if used in a soft-label. It has been shown, that soft-labels can carry valuable information even when they are noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision and recall.
Therefore soft-labels can also be used for a more robust training with noisy labels. It could happen, that the noise in a generated label just relies by a very uncertain prediction. The noise gets maximized by using crisp labels, but has less impact to the training process if used in a soft-label. It has been shown, that soft-labels can carry valuable information even when they are noisy. So a model trained with multiple sets of noisy labels can outperform a model which is trained on a single set of ground-truth labels~\cite{Ahfock2021Sep, Thiel2008}. Hu et al. used soft labeled data for Human Activity Recognition to overcome the problem of incomplete or uncertain annotations~\cite{Hu2016Oct}. They outperformed the hard labeling methods in accuracy, precision and recall.
......@@ -8,7 +8,7 @@ Finally, I present an active learning implementation, which is used for performa
\section{Base Application}\label{sec:approachBaseApplication}
The application I wanted to personalize, detects hand washing activities and is executed on a smart watch. It is used to observe obsessive behavior of a participant to tread OCD. If the device detects a hand wash activity, a notification is prompted to the user, which can then confirmed or declined. A confirmation leads to a survey, where the user can rate its mental condition. Furthermore a user can trigger manual evaluations if a hand washing was not detected by the device. This evaluations can be used later by psychologists to analyze and treat the participants state during the day. For activity prediction the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}.
The application I wanted to personalize, detects hand washing activities and is executed on a smart watch. It is used to observe obsessive behavior of a participant to treat OCD. If the device detects a hand wash activity, a notification is prompted to the user, which can then confirmed or declined. A confirmation leads to a survey, where the user can rate its mental condition. Furthermore a user can trigger manual evaluations if a hand washing was not detected by the device. This evaluations can be used later by psychologists to analyze and treat the participants state during the day. For activity prediction the application uses a general neural network model based on the work of Robin Burchard \cite{robin2021}.
The integrated IMU of the smart watch is used to record wrist movements and stores the sensor data in a buffer. After a cycle of 10 seconds the stored data is used to predict the current activity. Therefore a sliding window with length of 3 seconds and a window shift of 1.5 seconds is applied to the buffer. For each window the largest distance between the sensor values is calculated to filter out sections where there is just little movement. If there is some motion, the general recognition model is applied to the windows of this section, to predict the current activity label. To avoid detection based on outliers, a running mean is computed over the last $kw$ predictions. Just if it exceeds a certain threshold $kt$ the final detection is triggered. Additionally the buffer is saved to an overall recording in the internal storage. While charging the smart watch, all sensor recordings and user evaluations are sent to an HTTP server.
......@@ -170,6 +170,7 @@ The terms $Y$ and $\hat{Y}$ are the sets of soft-label vectors of the ground tru
New models which results by the personalization should be evaluated to ensure that the new model performs better than the one currently used. To determine the quality of a model, the predicted values are compared to the actual ground truth data using different metrics like introduced in \secref{sec:approachMetrics}. However, in our case there is no ground truth data available for common evaluations. Therefore I use again the information given by the indicators. I assert that, the performance of a model is reflected by the resulting behavior of the application. In our case, the situations in which a hand washing activity is detected and a notification is promted to the user. So we can simulate the behavior of the application using the new model on an existing recording and compare the potential detected hand wash sections with the actual user feedback. To simulate the application I use the new model to predict the classes on a recording and compute the running mean over the predictions. Additionally low movement sections are detected and the predictions are set to \textit{null}. This is equal to the filter of the application where no prediction model is applied to low movement sections. At each sample where the mean for \textit{hw} predictions is higher than the given threshold I check if the label is inside of a \textit{hw} or \textit{manual} interval. If yes, then it is counted as a true positive (TP) prediction otherwise it is false positive (FP) prediction. Since the application buffers multiple samples for prediction I require a minimum distance between two detections.
To observe wrongly predicted activities I take the same assumption as for the \texttt{all\_null\_*} filter configurations in~\secref{sec:approachFilterConfigurations}. If a hand wash activity is detected on a section which is not covered by a \textit{hw} or \textit{manual} interval then it is probably a false detection. If the running mean has not exceeded the threshold within a \textit{hw} or \textit{manual} interval it would lead to a false negative (FN) prediction. All other sections where no hand washing activity is detected would be true negative (TN) predictions. But due to the minimum distance between predictions and overlapping \textit{hw, manual} intervals it is hard to estimate section boundaries. Therefore the true negative value would not be precise. Using these values it is possible to create the confusion matrix as described in \secref{sec:approachMetrics}. I compute the Sensitivity, Precision and F1 score since they do not depend on the true negative values. So it is possible to compare the performance of arbitrary models.
\subsection{Best kernel settings}
Furthermore these mechanism can also be used to redefine the values of kernel width and threshold for the running mean. I apply a grid search over kernel sizes of $[10, 15, 20]$ and thresholds of $[0.5, 0.52, \dots, 0.99]$. For each value combination the resulting true positive and false positive values are computed. The combination which results at least the same amount of true positives and minimal false positives is set as optimal mean setting.
......
......@@ -7,10 +7,10 @@ To evaluate my personalization approach I use different metrics which rely on gr
\input{figures/experiments/table_supervised_datasets}
\section{Supervised personalization}
\section{Supervised personalization}\label{sec:expSupervisedPerso}
Following experiments show theoretical aspects of the personalization process.
\subsection{Transfer learning based on ground truth data}\label{sec:expTransferLearningGT}
These build a baseline how a personalized model could perform if a perfect labeling would be possible. Additionally the influence of label noise is analyzed. First the base model is trained by the ground truth data of the training set for each participant. After that the resulted model is evaluated with the test data and given metrics. Then $n\%$ of the \textit{null} labels and $h\%$ of the hand wash labels are flipped. Again the base model is trained and evaluated with the new data. This is repeated over different values for $n$ and $h$. The plots of \figref{fig:supervisedNoisyAllSpecSen} and \figref{fig:supervisedNoisyAllF1S} show the resulted mean evaluations of all participants with increasing noise.
These build a baseline, on how a personalized model could perform if a perfect labeling would be possible. Additionally the influence of label noise is analyzed. First the base model is trained by the ground truth data of the training set for each participant. After that the resulted model is evaluated with the test data and given metrics. Then $n\%$ of the \textit{null} labels and $h\%$ of the hand wash labels are flipped. Again the base model is trained and evaluated with the new data. This is repeated over different values for $n$ and $h$. The plots of \figref{fig:supervisedNoisyAllSpecSen} and \figref{fig:supervisedNoisyAllF1S} show the resulted mean evaluations of all participants with increasing noise.
%\input{figures/experiments/supervised_random_noise_hw_all}
%\input{figures/experiments/supervised_random_noise_null_all}
......@@ -18,7 +18,11 @@ These build a baseline how a personalized model could perform if a perfect label
\input{figures/experiments/supervised_random_noise_all_f1_s}
\input{figures/experiments/supervised_random_noise_part}
First we concentrate on (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around of $40\%-50\%$ have just small impact to specificity and sensitivity. If the noise increases further, the sensitivity tend to decrease. For specificity it seems that there is no trend and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just minor influence to the training and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as can be seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher as of the general model. But for lager noise this also decreases significant. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks of penalty for false positive predictions. According to the F1 score a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to way more false hand wash detections.
First we concentrate on the dashed lines of the graphs. These are the evaluations of the general model, in red, and the supervised trained model in green. We can see, that in all graphs the personalized models performs better than the general model. The base model achieves a F1 score of ${\sim}0.4127$ and ${\sim}0.7869$ in S score, whereas the personalized model reaches a F1 score of ${\sim}0.6205$ and ${\sim}0.8633$ in S score. So personalization can lead to an increase of ${\sim}0.2079$ in F1 score an ${\sim}0.0765$ in S score. This builds the theoretical performance gain if a perfect labeling would be available.
\subsubsection{Label noise}\label{sec:expTransferLearningNoise}
Now we observe the scenarios, where some of the labels are noisy. Therefore we look at (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around of $40\%-50\%$ have just small impact to specificity and sensitivity. If the noise increases further, the sensitivity tend to decrease. For specificity it seems that there is no trend and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just minor influence to the training and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as can be seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher as of the general model. But for lager noise this also decreases significant. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks of penalty for false positive predictions. According to the F1 score a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to way more false hand wash detections.
I argue the high performance loss by the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. But already $1\%$ of flipped \textit{null} labels lead to ${\sim}68\%$ of false hand wash labels. So they would have a higher impact to the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows it is possible, that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim, that the training data should contain less than ${\sim}20\%$ of false hand wash labels whereas the amount of incorrect \textit{null} labels does not require particular focus.
......@@ -30,7 +34,9 @@ In these experiments, I would like to show the effect of noise in soft labels co
\input{figures/experiments/supervised_soft_noise_null}
\subsubsection{Negative impact of soft labels}\label{sec:expNegImpSoftLabel}
In the next step I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but don't flip the randomly selected labels and just apply the smoothing $s$ to them. Again, no major changes in the performance due to noise in \textit{hw} labels is expected which can also be seen in the left graph of \figref{fig:supervisedFalseSoftNoise}. In the case of wrongly smoothed \textit{null} labels we can see a negative trend in S score for higher smoothing values, as shown in the right graph. For a greater portion of smoothed labels, the smooth value has higher influence to the models performance. But for noise values $\leq 0.2\%$ the all personalized models still achieve higher S scores than the general models. Therefore it seems, that the personalization benefits from using soft labels. To make sure that the performance increase of smoothing false labels prevails the drawbacks of falsely smoothed correct labels, I combined both experiments. This is oriented what happens to the labels if one of the denoising filters would be applied to a hand wash section. First a certain ratio $n$ of \textit{null} labels are flipped. This expresses when the filter would falsely classify a \textit{null} label as hand washing. The false labels are smoothed to value $s$. After that the same ratio $n$ of correct \textit{hw} labels are smoothed to value $s$. This is equal to smoothing the label boundaries of a hand wash action. The resulting performance of personalizations can be seen in \figref{fig:supervisedSoftNoiseBoth}. The performance increase of smoothing false labels and the performance decrease of smoothing correct labels seems to cancel out for smaller values of $n$. For lager values the performance slightly increases for larger $s$. For training data refining I concentrate to use soft labels mainly within hand washing sections for activity borders. There will be a higher chance for false labeling.
In the next step I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but don't flip the randomly selected labels and just apply the smoothing $s$ to them. Again, no major changes in the performance due to noise in \textit{hw} labels is expected which can also be seen in the left graph of \figref{fig:supervisedFalseSoftNoise}. In the case of wrongly smoothed \textit{null} labels we can see a negative trend in S score for higher smoothing values, as shown in the right graph. For a greater portion of smoothed labels, the smooth value has higher influence to the models performance. But for noise values $\leq 0.2\%$ the all personalized models still achieve higher S scores than the general models.
To ensure that the drawbacks of incorrectly smoothed correct labels do not prevails the performance gains from smoothing false labels, I combined both experiments. This is oriented what happens to the labels if one of the denoising filters would be applied to a hand wash section. First a certain ratio $n$ of \textit{null} labels are flipped. This expresses when the filter would falsely classify a \textit{null} label as hand washing. The false labels are smoothed to value $s$. After that the same ratio $n$ of correct \textit{hw} labels are smoothed to value $s$. This is equal to smoothing the label boundaries of a hand wash action. The resulting performance of personalizations can be seen in \figref{fig:supervisedSoftNoiseBoth}. The performance increase of smoothing false labels and the performance decrease of smoothing correct labels seems to cancel out for smaller values of $n$. For lager values the performance slightly increases for larger $s$. For training data refining I concentrate to use soft labels mainly within hand washing sections for activity borders. There will be a higher chance for false labeling.
%\input{figures/experiments/supervised_false_soft_noise_hw}
%\input{figures/experiments/supervised_false_soft_noise_null}
\input{figures/experiments/supervised_false_soft_noise}
......@@ -44,7 +50,7 @@ In the next step I want to observe if smoothing could have a negative effect if
\section{Evaluation of different Pseudo label generations}
\section{Evaluation of different Pseudo label generations}\label{sec:expPseudoModels}
In this section, I describe the evaluation of different pseudo labeling approaches using the filters introduced in \secref{sec:approachFilterConfigurations}. For each filter configuration, the base model is used to predict the labels of the training sets and create pseudo labels. After that the filter is applied to the pseudo labels. To determine the quality of the pseudo labels, they are evaluated against the ground truth values using soft versions of the metrics $Sensitivity^{soft}$, $Specificity^{soft}$, $F_1^{soft}$, $S^{soft}$. The general model is then trained by the refined pseudo labels. All resulted models are evaluated by their test sets and the mean over all is computed. \figref{fig:pseudoModelsEvaluation} shows a bar plot over the metrics for all filter configuration. I concentrate to the values of S score in terms of performance.
\subsubsection{Baseline configurations}
......@@ -59,7 +65,7 @@ In the case of \texttt{all\_cnn\_*} configurations, the training data obtain sim
\input{figures/experiments/supervised_pseudo_models}
\input{figures/experiments/supervised_pseudo_models_training_data}
\subsection{Influence of missing feedback}
\subsection{Influence of missing feedback}\label{sec:expMissingFeedback}
The following experiment shows the impact of missing user feedback to the training data and resulting model performance. As before the base model is trained on data which is refined with the different filter configurations. But in this case just $f\%$ of the \textit{false} and $c\%$ of the \textit{correct} indicators exists. All others are replaced with neutral indicators.
\begin{itemize}
......@@ -69,34 +75,35 @@ The following experiment shows the impact of missing user feedback to the traini
\item Advantages and disadvantages
\end{itemize}
\section{Evaluation over iteration steps}
\subsection{Evaluation over iteration steps}
In this section I compare the performance of the personalized models between iteration steps. Therefore the base model is applied to one of the training data sets of a participant, which is refined by one of the filter configurations. After that the resulted personalized model is evaluated. This step is repeated over all training sets where the previous base model is replaced by the new model. Additionally I evaluate the performance of a single iteration step by always training and evaluating the base model on the respective training data. I repeat that experiment with different amounts of training epochs and for the two regularization approaches of \secref{sec:approachRegularization}.
\begin{itemize}
\item How does the personalized model evolve over multiple training steps
\end{itemize}
\section{Evaluation of personalization}
\subsection{Compare with supervised / general}
specificity, sensitivity, f1, S1
\subsection{Compare Active learning with my approach}
To confirm the robustness of my personalization approach, I compare it with a common active learning implementation as introduced in \secref{sec:approachActiveLearning}. To find a appropriate selection of hyper parameters $B$, $s$, $h$, use weighting, and number of epochs, I use a grid search approach. \tabref{tab:activeLearningGridSearch} shows the covered values for the hyper parameters.
To confirm the robustness of my personalization approach, I compare it with a common active learning implementation as introduced in \secref{sec:approachActiveLearning}. To find a appropriate selection of hyper parameters $B$, $s$, $h$, use weighting, and number of epochs, I use a grid search approach. \tabref{tab:activeLearningGridSearch} shows the covered values for the hyper parameters. The 10 parameter constellations which yields best S scores are shown in \tabref{tab:activeLearningEvaluation}. They achieve S scores from ${\sim}0.8444$ to ${\sim}0.8451$. From the experiment of \secref{sec:expPseudoModels}, we know that the models based on configurations \texttt{all\_cnn\_*\_hard} reach scores around ${\sim}0.86$ and \texttt{all\_null\_deepconv}, \texttt{all\_null\_fcndae} and \texttt{all\_null\_convlstm1} even ${\sim}0.867$. So my personalization approach outperforms active learning in terms of performance increase.
In the next step I analyze the required user interaction. The best performing hyper parameter setting relies on a budget of 20. So the user has to answer to 20 queries. The training data contains $3.4$ manual, $10.6$ correct and $53.4$ false indicators as average per participant. If we would equate the answer to a query to the user feedback on which the indicators have been drawn, the active learning approach would lead to less user interaction. But as shown in the experiment of \secref{sec:expMissingFeedback}, false indicators does not have significant impact to the training process. Therefore a user could ignore false hand washing detentions and just answer for correct and manual activities. This would result in $14$ user interactions, which is less than the budget of the active learning implementation. Furthermore positive feedback is just gathered when the application has to react to the target activity. So in this case, the user interaction is intended anyways.
\input{figures/experiments/table_active_learning_grid_search}
\input{figures/experiments/table_active_learning_evaluation}
\begin{itemize}
\item Performance
\item User interaction
\end{itemize}
\subsection{Quality estimation}
In this section I compare the estimated F1 score with the ground truth evaluation. Therefore I compute the estimated F1 score as described in \secref{sec:approachQualityEstimation} of the general and supervised personalized model for each participant. As kernel settings the best settings are determined in advance. Then the real F1 based on ground truth data is calculated. The mean of estimated F1 and ground truth F1 is taken. The results can be seen in \tabref{tab:qualityEstimationEvaluation}. For the general model, the estimated F1 score differs by $15\%$ and for the personalized model by $8\%$. This proofs, that the estimated quality gives the right intuition of a models performance.
\begin{itemize}
\item Simulated notification triggers
\item Running mean settings
\item Trade off between increased correct and decreased wrong triggers
\end{itemize}
\input{figures/experiments/table_quality_estimation_evaluation}
\section{Real world analysis}
In corporation with University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. In this study, they wore a smart watch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user beedback indicators. \tabref{arg1} shows the data over each participant in detail. Since no exact labeling for the sensor values exist I used the quality estimation approach for evaluation.
In corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smart watch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user beedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exist I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. For each of the training recordings the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulted data set is used for training based on the previous model or the model with best F1 score. As regularization freezing layers or l2-sp penalty is used. Over all personalizations of a participant the model with highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and base model values give the true performance increase of the retraining.
Entries with zero iterations, as for participants OCDetect\_09 and OCDetect\_13, state that there was no personalization that is better than the adjusted base model.
For all other participants the personalization process was able to generate a model which performs better than the general. All of them base on the best model for each iteration step. From this it can be concluded, that less but good recordings lead to a better personalization, than iterating over all available. They just rely on one or two iterations and in most cases l2-sp regularization was used. The highest increase in F1 score achieved OCDetect\_10 with $0.10$. In practice this would lead to $82\%$ less incorrect hand wash notifications and $14\%$ more correct detections. The participants OCDetect\_02, OCDetect\_07 and OCDetect\_12 achieve an increase of around $0.05$. For OCDetect\_12, the personalization would lead to $6\%$ more wrong triggers but also increase the detection of correct hand washing activities by $45\%$. All best personalizations used either the \texttt{all\_cnn\_convlstm2\_hard} or \texttt{all\_cnn\_convlstm3\_hard} filter configurations.
\input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_evaluation}
\chapter{Discussion}\label{chap:Discussion}
\ No newline at end of file
\chapter{Discussion}\label{chap:Discussion}
The experiments of the previous section given several insights of the personalization in context of hand washing.
\begin{itemize}
\item[1.] \textbf{Personalization leads to performance increase.}
As the experiments of \secref{sec:expSupervisedPerso} has shown, retraining the general model with personal data has a positive impact to the detection performance. Using ground truth data can increase the F1 score by ${\sim}0.207$ and S score by ${\sim}0.076$.
\item[2.] \textbf{The influence of label noise is derived from the unbalanced data.}
Since in a daily usage of the application most activities will be non hand washing, the resulted dataset which is used for personalization will be quite imbalanced. Therefore the pseudo labels for \textit{null} and \textit{hw} have different impact to the learning. Already few incorrect as hand washing labeled \textit{null} samples lead to significant performance decreases. Whereas wrong \textit{null} labels do not stand out among the many others correct labels, so there can be several such labels without strong performance changes.
\item[3.] \textbf{The use of soft labels makes the training more resistant to label noise.}
Soft labels which are able to depict uncertainty can reduce the fitting to errors. Especially wrong hand washing labels with lower class affiliations achieve a better performing model than using their hardened values. But smoothing correct labels also can have a negative impact.
\item[4.] \textbf{Pseudo labels have to be filtered and denoised.}
Just relying on predicted labels by the general model as training data results in a worse personalized model than the general model. Even the inclusion of user feedback alone is not enough to achieve higher performance. Just the use of a high variety of samples which contain no false positive labels achieve higher performance than the general model.
\item[5.] \textbf{Pseudo labeled data can reach nearly supervised performance.}
The combination of denoising filters and user feedback, generates training data which can result in a personalized model that reaches similar F1 and S scores as for supervised training.
\item[6.] \textbf{This personalization approach outperforms active leanring.}
\end{itemize}
\section{Future Work}
Observe how the perosnalized models are noticeable by the user.
\ No newline at end of file
\chapter{Conclusion}\label{chap:conclusion}
\ No newline at end of file
\chapter{Conclusion}\label{chap:conclusion}
In this work, I have elaborated a personalization process for human activity recognition in the context of hand washing observation. My approach utilizes indirect user feedback to automatically refine training data for fine tuning the general detection model. I described the generation of pseudo labels by predictions of t he general model and introduced several approaches to denoise them. For evaluation I showed common supervised metrics and defined a quality estimation which also relies just on the user feedback. An actual implementation extends the existing application and allows real world experiments.
I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how soft labels can counter training errors. Based on these insights several constellations and filter approaches for training data have been implemented to analyze the behavior of resulting models under the different aspects. I found out, that just using the predictions of the base model leads to performance decreases, since they consist of too much label noise. But even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore the training data have to consist of a variety of samples which contain as less incorrect labels as possible. The resulting denoising approaches all generates training data which leads to personalized models which achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training.
I compared my personalization approach with a active learning implementation as common personalization method. The sophisticated filters configurations achieve higher S scores which confirms the robustness.
The real world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach on a large variety of users and their feedback behaviors. It confirms, that in most cases personalized models outperforms the general model.
\ No newline at end of file
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_dataset.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_dataset.png}}
\caption[Example synthetic data set]{\textbf{Example synthetic data set} Plot over multiple windows on x axis and their activity label on y axis. Indicators mark sections where the running mean of labels exceed the threshold and would triggered a user feedback. Correct indicators are sections where the ground truth data has the activity label for hand washing and false indicators for \textit{null} activities. Neutral indicators are already covered by the following indicator. Manual indicators mark sections where the model missed a hand wash detection.}
\label{fig:exampleSyntheticSataset}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_dataset_feedback.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_dataset_feedback.png}}
\caption[Example synthetic data set indicator intervals]{\textbf{Example synthetic data set indicator intervals} Highlighted areas of correspondingly indicator intervals. Red areas are \textit{false} intervals, green for \textit{correct}/\textit{manual} and gray for \textit{neutral} intervals.}
\label{fig:exampleSyntheticIntervals}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_cnn.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_cnn.png}}
\caption[Example pseudo filter CNN]{\textbf{Example pseudo filter CNN} Plot of two \textit{positive} intervals, where convolutional neural network filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}
\label{fig:examplePseudoFilterCNN}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_fcndae.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_fcndae.png}}
\caption[Example pseudo filter FCN-dAE]{\textbf{Example pseudo filter FCN-dAE} Plot of two \textit{positive} intervals, where fully convolutional network denoising auto encoder (FCN-dAE) filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}
\label{fig:examplePseudoFilterFCNdAE}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\subfloat[convLSTM1-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE1.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE1.png}}
\subfloat[convLSTM2-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE2.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE2.png}}
\subfloat[convLSTM3-dAE]
{\includegraphics[scale=0.22]{figures/approach/example_pseudo_filter_convLSTMdAE3.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_convLSTMdAE3.png}}
\caption[Example pseudo filter convLSTM-dAE]{\textbf{Example pseudo filter convLSTM-dAE} Plot of two \textit{positive} intervals, where the thre convolutional LSTM denoising auto encoder (convLSTM-dAE) filter approaches were applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively.}
\label{fig:examplePseudoFilterconvLSTM}
......
\begin{figure}[t]
\begin{centering}
\makebox[\textwidth]{\includegraphics[width=0.8\paperwidth]{figures/approach/example_pseudo_filter_score.png}}
\makebox[\textwidth]{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_filter_score.png}}
\caption[Example pseudo filter score]{\textbf{Example pseudo filter score} Plot of two \textit{positive} intervals, where naive filter approach was applied. Values for \textit{hw} of predictions and pseudo labels are plotted in orange and magenta respectively}
\label{fig:examplePseudoFilterScore}
\end{centering}
......
\begin{figure}[t]
\begin{centering}
\subfloat[Predicted \textit{null} values]
{\includegraphics[scale=0.25]{figures/approach/example_pseudo_plot_null.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_plot_null.png}}
\subfloat[Predicted \textit{hw} values]
{\includegraphics[scale=0.25]{figures/approach/example_pseudo_plot_hw.png}}
{\includegraphics[width=\textwidth]{figures/approach/example_pseudo_plot_hw.png}}
\caption[Pseudo labels of example synthetic data set]{\textbf{Pseudo labels of example synthetic data set} Plot of predicted pseudo labels in orange. (a) shows prediction of \textit{null} values and (b) shows predictions of \textit{hw} values}.
\label{fig:examplePseudoSataset}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment