Commit ca30e076 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

work on soft evaluation

parent ed983480
@article{Paszke2019,
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
title = {{PyTorch: An Imperative Style, High-Performance Deep Learning Library}},
journal = {Advances in Neural Information Processing Systems},
volume = {32},
year = {2019},
url = {https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html}
}
\ No newline at end of file
\chapter{Evaluation }\label{chap:experiments} \chapter{Experiments }\label{chap:experiments}
In this chapter I show my experiments. At first I define the setup of detection model and personalization as well the used datasets and metrics. Then I introduce a supervised personalization to give a baseline and use this to investigate the performance in presence of label noise. Afterwards I evaluate different methods in pseudo label generation and their noise reduction. Next, the development of the personalization is analyzed by determining the performance over each individual iteration step. To evaluate the overall performance, I compare my approach with the supervised baseline and an active learning implementation. Additionally I simulate the behavior of the resulting personalization to estimate the archived quality of an HAR application. More over I have conducted multiple real world analysis and present the experiences of users. Finally I summarize the results. In this chapter I show my experiments. At first I define the setup of detection model and personalization as well the used datasets and metrics. Then I introduce a supervised personalization to give a baseline and use this to investigate the performance in presence of label noise. Afterwards I evaluate different methods in pseudo label generation and their noise reduction. Next, the development of the personalization is analyzed by determining the performance over each individual iteration step. To evaluate the overall performance, I compare my approach with the supervised baseline and an active learning implementation. Additionally I simulate the behavior of the resulting personalization to estimate the archived quality of an HAR application. More over I have conducted multiple real world analysis and present the experiences of users. Finally I summarize the results.
\section{Methology} \section{Experiment Setup}
To evaluate my personalization approach I observe different factors and their impact to the resulting model performance. In the following the same datasets are used. Therefore I created 13 synthetic recordings for 3 participants as described in~\secref{sec:synDataset}. Additionally I recorded new data sets of daily usage with 2 participants. The individual recordings of an participant are splitted into training and test sets by the leave one out method. This result in roughly $20\%$ of recordings for the test set. These data contain ground truth labels which allows common evaluations. The measurement metrics are Specificity, Sensitivity, F1 score and S1 score. Over all test sets the mean value of a metric is computed. Moreover real world experiments which cover a larger amount of participants and recorded data, give a more realistic insight of performance gain. Since these does not contain supervised labels, the performance is estimated as described in section \secref{sec:approachQualityEstimation}. To evaluate my personalization approach I use different metrics which rely on ground truth data. Therefore fully labeled datasets are required. I create synthetic recordings for 3 participants as described in~\secref{sec:synDataset}. Additionally I have recorded multiple daily usages of the base application and split these recordings into two participants. For the second participant I've focused to do a more intense hand washing than usual. The recordings of a participant are splitted into training and test sets by the leave one out method. \tabref{tab:supervisedDatasets} shows the resulting datasets in detail. The measurement metrics are Specificity, Sensitivity, F1 score and S score. For each evaluation, the personalized models of the participants are applied to the respective test sets and the mean over all is computed. The models are implemented using PyTorch~\cite{Paszke2019}.
\extend{pytorch, epochs, regularization} \input{figures/experiments/table_supervised_datasets}
\section{Supervised personalization} \section{Supervised personalization}
Following experiments show theoretical aspects of the personalization process. Following experiments show theoretical aspects of the personalization process.
\subsection{Transfer learning based on ground truth data} \subsection{Transfer learning based on ground truth data}
These build a baseline how a personalized model could perform if a perfect labeling would be possible. Additionally the influence of label noise is analyzed. First the base model is trained by the ground truth data of the training set for each participant. After that the resulted model is evaluated with the test data and given metrics. Then $n\%$ of the \textit{null} labels and $h\%$ of the hand wash labels are flipped. Again the base model is trained and evaluated with the new data. This is repeated over different values for $n$ and $h$. The plots of \figref{arg1} and \figref{arg1} show the resulted evaluations over increasing noise. These build a baseline how a personalized model could perform if a perfect labeling would be possible. Additionally the influence of label noise is analyzed. First the base model is trained by the ground truth data of the training set for each participant. After that the resulted model is evaluated with the test data and given metrics. Then $n\%$ of the \textit{null} labels and $h\%$ of the hand wash labels are flipped. Again the base model is trained and evaluated with the new data. This is repeated over different values for $n$ and $h$. The plots of \figref{fig:supervisedNoisyAllSpecSen} and \figref{fig:supervisedNoisyAllF1S} show the resulted mean evaluations of all participants with increasing noise.
First we concentrate on a) and b) of \figref{arg1}. Here the value of $n=0$ and noise is added to the hand wash labels. We can see that noise values up to around of $40\%-50\%$ have just small impact to specificity and sensitivity. If the noise increases further, the sensitivity tend to decrease. For specificity it seems that there is no trend and only the deflections become more extreme. But models trained on additional data with up to $~70\%$ noise on the hand wash labels can still benefit in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as can seen in the plots of c) and d). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{arg1} shows plots of noise on \textit{null} labels in a range from $0\%$ to $10\%$. The specificity drops to $~0.55$ for $n<0.015$ and seems to converge to $~0.5$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher as of the general model. But for lager noise this also decreases drastically. \figref{arg1} shows the resulting F1 score and S scores. These clarify, that noise on hand wash labels just has minor impact to the model performance, whereas even a small amount of noise on $null$ labels drastically reduces the performance which leads to worse personalized models as the base model. %\input{figures/experiments/supervised_random_noise_hw_all}
%\input{figures/experiments/supervised_random_noise_null_all}
\input{figures/experiments/supervised_random_noise_all_spec_sen}
\input{figures/experiments/supervised_random_noise_all_f1_s}
\input{figures/experiments/supervised_random_noise_part}
First we concentrate on (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around of $40\%-50\%$ have just small impact to specificity and sensitivity. If the noise increases further, the sensitivity tend to decrease. For specificity it seems that there is no trend and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just minor influence to the training and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as can be seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher as of the general model. But for lager noise this also decreases significant. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks of penalty for false positive predictions. According to the F1 score a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to way more false hand wash detections.
I argue the high performance loss by the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. But already $1\%$ of flipped \textit{null} labels lead to ${\sim}68\%$ of false hand wash labels. So they would have a higher impact to the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows it is possible, that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim, that the training data should contain less than ${\sim}20\%$ of false hand wash labels whereas the amount of incorrect \textit{null} labels does not require particular focus.
I argue this by the imbalance of the labels. A typical daily recording of around 8 hours contains ${\sim}20,000$ labeled windows. A single hand wash action of $20$ seconds covers $15$ windows. If a user would wash its hands $10$ times a day, it would lead to $150$ \textit{hw} labels and $19,850$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.3\%$ of false labels. But already $1\%$ of flipped \textit{null} labels lead to ${\sim}56\%$ of false hand wash labels. So there would have a higher impact to the training than the original hand wash data.
\begin{itemize}
\item As baseline
\item Observe label flips in noise/hand wash sections -> label noise
\end{itemize}
\subsection{Hard vs. Soft labels} \subsection{Hard vs. Soft labels}
In these experiments, I would like to show the effect of noise in soft labels compared to crisp labels. Similar as before different values of label flips are applied to the training data. Then the labels are smoothen to a degree $s$. The base model is trained with the noisy crisp labels and separately with ne smoothed soft labels. Both resulted models are then evaluated and compared. In these experiments, I would like to show the effect of noise in soft labels compared to crisp labels. Similar as before different values of label flips are applied to the training data. Then the labels are smoothen to a degree $s\in [0, 0.49]$. As seen before, noise on \textit{hw} labels does not have significant impact to the performance. Therefore not much changes in performance due to different smoothing is expected. This is confirmed by \figref{fig:supervisedSoftNoiseHW}. Just for lager noise values a trend can be detected which is a slightly increase of the S score. I focus on noise in \textit{null} labels. \figref{fig:supervisedSoftNoiseNull} gives detailed insights of the performance impact. For all noise values, the specificity increases with higher smoothing, what becomes clearer, for more noise. But the sensitivity seems to decrease slightly, especially for higher noise rates. Overall the F1 score and S score benefits from smoothing. In the case of $0.2\%$ noise, the personalized models trained on smoothed false labels, unlike without smoothing, can reach a higher S score than the base model.
\input{figures/experiments/supervised_soft_noise_hw}
\input{figures/experiments/supervised_soft_noise_null}
In the next step I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but don't flip the randomly selected labels and just apply the smoothing $s$ to them. In the next step I want to observe if smoothing could have a negative effect if correct labels are smoothed. Therefore I repeat the previous experiment but don't flip the randomly selected labels and just apply the smoothing $s$ to them. Again, no major changes in the performance due to noise in \textit{hw} labels is expected which can also be seen in the left graph of \figref{fig:supervisedFalseSoftNoise}. In the case of wrongly smoothed \textit{null} labels we can see a negative trend in S score for higher smoothing values, as shown in the right graph. For a greater portion of smoothed labels, the smooth value has higher influence to the models performance. But for noise values $\leq 0.2\%$ the all personalized models still achieve higher S scores than the general models. Therefore it seems, that the personalization benefits from using soft labels.
%\input{figures/experiments/supervised_false_soft_noise_hw}
%\input{figures/experiments/supervised_false_soft_noise_null}
\input{figures/experiments/supervised_false_soft_noise}
\begin{itemize} \begin{itemize}
\item Which impact does hardened labels have against soft labels \item Which impact does hardened labels have against soft labels
\item flip labels and smooth out \item flip labels and smooth out
......
\begin{figure}[t]
\begin{centering}
\subfloat[Some cool graphic]
{\includegraphics[scale=0.2]{figures/experiments/img.JPG}}
\subfloat[Some cool related graphic]
{\includegraphics[scale=0.4]{figures/experiments/microcontroller.png}}
\caption[Caption that appears in the figlist]{\textbf{Caption that appears under the fig} \lipsum[1-1]}
\label{fig:pcaclasses}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\includegraphics[width=\textwidth]{figures/experiments/supervised_false_soft_noise.png}
\caption[Supervised false soft noise]{\textbf{Supervised false soft noise.} Graphs show multiple plots for personalized models trained on data where a part of labels are smoothed to value $s$. On left, the smoothing is applied to hand wash labels and on right to \textit{null} labels.}
\label{fig:supervisedFalseSoftNoise}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\includegraphics[width=\textwidth]{figures/experiments/supervised_false_soft_noise_hw.png}
\caption[Supervised false soft noise on hw]{\textbf{Supervised false soft noise on hw.} Multiple plots of F1 score (left) and S score (right) for personalized models, which are trained on data where a part (noise value) of hand wash labels are smoothed to value $s$.}
\label{fig:supervisedFalseSoftNoise}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\subfloat
{\includegraphics[width=\textwidth]{figures/experiments/supervised_false_soft_noise_null_spec_sens.png}}
\subfloat
{\includegraphics[width=\textwidth]{figures/experiments/supervised_false_soft_noise_null_f1_s.png}}
\caption[Supervised soft noise on null]{\textbf{Supervised soft noise on null.} Graphs show multiple plots for personalized models trained on data where a part (noise value) of \textit{null} labels are smoothed to value $s$.}
\label{fig:supervisedFalseSoftNoiseNull}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\subfloat[Increasing $h$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_hw_all_f1_s.png}}
\subfloat[Increasing $n$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_null_all_f1_s.png}}
\caption[Supervised noisy training hw]{\textbf{Supervised noisy training hw} Graphs shows the F1 score (left) and S score (right) of personalized models which are trained on increasing values of noise. The mean is computed over all personalizations with same noise value. In (a) noise is applied to hand wash labels and in (b) to \textit{null} labels.}
\label{fig:supervisedNoisyAllF1S}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\subfloat[Increasing $h$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_hw_all_spec_sen.png}}
\subfloat[Increasing $n$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_null_all_spec_sen.png}}
\caption[Supervised noisy training hw]{\textbf{Supervised noisy training hw} Graphs shows the specificity (left) and sensitivity (right) of personalized models which are trained on increasing values of noise. The mean is computed over all personalizations with same noise value. In (a) noise is applied to hand wash labels and in (b) to \textit{null} labels.}
\label{fig:supervisedNoisyAllSpecSen}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_hw_all_spec_sen.png}
\caption[Supervised noisy training hw]{\textbf{Supervised noisy training hw} Graph shows the mean specificity (left) and sensitivity (right) over all participants and their test sets of personalized models. Increasing values of noise }
\label{fig:examplePseudoSataset}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
{\includegraphics[width=\textwidth]{figures/experiments/supervised_random_noise_hw_all_spec_sen.png}
\caption[Supervised noisy training hw]{\textbf{Supervised noisy training hw} Graph shows the specificity (left) and sensitivity (right) of personalized models which are trained on increasing values of noise in \textit{hw} labels ($h\%$). The mean over all personalizations with same noise value is calculated.}
\label{fig:examplePseudoSataset}
\end{centering}
\end{figure}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment