Commit a4241dd1 authored by Alexander Henkel's avatar Alexander Henkel
Browse files

work on experiemtns

parent 0acc11ba
...@@ -66,24 +66,27 @@ In the case of \texttt{all\_cnn\_*} configurations, the training data obtain sim ...@@ -66,24 +66,27 @@ In the case of \texttt{all\_cnn\_*} configurations, the training data obtain sim
\input{figures/experiments/supervised_pseudo_models_training_data} \input{figures/experiments/supervised_pseudo_models_training_data}
\subsection{Influence of missing feedback}\label{sec:expMissingFeedback} \subsection{Influence of missing feedback}\label{sec:expMissingFeedback}
The following experiment shows the impact of missing user feedback to the training data and resulting model performance. As before the base model is trained on data which is refined with the different filter configurations. But in this case just $f\%$ of the \textit{false} and $c\%$ of the \textit{correct} indicators exists. All others are replaced with neutral indicators. The following experiment shows the impact of missing user feedback to the training data and resulting model performance. As before the base model is trained on data which is refined with the different filter configurations. But in this case just $f\%$ of the \textit{false} and $c\%$ of the \textit{correct} indicators exists. All others are replaced with neutral indicators. \figref{fig:supervisedPseudoMissingFeedback} shows the S scores of the personalized models which are trained with the respective filter configuration and increasing values of $f$ in (a) and $c$ in (b).
As you can see, missing \textit{false} indicators does not lead to any performance changes. The \texttt{all\_null\_*} filter configurations include the covered samples anyways wit \textit{null} labels and \texttt{all\_cnn\_*} configurations contain a greater part of high confidence samples with \textit{null} labels.
In contrast, missing \textit{correct} indicators lead to performance loss. But even with just $20\%$ of answered detections, the resulting personalized model still outperforms the general model.
\extend{more data, filter configurations. Describe differences between filters}
\input{figures/experiments/supervised_pseudo_missing_feedback}
\begin{itemize}
\item Setup of filters
\item Comparison
\item Performance at unreliable evaluations
\item Advantages and disadvantages
\end{itemize}
\subsection{Evaluation over iteration steps} \subsection{Evaluation over iteration steps}
In this section I compare the performance of the personalized models between iteration steps. Therefore the base model is applied to one of the training data sets of a participant, which is refined by one of the filter configurations. After that the resulted personalized model is evaluated. This step is repeated over all training sets where the previous base model is replaced by the new model. Additionally I evaluate the performance of a single iteration step by always training and evaluating the base model on the respective training data. I repeat that experiment with different amounts of training epochs and for the two regularization approaches of \secref{sec:approachRegularization}. In this section I compare the performance of the personalized models between iteration steps. Therefore the base model is applied to one of the training data sets of a participant, which is refined by one of the filter configurations. After that the resulted personalized model is evaluated. This step is repeated over all training sets where the previous base model is replaced by the new model. Additionally I evaluate the performance of a single iteration step by always training and evaluating the base model on the respective training data. I repeat that experiment with different amounts of training epochs and for the two regularization approaches of \secref{sec:approachRegularization}.
\subsubsection{Evolution} \subsubsection{Evolution}
First we observe how the model performance evolves over the iteration steps. \figref{arg1} shows the S scores for each iteration step of the overall personalized model and single trained model. The training data is generated by the \texttt{all\_noise\_hwgt} filter configuration. In graph (a) epochs and regularization are the same as of the previous experiments. We can see, that the first iteration leads a lower S score than the general model. But for all following iteration steps, the performance increases continuously. Although the single step model has a lower S score in the second iteration, the iterated model still benefits from the training. Similarly, in graph (b) the overall personalized models performance increases with each iteration step despite of oscillating values of the single models. This illustrates that personalization does not depend on the last training step, but accumulates data across all iterations. First we observe how the model performance evolves over the iteration steps. \figref{fig:evolutionSingle} shows the S scores for each iteration step of the overall personalized model and single trained model. The training data is generated by the \texttt{all\_noise\_hwgt} filter configuration. In graph (a) epochs and regularization are the same as of the previous experiments. We can see, that the first iteration leads a lower S score than the general model. But for all following iteration steps, the performance increases continuously. Although the single step model has a lower S score in the second iteration, the iterated model still benefits from the training. This becomes clearer with less training epochs per step. In graph (b) with 50 epochs, the overall personalized models performance increases with each iteration step despite of oscillating values of the single models. This illustrates that personalization does not depend on the last training step, but accumulates data across all iterations.
\input{figures/experiments/supervised_evolution_single}
\subsubsection{Comparison of filter configurations} \subsubsection{Comparison of filter configurations}
In this step I compare the evaluation of the personalized model over the different filter configurations. In this step I compare the evaluation of the personalized model over the different filter configurations. Additionally I apply different number of epochs and split the regularization methods. The models are trained with 50, 100, and 150 epochs. \figref{fig:evolutionAll} shows their S scores. In (a) freezing the feature layers and for (b) in l2-sp penalty is used for regularization. The personalizations trained with freezed layers show all a similar increasing trend in performance. With more epochs they seem to achieve higher S values. Especially in the first iteration, all personalized models trained with 150 epochs already outperformed the general model. With l2-sp regularization the performance varies heavily. For each selection of epochs the personalization lead to different results. It could be possible, that a better performing model is trained, but no exact statement can be made.
\input{figures/experiments/supervised_evolution_all}
\section{Evaluation of personalization} \section{Evaluation of personalization}
...@@ -102,12 +105,17 @@ In this section I compare the estimated F1 score with the ground truth evaluatio ...@@ -102,12 +105,17 @@ In this section I compare the estimated F1 score with the ground truth evaluatio
\input{figures/experiments/table_quality_estimation_evaluation} \input{figures/experiments/table_quality_estimation_evaluation}
\section{Real world analysis} \section{Real world analysis}
In corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smart watch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user beedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exist I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. For each of the training recordings the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulted data set is used for training based on the previous model or the model with best F1 score. As regularization freezing layers or l2-sp penalty is used. Over all personalizations of a participant the model with highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and base model values give the true performance increase of the retraining. In corporation with the University of Basel, I evaluated my personalization approach on data collected by a study over multiple participants. They wore a smart watch running the base application almost every day for a month. Most participants showed indications of obsessive hand washing. The collected data covers XX participants with overall XXX hours of sensor data and XXX user beedback indicators. \tabref{tab:realWorldDataset} shows the data over each participant in detail. Since no exact labeling for the sensor values exist I used the quality estimation approach for evaluation. The recordings for testing have been selected in advance by hand. \tabref{tab:realWorldGeneralEvaluation} shows the evaluation of the base model to the test set as it is applied in the application.
\input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_general_evaluation}
For each of the training recordings the base model is used to generate predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. The resulted data set is used for training based on the previous model or the model with best F1 score. As regularization freezing layers or l2-sp penalty is used. Over all personalizations of a participant the model with highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and base model values give the true performance increase of the retraining.
Entries with zero iterations, as for participants OCDetect\_09 and OCDetect\_13, state that there was no personalization that is better than the adjusted base model. Entries with zero iterations, as for participants OCDetect\_09 and OCDetect\_13, state that there was no personalization that is better than the adjusted base model.
For all other participants the personalization process was able to generate a model which performs better than the general. All of them base on the best model for each iteration step. From this it can be concluded, that less but good recordings lead to a better personalization, than iterating over all available. They just rely on one or two iterations and in most cases l2-sp regularization was used. The highest increase in F1 score achieved OCDetect\_10 with $0.10$. In practice this would lead to $82\%$ less incorrect hand wash notifications and $14\%$ more correct detections. The participants OCDetect\_02, OCDetect\_07 and OCDetect\_12 achieve an increase of around $0.05$. For OCDetect\_12, the personalization would lead to $6\%$ more wrong triggers but also increase the detection of correct hand washing activities by $45\%$. All best personalizations used either the \texttt{all\_cnn\_convlstm2\_hard} or \texttt{all\_cnn\_convlstm3\_hard} filter configurations. For all other participants the personalization process was able to generate a model which performs better than the general. All of them base on the best model for each iteration step. From this it can be concluded, that less but good recordings lead to a better personalization, than iterating over all available. They just rely on one or two iterations and in most cases l2-sp regularization was used. The highest increase in F1 score achieved OCDetect\_10 with $0.10$. In practice this would lead to $82\%$ less incorrect hand wash notifications and $14\%$ more correct detections. The participants OCDetect\_02, OCDetect\_07 and OCDetect\_12 achieve an increase of around $0.05$. For OCDetect\_12, the personalization would lead to $6\%$ more wrong triggers but also increase the detection of correct hand washing activities by $45\%$. All best personalizations used either the \texttt{all\_cnn\_convlstm2\_hard} or \texttt{all\_cnn\_convlstm3\_hard} filter configurations.
\todo{Fix broken data}
\input{figures/experiments/table_real_world_datasets}
\input{figures/experiments/table_real_world_evaluation} \input{figures/experiments/table_real_world_evaluation}
...@@ -21,7 +21,5 @@ The experiments of the previous section given several insights of the personaliz ...@@ -21,7 +21,5 @@ The experiments of the previous section given several insights of the personaliz
\end{itemize} \end{itemize}
The real world experiment summarizes these findings and uses the combination of different aspects to achieve a best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. By using the quality estimation it is possible to find the best personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best performing personalizations depend on just few additional training data, it is sufficient, if among several days of records, only a few well usable ones exist.
\extend{When full experiments are done}
\section{Future Work}
Observe how the perosnalized models are noticeable by the user.
\ No newline at end of file
\begin{figure}[t]
\begin{centering}
\subfloat[freeze feature layers]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_evolution_all.png}}
\subfloat[l2-sp regularization]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_evolution_all_l2-sp.png}}
\caption[Personalization evolution comparison]{\textbf{Personalization evolution comparison} Plot of model evaluations for each iteration step which are trained with different filter configurations. The three graphs show training results with 50, 100 and 150 epochs used. In (a) freezing the feature layers of the model is used as regularization and in (b) l2-sp regularization is used.}
\label{fig:evolutionAll}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\subfloat[100 epochs]{
\includegraphics[width=\textwidth]{figures/experiments/supervised_evolution_single.png}}
\subfloat[50 epochs]{
\includegraphics[width=\textwidth]{figures/experiments/supervised_evolution_single_50.png}}
\caption[Personalization evolution single]{\textbf{Personalization evolution single.} Graph shows the iterative evaluation of personalized model after each iteration step. Additionally evaluation of the base model just trained with the single training data of the respective step is drawn.}
\label{fig:evolutionSingle}
\end{centering}
\end{figure}
\begin{figure}[t]
\begin{centering}
\subfloat[Increasing $f\%$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_pseudo_missing_null.png}}
\subfloat[Increasing $c\%$]
{\includegraphics[width=\textwidth]{figures/experiments/supervised_pseudo_missing_hw.png}}
\caption[Pseudo models missing feedback]{\textbf{Pseudo models missing feedback} Evaluation of personalized models with incomplete feedback.}
\label{fig:supervisedPseudoMissingFeedback}
\end{centering}
\end{figure}
\begin{table}[ht]
\centering
% spacing in table
\ra{1.3}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrr}
\toprule
\thead{participant} & \thead{sum\\hand washes} & \thead{sum correct \\ hand washes} & \thead{sum false \\ hand washes} & \thead{f1} \\
\midrule
OCDetect\_02 & 68 & 55 & 175 & 0.3691 \\
OCDetect\_03 & 97 & 66 & 183 & 0.3815 \\
OCDetect\_04 & 39 & 17 & 19 & 0.4533 \\
OCDetect\_05 & 220 & 90 & 307 & 0.2917 \\
OCDetect\_07 & 16 & 13 & 14 & 0.6047 \\
OCDetect\_09 & 26 & 11 & 77 & 0.1930 \\
OCDetect\_10\_2 & 17 & 7 & 134 & 0.0886 \\
OCDetect\_11 & 38 & 9 & 26 & 0.2466 \\
OCDetect\_12 & 77 & 32 & 13 & 0.5246 \\
OCDetect\_13 & 46 & 20 & 62 & 0.3125 \\
\bottomrule
\end{tabular}}
\caption[General model evaluation]{\textbf{General model evaluation} Evaluation of the general model to the test sets of real world experiment.}
\label{tab:realWorldGeneralEvaluation}
\end{table}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment