Commit d3a5577c authored by Alexander Henkel's avatar Alexander Henkel
Browse files

finished thesis

parent bc2a6dbd
...@@ -6,7 +6,7 @@ In the following, I give a brief overview of literature about state-of-the-art H ...@@ -6,7 +6,7 @@ In the following, I give a brief overview of literature about state-of-the-art H
\section{Activity recognition}\label{sec:relWorkActivityRecognition} \section{Activity recognition}\label{sec:relWorkActivityRecognition}
Most Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently, deep neural network techniques have replaced traditional ones such as Support Vector Machines or Random Forests since no hand-crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature encoding and an output layer that provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons connected to the following layer's neurons. These connections are assigned a weight that is learned during the training process. First, in the feed-forward pass, the output values are computed based on a batch sampled from the training data sets. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. Feed-forward pass and backpropagation are repeated over multiple iterations, called epochs~\cite{Liu2017Apr}. Most Inertial Measurement Units (IMUs) provide a combination of 3-axis acceleration and orientation data in continuous streams. Sliding windows are applied to the streams and are assigned to an activity by the underlying classifying technique ~\cite{s16010115}. This classifier is a prediction function $f(x)$ which returns the predicted activity labels for a given input $x$. Recently, deep neural network techniques have replaced traditional ones such as Support Vector Machines or Random Forests since no hand-crafted features are required ~\cite{ramasamy2018recent}. They use multiple hidden layers of feature encoding and an output layer that provides predicted class distributions ~\cite{MONTAVON20181}. Each layer consists of multiple artificial neurons connected to the following layer's neurons. These connections are assigned a weight that is learned during the training process. First, in the feed-forward pass, the output values are computed based on a batch sampled from the training data sets. In the second stage, called back propagation, the error between the expected and predicted values are computed by a loss function $J$ to get minimized by optimization of the weights. Feed-forward pass and backpropagation are repeated over multiple iterations, called epochs~\cite{Liu2017Apr}.
The combination of Convolutional Neural Networks (CNN) and Long-short-term memory recurrent neural networks (LSTMs) tend to outperform other approaches. They are considered the current state of the art for human activity recognition~\cite{9043535}. For classification problems, cross-entropy as loss function is used in most works. \extend{???} The combination of Convolutional Neural Networks (CNN) and Long-short-term memory recurrent neural networks (LSTMs) tend to outperform other approaches. They are considered the current state of the art for human activity recognition~\cite{9043535}. For classification problems, cross-entropy as loss function is used in most works.
\section{Personalization}\label{sec:relWorkPersonalization} \section{Personalization}\label{sec:relWorkPersonalization}
Even well-performing architectures can result in low-quality results in real-world scenarios. Varying users and environments create many different influences that can affect performance. These could be the device's position, differences between the sensors, or human characteristics \cite{ferrari2020personalization}. Even well-performing architectures can result in low-quality results in real-world scenarios. Varying users and environments create many different influences that can affect performance. These could be the device's position, differences between the sensors, or human characteristics \cite{ferrari2020personalization}.
......
...@@ -21,7 +21,7 @@ In this experiment a baseline is build on how a personalized model could perform ...@@ -21,7 +21,7 @@ In this experiment a baseline is build on how a personalized model could perform
First, we concentrate on the dashed lines of the graphs. These are the evaluations of the general model in red and the supervised, trained model in green. In all graphs, the personalized models perform better than the general model. The base model achieves a F1 score of ${\sim}0.4127$ and ${\sim}0.7869$ in S score, whereas the personalized model reaches a F1 score of ${\sim}0.6205$ and ${\sim}0.8633$ in S score. So personalization can lead to an increase of ${\sim}0.2079$ in F1 score and ${\sim}0.0765$ in S score. This builds the theoretical performance gain for perfect labeling. First, we concentrate on the dashed lines of the graphs. These are the evaluations of the general model in red and the supervised, trained model in green. In all graphs, the personalized models perform better than the general model. The base model achieves a F1 score of ${\sim}0.4127$ and ${\sim}0.7869$ in S score, whereas the personalized model reaches a F1 score of ${\sim}0.6205$ and ${\sim}0.8633$ in S score. So personalization can lead to an increase of ${\sim}0.2079$ in F1 score and ${\sim}0.0765$ in S score. This builds the theoretical performance gain for perfect labeling.
\subsubsection{Label noise}\label{sec:expTransferLearningNoise} \subsubsection{Label noise}\label{sec:expTransferLearningNoise}
Now I observe the scenarios where some of the labels are noisy. Therefore we look at (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around $40\%-50\%$ have just a small impact on specificity and sensitivity. If the noise increases further, the sensitivity tends to decrease. For specificity, there is no trend, and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just a minor influence on the training, and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher than the general model. But for larger noise, this also decreases significantly. The F1 score and S scores of (b) in\figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance, which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks a penalty for false-positive predictions. According to the F1 score, a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to more false hand wash detections. Now I observe the scenarios where some of the labels are noisy. Therefore we look at (a) of \figref{fig:supervisedNoisyAllSpecSen}. Here $n=0$ and noise is added to the hand wash labels only. We can see that noise values up to around $40\%-50\%$ have just a small impact on specificity and sensitivity. If the noise increases further, the sensitivity tends to decrease. For specificity, there is no trend, and only the deflections become more extreme. But as shown in (a) of \figref{fig:supervisedNoisyAllF1S}, noise on the hand wash labels has just a minor influence on the training, and a personalized model can still benefit from additional data with high noise in comparison to the general model. In contrast, noise on \textit{null} labels leads to much worse performance, as seen in the plots of (b). Values of $n$ below $0.1$ already lead to drastic decreases in specificity and sensitivity. To better illustrate this, \figref{fig:supervisedNoisyPart} shows plots of noise on \textit{null} labels in a range from $0\%$ to $1\%$. The specificity drops and converge to $~0.5$ for $n>0.01$. All trained models on these labels achieve less specificity than the general model. For noise values around $0-2\%$ the sensitivity value can be higher than the general model. But for larger noise, this also decreases significantly. The F1 score and S scores of (b) in \figref{fig:supervisedNoisyAllF1S} clarify that even a small amount of noise on $null$ labels drastically reduces the performance, which leads to worse personalized models as the base model. Moreover, it becomes clear that the F1 measure lacks a penalty for false-positive predictions. According to the F1 score, a personalized model would achieve a higher performance than the base model for arbitrary noise values, although it leads to more false hand wash detections.
I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus. I attribute the high-performance loss to the imbalance of the labels. A typical daily recording of around 12 hours contains $28,800$ labeled windows. A single hand wash action of $20$ seconds covers ${\sim}13$ windows. If a user would wash its hands $10$ times a day, it would lead to $130$ \textit{hw} labels and $28,670$ \textit{null} labels. Even $50\%$ of noise on the \textit{hw} labels results in ${\sim}0.2\%$ of false data. However, $1\%$ of flipped \textit{null} labels already lead to ${\sim}68\%$ of wrong hand wash labels. So they would have a higher impact on the training than the original hand wash data. As the S score of \figref{fig:supervisedNoisyPart} shows, it is possible that the personalized model benefits from additional data if the ratio of noise in \textit{null} labels is smaller than ${\sim}0.2\%$. The training data of the experiments contains $270591$ \textit{null} labels and $2058$ hand wash labels. So ${\sim}0.2\%$ noise would lead to ${\sim}541$ false \textit{hw} labels which is ${\sim}20\%$ of training data. As a rule of thumb, I claim that the training data should contain less than ${\sim}20\%$ of wrong hand wash labels, whereas the amount of incorrect \textit{null} labels does not require particular focus.
...@@ -112,10 +112,10 @@ In a corporation with the University of Basel, I evaluated my personalization ap ...@@ -112,10 +112,10 @@ In a corporation with the University of Basel, I evaluated my personalization ap
For each training recording, the base model generates predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. I have compared the \texttt{all\_null\_*} configurations with this setup in before, and they achieved the same results in most cases. Therefore I only consider one of them in the following. The resulting data set is used for training based on the previous model or the model with the best F1 score. As regularization, freezing layers or l2-sp penalty is used. Overall personalizations of a participant, the model with the highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally, the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and adjusted base model values gives the true performance increase of the retraining. For each training recording, the base model generates predictions for the pseudo labels. After that, one of the filter configurations \texttt{all\_null\_convlstm3}, \texttt{all\_cnn\_convlstm2\_hard} and \texttt{all\_cnn\_convlstm3\_hard} is applied. I have compared the \texttt{all\_null\_*} configurations with this setup in before, and they achieved the same results in most cases. Therefore I only consider one of them in the following. The resulting data set is used for training based on the previous model or the model with the best F1 score. As regularization, freezing layers or l2-sp penalty is used. Overall personalizations of a participant, the model with the highest F1 score is determined. \tabref{tab:realWorldEvaluation} shows the resulting best personalization of each participant. Additionally, the last three columns contain the evaluation of the base model after adjusting the kernel settings. The difference between personalization and adjusted base model values gives the true performance increase of the retraining.
Entries with zero iterations, as for participants OCDetect\_12 and OCDetect\_13, state that no better personalization could be found. Entries with zero iterations, as for participants OCDetect\_12 and OCDetect\_13, state that no better personalization could be found. In the case of participant OCDetect\_12, there were technical problems, which meant that the detection of hand washing did not work. Therefore, only manual indicators are available. As mentioned, the intervals cover larger parts of sensor data, which may be misplaced. This could lead to difficulties during training. For participant 13, the problem could be that the data contains the fewest hand wash activities per hour of all subjects. So there is less information about hand washing in comparison to other activities.
For all other participants, the personalization process generated a model that performs better than the general model with adjusted kernel settings. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on at most three iterations, and l2-sp regularization was used in more cases. The highest increase in F1 score achieved OCDetect\_21 with a difference of $0.25$. Compared to the general model without adjusted kernel settings, the F1 score increases by $0.355$. In practice, this would lead to the same amount of incorrect hand wash notifications and $80\%$ more correct detections. The highest decrease of false predictions achieves participant OCDetect\_10 with $74\%$. All best personalizations except of one use the \texttt{all\_cnn\_convlstm3\_hard} filter configuration. But for participants OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_cnn\_convlstm2\_hard} filter configuration achieves the same score. Moreover, for participants OCDetect\_05, OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_null\_convlstm3} filter configuration also reaches same F1 scores. For all other participants, the personalization process generated a model that performs better than the general model with adjusted kernel settings. All of them are based on the best model for each iteration step. From this, it can be concluded that fewer but good recordings lead to a better personalization than iterating over all available. They only rely on at most three iterations, and l2-sp regularization was used in more cases. The highest increase in F1 score achieved OCDetect\_21 with a difference of $0.25$. Compared to the general model without adjusted kernel settings, the F1 score increases by $0.355$. In practice, this would lead to the same amount of incorrect hand wash notifications and $80\%$ more correct detections. The highest decrease of false predictions achieves participant OCDetect\_10 with $74\%$. All best personalizations except of one use the \texttt{all\_cnn\_convlstm3\_hard} filter configuration. But for participants OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_cnn\_convlstm2\_hard} filter configuration achieves the same score. Moreover, for participants OCDetect\_05, OCDetect\_07, OCDetect\_09 and OCDetect\_19, the \texttt{all\_null\_convlstm3} filter configuration also reaches same F1 scores.
So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration. For all results see \tabref{tab:fullRealWorldResults} in appendix. So the \texttt{all\_cnn\_*\_hard} outperforms the \texttt{all\_null\_convlstm3} configuration. This could indicate that the participants did not report all hand washing actions, and too many false negatives are generated by the \texttt{all\_null\_convlstm3} filter configuration. For all results see \tabref{tab:fullRealWorldResults} in the appendix.
The mean F1 score over each best-personalized model increases by $0.044$ compared to the general model with adjusted kernel settings and $0.11$ to the plain general model without adjusting kernel settings. That is $9.6\%$ and $28.2\%$ respectively. So, personalization leads to a reduction of the false detections by $31\%$ and an increase of correct detections by $16\%$. The mean F1 score over each best-personalized model increases by $0.044$ compared to the general model with adjusted kernel settings and $0.11$ to the plain general model without adjusting kernel settings. That is $9.6\%$ and $28.2\%$ respectively. So, personalization leads to a reduction of the false detections by $31\%$ and an increase of correct detections by $16\%$.
......
...@@ -11,8 +11,8 @@ The previous section's experiments gave several insights into personalization in ...@@ -11,8 +11,8 @@ The previous section's experiments gave several insights into personalization in
\item[4.] \textbf{Pseudo labels must be filtered and denoised.} \item[4.] \textbf{Pseudo labels must be filtered and denoised.}
Just relying on predicted labels by the general model as training data results in a worse personalized model than the general model. Even the inclusion of user feedback alone is not enough to achieve higher performance. A wide variety of samples containing no false-positive labels achieves higher performance than the general model. Just relying on predicted labels by the general model as training data results in a worse personalized model than the general model. Even the inclusion of user feedback alone is not enough to achieve higher performance. A wide variety of samples containing no false-positive labels achieves higher performance than the general model.
\item[5.] \textbf{Pseudo-labeled data can reach nearly supervised performance.} \item[5.] \textbf{Pseudo-labeled data can reach nearly supervised performance.}
The combination of denoising filters and user feedback generates training data, resulting in a personalized model that reaches similar F1 and S scores for supervised training. The combination of denoising filters and user feedback generates training data, resulting in a personalized model that reaches similar F1 and S scores as supervised training.
\item[6.] \textbf{Missing feedback has just minor impact.} Most of the filter configurations are robust of missing \textit{false} indicators. Also, they achieve similar performance with just $40\%$ of \textit{correct} indicators. \item[6.] \textbf{Missing feedback has just minor impact.} Most of the filter configurations are robust against missing \textit{false} indicators. Also, they achieve similar performance with just $40\%$ of \textit{correct} indicators.
\item[7.] \textbf{This personalization approach outperforms active learning.} It achieved a higher S score by similar user interaction. \item[7.] \textbf{This personalization approach outperforms active learning.} It achieved a higher S score by similar user interaction.
...@@ -20,7 +20,7 @@ The previous section's experiments gave several insights into personalization in ...@@ -20,7 +20,7 @@ The previous section's experiments gave several insights into personalization in
The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist.\\ The real-world experiment summarizes these findings and combines different aspects to achieve the best possible personalization. The pillar of this approach builds the opportunity to evaluate various personalized models and compare them. The quality estimation makes it possible to find the best-personalized model for each new recording. Therefore erroneous data which would lead to a heavily noisy training set can be detected and filtered out. Since most best-performing personalizations depend on just a few additional training data, it is sufficient if, among several days of records, only a few well usable exist.\\
The experiment also confirms that the \texttt{all\_cnn\_*} filter configurations are better suited for a broader user base than the \texttt{all\_null\_*} configurations since they are more robust against missing feedback. For all participants, the \texttt{all\_cnn\_*} filter configurations achieved at least the same F1 scores as the \texttt{all\_null\_*} configurations, and in most cases, they outperformed them. The experiment also confirms that the \texttt{all\_cnn\_*} filter configurations are better suited for a broader user base than the \texttt{all\_null\_*} configurations since they are more robust against missing feedback. For all participants, the \texttt{all\_cnn\_*} filter configurations achieved at least the same F1 scores as the \texttt{all\_null\_*} configurations, and in most cases, they outperformed them. For almost every participant, a better model could be created than the general model.
...@@ -29,4 +29,4 @@ The performance of the personalization heavily depends on the quality of the pse ...@@ -29,4 +29,4 @@ The performance of the personalization heavily depends on the quality of the pse
Additionally, other sources of indicators can be considered. For example, Bluetooth beacons can be placed by the sinks. The distance between the watch and the sink can be estimated if the watch is within range. A short distance states that the user is probably washing their hands. This indicator can be handled similarly to a \textit{manual} feedback. Additionally, other sources of indicators can be considered. For example, Bluetooth beacons can be placed by the sinks. The distance between the watch and the sink can be estimated if the watch is within range. A short distance states that the user is probably washing their hands. This indicator can be handled similarly to a \textit{manual} feedback.
Furthermore, this approach offers the opportunity to learn new classes. For example, during the real-world experiment, the participant was asked if this action was compulsive for each hand washing activity. So there exists additional information to each hand washing. The target task can be adapted to learn a new classification into $A=\{null,hw,compulsive\}$. The resulting model would be able to distinguish between hand washing and not hand washing and moreover between regular hand washing and compulsive hand washing. Furthermore, this approach offers the opportunity to learn new classes. For example, during the real-world experiment, the participant was asked if an action was compulsive for each hand washing activity. So there exists additional information to each hand washing. The target task can be adapted to learn a new classification into $A=\{null,hw,compulsive\}$. Therefore, the resulting model would be able to distinguish between hand washing and not hand washing and moreover between regular hand washing and compulsive hand washing.
...@@ -3,4 +3,5 @@ In this work, I have elaborated a personalization process for human activity rec ...@@ -3,4 +3,5 @@ In this work, I have elaborated a personalization process for human activity rec
I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how-soft labels can counter training errors. Based on these insights, several constellations and filter approaches for training data have been implemented to analyze the behavior of the resulting models under the different aspects. I found out that just using the predictions of the base model leads to performance decreases since they consist of too much label noise. However, even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore more sophisticated denoising approaches are implemented that generate training data that consist of various samples with as few incorrect labels as possible. This data leads to personalized models that achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training. I evaluated personalization in general on a theoretical basis with supervised data. These revealed the impact of noise in the highly imbalanced data and how-soft labels can counter training errors. Based on these insights, several constellations and filter approaches for training data have been implemented to analyze the behavior of the resulting models under the different aspects. I found out that just using the predictions of the base model leads to performance decreases since they consist of too much label noise. However, even relying only on data covered by user feedback does not overcome the general model, although the training data hardly consists of false labels. Therefore more sophisticated denoising approaches are implemented that generate training data that consist of various samples with as few incorrect labels as possible. This data leads to personalized models that achieve higher F1 and S scores than the general model. Some of the configurations even result in similar performance as with supervised training.
Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness. The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, personalization would reduce the false detections by $31\%$, and increase correct detections by $16\%$. Furthermore, I compared my personalization approach with an active learning implementation as a common personalization method. The sophisticated filter configurations achieve higher S scores, confirming my approach's robustness.\\
\ No newline at end of file The real-world experiment in corporation with the University of Basel offered a great opportunity to evaluate my personalization approach to a large variety of users and their feedback behaviors. It confirms that in most cases, personalized models outperform the general model. Overall, the implemented personalization would reduce the false detections by $31\%$, and increase correct detections by $16\%$.
\ No newline at end of file
...@@ -287,6 +287,6 @@ ...@@ -287,6 +287,6 @@
\caption[Full real world experiment results]{\textbf{Full real world experiment results.} The table shows the best personalization over each hyperparameter constellation for all participants.} \caption[Full real world experiment results]{\textbf{Full real world experiment results.} The table shows the personalized models and evaluation results over each hyperparameter constellation for all participants.}
\label{tab:fullRealWorldResults} \label{tab:fullRealWorldResults}
\end{table} \end{table}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment