Commit 0b177442 authored by burcharr's avatar burcharr 💬
Browse files

Update docs/thesis/md/related_work.md

parent 89c0e9f1
......@@ -6,7 +6,7 @@ Automatically detecting the current activity of a human being is a wide research
In the area of gesture recognition, we try to detect and classify specific, and narrowly defined gestures.
The defined gestures can e.g. be used to actively control a system @saini_human_2020. This kind of approach is not directly applicable to our task of detecting hand washing. However, it could be possible to adapt algorithms from this field to the detection of a new gesture or a new set of gestures related to hand washing.
There are camera-based approaches and physical measurement based approaches @saini_human_2020. The camera based approaches were out of scope for this work. As explained in the introduction, in our setting, wrist worn devices have significant advantages over camera-based solutions that would have to be stationary, i.e. in fixed locations.
There are camera-based approaches and physical measurement-based approaches @saini_human_2020. The camera-based approaches were out of scope for this work. As explained in the introduction, in our setting, wrist worn devices have significant advantages over camera-based solutions that would have to be stationary, i.e. in fixed locations.
There also exist approaches based on inertial measurement sensors. These sensors measure movement related physical values, such as the force or acceleration, angular velocity or orientation in space.
Gesture recognition, in general, uses similar methods as the more difficult human activity recognition @saini_human_2020, which will be explained below.
......@@ -16,18 +16,18 @@ Gesture recognition, in general, uses similar methods as the more difficult huma
\label{section:har}
Recognizing more than one gesture or body movement in combination in a temporal context and deriving the current activity of the user is called human activity recognition (HAR). In this task, we want to detect more general activities, compared to the shorter and simpler gestures. An activity can include many distinguishable gestures. However, the same activity will not always include all of the same gestures and the gestures included could be in a different order for every repetition. Activities are less repetitive than gestures, and harder to detect in general @zhu_wearable_2011. However, Zhu et al. have shown that the combined detection of multiple different gestures can be used in HAR tasks too @zhu_wearable_2011, which makes sense, because a human activity can consist of many gestures. Nevertheless, most methods used for HAR consist of more direct applications of machine learning to the data, without the detour of detecting specific gestures contained in the execution of an activity.
Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. The required feature engineering is the creation of meaningful statistics or calculations based on the time frame for which the activity should be predicted. The features can be frequency-domain based and time-domain based, but usually both are used at the same time to train these conventional models @liu_overview_2021. The classical machine learning methods include but are not limited to Random Forests (RFC), Hidden Markov Models (HMM), Support Vector Machines (SVM), the $k$-nearest neighbors algorithm and more.
Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. The required feature engineering is the creation of meaningful statistics or calculations based on the time frame for which the activity should be predicted. The features can be frequency-domain-based and time-domain-based, but usually both are used at the same time to train these conventional models @liu_overview_2021. The classical machine learning methods include but are not limited to Random Forests (RFC), Hidden Markov Models (HMM), Support Vector Machines (SVM), the $k$-nearest neighbors algorithm and more.
#### Deep neural networks
Recently, deep neural networks have taken over the role of the state of the art machine learning method in the area of human activity recognition @bock_improving_2021, @liu_overview_2021. Deep neural networks are universal function approximators @bishop_pattern_2006, and are known for being easy to use on "raw" data. They are "artificial neural networks" consisting of multiple layers, where each layer contains a certain amount of nodes that are connected to the nodes of the following layer. The connections are each assigned a weight, and the weighted sum over the values of all the previous connected nodes is used to calculate the value of a node in the next layer. Simple neural networks where all nodes of a layer are connected to all nodes in the following layer are often called "fully connected neural networks" (FC-NN or FC).
Recently, deep neural networks have taken over the role of the state-of-the-art machine learning method in the area of human activity recognition @bock_improving_2021, @liu_overview_2021. Deep neural networks are universal function approximators @bishop_pattern_2006 and are known for being easy to use on "raw" data. They are "artificial neural networks" consisting of multiple layers, where each layer contains a certain number of nodes that are connected to the nodes of the following layer. The connections are each assigned a weight, and the weighted sum over the values of all the previous connected nodes is used to calculate the value of a node in the next layer. Simple neural networks where all nodes of a layer are connected to all nodes in the following layer are often called "fully connected neural networks" (FC-NN or FC).
The connections' parameters are optimized using forward passes through the network of nodes, followed by the execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data passed and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to values that lead to a lower value of the loss function for the training data. However, there is no guarantee of convergence whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. Nevertheless, the parameters for the perfect approximation cannot be easily found, and empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods. In return, with enough data, deep neural networks often outperform classical machine learning methods.
###### Convolutional neural networks (CNNs)
are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with them. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same amount of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional (2d), like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional (1d), which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.
are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand-written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with their help. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same number of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional (2d), like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional (1d), which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.
###### Recurrent neural networks (RNNs)
are similar to feed forward neural networks, with the difference being that they have access to information from a previous time step. The simplest version of an RNN is a single node that takes the input $\mathbf{x}_t$ and its own output $\mathbf{h}_{t-1}$ from the last time step as inputs. RNNs can be trained on time series data and are able to interpret temporal connections and dependencies in the data to some extent. Recurrent neural networks are trained using "back propagation through time" @mozer_focused_1995. This means that we have to run a forwards pass of multiple time steps through the network first, followed by a back propagation that sums up over all the different time steps and their gradients. For "long" runs, i.e. if the network is supposed to take into account many time steps, there is the "vanishing gradient problem" @hochreiter_vanishing_1998. With an increasing amount of time steps, the gradients become smaller and smaller, making it harder or impossible to properly train the recurrent neural network.
are similar to feed forward neural networks, with the difference being that they have access to information from a previous time step. The simplest version of an RNN is a single node that takes the input $\mathbf{x}_t$ and its own output $\mathbf{h}_{t-1}$ from the last time step as inputs. RNNs can be trained on time series data and are able to interpret temporal connections and dependencies in the data to some extent. Recurrent neural networks are trained using "back propagation through time" @mozer_focused_1995. This means that we must run a forwards pass of multiple time steps through the network first, followed by a back propagation that sums up over all the different time steps and their gradients. For "long" runs, i.e. if the network is supposed to take into account many time steps, there is the "vanishing gradient problem" @hochreiter_vanishing_1998. With an increasing number of time steps, the gradients become smaller and smaller, making it harder or impossible to properly train the recurrent neural network.
###### Long short-term memory (LSTM)
......@@ -64,16 +64,16 @@ The four LSTMs' gates are:
These gates are fully connected neural network layers (marked in orange and with the corresponding activation functions in @fig:lstm_cell) with respective weights and biases and serve a functionality from which their names are derived. The weights and biases must be learned during the training phase of the neural network. The forget gate allows the LSTM to only apply part of the "remembered" cell memory $\mathbf{c}_{t-1}$ in the current step, i.e. which bits should be used to which extent with regard to the current new input data $\mathbf{x}_t$ and the hidden state from the last time step $\mathbf{h}_{t-1}$. The output of the forget gate, $\mathbf{f}_t$, multiplied element-wise with $\mathbf{c}_{t-1}$ is considered the "remembered" information from the last step. The new memory gate and the input gate are used to decide which new data is added to the cell state. These two layers are also given the previous step's hidden state $\mathbf{h}_{t-1}$ and the current step's input $\mathbf{x}_t$. In combination, the new memory network output $\tilde{\mathbf{c}}_t$ and the input gates' output $\mathbf{i}_t$ decide which components of the current input and hidden state will be taken into the new memory state $\mathbf{c}_{t}$. The memory state is passed on to the next step. The output gate will generate $\mathbf{o}_t$, which will be combined with $tanh(\mathbf{c}_{t})$ by element-wise matrix multiplication to form the new hidden state $\mathbf{h}_{t}$.
###### DeepConvLSTM
is a network proposed by Ordonez et al. @ordonez_deep_2016 and consists of a number of convolutional layers as well as two LSTM layers. It reaches state of the art performance and is used for general human activity recognition tasks. The combination of convolutional layers and LSTMs works well with time series data, as it can use the advantages of both convolutional layers and the intelligent "memory" provided by the LSTMs.
is a network proposed by Ordonez et al. @ordonez_deep_2016 and consists of four convolutional layers as well as two LSTM layers. It reaches state-of-the-art performance and is used for general human activity recognition tasks. The combination of convolutional layers and LSTMs works well with time series data, as it can use the advantages of both convolutional layers and the intelligent "memory" provided by the LSTMs.
Bock et al. @bock_improving_2021 employ an altered version of DeepConvLSTM @ordonez_deep_2016. Bock et al. propose reducing the amount of LSTM layers to one, resulting in the architecture shown in @fig:deepConvLSTM. They evaluate their approach on 5 different publicly available data sets and report an increased performance on four out of the five. Leaving out one LSTM layer drastically reduces the amount of parameters to be learned as well as the time needed to train the network.
Bock et al. @bock_improving_2021 employ an altered version of DeepConvLSTM @ordonez_deep_2016. Bock et al. propose reducing the number of LSTM layers to one, resulting in the architecture shown in @fig:deepConvLSTM. They evaluate their approach on 5 different publicly available data sets and report an increased performance on four out of the five. Leaving out one LSTM layer drastically reduces the amount of parameters to be learned as well as the time needed to train the network.
![DeepConvLSTM and the altered version, by Marius Bock @bock_improving_2021](img/deepConvBock.png){#fig:deepConvLSTM width=98%}
![Information propagation of LSTM and LSTM with temporal attention mechanism (adjusted from @zeng_understanding_2018)](img/lstm_lstm_temporal_attention.png){#fig:lstm_attention width=98%}
\label{sec:LSTMA}
In their paper "Understanding and improving recurrent networks for human activity recognition by continuous attention" , Zeng et al. apply an attention mechanism to LSTM based neural network models @zeng_understanding_2018. They propose the separate addition of temporal and sensor attention to the LSTM layers used in such networks. The sensor attention approach can be useful when using multiple sensor locations across the body and can be used to let the network focus on measurements from the more relevant sensors for specific tasks. The temporal attention approach works as shown in @fig:lstm_attention. The "normal" unrolled LSTM forward pass is pictured on the left. The temporal attention mechanism, makes the information from the past recurrency steps available after the LSTM output and can be seen on the right. The past outputs of the LSTM are saved in each step and then added together at $\mathbf{H}$ as a weighted sum with weight parameters $\alpha_t$, $t \in [1, ..., T]$. The parameters $\alpha_t$ for the weighted sum are also predicted by the network. The parameters for the "score" layer $\mathbf{W}_{\alpha}$ are learned as part of the neural networks training routine. The resulting formulas are shown in equations \ref{eqn:attent_lstm1} to \ref{eqn:attent_lstm2}.
In their paper "Understanding and improving recurrent networks for human activity recognition by continuous attention" , Zeng et al. apply an attention mechanism to LSTM-based neural network models @zeng_understanding_2018. They propose the separate addition of temporal and sensor attention to the LSTM layers used in such networks. The sensor attention approach can be useful when using multiple sensor locations across the body and can be used to let the network focus on measurements from the more relevant sensors for specific tasks. The temporal attention approach works as shown in @fig:lstm_attention. The "normal" unrolled LSTM forward pass is pictured on the left. The temporal attention mechanism makes the information from the past recurrency steps available after the LSTM output and can be seen on the right. The past outputs of the LSTM are saved in each step and then added together at $\mathbf{H}$ as a weighted sum with weight parameters $\alpha_t$, $t \in [1, ..., T]$. The parameters $\alpha_t$ for the weighted sum are also predicted by the network. The parameters for the "score" layer $\mathbf{W}_{\alpha}$ are learned as part of the neural networks training routine. The resulting formulas are shown in equations \ref{eqn:attent_lstm1} to \ref{eqn:attent_lstm2}.
\begin{figure}
\begin{align}
......@@ -87,21 +87,21 @@ score(\mathbf{h}_T,\mathbf{h}_s) &= \mathbf{h}_t^T\mathbf{W}_{\alpha}\mathbf{h}_
Note that the calculation of $\alpha_t$ is done with the softmax function as shown in eqn. \ref{eqn:attent_lstm_sm}, although this is not explicitly mentioned by the authors of the paper. This makes sure that the weights $\alpha$ used for the weighted sum always sum up to 1.
Zeng et al. evaluate their approach on 3 data sets and report a state of the art performance, beating the initial DeepConvLSTM.
Zeng et al. evaluate their approach on 3 data sets and report a state-of-the-art performance, beating the initial DeepConvLSTM.
\label{deepconvlstm_att}
Another study by Singh et al. combines DeepConvLSTM with a self-attention mechanism @singh_deep_2021. The attention mechanism is very similar to the one used by Zeng et al. @zeng_understanding_2018, where the mechanism consists of a layer that follows the LSTM layers in the DeepConvLSTM network. Instead of utilizing a score layer which uses the relation of each $h_t$ to $h_T$, Singh et al. find the weights $\mathbf{\alpha}$ by applying the softmax function to the output of a fully connected layer through which they pass the concatenated $h_t$ values. Instead of taking into account only the relations of each $h_t$ to $h_T$ separately, they use one layer to jointly calculate all the attention weights. Other than that, the two attention mechanisms are pretty similar. Singh et al. also report a statistically significant increase in performance compared to the initial DeepConvLSTM, although the evaluate their approach on different data sets than Zeng et al..
Another study by Singh et al. combines DeepConvLSTM with a self-attention mechanism @singh_deep_2021. The attention mechanism is very similar to the one used by Zeng et al. @zeng_understanding_2018, where the mechanism consists of a layer that follows the LSTM layers in the DeepConvLSTM network. Instead of utilizing a score layer which uses the relation of each $h_t$ to $h_T$, Singh et al. find the weights $\mathbf{\alpha}$ by applying the softmax function to the output of a fully connected layer through which they pass the concatenated $h_t$ values. Instead of considering only the relations of each $h_t$ to $h_T$ separately, they use one layer to jointly calculate all the attention weights. Other than that, the two attention mechanisms are similar. Singh et al. also report a statistically significant increase in performance compared to the initial DeepConvLSTM, although the evaluate their approach on different data sets than Zeng et al..
For HAR, DeepConvLSTM and the models derived from it are the state of the art machine learning methods, as they consistently outperform other model architectures on the available benchmarks and data sets.
For HAR, DeepConvLSTM and the models derived from it are the state-of-the-art machine learning methods, as they consistently outperform other model architectures on the available benchmarks and data sets.
## Hand washing
To our knowledge, no study has ever tried to separately predict compulsive hand washing opposed to non-compulsive hand washing.
Most studies that try to automatically detect hand washing are aiming for compliance improvements, i.e. trying to increase or measure the frequency of hand washes or assessing or improving the quality of hand washes.
Hand washing compliance can be measured using different tools. Jain et al. @jain_low-cost_2009 use an RFID-based system to check whether health care workers comply with hand washing frequency requirements. However, the system is merely used to make sure all workers entering an emergency care unit have washed their hands. Bakshi et al. @bakshi_feature_2021 developed a hand washing detection data set with RGB video data, and showed a valid way to extract SIFT-descriptors from it for further research. Llorca et al. showed a vision based system for automatic hand washing quality assessment @llorca_vision-based_2011 based on the detection of skin in RGB images using optical techniques such as optical flow estimation.
Hand washing compliance can be measured using different tools. Jain et al. @jain_low-cost_2009 use an RFID-based system to check whether health care workers comply with hand washing frequency requirements. However, the system is merely used to make sure all workers entering an emergency care unit have washed their hands. Bakshi et al. @bakshi_feature_2021 developed a hand washing detection data set with RGB video data and showed a valid way to extract SIFT-descriptors from it for further research. Llorca et al. showed a vision-based system for automatic hand washing quality assessment @llorca_vision-based_2011 based on the detection of skin in RGB images using optical techniques such as optical flow estimation.
A study by Li et al. @li_wristwash_2018 is able to recognize 13 steps of a hand washing procedure on wrist motion data with an accuracy of $85\,\%$. They employ a sliding window feature based hidden markov model approach and run a continuous recognition. Wang et al. explore using sensor armbands to assess the users compliance with given hand washing hygiene guidelines @wang_accurate_2020. They run a classifier using XGBoost and are mostly able to separate the different steps of the scripted hand washing routine.
Added to that, Cao et al. @cao_awash_2021 developed a system that similarly detects different steps of a scripted hand washing routine and prompts the user, if they confuse the order of the steps or forget one of the steps. The technology is aimed at elderly patients with dementia. Their system is able to detect which step of hand washing is currently conducted based on wrist motion data using an LSTM based neural network. However, none of the three systems mentioned in this paragraph are meant to separate hand washing from other activities. These models are trained to tell apart the different steps of hand washing, as they are defined in their respective studies. The models used in these studies are not tested on a null class, i.e. they are not tested for other activities than hand washing. Thus, they can only be used for the detection of steps of hand washing, but not for the detection of hand washing in real life.
A study by Li et al. @li_wristwash_2018 is able to recognize 13 steps of a hand washing procedure on wrist motion data with an accuracy of $85\,\%$. They employ a sliding window feature-based hidden markov model approach and run a continuous recognition. Wang et al. explore using sensor armbands to assess the users compliance with given hand washing hygiene guidelines @wang_accurate_2020. They run a classifier using XGBoost and are mostly able to separate the different steps of the scripted hand washing routine.
Added to that, Cao et al. @cao_awash_2021 developed a system that similarly detects different steps of a scripted hand washing routine and prompts the user, if they confuse the order of the steps or forget one of the steps. The technology is aimed at elderly patients with dementia. Their system is able to detect which step of hand washing is currently conducted based on wrist motion data using an LSTM-based neural network. However, none of the three systems mentioned in this paragraph are meant to separate hand washing from other activities. These models are trained to tell apart the different steps of hand washing, as they are defined in their respective studies. The models used in these studies are not tested on a null class, i.e. they are not tested for other activities than hand washing. Thus, they can only be used for the detection of steps of hand washing, but not for the detection of hand washing in real life.
In order to separate hand washing from other activities, Mondol et al. employ a simple feed forward neural network. Their network consists of a few linear layers and can be used to detect hand washing @sayeed_mondol_hawad_2020. Their method seeks to specifically eliminate false positives by trying to detect out of distribution (OOD) samples, i.e. samples that are very different from the ones seen by the model during training. They apply a conditional Gaussian distribution of the network's features of the last layer before the output layer (penultimate layer).
......@@ -117,4 +117,4 @@ D_M(\mathbf{x}) = \sqrt{(\mathbf{x}- \boldsymbol{\mu})^T\mathbf{S}^{-1}(\mathbf{
\caption*{Equation \ref*{eqn:mahala}: Mahalanobis distance}
\end{figure}
To our knowledge, no hand washing detection method using more complicated neural networks than fully connected networks has been published as of 2021. The performance reached for the HAWAD paper could possibly be surpassed by convolutional or recurrent networks or a combination thereof, e.g. CNN, LSTM or DeepConvLSTM. Added to that, the detection and separation of compulsive hand washing from ordinary hand washing has, to our knowledge, never been done before. It seems likely, that methods from hand washing detection and human activity recognition can be applied to this problem as well.
\ No newline at end of file
To our knowledge, no hand washing detection method using more complicated neural networks than fully connected networks has been published as of 2021. The performance reached for the HAWAD paper could possibly be surpassed by convolutional or recurrent networks or a combination thereof, e.g. CNN, LSTM or DeepConvLSTM. Added to that, the detection and separation of compulsive hand washing from ordinary hand washing has, to our knowledge, never been done before. It seems likely, that methods from hand washing detection and human activity recognition can be applied to this problem as well.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment