Skip to content
Snippets Groups Projects

Related Work

Automatically detecting the current activity of a human being is a wide research field in computer science. There are many possible applications of gesture and activity recognition, e.g. human robot interaction, quality assessments, worker surveillance, control of user interfaces and more. Hand washing detection, which is a special case of activity detection, has also been the research interest of multiple studies over the past years. The most interesting fields of research for our work, apart from hand washing detection itself, are the fields of gesture recognition and human activity recognition, whose relevance will be explained below.

Gesture recognition

In the area of gesture recognition, we try to detect and classify specific, and narrowly defined gestures. The defined gestures can e.g. be used to actively control a system @saini_human_2020. This kind of approach is not directly applicable to our task of detecting hand washing. However, it could be possible to adapt algorithms from this field to the detection of a new gesture or a new set of gestures related to hand washing.

There are camera-based approaches and physical measurement based approaches @saini_human_2020. The camera based approaches were out of scope for this work. As explained in the introduction, in our setting, wrist worn devices have significant advantages over camera-based solutions that would have to be stationary, i.e. in fixed locations. There also exist approaches based on inertial measurement sensors. These sensors measure movement related physical values, such as the force or acceleration, angular velocity or orientation in space.

Gesture recognition, in general, uses similar methods as the more difficult human activity recognition @saini_human_2020, which will be explained below.

Human activity recognition

\label{section:har} Recognizing more than one gesture or body movement in combination in a temporal context and deriving the current activity of the user is called human activity recognition (HAR). In this task, we want to detect more general activities, compared to a shorter and simpler gestures. An activity can include many distinguishable gestures. However, the same activity will not always include all of the same gestures and the gestures included could be in a different order for every repetition. Activities are less repetitive than gestures, and harder to detect in general @zhu_wearable_2011. However, Zhu et al. have shown that the combined detection of multiple different gestures can be used in HAR tasks too @zhu_wearable_2011, which makes sense, because a human activity can consist of many gestures. Nevertheless, most methods used for HAR consist of more direct applications of machine learning to the data, without the detour of detecting specific gestures contained in the execution of an activity.

Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. The required feature engineering is the creation of meaningful statistics or calculations based on the time frame for which the activity should be predicted. The features can be frequency-domain based and time-domain based, but usually both are used at the same time to train these conventional models @liu_overview_2021. The classical machine learning methods include but are not limited to Random Forests (RFC), Hidden Markov Models (HMM), Support Vector Machines (SVM), the

k
-nearest neighbors algorithm and more.

Deep neural networks

Recently, deep neural networks have taken over the role of the state of the art machine learning method in the area of human activity recognition @bock_improving_2021, @liu_overview_2021. Deep neural networks are universal function approximators @bishop_pattern_2006, and are known for being easy to use on "raw" data. They are "artificial neural networks" consisting of multiple layers, where each layer contains a certain amount of nodes that are connected to the nodes of the following layer. The connections are each assigned a weight, and the weighted sum over the values of all the previous connected nodes is used to calculate the value of a node in the next layer. Simple neural networks where all nodes of a layer are connected to all nodes in the following layer are often called "fully connected neural networks" (FC-NN or FC). The connections' parameters are optimized using forward passes through the network of nodes, followed by the execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data passed and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to values that lead to a lower value of the loss function for the training data. However, there is no guarantee of convergence whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. Nevertheless, the parameters for the perfect approximation cannot be easily found, and empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods. In return, with enough data, deep neural networks often outperform the classical machine learning methods.

Convolutional neural networks (CNNs)

are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with them. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same amount of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional, like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional, which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.

Recurrent neural networks (RNNs)

are similar to feed forward neural networks, with the difference being that they have access to information from a previous time step. The simplest version of an RNN is a single node that takes the input

\mathbf{x}_t
and its own output
\mathbf{h}_{t-1}
from the last time step as inputs. RNNs can be trained on time series data and are able to interprete temporal connections and dependencies in the data to some extent. Recurrent neural networks are trained using "back propagation through time" @mozer_focused_1995. This means that we have to run a forwards pass of multiple time steps through the network first, followed by a back propagation that sums up over all the different time steps and their gradients. For "long" runs, i.e. if the network is supposed to take into account many time steps, there is the "vanishing gradient problem" @hochreiter_vanishing_1998. With an increasing amount of time steps, the gradients become smaller and smaller, making it harder or impossible to properly train the recurrent neural network.

Long short-term memory (LSTM)

can be used to combat the vanishing gradient problem in recurrent neural networks @hochreiter_long_1997, @hochreiter_vanishing_1998. It can be used in various applications, such as time series prediction, speech recognition and translation tasks (including generative tasks) @smagulova_survey_2019, but also for human activity recognition. It can handle temporal connections well and "remember" important parts of its past state.

\label{sec:LSTM} LSTMs consist of a "cell" of which one or more can be contained in a neural network. The LSTM cell is shown in @fig:lstm_cell and consists of two inputs, four gates and two outputs. The values gathered from the outputs are also part of the input in the next time step of the network's execution, introducing a special case of recurrency. The inputs to the cell are the external inputs

\mathbf{x}_t
(from the previous network layer and the current time step), as well as the "memory cell"
\mathbf{c}_{t-1}
from the previous time step and the "hidden state"
\mathbf{h}_{t-1}
which is the LSTM's output vector from the previous time step. The calculations describing one time step of the LSTM forward pass are listed in equations \ref{eqn:lstm1} to \ref{eqn:lstm2}.

\begin{figure} \begin{align} \label{eqn:lstm1} \mathbf{f}t &= \sigma(\mathbf{W}\mathbf{fx}\mathbf{x}t + \mathbf{W}\mathbf{fh}\mathbf{h}_{t-1} + \mathbf{b}f) \ \mathbf{i}t &= \sigma(\mathbf{W}\mathbf{ix}\mathbf{x}t + \mathbf{W}\mathbf{ih}\mathbf{h}{t-1} + \mathbf{b}i) \ \tilde{\mathbf{c}}t &= tanh(\mathbf{W}\mathbf{cx}\mathbf{x}t + \mathbf{W}\mathbf{ch}\mathbf{h}{t-1} + \mathbf{b}_c) \ \mathbf{c}_t &= \mathbf{f}t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}t\ \mathbf{o}t &= \sigma(\mathbf{W}\mathbf{ox}\mathbf{x}t + \mathbf{W}\mathbf{oh}\mathbf{h}{t-1} + \mathbf{b}_o) \ \mathbf{h}_t &= \mathbf{o}_t \odot tanh(\mathbf{c}_t) \label{eqn:lstm2} \end{align} \caption*{(

\odot
marking element-wise multiplication)} \end{figure}

LSTM Cell, by Guillaume Chevalier CC BY-SA 4.0, with added labeling for the gates)

The four LSTMs' gates are:

  • forget gate
  • new memory gate
  • input gate
  • output gate

These gates are fully connected neural network layers (marked in orange and with the corresponding activation functions in @fig:lstm_cell) with respective weights and biases and serve a functionality from which their names are derived. The weights and biases must be learned during the training phase of the neural network. The forget gate allows the LSTM to only apply part of the "remembered" cell memory

\mathbf{c}_{t-1}
in the current step, i.e. which bits should be used to which extent with regard to the current new input data
\mathbf{x}_t
and the hidden state from the last time step
\mathbf{h}_{t-1}
. The output of the forget gate,
\mathbf{f}_t
, multiplied element-wise with
\mathbf{c}_{t-1}
is considered the "remembered" information from the last step. The new memory gate and the input gate are used to decide which new data is added to the cell state. These two layers are also given the previous step's hidden state
\mathbf{h}_{t-1}
and the current step's input
\mathbf{x}_t
. In combination, the new memory network output
\tilde{\mathbf{c}}_t
and the input gates' output
\mathbf{i}_t
decide which components of the current input and hidden state will be taken into the new memory state
\mathbf{c}_{t}
. The memory state is passed on to the next step. The output gate will generate
\mathbf{o}_t
, which will be combined with
tanh(\mathbf{c}_{t})
by element-wise matrix multiplication to form the new hidden state
\mathbf{h}_{t}
.

DeepConvLSTM

is a network proposed by Ordonez et al. @ordonez_deep_2016 and consists of a number of convolutional layers as well as two LSTM layers. It reaches state of the art performance and is used for general human activity recognition tasks. The combination of convolutional layers and LSTMs works well with time series data, as it can use the advantages of both convolutional layers and the intelligent "memory" provided by the LSTMs.

Bock et al. @bock_improving_2021 employ an altered version of DeepConvLSTM @ordonez_deep_2016. Bock et al. propose reducing the amount of LSTM layers to one, resulting in the architecture shown in @fig:deepConvLSTM. They evaluate their approach on 5 different publicly available data sets and report an increased performance on four out of the five. Leaving out one LSTM layer drastically reduces the amount of parameters to be learned as well as the time needed to train the network.

DeepConvLSTM and the altered version, by Marius Bock @bock_improving_2021

Information propagation of LSTM and LSTM with temporal attention mechanism (adjusted from @zeng_understanding_2018)

\label{sec:LSTMA} In their paper "Understanding and improving recurrent networks for human activity recognition by continuous attention" , Zeng et al. apply an attention mechanism to LSTM based neural network models @zeng_understanding_2018. They propose the separate addition of temporal and sensor attention to the LSTM layers used in such networks. The sensor attention approach can be useful when using multiple sensor locations across the body and can be used to let the network focus on measurements from the more relevant sensors for specific tasks. The temporal attention approach works as shown in @fig:lstm_attention. The "normal" unrolled LSTM forward pass is pictured on the left. The temporal attention mechanism, makes the information from the past recurrency steps available after the LSTM output and can be seen on the right. The past outputs of the LSTM are saved in each step and then added together at

\mathbf{H}
as a weighted sum with weight parameters
\alpha_t
,
t \in [1, ..., T]
. The parameters
\alpha_t
for the weighted sum are also predicted by the network. The parameters for the "score" layer
\mathbf{W}_{\alpha}
are learned as part of the neural networks training routine. The resulting formulas are shown in equations \ref{eqn:attent_lstm1} to \ref{eqn:attent_lstm2}.

\begin{figure} \begin{align} \label{eqn:attent_lstm1} \mathbf{H} &= \sum_{t=1}^T \alpha_t\mathbf{h}_t \ \alpha_t &= \frac{exp{score(\mathbf{h}_T,\mathbf{h}t)}}{\sum{s=1}^{T}exp{score(\mathbf{h}_T,\mathbf{h}_s)}} \label{eqn:attent_lstm_sm} \ score(\mathbf{h}_T,\mathbf{h}_s) &= \mathbf{h}t^T\mathbf{W}{\alpha}\mathbf{h}_s \label{eqn:attent_lstm2} \end{align} \end{figure}

Note that the calculation of

\alpha_t
is done with the softmax function as shown in eqn. \ref{eqn:attent_lstm_sm}, although this is not explicitly mentioned by the authors of the paper. This makes sure that the weights
\alpha
used for the weighted sum, always sum up to 1.

Zeng et al. evaluate their approach on 3 data sets and report a state of the art performance, beating the initial DeepConvLSTM.

Another study by Singh et al. combines DeepConvLSTM with a self-attention mechanism @singh_deep_2021. The attention mechanism is very similar to the one used by Zeng et al. @zeng_understanding_2018, where the mechanism consists of a layer that follows the LSTM layers in the DeepConvLSTM network. Instead of utilizing a score layer which uses both

h_t
and
h_T
, Singh et al. find the weights
\mathbf{\alpha}
by applying the softmax function to the output of a fully connected layer, for each
h_t
, without taking into account
h_T
. Other than that, the two attention mechanisms are pretty similar. Singh et al. also report a statistically significant increase in performance compared to the initial DeepConvLSTM, although the evaluate their approach on different data sets than Zeng et al..

For HAR, DeepConvLSTM and the models derived from it are the state of the art machine learning methods, as their consistently outperform other model architectures on the available benchmarks and data sets.

Hand washing

To our knowledge, no study has ever tried to separately predict obsessive hand washing opposed to non-obsessive hand washing.

Most studies that try to automatically detect hand washing are aiming for compliance improvements, i.e. trying to increase or measure the frequency of hand washes or assessing or improving the quality of hand washes. Hand washing compliance can be measured using different tools. Jain et al. @jain_low-cost_2009 use an RFID-based system to check whether health care workers comply with hand washing frequency requirements. However, the system is merely used to make sure all workers entering an emergency care unit have washed their hands. Bakshi et al. @bakshi_feature_2021 developed a hand washing detection data set with RGB video data, and showed a valid way to extract SIFT-descriptors from it for further research. Llorca et al. showed a vision based system for automatic hand washing quality assessment @llorca_vision-based_2011 based on the detection of skin in RGB images using optical techniques such as optical flow estimation.

A study by Li et al. @li_wristwash_2018 is able to recognize 13 steps of a hand washing procedure with an accuracy of

85\,\%
. They employ a sliding window feature based hidden markov model approach. Wang et al. explore using sensor armbands to assess the users compliance with given hand washing hygiene guidelines @wang_accurate_2020. They run a classifier using XGBoost and are mostly able to separate the different steps of the scripted hand washing routine. Added to that, Cao et al. @cao_awash_2021 developed a system that similarly detects different steps of a scripted hand washing routine and prompts the user, if they confuse the order of the steps or forget one of the steps. The technology is aimed at elderly patients with dementia. Their system is able to detect which step of hand washing is currently conducted based on wrist motion data using an LSTM based neural network. However, none of the three systems mentioned in this paragraph are meant to separate hand washing from other activities.

In order to separate hand washing from other activities, Mondol et al. employ a simple feed forward neural network. Their network consists of a few linear layers and can be used to detect hand washing @sayeed_mondol_hawad_2020. Their method seeks to specifically eliminate false positives by trying to detect out of distribution (OOD) samples, i.e. samples that are very different from the ones seen by the model during training. They apply a conditional Gaussian distribution of the network's features of the last layer before the output layer (penultimate layer).

Steps of HAWAD for parameter estimation and inference, taken from @sayeed_mondol_hawad_2020

They use the said features of all positive class samples to calculate the mean

\boldsymbol{\mu}
and covariance matrix
\mathbf{S}
of the feature distribution. Based on these measures, one can compute each sample's distance to the distribution using the Mahalanobis distance (as seen in equation \ref{eqn:mahala}). If during test time, the model predicts a sample to belong to the positive class, the distance is calculated. If the distance is bigger than a threshold (
d_{th}
), the sample is classified as a negative. The threshold
d_{th}
can be derived by selecting it fittingly in order to include almost all positive samples seen during training. The parameter estimation and hand washing steps performed in the HAWAD paper can be seen in @fig:HAWAD. On their own data set (HAWAD data set) they reach F1-Scores of over 90% for hand washing detection.

\begin{figure} \begin{align} D_M(\mathbf{x}) = \sqrt{(\mathbf{x}- \boldsymbol{\mu})^T\mathbf{S}^{-1}(\mathbf{x}- \boldsymbol{\mu})} \label{eqn:mahala} \end{align} \caption*{Equation \ref*{eqn:mahala}: Mahalanobis distance} \end{figure}

To our knowledge, no hand washing detection method using more complicated neural networks than fully connected networks has been published as of 2021. The performance reached for the HAWAD paper could possibly be surpassed by convolutional or recurrent networks or a combination thereof, e.g. CNN, LSTM or DeepConvLSTM. Added to that, the detection and separation of compulsive hand washing from ordinary hand washing has, to our knowledge, never been done before. It seems likely, that methods from hand washing detection and human activity recognition can be applied to this problem as well.