Related Work
Automatically detecting the current activity of a human being is a wide research field in computer science. There are many possible applications of gesture and activity recognition, e.g. human robot interaction, quality assessments, worker surveillance, control of user interfaces and more. Hand washing detection, which is a special case of activity detection, has also been the research interest of multiple studies over the past years. The most interesting fields of research for our work, apart from hand washing detection itself, are the fields of gesture recognition and human activity recognition, whose relevance will be explained below.
Gesture recognition
In the area of gesture recognition, we try to detect and classify specific, and narrowly defined gestures. The defined gestures can e.g. be used to actively control a system @saini_human_2020. This kind of approach is not directly applicable to our task of detecting hand washing. However, it could be possible to adapt algorithms from this field to the detection of a new gesture or a new set of gestures related to hand washing.
There are camera-based approaches and physical measurement based approaches @saini_human_2020. The camera based approaches were out of scope for this work. As explained in the introduction, in our setting, wrist worn devices have significant advantages over camera-based solutions that would have to be stationary, i.e. in fixed locations. There also exist approaches based on inertial measurement sensors. These sensors measure movement related physical values, such as the force or acceleration, angular velocity or orientation in space.
Gesture recognition, in general, uses similar methods as the more difficult human activity recognition @saini_human_2020, which will be explained below.
Human activity recognition
\label{section:har} Recognizing more than one gesture or body movement in combination in a temporal context and deriving the current activity of the user is called human activity recognition (HAR). In this task, we want to detect more general activities, compared to a shorter and simpler gestures. An activity can include many distinguishable gestures. However, the same activity will not always include all of the same gestures and the gestures included could be in a different order for every repetition. Activities are less repetitive than gestures, and harder to detect in general @zhu_wearable_2011. However, Zhu et al. have shown that the combined detection of multiple different gestures can be used in HAR tasks too @zhu_wearable_2011, which makes sense, because a human activity can consist of many gestures. Nevertheless, most methods used for HAR consist of more direct applications of machine learning to the data, without the detour of detecting specific gestures contained in the execution of an activity.
Methods used in HAR include classical machine learning methods as well as deep learning @liu_overview_2021 @bulling_tutorial_2014. The classical machine learning methods rely on features of the data obtained by feature engineering. The required feature engineering is the creation of meaningful statistics or calculations based on the time frame for which the activity should be predicted. The features can be frequency-domain based and time-domain based, but usually both are used at the same time to train these conventional models @liu_overview_2021. The classical machine learning methods include but are not limited to Random Forests (RFC), Hidden Markov Models (HMM), Support Vector Machines (SVM), the
Deep neural networks
Recently, deep neural networks have taken over the role of the state of the art machine learning method in the area of human activity recognition @bock_improving_2021, @liu_overview_2021. Deep neural networks are universal function approximators @bishop_pattern_2006, and are known for being easy to use on "raw" data. They are "artificial neural networks" consisting of multiple layers, where each layer contains a certain amount of nodes that are connected to the nodes of the following layer. The connections are each assigned a weight, and the weighted sum over the values of all the previous connected nodes is used to calculate the value of a node in the next layer. Simple neural networks where all nodes of a layer are connected to all nodes in the following layer are often called "fully connected neural networks" (FC-NN or FC). The connections' parameters are optimized using forward passes through the network of nodes, followed by the execution of the backpropagation algorithm, and an optimization step. We can accumulate all the gradients with regard to a loss function for each of the parameters and for a small subset of the data passed and perform "stochastic gradient decent" (SGD). SGD or alternative similar optimization methods like the commonly used ADAM @kingma_adam_2017 optimizer perform a parameter update step. After many such updates and if the training works well, the network parameters will have been updated to values that lead to a lower value of the loss function for the training data. However, there is no guarantee of convergence whatsoever. As mentioned above, deep neural networks can, in theory, be used to approximate arbitrary functions. Nevertheless, the parameters for the perfect approximation cannot be easily found, and empirical testing has revealed that neural networks do need a lot of training data in order to perform well, compared to classical machine learning methods. In return, with enough data, deep neural networks often outperform the classical machine learning methods.
Convolutional neural networks (CNNs)
are neural networks that are not fully connected, but work by using convolutions with a kernel, that we slide over the input. CNNs were first introduced for hand written character recognition @lecun_backpropagation_1989 @le_cun_handwritten_1990 (1989, 1990), but were later revived for computer vision tasks @krizhevsky_imagenet_2012 (2012), after more computational power was available on modern devices to train them. Since the rise of CNNs in computer vision, most computer vision problems are solved with them. The convolutions work by moving filter windows with learnable parameters (also called kernels) over the input @albawi_understanding_2017. Opposed to a fully connected network, the weights are shared over many of the nodes, because the same filters are applied over the full size of the input. CNNs have less parameters to train than a fully connected network with the same amount of nodes, which makes them easier to train. They are generally expected to perform better than FC networks, especially on image related tasks. The filters can be 2-dimensional, like for images (e.g. a 5x5 filter moved across the two axes of an image) or 1-dimensional, which can e.g. be used to slide a kernel along the time dimension of a sensor recording. Even in the 1-dimensional case, less parameters are needed compared to the application of a fully connected network. Thus, the 1-dimensional CNN is expected to be easier to train and achieve a better performance.
Recurrent neural networks (RNNs)
are similar to feed forward neural networks, with the difference being that they have access to information from a previous time step. The simplest version of an RNN is a single node that takes the input
Long short-term memory (LSTM)
can be used to combat the vanishing gradient problem in recurrent neural networks @hochreiter_long_1997, @hochreiter_vanishing_1998. It can be used in various applications, such as time series prediction, speech recognition and translation tasks (including generative tasks) @smagulova_survey_2019, but also for human activity recognition. It can handle temporal connections well and "remember" important parts of its past state.
\label{sec:LSTM} LSTMs consist of a "cell" of which one or more can be contained in a neural network. The LSTM cell is shown in @fig:lstm_cell and consists of two inputs, four gates and two outputs. The values gathered from the outputs are also part of the input in the next time step of the network's execution, introducing a special case of recurrency. The inputs to the cell are the external inputs
\begin{figure} \begin{align} \label{eqn:lstm1} \mathbf{f}t &= \sigma(\mathbf{W}\mathbf{fx}\mathbf{x}t + \mathbf{W}\mathbf{fh}\mathbf{h}_{t-1} + \mathbf{b}f) \ \mathbf{i}t &= \sigma(\mathbf{W}\mathbf{ix}\mathbf{x}t + \mathbf{W}\mathbf{ih}\mathbf{h}{t-1} + \mathbf{b}i) \ \tilde{\mathbf{c}}t &= tanh(\mathbf{W}\mathbf{cx}\mathbf{x}t + \mathbf{W}\mathbf{ch}\mathbf{h}{t-1} + \mathbf{b}_c) \ \mathbf{c}_t &= \mathbf{f}t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}t\ \mathbf{o}t &= \sigma(\mathbf{W}\mathbf{ox}\mathbf{x}t + \mathbf{W}\mathbf{oh}\mathbf{h}{t-1} + \mathbf{b}_o) \ \mathbf{h}_t &= \mathbf{o}_t \odot tanh(\mathbf{c}_t) \label{eqn:lstm2} \end{align} \caption*{(
The four LSTMs' gates are:
- forget gate
- new memory gate
- input gate
- output gate
These gates are fully connected neural network layers (marked in orange and with the corresponding activation functions in @fig:lstm_cell) with respective weights and biases and serve a functionality from which their names are derived. The weights and biases must be learned during the training phase of the neural network. The forget gate allows the LSTM to only apply part of the "remembered" cell memory
DeepConvLSTM
is a network proposed by Ordonez et al. @ordonez_deep_2016 and consists of a number of convolutional layers as well as two LSTM layers. It reaches state of the art performance and is used for general human activity recognition tasks. The combination of convolutional layers and LSTMs works well with time series data, as it can use the advantages of both convolutional layers and the intelligent "memory" provided by the LSTMs.
Bock et al. @bock_improving_2021 employ an altered version of DeepConvLSTM @ordonez_deep_2016. Bock et al. propose reducing the amount of LSTM layers to one, resulting in the architecture shown in @fig:deepConvLSTM. They evaluate their approach on 5 different publicly available data sets and report an increased performance on four out of the five. Leaving out one LSTM layer drastically reduces the amount of parameters to be learned as well as the time needed to train the network.
\label{sec:LSTMA} In their paper "Understanding and improving recurrent networks for human activity recognition by continuous attention" , Zeng et al. apply an attention mechanism to LSTM based neural network models @zeng_understanding_2018. They propose the separate addition of temporal and sensor attention to the LSTM layers used in such networks. The sensor attention approach can be useful when using multiple sensor locations across the body and can be used to let the network focus on measurements from the more relevant sensors for specific tasks. The temporal attention approach works as shown in @fig:lstm_attention. The "normal" unrolled LSTM forward pass is pictured on the left. The temporal attention mechanism, makes the information from the past recurrency steps available after the LSTM output and can be seen on the right. The past outputs of the LSTM are saved in each step and then added together at
\begin{figure} \begin{align} \label{eqn:attent_lstm1} \mathbf{H} &= \sum_{t=1}^T \alpha_t\mathbf{h}_t \ \alpha_t &= \frac{exp{score(\mathbf{h}_T,\mathbf{h}t)}}{\sum{s=1}^{T}exp{score(\mathbf{h}_T,\mathbf{h}_s)}} \label{eqn:attent_lstm_sm} \ score(\mathbf{h}_T,\mathbf{h}_s) &= \mathbf{h}t^T\mathbf{W}{\alpha}\mathbf{h}_s \label{eqn:attent_lstm2} \end{align} \end{figure}
Note that the calculation of
Zeng et al. evaluate their approach on 3 data sets and report a state of the art performance, beating the initial DeepConvLSTM.
Another study by Singh et al. combines DeepConvLSTM with a self-attention mechanism @singh_deep_2021. The attention mechanism is very similar to the one used by Zeng et al. @zeng_understanding_2018, where the mechanism consists of a layer that follows the LSTM layers in the DeepConvLSTM network. Instead of utilizing a score layer which uses both
For HAR, DeepConvLSTM and the models derived from it are the state of the art machine learning methods, as their consistently outperform other model architectures on the available benchmarks and data sets.
Hand washing
To our knowledge, no study has ever tried to separately predict obsessive hand washing opposed to non-obsessive hand washing.
Most studies that try to automatically detect hand washing are aiming for compliance improvements, i.e. trying to increase or measure the frequency of hand washes or assessing or improving the quality of hand washes. Hand washing compliance can be measured using different tools. Jain et al. @jain_low-cost_2009 use an RFID-based system to check whether health care workers comply with hand washing frequency requirements. However, the system is merely used to make sure all workers entering an emergency care unit have washed their hands. Bakshi et al. @bakshi_feature_2021 developed a hand washing detection data set with RGB video data, and showed a valid way to extract SIFT-descriptors from it for further research. Llorca et al. showed a vision based system for automatic hand washing quality assessment @llorca_vision-based_2011 based on the detection of skin in RGB images using optical techniques such as optical flow estimation.
A study by Li et al. @li_wristwash_2018 is able to recognize 13 steps of a hand washing procedure with an accuracy of
In order to separate hand washing from other activities, Mondol et al. employ a simple feed forward neural network. Their network consists of a few linear layers and can be used to detect hand washing @sayeed_mondol_hawad_2020. Their method seeks to specifically eliminate false positives by trying to detect out of distribution (OOD) samples, i.e. samples that are very different from the ones seen by the model during training. They apply a conditional Gaussian distribution of the network's features of the last layer before the output layer (penultimate layer).
They use the said features of all positive class samples to calculate the mean
\begin{figure} \begin{align} D_M(\mathbf{x}) = \sqrt{(\mathbf{x}- \boldsymbol{\mu})^T\mathbf{S}^{-1}(\mathbf{x}- \boldsymbol{\mu})} \label{eqn:mahala} \end{align} \caption*{Equation \ref*{eqn:mahala}: Mahalanobis distance} \end{figure}
To our knowledge, no hand washing detection method using more complicated neural networks than fully connected networks has been published as of 2021. The performance reached for the HAWAD paper could possibly be surpassed by convolutional or recurrent networks or a combination thereof, e.g. CNN, LSTM or DeepConvLSTM. Added to that, the detection and separation of compulsive hand washing from ordinary hand washing has, to our knowledge, never been done before. It seems likely, that methods from hand washing detection and human activity recognition can be applied to this problem as well.