Commit a8327294 authored by burcharr's avatar burcharr 💬
Browse files

automatic writing commit ...

parent dcc61f7c
......@@ -13,7 +13,7 @@ The results of the theoretical evaluation show, that for each of the defined pro
##### Problem 1
For the problem of classifying hand washing and separating it from all other activities, the raw predictions of the networks without smoothing reached an F1 score of $0.853$ (DeepConvLSTM) and an S score of $0.758$ (DeepConvLSTM-A). The DeepConvLSTM and DeepConvLSTM-A surpass all the other models that we tested, including the baselines RFC, SVM, majority classifier and random classifier. The baselines are surpassed by large margins. This is in line with related work on other human activity recognition tasks, where DeepConvLSTM and DeepConvLSTM with small modifications also achieved the best results. On this specific problem, the CNN model also needs to be mentioned, because its performance was worse, but not far from the DeepConvLSTM based models.
The application of smoothing improved the performance of the models even further, to an F1 score of $0.892$ (DeepConvLSTM) and an S scores of $0.819$ (DeepConvLSTM-A). This performance boost by smoothing can be explained by the temporal context captured in the data. It is clear, that if many windows in rapid succession are classified as hand washing, it is likely that a small amount of predictions of the Null class is likely to be wrong. The smoothing helps to both filter out false positives and false negatives.
The application of smoothing improved the performance of the models even further, to an F1 score of $0.892$ (DeepConvLSTM) and an S scores of $0.819$ (DeepConvLSTM-A). This performance boost by smoothing can be explained by the temporal context captured in the data. It is clear, that if many windows in rapid succession are classified as hand washing, it is likely that a small amount of wrong predictions of the Null class appear. The smoothing helps to both filter out false positives and false negatives.
Normalization was shown to be ineffective for our approach, worsening the performance of almost all models. This could be due to the difference in distribution in the train and test set. The parameters for normalization were estimated from the train set and applied to the test set, which can always be inaccurate, because we assume that train and test set have the same distributions. This was not the case here, which is probably why the normalized data was harder to learn and test on, than the non normalized data.
......@@ -21,25 +21,25 @@ For the reasons explained in section \ref{s_score}, we weigh the results of the
The binarized versions of the models trained on problem 3 achieve a notable success in terms of their similar F1 scores to the models trained for problem 1. However their performance in terms of the S score metric is worse by about $0.052$ for the best and by more for the other models. Therefore, and especially because of the higher importance of the S score, the models trained on problem 3 are not as good at classifying hand washing and separating it from other activities, as the models specifically trained for this problem. This lower performance can be explained with the higher difficulty of the 3 class problem learned by the classifiers trained for model 3. Thus the loss in performance was to be expected.
Compared to results obtained by Mondol et al. in HAWAD @sayeed_mondol_hawad_2020, with F1 scores over $90\,\%$, it may look like our approach provides weaker results. Their detection of samples, that are out of distribution sounds like a good idea in theory. However, we must argue that their results and our results are not entirely comparable, because we did not train or evaluate on the same data. Added to that, from what they report in their paper, they did not split the data by subjects, but rather by data windows, with random sampling. This means that, during training, their model saw data from all subjects, including the subjects whose data they later tested on. Although this is not technically a leak from train to test set, our approach of splitting by subjects can be expected to deliver a better estimate of the generalization performance, because our models' (over-)adaptation to the specific subjects styles or patterns of hand washing cannot yield a performance boost on unseen subjects. Nevertheless, the detection of out of distribution samples could possibly increase the performance of our models. Still, one has to keep in mind, that a sample being out of distribution does not always mean, that it cannot be hand washing, especially if we test on unseen subjects who might arguably employ different patterns of motion during hand washing. For these reasons the comparability of the results seems rather low, with the performance of HAWAD likely being overestimated in comparison to our scenario.
Compared to results obtained by Mondol et al. in HAWAD @sayeed_mondol_hawad_2020, with F1 scores over $90\,\%$, it may look like our approach provides weaker results. Their detection of samples, that are out of distribution sounds like a good idea in theory. However, we must argue that their results and our results are not entirely comparable, because we did not train or evaluate on the same data as they did. Added to that, from what they report in their paper, they did not split the data by subjects, but rather by data windows, with random sampling. This means that, during training, their model saw data from all subjects, including the subjects whose data they later tested on. Although this is not technically a leak from train to test set, our approach of splitting by subjects can be expected to deliver a better estimate of the generalization performance, because our models' (over-)adaptation to the specific subjects styles or patterns of hand washing cannot yield a performance boost on unseen subjects. Nevertheless, the detection of out of distribution samples could possibly increase the performance of our models. Still, one has to keep in mind, that a sample being out of distribution does not always mean, that it cannot be hand washing, especially if we test on unseen subjects who might arguably employ different patterns of motion during hand washing. For these reasons the comparability of the results seems rather low, with the performance of HAWAD likely being overestimated in comparison to our scenario. According to our findings, a fully connected network cannot reach the same level of performance as the DeepConvLSTM based models.
##### Problem 2
The problem of classifying compulsive hand washing and distinguishing it from non-compulsive hand washing seems to be more difficult than problem 1 from an outsiders perspective. Distinguishing different types of hand washing should be more difficult than telling apart hand washing from all other activities. However, the results of problem 2 seem to be proving the opposite, as significantly higher F1 scores and S scores are reached. For the raw predictions, F1 scores of around $0.92$ are reached by the LSTM, and DeepConvLSTM(-A). An S score of $0.869$ is reached by DeepConvLSTM. The classic machine learning methods SVM and RFC also reach good F1 scores near $0.89$, but significantly lower S scores below $0.735$. The different metrics' scores reached without smoothing are both higher than the scores reached with smoothing for problem 1, indicating that classifying compulsive hand washing and distinguishing it from non-compulsive hand washing could be learned better than classifying hand washing and separating it from all other activities. This property could stem from the much smaller amount of data used for problem 2, as we only included hand washing data here, which all stem from our own data sets, rather than from the external ones. The heterogeneity of the data set for problem 1 and problem 2 probably makes the network training harder. The imbalance of the data used ($65\,\%$ positive samples, $35\,\%$ negative samples) is also smaller than the one of problem 1. However it is unsure, if this had an effect on the performance of the neural network based methods, as we used the weighted loss function to combat this problem.
The problem of classifying compulsive hand washing and distinguishing it from non-compulsive hand washing may seem to be more difficult than problem 1 from an outsiders perspective. Distinguishing different, but closely related, types of hand washing should be more difficult than telling apart hand washing from all other activities. However, the results of problem 2 seem to be proving the opposite, as significantly higher F1 scores and S scores are reached. For the raw predictions, F1 scores of around $0.92$ are reached by the LSTM, and DeepConvLSTM(-A). An S score of $0.869$ is reached by DeepConvLSTM. The classic machine learning methods SVM and RFC also reach good F1 scores near $0.89$, but significantly lower S scores below $0.735$. The different metrics' scores reached without smoothing are both higher than the scores reached with smoothing for problem 1, indicating that classifying compulsive hand washing and distinguishing it from non-compulsive hand washing could be learned better than classifying hand washing and separating it from all other activities. This property could stem from the much smaller amount of data used for problem 2, as we only included hand washing data here, which was taken from our own data sets, rather than from the external ones. The heterogeneity of the data set for problem 1 compared to problem 2 probably makes the network training harder in problem 1. Furthermore, the data for problem 1 includes a lot of activities, which must implicitly be mapped to the null class by the network, while problem 2 only knows two activities. The imbalance of the data used for problem 2 ($65\,\%$ positive samples, $35\,\%$ negative samples) is also smaller than the one of problem 1. However it is unsure, if this had an effect on the performance of the neural network based methods, as we used the weighted loss function to combat this problem.
Like for problem 1, normalizing the data did not lead to a better performance, but rather to a decrease of the performance. The reasons are assumed to be the same reasons as for problem 1.
The results of problem 2 with the application of smoothing look even more promising. A very high F1 score of $0.966$ and an S score of $0.911$ are reached by DeepConvLSTM-A. The sensitivity reaches a value of $0.997$, while the specificity is $0.839$. Added to that, DeepConvLSTM and LSTM, as well as LSTM-A are all able to reach very similar performance levels. DeepConvLSTM-A has the highest performance values, but because it is so close, the difference is insignificant. The performance is even better than the performance without smoothing and hints that the detection of compulsive hand washing in a hand washing scenario is actually feasible. However, at this point we must stress again that the compulsive hand washing data used in this work is actually only simulated. Although the simulation was supervised with the help of expert psychologists, there is no guarantee, that real compulsive hand washing is distinguishable with the same level of performance. Arguably, even if the real difference from hand washing to compulsive hand washing was lower than in our data, the performance could still be high enough for a satisfactory separation of hand washing and compulsive hand washing. Is is likely that once there is enough data from OCD patients available, the models mentioned could be trained to detect non simulated compulsive washing with high accuracy.
As there is no published previous work in the area of automatically detecting compulsive hand washing, the results cannot be compared to already achieved results. The strong performance levels indicate a high probability for the approach being applicable in real world testing. Sadly, as our work's real world evaluation was limited to the evaluation of the best model for problem 1, we cannot report real world results to prove this hypothesis.
As there is no published previous work in the area of automatically detecting compulsive hand washing, the results cannot be compared to already achieved results. The strong performance levels indicate a high probability for the approach being applicable in real world testing. Sadly, as our work's real world evaluation was limited to the evaluation of the best model for problem 1, we cannot report real world results to test this hypothesis.
##### Problem 3
The problem of classifying hand washing and compulsive hand washing separately and distinguishing both from other activities at the same time is arguably harder than the other two problems. Problem 3 can be seen as the unification of problem 1 and problem 2, namely classifying whether and activity is hand washing (problem 1) and, if yes, whether said activity is compulsive hand washing (problem 2). By being this 3 class classification problem, problem 3 is thus more difficult and has more room for errors than the other two problems. Thus, a lower level of performance must be expected.
The problem of classifying hand washing and compulsive hand washing separately and distinguishing both from other activities at the same time is arguably harder than the other two problems. Problem 3 can be seen as the unification of problem 1 and problem 2, namely classifying whether and activity is hand washing (problem 1) and, if yes, whether said washing activity is compulsive hand washing (problem 2). By being this 3 class classification problem, problem 3 is thus more difficult and has more room for errors than the other two problems. Thus, a lower level of performance must be expected.
Out of the models distinctively trained on problem 3, DeepConvLSTM-A performed best with a multiclass F1 Score of $0.692$, a multiclass S score of $0.769$ and a mean diagonal value of the confusion matrix $0.712$. DeepConvLSTM achieved a slightly lower, but nearly as good performance. For this problem, the baseline classic machine learning methods performed much worse, with their multiclass F1 and S scores, as well as their mean diagonal values of the confusion matrix being in the range of around $0.5$.
Added to the models trained on problem 3, we also report the performance of a chained model, that consists of the two best performing models for problem 1 and problem 2. Due to problem 3 being the combination of problem 1 and problem 2, the chained model can be used to make the same predictions. The chained model we used was DeepConvLSTM-A from problem 1 and DeepConvLSTM from problem 2, as those were the best performing models in terms of non smoothed predictions and the S score for these two problems. The chained model reached an even higher performance, with a multiclass F1 Score of $0.714$, a multiclass S score of $0.783$ and a mean diagonal value of the confusion matrix $0.718$. This result is valuable, because it shows that the classifiers trained for problem 1 and problem 2 can outperform a classifier specifically trained for problem 3. This indicates, that the sub-problems of problem 3 can more easily be solved independently, than it is to solve problem 3 directly. The downside of using the two networks is, that they take twice the time to train, twice the time and energy to run and twice the memory or storage and thus are less efficient. Especially on the smart watch or any embedded mobile device the models could be deployed on, this could be a big disadvantage compared to the single model trained for problem 3. The performance difference is significant, but not by a large margin, the difference of $0.03$ in the multiclass S score could well be indistinguishable for real world users.
Added to the models trained on problem 3, we also report the performance of a chained model, that consists of the two best performing models for problem 1 and problem 2. Due to problem 3 being the combination of problem 1 and problem 2, the chained model can be used to make the same predictions. The chained model we use consists of DeepConvLSTM-A from problem 1 and DeepConvLSTM from problem 2, as those were the best performing models in terms of non smoothed predictions and the S score for these two problems. The chained model reached an even higher performance, with a multiclass F1 Score of $0.714$, a multiclass S score of $0.783$ and a mean diagonal value of the confusion matrix $0.718$. This result is valuable, because it shows that the classifiers trained for problem 1 and problem 2 can outperform a classifier specifically trained for problem 3. This indicates, that the sub-problems of problem 3 can more easily be solved independently, than it is to solve problem 3 directly. The downside of using the two networks is, that they take twice the time to train, twice the time and energy to run and twice the memory or storage and thus are less efficient. Especially on the smart watch or any embedded mobile device the models could be deployed on, this could be a big disadvantage compared to the single model trained for problem 3. The performance difference is significant, but not by a large margin. The difference of $0.03$ in the multiclass S score is so small, that it could well be indistinguishable for real world users.
We did not apply smoothing for problem 3, but it could be done in theory, using a slightly different approach, and it may also improve the performance of the system.
......@@ -52,45 +52,47 @@ To conclude the results of problem 3, the overall performance of this more diffi
### Practical applicability
The data from the real world evaluation with our test subjects shows, that not all real world hand washing procedures are detected by our smart watch system. Overall, the system's sensitivity was only $28.33\,\%$ in the evaluation of a "normal day", which is much lower compared to the theoretical results. However, this was to be expected to some degree, since real hand washing knows many forms and patterns, that are unlikely to all be captured during the explicit recording of training data. Added to that, the hand washing detection depended on the side of the body, on which the watch was worn, at least for some subjects. The performance was significantly worse if the watch was worn on the right wrist. This is likely due to the hand washing data used for training being collected almost exclusively with smart watches worn on the left wrist. If the data from subjects wearing the watch on the right wrist is left out, the overall detection sensitivity rises to $50\,\%$.
For some subjects, the smart watch application did not work properly, i.e. not start to run in the background as desired, which is why their results could not be included in the reported results. However, it could be possible, that other users' smart watch applications also were inactive for some of the time, possibly missing some hand washing procedures during this time.
For 2 subjects, the smart watch application did not work properly, i.e. it did not start to run in the background as desired, which is why their results could not be included in the reported results. However, it could be possible, that other users' smart watch applications also were inactive for some of the time, possibly missing some hand washing procedures during this time.
Because of the smoothing that was applied to the data, at least some consecutive windows must be classified into the positive class, which means that a real hand washing procedure needs to be longer than or around $10\,s$. In practice, it can happen that washing ones hands does take a shorter amount of time, which the system will then not detect properly. It is even enough, if for some period of time in the middle of a washing procedure, the washing intensity is small enough for the model to misclassify it as noise.
Because of the smoothing that was applied to the data, in order to make a prediction, at least some consecutive windows must be classified into the positive class, which means that a real hand washing procedure needs to be longer than or around $10\,s$. In practice, it can happen that washing ones hands does take a shorter amount of time, which the system will then not detect properly. It is even enough, if for some period of time in the middle of a washing procedure, the washing intensity is small enough for the model to misclassify it as noise.
It is not entirely clear, why the theoretical results could not be reached entirely in the real life scenario. It could be due to the assumptions made during the recording of the data sets, i.e. the way the hands were washed during the recordings could be too different from unbiased real world washing. In order to improve the performance in the real world, further research has to be conducted. All in all, the system was able to correctly detect most hand washing procedures, and is therefore somewhat effective at this task.
We also expected, that a higher intensity or a longer duration of the hand washing have a positive influence on the detection probability by the model on the smart watch. This seems logical for the longer duration due to the smoothing, but also for the intensity. It can be assumed, that the system can reach higher certainties with high intensity compared to low intensity washing, as it is likely more separable from less intense activities. However, the results showed a significantly positive correlation value only for intensity and detection rate, whereas the detection rate and hand washing duration seemed to be mostly uncorrelated. However, this may again be due to the relatively small sample size. Especially for the longer washing tasks of 30s and 35s, there were only 2 examples, out of which one was not detected. This may have had a big influence on the absence of a positive correlation value in the evaluation results.
We also expected, that a higher intensity or a longer duration of the hand washing would have a positive influence on the detection probability by the model on the smart watch. This seems logical for the longer duration due to the smoothing, but also for the intensity, due to the higher sensor values. It can be assumed, that the system can reach higher certainties with high intensity compared to low intensity washing, as it is likely more separable from less intense activities. However, the results showed a significantly positive correlation value only for intensity and detection rate ($0.267$), whereas the detection rate and hand washing duration seemed to be mostly uncorrelated ($-0.039$). However, this may again be due to the relatively small sample size. Especially for the longer washing tasks of 30s and 35s, there were only 2 examples, which were both not detected. This may have had a big influence on the absence of a positive correlation value in the evaluation results.
Added to that, the system did detect an average of 4 false positives per subject per day. These false positives could lead to annoyances and ultimately to the users losing trust in the detection capabilities of the system. However, the amount found here in the everyday task also varied a lot from subject to subject. Mainly, washing activities lead to false positives, which was to be expected, because similar movements like in hand washing are executed. Other activities also lead to false positives, which also confirmed that the theoretical results' high, but not very high specificity does not lead to the total avoidance of false positives.
The test of scenario 2, the task of intensively washing for at least 30 seconds, yielded a lot higher accuracy. Per subject the washing was on average detected in $76\,\%$ of washing repetitions. Compared to the sensitivity of $90\,\%$ reached for problem 1 with smoothing, this is only lower by $14$ percentage points. The discrepancy here is much lower than in the every day scenario. This could be due to the fact that the training data for hand washing procedures was also collected in a more controlled environment, and more similar patterns were achieved. The results of the evaluation for scenario 2 are thus better than the results for scenario 1.
The test of scenario 2, the task of intensively washing for at least 30 seconds, yielded a lot higher accuracy. Per subject, the washing was on average detected in $76\,\%$ of washing repetitions. Compared to the sensitivity of $90\,\%$ reached for problem 1 with smoothing, this is only lower by $14$ percentage points. The discrepancy here is much lower than in the every day scenario. This could be due to the fact that the training data for hand washing procedures was also collected in a more controlled environment, and more similar patterns were achieved. The results of the evaluation for scenario 2 are thus better than the results for scenario 1.
In total, the practical evaluation showed some weaknesses and some strengths of the system. As the sample size is small, and system instabilities occurred, the results have to be interpreted carefully. The evaluation is valid, especially for the false positives and the activities provoking them. However, the low sensitivity found in the every day task does not match the much higher sensitivity found in the intensive hand washing task, and the differences between subjects were huge for scenario 1. Part of the reason for this is the difference in performance on the left and right wrists respectively.
## Comparison of goals to results
#### Detection of hand washing in real time from inertial motion sensors
#### The detection of hand washing in real time from inertial motion sensors is feasible
The goal of detecting hand washing and separating it from other activities in real time was reached by employing the trained DeepConvLSTM-A network which achieved good performance in our theoretical evaluation. The detection is not perfect yet, especially the separation from other activities seems to still have some weaknesses, especially when washing activities other than hand washing are included. The system also missed out on too many of the hand washing procedures executed in the real world evaluation. However, the system was able to detect and correctly identify hand washing very well in the theoretical evaluation, and in many of the cases in the practical evaluation, which is why we consider our goal mostly reached.
#### Separation of hand washing and compulsive hand washing
The separation of hand washing from compulsive hand washing worked extremely well for the theoretical evaluation, which is the only evaluation we were able to test it with. A sensitivity of $99,7\,\%$ was reached with smoothing, while maintaining a specificity of $83,9\,\%$. This means that almost all compulsive hand washing in our test data was detected by the system, although the false positive rate is still a bit higher than we want it to be. Nevertheless, the performance of the model trained for this problem was really strong and fully matched our expectations. We think that a performance on this level in the real world could possibly really be applied in the treatment of patients with OCD, which is why we consider this goal as reached, too.
#### Hand washing and compulsive hand washing can be separated
The separation of hand washing from compulsive hand washing worked extremely well for the theoretical evaluation, which is the only evaluation we were able to test it with. A sensitivity of $99.7\,\%$ was reached with smoothing, while maintaining a specificity of $83.9\,\%$. This means that almost all compulsive hand washing in our test data was detected by the system, although the false positive rate is still a bit higher than we want it to be. Nevertheless, the performance of the model trained for this problem was really strong and fully matched our expectations. We think that a performance on this level in the real world could possibly really be applied in the treatment of patients with OCD, which is why we consider this goal as reached, too.
#### Real world evaluation
The real world evaluation provided us with valuable feedback, showing us strengths and weaknesses of the hand washing detection model. Especially for the previously mentioned washing activities, the evaluation showed us the need of their inclusion as negative training examples. Apart from the false positives, the real world evaluation confirmed some of our hopes of the system actually being able to detect hand washing with high precision. Although the performance in the real world test was lower than the one of the theoretical evaluation, it worked well for some of the subjects, and showed us the main weaknesses of the system. The real world evaluation still yielded a strong performance especially in the task of scenario 2. The estimation of performance for the intense and long washing task (scenario 2) was much closer to the performance reached on our pre-recorded test set, which showed that the system is able to detect stronger hand washing even better. Overall, the real world evaluation was successful, returning us valuable information about the weak points and strengths of the system so far. In total, the amount of subjects and feedback received was a little bit too small, in order to draw fully qualified conclusions, as the variance of results was high between the subjects.
#### A real world evaluation was conducted
The practical evaluation provided us with valuable feedback, showing us strengths and weaknesses of the hand washing detection model. Especially for the previously mentioned washing activities, the evaluation showed us the need of their inclusion as negative training examples. Apart from the false positives, the real world evaluation confirmed some of our hopes of the system actually being able to detect every day hand washing with high precision. Although the performance in the real world test was lower than the one of the theoretical evaluation, it worked well for some of the subjects. The real world evaluation still yielded a strong performance especially in the task of scenario 2. The estimation of performance for the intense and long washing task (scenario 2) was much closer to the performance reached on our pre-recorded test set, which showed that the system is able to detect more intense hand washing even better. Overall, the real world evaluation was successful, returning us valuable information about the weak points and strengths of the system so far. In total, the amount of subjects and feedback received was a little bit too small, in order to draw fully qualified conclusions, as the variance of the results was high between the subjects.
## Future work
The general performance of our models on problem 2, distinguishing compulsive hand washing from non compulsive hand washing, was high. The downside is, that this model is only applicable if we know, when the hand washing takes place. However, our results could be employed together with other tools that give us this knowledge about the user currently washing their hands. Examples for this are in development in our group, one of them being a soap dispenser with integrated proximity sensor. Added to that, Bluetooth beacons stationed near sinks can be used to let the smart watch know that the user is near a specific sink. Conductivity sensors on the users skin could be employed to detect a change of conductivity caused by a contact with tap water. One or more of these methods combined with our model trained for problem 2 could possibly be used to achieve a higher performance for the task of compulsive hand washing detection in the future.
The general performance of our models on problem 2, distinguishing compulsive hand washing from non compulsive hand washing, was high. The downside is, that this model is only applicable if we know, when the hand washing takes place. However, our results could be employed together with other tools that give us this knowledge about the user currently washing their hands. Examples for this are in development in our group, one of them being a soap dispenser with integrated proximity sensor. Added to that, Bluetooth beacons stationed near sinks can be used to let the smart watch know that the user is near a specific sink. Conductivity sensors on the users skin could be employed to detect a change of conductivity caused by the contact with tap water. One or more of these methods combined with our model trained for problem 2 could possibly be used to achieve a higher performance for the task of compulsive hand washing detection in the future.
The detection of hand washing could be incorporated into many devices, mainly wrist worn ones, like smart watches. In order to further improve the detection capabilities and accuracy, one would need to invest even more time into carefully designing and training better models. This work's architecture search could be expanded, and more parameter combinations could be tried out. For example, different types of layers, that have not been included in the architecture yet could be tried. Instead of normalizing data on the data set level, batch normalization could be used try to make the networks faster and more stable.
The detection of hand washing could be incorporated into many devices, mainly wrist worn ones, like smart watches. In order to further improve the detection capabilities and accuracy, one would need to invest even more time into carefully designing and training better models. This work's architecture search could be expanded, and more parameter combinations could be tried out. For example, different types of layers, that have not been included in the architecture yet could be tried. Instead of normalizing data on the data set level, batch normalization could be used to try to make the networks faster and more stable.
Different attention mechanisms could be tried out on the hand washing data.
Added to that, all the other hyperparameters could be optimized better. Instead of manual hyperparameter optimization (HPO), more sophisticated versions of HPO could be employed, e.g. bayesian optimization. This could lead to better choices for the batch size, learning rate and other parameters. However, this may take a lot of time to run, as it is computationally expensive.
On top of that, all the other hyperparameters could be optimized better. Instead of manual hyperparameter optimization (HPO), more sophisticated versions of HPO could be employed, e.g. bayesian optimization. This could lead to better choices for the batch size, learning rate and other parameters. However, this may take a lot of time to run, as it is computationally expensive.
The current state of the system, especially for the classification of hand washing versus compulsive hand washing class looks promising for future work in this area. The collection of real obsessive-compulsive hand washing data would likely lead to the possible training of models capable of reliably classifying compulsive hand washing. Such models could then be tested on real world subjects, and also evaluated with them. If they perform well enough, they could aid psychologists and their patients with the treatment of compulsive hand washing. Like explained in the introduction, exposure and response prevention (ERP) is a viable treatment method, and interventions from a smart watch could possibly be used for response prevention. The exact design of the interventions and their actual usability forms another exciting problem field and is yet to be researched.
The hand washing detection should also work well on both wrists. Multiple solutions for the differences occurring between the two sides could be tried. One could separately train two models, each for one of the wrists. The downside of this the system would also need to figure out, on which wrist it is worn, either automatically, or by user input. This leads to some extra uncertainty. Another idea would be to just train a model on balanced data from both wrists, leading to a model that can possibly implicitly learn, which wrist the watch is worn on. No matter how we solve this problem, it seems like the watch position on the body must be accounted for in some way, possibly also needing more data, or specific position labels for the existing data.
The hand washing detection should also work well on both wrists. Multiple solutions for the differences occurring between the two sides could be tried. One could separately train two models, each for one of the wrists. The downside of this is, that the system would also need to figure out, on which wrist it is worn, either automatically, or by user input. This leads to some added uncertainty. Another idea would be to just train a model on balanced data from both wrists, leading to a model that can possibly implicitly learn, which wrist the watch is worn on. No matter how we solve this problem, it seems like the watch position on the body must be accounted for in some way, possibly also needing more data, or specific position labels for the existing data.
More data could also be incorporated for the negative class, because more different activities should be included in the data. While the standard movement activities of walking, jogging, sitting, walking up and down stairs and some fitness activities were already included for this work, more special activities have not yet been included, possibly leading to the increased false positive rate in the real world scenario. Although we already include day long recordings of long term, i.e. every day activity data, the data set would quickly become huge if we would include more of these. As a result, it is likely necessary to manually record and include every day activities, like washing plates or pans, cleaning, brushing teeth and more. It would be even more desirable to have access to a whole database of human activities and their recordings from body worn sensors, to be used as negative examples in the training of a hand washing detection model.
Another way of generating more data would be to use data augmentation. This would work by copying existing data samples, and adding small amounts of random noise on top of it, forming new samples with the same labels. Data augmentation is a non-expensive way of generating more data to train our neural networks. In our specific case, where the recording of new hand washing data of any kind takes a lot of effort, hand washing data could be generated in this way, in order to have more of it available to train the models.
To avoid false, positives, one could also try to do detection of out of distribution movements, similar to the HAWAD approach that we discussed. The application of this method must be carefully done, as we do not certainly know that all out of distribution samples are no hand washing. The applicability of this method needs to be tested thoroughly.
The most important part of the future work in this area, especially for the detection of compulsive hand washing, will be the application to the real world with actual patients suffering from OCD with compulsive hand washing. Only on their authentic data we will be able to properly train models, and only with them we will be able to properly evaluate the developed models, in order to gain a certain estimate of our performance.
......@@ -104,11 +106,10 @@ In this work, we described the development, training and evaluation of a powerfu
We theoretically evaluated different designs of neural networks on three related problems of hand washing detection, including the separation of hand washing from other activities, the separation of hand washing from compulsive hand washing and the separation of hand washing from compulsive hand washing and from other activities at the same time. For this task, we used hand washing data, data of simulated compulsive hand washing, and data of other activities which was collected from publicly available data sets. After training and evaluation, we selected the best functioning system based on several metrics, including the F1 score and the harmonic mean of sensitivity and specificity, which we called S score. The dominating models, DeepConvLSTM and DeepConvLSTM-A were both based on a deep convolutional neural network joined with an LSTM layer. For DeepConvLSTM-A, which performed slightly better than DeepConvLSTM, we added an attention mechanism, in order to allow the model to flexibly focus on more relevant sections of its input. The designed models were able to beat baselines such as a random forest classifier and a support vector machine, as well as chance level baselines by a large margin.
In a practical evaluation using 5 subjects, we tested DeepConvLSTM-A on the hand washing detection task in a real world and every day environment, as well as in a fixed schedule hand washing test. The system ran on a smart watch, which was used to monitor the users wrist movements in real-time and tried to correctly detect hand washing. The sensitivity of this test was lower than expected ($28,33\,\%$), ($50\,\%$ if the correct wrist was used). Furthermore, around 4 false positives per day appeared for different activities, many of which were washing related. They included but were not limited to
High amounts of false positives should be ruled out in the future.
In a practical evaluation using 5 subjects, we tested DeepConvLSTM-A on the hand washing detection task in a real world and every day environment, as well as in a fixed schedule hand washing test. The system ran on a smart watch, which was used to monitor the users wrist movements in real-time and tried to correctly detect hand washing. The sensitivity of this test was lower than expected ($28,33\,\%$), ($50\,\%$ if the correct wrist was used). Furthermore, around 4 false positives per day appeared for different activities, many of which were washing related. They included but were not limited to doing the dishes, brushing ones teeth and scratching oneself. High amounts of false positives could be ruled out in the future, by adding more every day activities to the training data.
In the second test of the practical evaluation, subjects performed intensive and long hand washing repetitions, which were closer to our lab recorded washing data (including the simulated compulsive data) and thus more easy to detect. The system's performance here was much closer to the results of the theoretical evaluation of our models sensitivity ($76\,\%$ vs $90\,\%$, $82,5\,\%$ if the correct wrist was used).
In the second test of the practical evaluation, subjects performed intensive and long hand washing repetitions, which were closer to our lab recorded washing data (including the simulated compulsive data) and thus more easy to detect. The system's performance here was much closer to the results of the theoretical evaluation of our models sensitivity ($76\,\%$ vs $90\,\%$ and $82,5\,\%$ if the correct wrist was used).
Hence, the evaluation results suggest that the developed system is able to properly detect hand washing in many cases. The theoretical specificity ($75.1\,\%$) and sensitivity ($90\,\%$) of the system is high, but the practical application shows some room for improvement.
In conclusion, the application of wrist worn sensor data to the detection of hand washing and compulsive hand washing remains an interesting and open field of research, with many possible areas of application. Especially the detection of compulsive hand washing would be a world's first, and seems promising for future usage in the treatment of OCD patients. Due to the possibility of directly running neural network models on wrist worn smart watches, interventions could be generated in real time and a latency below 15 seconds.
In conclusion, the application of wrist worn sensor data to the detection of hand washing and compulsive hand washing remains an interesting and open field of research, with many possible areas of application. Especially the detection of compulsive hand washing in real time would be a world's first, and seems promising for future usage in the treatment of OCD patients. Due to the possibility of directly running neural network models on wrist worn smart watches, interventions could be generated in real time and with a latency below 15 seconds.
......@@ -29,7 +29,7 @@ Different types of sensors can be used to detect activities such as hand washing
Inertial measurement units (IMUs) can measure different types of time series movement data, e.g. the acceleration or angular velocity of the device they are embedded in. IMUs are embedded in most modern smart phones and smart watches, which makes them easily available. For hand washing detection, especially the movement of the hands and wrists can contain information that can help us classify hand washing. Therefore, we can use a smart watch and its embedded IMU to try to predict whether a user is washing their hands or not. Added to that, if the user is washing their hands, we could try to predict if they are washing them in an obsessive-compulsive way or not. Another advantage of using a smart watch would be, that they usually have in-built vibration motors or even speakers. These means could be used to intervene, whenever compulsive hand washing is detected, as described above. Therefore, wrist worn sensors, especially those embedded in smart watch systems, are used in this work. The wrist worn devices can also be used to execute machine learning models in real time, using publicly available libraries, e.g. on smart watches running Wear OS.
## Goals
In this work, we want to develop a method for the real time detection of hand washing and compulsive hand washing. We also want to test the method and report meaningful statistics of its success. Further, we want to test parts of the developed method in a real world scenario. We then want to draw conclusions on the applicability of the developed systems in the real world.
In this work, we want to develop several neural network based machine learning methods for the real time detection of hand washing and compulsive hand washing on inertial sensor data of wrist worn devices. We also want to test the methods and report meaningful statistics for their performance. Further, we want to test parts of the developed methods in a real world scenario. We then want to draw conclusions on the applicability of the developed systems in the real world.
### Detection of hand washing in real time utilizing inertial measurement sensors
We want to show that neural network based classification methods can be applied to the recognition of hand washing. We want to base our method on sensor data from inertial measurement sensors in smart watches or other wrist worn IMU-equipped devices. We want to detect the hand washing in real time and directly on the mobile, i.e. on a wrist wearable device, such as a smart watch. Doing so, we would be able to give instant real time feedback to the user of the device.
......@@ -38,8 +38,4 @@ We want to show that neural network based classification methods can be applied
On top of the detection of hand washing, the detection of obsessive-compulsive hand washing is part of our goals. We want to be able to separate compulsive hand washing from non compulsive hand washing, based on inertial motion data. Especially for the scenario of possible interventions used for the treatment of OCD, this separation is crucial, as OCD patients do also wash their hands in non compulsive ways and we do not want to intervene for these kinds of hand washing procedures.
### Real world evaluation
We want to evaluate the most promising of the developed models in a real world evaluation, in order to obtain a realistic estimate of its applicability in the task of hand washing detection. We want to report results of an evaluation with multiple subjects to obtain a meaningful performance estimation. From this estimation we want to draw conclusions on the applicability of the developed system in real world therapy scenarios. Added to that, we want to derive future improvements, that could be applied to the system.
TODO merge if needed or remove:
In this thesis we aim to develop several neural network based machine learning methods that can be used to detect hand washing and compulsive hand washing on inertial sensor data of wrist worn devices. We evaluate different approaches for multiple scenarios of hand washing classification. We examine the real world applicability of the developed approach with multiple users.
\ No newline at end of file
We want to evaluate the most promising of the developed models in a real world evaluation, in order to obtain a realistic estimate of its applicability in the task of hand washing detection. We want to report results of an evaluation with multiple subjects to obtain a meaningful performance estimation. From this estimation we want to draw conclusions on the applicability of the developed system in real world therapy scenarios. Added to that, we want to derive future improvements, that could be applied to the system.
\ No newline at end of file
......@@ -17,7 +17,7 @@ studentnumber: "4133000"
date: "08.11.2021"
reviewer1: "Dr. Phillip M. Scholl"
reviewer2: "TBD"
reviewer2: "Prof. Dr. Thomas Brox"
## optional
......@@ -35,5 +35,5 @@ declaration: Hiermit erkläre ich, dass ich diese Arbeit selbstständig verfasst
#abstract
abstract-de: Die automatische Erkennung von Händewaschen und zwanghaftem Händewaschen hat mehrere Anwendungsbereiche in Arbeits- und medizinischen Umgebungen. Die Erkennung kann zur Überprüfung der Einhaltung von Hygieneregeln eingesetzt werden, da das Händewaschen eine der wichtigsten Komponenten der persönlichen Hygiene ist. Allerdings kann das Waschen auch übertrieben werden, was bedeutet, dass es für die Haut und die allgemeine Gesundheit schädlich sein kann. Manche Patienten mit Zwangsstörungen waschen sich zwanghaft und zu häufig die Hände auf diese schädliche Weise. Die automatische Erkennung von zwanghaftem Händewaschen kann bei der Behandlung dieser Patienten helfen. Ziel dieser Arbeit ist es, auf neuronalen Netzen basierende Methoden zu entwickeln, die in der Lage sind, Händewaschen und zwanghaftes Händewaschen in Echtzeit auf einem am Handgelenk getragenen Gerät zu erkennen, wobei die Daten der Bewegungssensoren des am Handgelenk getragenen Geräts verwendet werden. Die entwickelte Methode erreicht eine hohe Genauigkeit für beide Aufgaben und Teile der Arbeit wurden mit Probanden in einem realen Experiment evaluiert, um die starke theoretische Leistung (F1 score von 89,2 % bzw. 96,6 %) zu bestätigen.
abstract-en: The automatic detection of hand washing and compulsive hand washing has multiple areas of application in work and medical environments. The detection can be used in compliance and hygiene scenarios, as hand washing is one of the main components of personal hygiene. However, the washing can also be overdone, which means it can be unhealthy for the skin and general health. Patients with obsessive-compulsive disorder sometimes compulsively wash their hands in such a harmful way. In order to help with their treatment, the automatic detection of compulsive hand washing can possibly be applied. This thesis aims to develop neural network based methods which are able to detect hand washing as well as compulsive hand washing in real time on a wrist worn device using intertial motion sensor data of said wrist worn device. We achieve high accuracy for both tasks and evaluate parts of the work on subjects in a real world experiment, in order to confirm the strong theoretical performance (F1 score of 89.2 % and 96.6 %) achieved.
abstract-en: The automatic detection of hand washing and compulsive hand washing has multiple areas of application in work and medical environments. The detection can be used in compliance and hygiene scenarios, as hand washing is one of the main components of personal hygiene. However, the washing can also be overdone, which means it can be unhealthy for the skin and general health. Patients with obsessive-compulsive disorder sometimes compulsively wash their hands in such a harmful way. In order to help with their treatment, the automatic detection of compulsive hand washing can possibly be applied. This thesis aims to develop neural network based methods which are able to detect hand washing as well as compulsive hand washing in real time on a wrist worn device using inertial motion sensor data of said wrist worn device. We achieve high accuracy for both tasks and evaluate parts of the work on subjects in a real world experiment, in order to confirm the strong theoretical performance (F1 score of 89.2 % and 96.6 %) achieved.
---
......@@ -9,7 +9,7 @@ Added to that, we further explain the development and testing of different neura
Then we explain meaningful methods of evaluating the developed models and methods on both unseen pre-recorded data and with real world subjects.
## Data set
In order to be able to train any machine learning algorithm, we need enough data that can be used to correctly train the used model. In our case of wrist motion data, we used acceleration and gyroscope time series data from multiple sources which will be explained below. The needed inertial data of each sensor is given as $\mathbf{s}_i \in \mathbb{R}^{d_i \times t}$, where $d_i$ is the dimensionality of the sensor (e.g. $d_{accelerometer} = 3$) and $t$ is the amount of samples in a time series. We use accelerometer and gyroscope data witch both have 3 dimensions. We would have liked to use more available sensors, like the magnetometer available in many modern IMUs, but most external data sets do only include accelerometer and gyroscope data. We combine the two sensors we use into one data series of dimensionality $\mathbb{R}^{6 \times t}$. An example for the sensor data used in our experiments is shown in fig. \ref{fig:sensor_data}
In order to be able to train any machine learning algorithm, we need enough data that can be used to correctly train the used model. In our case of wrist motion data, we used acceleration and gyroscope time series data from multiple sources which will be explained below. The needed inertial data of each sensor is given as $\mathbf{s}_i \in \mathbb{R}^{d_i \times t}$, where $d_i$ is the dimensionality of the sensor (e.g. $d_{accelerometer} = 3$) and $t$ is the amount of samples in a time series. We use accelerometer and gyroscope data which both have 3 dimensions. We would have liked to use more available sensors, like the magnetometer available in many modern IMUs, but most external data sets do only include accelerometer and gyroscope data. We combine the two sensors we use into one data series of dimensionality $\mathbb{R}^{6 \times t}$. An example for the sensor data used in our experiments is shown in fig. \ref{fig:sensor_data}
\begin{figure}[hp]
\centering
......@@ -132,7 +132,7 @@ For each of the problems, we train the baselines with the same windows. For SVM
The implementations of SVM and RFC in scikit-learn @pedregosa_scikit-learn_nodate are used. SVM and RFC are trained with the standard parameters in scikit-learn.
To incorporate the "chance level" we use majority prediction and uniform random prediction. TODO maybe more?
To incorporate the "chance level" we use uniform random prediction and majority prediction. The majority prediction is able to achieve high levels of accuracy for heavily imbalanced data sets, as it always predicts the most frequent class. The uniform random prediction represents the performance level of completely uninformed guessing. We hope to outperform all these baselines with our methods.
## Neural network based detection of hand washing
As explained in Section \ref{section:har}, neural networks are the state-of-the-art when it comes to human activity recognition. For hand washing detection, this can also be applied and thus, our classification algorithms are all entirely based on neural networks.
......@@ -241,7 +241,7 @@ We used early stopping, based on the split off validation set. Early stopping is
\label{fig:learning_curves}
\end{figure}
We can see that the training loss will still decrease while the validation loss is already rising again. Therefore we employ early stopping, to be able to select the model parameters which lead to an empirically minimal validation loss. The losses achieved by parameter updates using mini batches are not decreasing monotonically. Due to the visible "zig-zagging" of the losses, it makes sense to continue running the training for a fixed amount of epochs, even if the validation loss is already rising. This is due to the possibility, that the validation loss could potentially decrease again, below the current minimal validation loss, in a future epoch. As training a model can take a lot of time, we need to select a value in a trade-off between continuing to run the training in order to get to a potential decrease of the validation loss again, and stopping the training in order to not waste training time. We fixed the amount of epochs to keep running to 50, and the maximum amount of epochs a training could run to 300. As can be seen partially in fig. \ref{fig:learning_curves}, the training was usually stopped much earlier than at the 300 epoch mark. The stopping positions heavily depend on the model classes and how fast each class can be trained, and reached from after around 20 epochs to after more than 100 epochs.
We can see that the training loss will still decrease while the validation loss is already rising again. Therefore we employ early stopping, to be able to select the model parameters which lead to an empirically minimal validation loss. The losses achieved by parameter updates using mini batches are not decreasing in every single step. Due to the visible "zig-zagging" of the losses, it makes sense to continue running the training for a fixed amount of epochs, even if the validation loss is already rising. This is due to the possibility, that the validation loss could potentially decrease again, below the current minimal validation loss, in a future epoch. As training a model can take a lot of time, we need to select a value in a trade-off between continuing to run the training in order to get to a potential decrease of the validation loss again, and stopping the training in order to not waste training time. We fixed the amount of epochs to keep running to 50, and the maximum amount of epochs a training could run to 300. As can be seen partially in fig. \ref{fig:learning_curves}, the training was usually stopped much earlier than at the 300 epoch mark. The stopping positions heavily depend on the model classes and how fast each class can be trained, and reached from after around 20 epochs to after more than 100 epochs.
In fig. \ref{fig:learning_curves}, the stop position of the early stopping is marked with a bold \textbf{x}, and it marks the epoch with the lowest validation loss. After the training is stopped due to the early stopping, we reset the models parameters to the parameters with the lowest validation loss. As can be seen in the figure, the training loss still decreases after this point, but the validation loss does not.
......@@ -251,7 +251,7 @@ Like for recording our data set, we use Android based smart watches running a cu
The application programming was done by Alexander Henkel and is not part of this work. We only designed the outline for the deep learning model deployment part of the app, by providing the needed pre-trained models in the appropriate formats, so that they can be executed on mobile devices.
In order to run a pre-trained neural network based model smart watches, we used TensorFlow Lite (tflite). The models were trained as explained above with PyTorch and then converted to tflite using ONNX and the TensorFlow Lite converter included in TensorFlow. However, conversion from ONNX to TensorFlow is not supported for all operations needed in a neural network. Thus, we also added compatibility for onnx runtime (ort) models on the smart watch.
In order to run a pre-trained neural network based model smart watches, we used TensorFlow Lite (tflite). The models were trained as explained above with PyTorch and then converted to tflite using ONNX and the TensorFlow Lite converter included in TensorFlow. However, conversion from ONNX to TensorFlow is not supported for all operations needed in a neural network. Thus, compatibility for ONNX runtime (.ort) models on the smart watch was also added, and we converted our models into this format.
![Flow diagram of the smart watch classification loop. A notification is only sent, if the notification cooldown is 0.](img/wear_data_flow.pdf){#fig:watch_flow width=98%}
......
......@@ -3,9 +3,6 @@
\label{sec:results}
This chapter will report the evaluation results from both the theoretical evaluation and the practical evaluation.
TODO: FLOAT BARRIERS formatting towards the end
## Theoretical Evaluation
For the theoretical evaluation, we report the results separately, split by the tasks 1.-3. described in Section \ref{sec:classification_problems}
......@@ -14,21 +11,18 @@ In all tables of this chapter, the best values for a specific metric will be hig
The values for the metrics specificity and sensitivity will be reported in the tables, but not discussed separately, because they are included in the more meaningful metrics F1 score and S score. The results generally show that achieving a high value in only one metric out of specificity and sensitivity, at cost of reaching low values in the other one, brings about worse performance in the F1 score and S score.
### Distinguishing hand washing from all other activities
For the first task of classifying hand washing in contrast to non hand washing activities, we report the results with and without the application of label smoothing. The results without label smoothing are shown in table \ref{tbl:washing}. In @fig:p1_metrics, the results scores for problem 1 with and without smoothing are shown.
For the first task of classifying hand washing in contrast to non hand washing activities, we report the results with and without the application of label smoothing. The results without label smoothing are shown in table \ref{tbl:washing}. In @fig:p1_metrics, the resulting scores for problem 1 with and without smoothing are shown.
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become really small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
\input{tables/washing.tex}
With label smoothing, we can reach an increased performance with all of the model classes, including the traditional machine learning methods RFC and SVM. The results with a 20 prediction wide average filter smoothing can be seen in table \ref{tbl:washing_rm} and @fig:p1_metrics. The top performing neural network architectures do not change with the smoothing. However, the performance measures increase. DeepConvLSTM has the best F1 score ($0.892$), followed by LSTM-A ($0.891$), DeepConvLSTM-A ($0.890$) and CNN ($0.888$). These results are higher by about $0.03$ to $0.05$ compared to utilizing the raw predictions, without smoothing. In the S score metric, DeepConvLSTM-A performs best ($0.819$), followed by DeepConvLSTM-A ($0.814$) and CNN ($0.808$). For the S score, the advantage of the label smoothing is bigger in general, between $0.05$ to $0.06$ for all model classes except the LSTM, which only improves by $0.015$. RFC and SVM do not improve with the label smoothing, their scores decrease by about $0.04$ for both of the metrics.
![F1 score and S score for problem 1](img/washing_all.pdf){#fig:p1_metrics width=105%}
As we can see, without label smoothing, the neural networks outperformed the conventional machine learning methods by a large margin. The best neural network method outperforms the best traditional method by a difference of nearly $0.2$ for the F1 score and by around $0.1$ for the S score. Between the neural network methods themselves, the differences can become really small, especially between the top performing DeepConvLSTM and DeepConvLSTM-A. While DeepConvLSTM reaches a slightly better F1 score of $0.853$, DeepConvLSTM-A reaches $0.847$. However, if we take into consideration the S score, DeepConvLSTM-A ($0.758$) is ahead of DeepConvLSTM ($0.756$). The convolutional neural network (CNN, $0.750$) and the LSTM with attention mechanism (LSTM-A, $0.708$) also reach similar levels of performance on both metrics, with the CNN outperforming the LSTM-A only in the S score. We can see that, like in the preliminary validation, normalization did not lead to the desired performance advantage. For the neural network methods, activating the normalization leads to a decrease of $0.01$ to $0.1$ in the F1 score and of $0.07$ to $0.15$ in the S score.
\input{tables/washing_rm.tex}
With label smoothing, we can reach an increased performance with all of the model classes, including the traditional machine learning methods RFC and SVM. The results with a 20 prediction wide average filter smoothing can be seen in table \ref{tbl:washing_rm} and @fig:p1_metrics. The top performing neural network architectures do not change with the smoothing. However, the performance measures increase. DeepConvLSTM has the best F1 score ($0.892$), followed by LSTM-A ($0.891$), DeepConvLSTM-A ($0.890$) and CNN ($0.888$). These results are higher by about $0.03$ to $0.05$ compared to utilizing the raw predictions, without smoothing. In the S score metric, DeepConvLSTM-A performs best ($0.819$), followed by DeepConvLSTM-A ($0.814$) and CNN ($0.808$). For the S score, the advantage of the label smoothing is bigger in general, between $0.05$ to $0.06$ for all model classes except the LSTM, which only improves by $0.015$. RFC and SVM do not improve with the label smoothing, their scores decrease by about $0.04$ for both of the metrics.
The models running on normalized data also profit from the label smoothing, however they still cannot reach the performance of the non normalized models.
For the special case of the models initially trained on problem 3 which were then binarized and run on problem 1 (without smoothing), we only report some results in this section. The full results can be found in the appendix. Surprisingly, the models trained on problem 3 reach similar F1 scores on the test data of problem 1 as the models trained on problem 1. DeepConvLSTM achieves an F1 score of $0.857$, DeepConvLSTM-A achieves $0.847$. The F1 score of DeepConvLSTM is even higher than the highest F1 score of the models trained for problem 1 by $0.004$. However, for the S score metric, the models trained for problem 3 can only reach up to $0.704$ (CNN) or $0.671$ (DeepConvLSTM-A), which is lower by $0.052$ than the best performing model trained for problem 1.
......@@ -42,7 +36,7 @@ The results without smoothing of predictions for the second task, distinguishing
\input{tables/only_conv_hw.tex}
Like for problem 1, applying normalization to the input data worsens the performance of almost all classifiers. The performance loss in the F1 score reaches from $0.024$ (LSTM) to $0.11$ (CNN). For the FC network, the normalization leads to a slight performance increase of $0.01$. The S score performance decrease when we apply normalization is between $0.27$ (CNN) and $0.128$ (DeepConvLSTM-A). As with the F1 scores, the FC network profits off the normalization, here by a difference in S score of $0.035$. SVM and RFC also do not perform better with the application of normalization.
Like for problem 1, applying normalization to the input data worsens the performance of almost all classifiers. The performance loss in the F1 score reaches from $0.024$ (LSTM) to $0.11$ (CNN). For the FC network, the normalization leads to a slight performance increase of $0.01$. The S score performance decrease when we apply normalization is between $0.27$ (CNN) and $0.128$ (DeepConvLSTM-A). As with the F1 scores, the FC network profits off the normalization, here by a difference in \mbox{S score} of $0.035$. SVM and RFC also do not perform better with the application of normalization.
The results for task 2 with the application of smoothing are shown in table \ref{tbl:only_conv_hw_rm} and @fig:p2_metrics. Similarly to problem 1, smoothing helps to further increase the performance of all classifiers. All neural network based methods reach F1 scores of over $0.95$. The best F1 score is achieved with DeepConvLSTM-A ($0.966$), the second best with LSTM ($0.965$). The differences remain small for this problem, as DeepConvLSTM ($0.963$) and LSTM-A ($0.961$) also achieve very similar scores. There is a small gap, after which the RFC ($0.922$) and SVM ($0.914$) follow. The traditional methods do not profit as much from the smoothing as the neural network based methods.
......@@ -62,7 +56,7 @@ The three class problem of classifying hand washing, compulsive hand washing and
![Confusion matrices for all neural network based classifiers with and without normalization of the sensor data](img/confusion.pdf){#fig:confusion width=98%}
The confusion matrices of the non normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM)). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ vs $0.79$ on HW and $0.82$ vs $0.85$ on HW-C).
The confusion matrices of the non normalized models in the right column do not allow us directly to decide on one "best" model, but we can see, that the diagonal values of all the LSTM-based models seem to be higher than the ones of FC and CNN. The "pure" LSTM model performs best on the compulsive hand washing class (HW-C, $0.88$), and close to best on the hand washing class (HW, $0.78$) being only closely beaten by DeepConvLSTM ($0.79$). However, LSTM only reaches an accuracy of $0.33$ on the Null class. The best performing model on the Null class is the CNN model ($0.64$), which in turn only reaches $0.51$ on HW and $0.72$ on HW-C. While DeepConvLSTM-A never reaches the highest value in any of the specific classes, its total performance in the confusion matrix is good. It reaches higher values in the Null class than the other LSTM based models ($0.53$ vs $0.47$ (LSTM-A), $0.46$ (DeepConvLSTM), $0.33$ (LSTM)). At the same time, its performance on the HW and HW-C classes is similar to the one of DeepConvLSTM, albeit a little bit lower ($0.78$ vs $0.79$ on HW and $0.82$ vs $0.85$ on HW-C).
As for problem 1 and for problem 2, we obtain the result, that normalization seems to decrease the performance of all the neural network based classifiers. For this problem, the FC network also has a decreased performance when normalized input data is used.
......@@ -98,8 +92,8 @@ The mean diagonal value of the confusion matrix upholds almost the same ordering
## Practical Evaluation
### Scenario 1: One day of evaluation
In the first scenario, the 5 (TODO) subjects reported an average of $4.75$ hand washing procedures on the day on which they evaluated the system.
Per subject, there were $4.75$ ($\pm 3.3$) hand washing procedures. Out of those, $1.75$ ($\pm 2.06\,\%$) were correctly identified. The accuracy per subject was $28,33\,\%$ ($\pm 37.9\,\%$). The highest accuracy for a subject was $80\,\%$ out of 5 hand washes, the lowest was $0\,\%$ out of 4 hand washes. Of all hand washing procedures conducted over the day by the subjects, $35,8\,\%$ were detected correctly.
In the first scenario, the 5 subjects reported an average of $4.75$ hand washing procedures on the day on which they evaluated the system.
Per subject, there were $4.75$ ($\pm\,3.3$) hand washing procedures. Out of those, $1.75$ ($\pm\,2.06\,\%$) were correctly identified. The accuracy per subject was $28.33\,\%$ ($\pm\,37.9\,\%$). The highest accuracy for a subject was $80\,\%$ out of 5 hand washes, the lowest was $0\,\%$ out of 4 hand washes. Of all hand washing procedures conducted over the day by the subjects, $35.8\,\%$ were detected correctly.
Some subjects wore the smart watch on the right wrist instead of the left wrist, and reported worse results for that. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $50\,\%$.
......@@ -108,7 +102,7 @@ The correlation of duration of the hand washing with the detection rate is $-0.0
Added to that, the correlation of the intensity of washing with the detection rate is $0.267$.
For the reported false positives, the subjects experiences varied. The subjects reported $4$ ($\pm 5.19$) false hand washing detections on this day. The minimum was 0 false positives, and the highest was 13 false positives.
For the reported false positives, the subjects' experiences varied. The subjects reported $4$ ($\pm\,5.19$) false hand washing detections on this day. The minimum was 0 false positives, and the highest was 13 false positives.
The activities leading to false positives include:
......@@ -123,7 +117,7 @@ The full list of reported activities for which false positives occurred can be f
Some subjects also reported difficulties with the smart watch application (not part of this work), which lead to the model not being run at all sometimes, which might also have influenced the results. It could be possible, that for some hand washing procedures, the smart watch application was not executed, which would lead the user to note down a false negative, also decreasing the sensitivity in the results.
### Scenario 2: Controlled intensive hand washing
In scenario 2, the subjects each washed their hands at least 3 times. Some subjects voluntarily agreed to perform more repetitions, which leads to more than 3 washing detection results per subject. The detection accuracy per subject was $76\,\%$ ($\pm 25\,\%$), with the highest being, $100\,\%$ and the lowest being $50\,\%$.
In scenario 2, the subjects each washed their hands at least 3 times. Some subjects voluntarily agreed to perform more repetitions, which leads to more than 3 washing detection results per subject. The detection accuracy per subject was $76\,\%$ ($\pm\,25\,\%$), with the highest being $100\,\%$ and the lowest being $50\,\%$.
The mean accuracy over all repetitions and not split by subjects was $73,7\,\%$. For scenario 2, one user moved the smart watch from the right wrist to the left wrist after two repetitions. The first two repetitions were not detected, while the two repetitions with the smart watch worn on the right wrist were detected correctly. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $78.6\,\%$, and the detection accuracy per subject is $82.5\,\%$ ($\pm 23.6\,\%$).
The mean accuracy over all repetitions and not split by subjects was $73.7\,\%$. For scenario 2, one user moved the smart watch from the right wrist to the left wrist after two repetitions. The first two repetitions were not detected, while the two repetitions with the smart watch worn on the right wrist were detected correctly. Leaving out hand washes conducted with the smart watch worn on the right wrist, the detection sensitivity rises to $78.6\,\%$, and the detection accuracy per subject is $82.5\,\%$ ($\pm\,23.6\,\%$).
\begin{table}
\centering
\caption{Problem 2: metrics of the different classes}
\caption{Problem 2: metrics of the different classes without smoothing}
\label{tbl:only_conv_hw}
\begin{tabular}{|l|l|c|c|c|c|}
\toprule
......
\begin{table}
\centering
\caption{Problem 1: metrics of the different classes}
\caption{Problem 1: metrics of the different classes without smoothing}
\label{tbl:washing}
\begin{tabular}{|l|l|c|c|c|c|}
\toprule
......
No preview for this file type
%------------------------------------- Includes / usepackage -------------------------
\usepackage{caption}
\usepackage{subfig}
\usepackage{float}
......@@ -9,9 +10,9 @@
\usepackage[section]{placeins}
% Hurenkinder und Schusterjungen verhindern
\clubpenalty10000
\widowpenalty10000
\displaywidowpenalty=10000
\clubpenalties=4 10000 10000 10000 0
\widowpenalties=4 10000 10000 10000 0
\displaywidowpenalty=9999
%------------------------------------- Custom Title Page ---------------------------
\renewcommand{\maketitle}{
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment