You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.

# Ultra-wideband data as input of a combined EfficientNet and LSTM architecture for human activity recognition

#### Abstract

The world population is aging in the last few years and this trend is expected to increase in the future. The number of persons requiring assistance in their everyday life is also expected to rise. Luckily, smart homes are becoming a more and more compelling alternative to direct human supervision. Smart homes are equipped with sensors that, coupled with Artificial Intelligence (AI), can support their occupants whenever needed. At the heart of the problem is the recognition of activities. Human activity recognition is a complex problem due to the variety of sensors available, their impact on privacy, the high number of possible activities, and the several ways even a simple activity can be performed. This paper proposes a deep learning model combining LSTM and a tuned version of the EfficientNet model using transfer learning, data fusion, minimalist pre-processing as well as training for both activity and movement recognition using data from three ultra-wideband (UWB) radars. As regards activity recognition, experiments were conducted in a real and furnished apartment where 15 different activities were performed by 10 participants. Results showed an improvement of 18.63% over previous work on the same dataset with 65.59% in Top-1 accuracy using Leave-One-Subject-Out cross validation. Furthermore, the experiments that address movement recognition were conducted under the same conditions where a single participant was asked to perform four distinct arm movements with the three UWB radars positioned at two different heights. With an overall accuracy of 73% in Top-1, the detailed analysis of the results obtained showed that the proposed model was capable of recognizing accurately large and fine-grained movements. However, the medium-sized movements demonstrated a significant impact on the movement recognition due to an insufficient degree of variation between the four proposed movements.

## 1.Introduction

From year to year, the world population is getting older and based on a report published by the United Nations Department of Economic and Social Affairs [29], people over the age of 65 could represent up to 25% of the North American population by 2050. In the world, the number of people over 80 could triple in the same period.

Many problems occur with the normal course of aging, namely the risk of developing cognitive impairments such as Alzheimer’s and Parkinson’s diseases or physical conditions such as osteoarthritis and general loss of muscle strength, endurance and flexibility. As a result, these problems lead to an increasing difficulty of maintaining healthy daily habits and behaviors. Activities that are performed to take care of one’s living environment and for self-care are called Activities of Daily Living (ADLs). Examples of such ADLs are Cooking, Cleaning, Brushing teeth, Taking medicines, and more. The ability of a person to perform these activities is directly linked to their ability to live unassisted. A person that has difficulty performing some ADLs can receive help at home or move to an adapted facility such as nursing homes. Leaving its home, particularly at an old age, can be a source of great distress [22]. It is, therefore, beneficial for a person to keep its home as long as possible. Help can take many forms (e.g., having a caregiver at home). While a caregiver can help in providing essential services, most of this task often becomes the responsibility of relatives. Providing care for an aging relative with an underlying medical condition such as Alzheimer’s disease is quite demanding and can become an emotional and financial burden for relatives [2].

Luckily, home automation can also bring some form of help. Home automation is the building of homes that incorporate sensors in order to improve their occupant’s experience. A minimalist example is the inclusion of smart devices in a home with consumer devices such as a smart assistant (e.g., Alexa), smart thermostats, smart switches, vacuum robots, etc. While it cannot replace the care given by another person, certain specific tasks can be automated. For instance, in the last few years, many researches focused on detecting falls [15,24,25], which is a critical event to monitor for people of a certain age. Although fall detection is important, more complex behaviors performed daily, such as ADLs, must be detected and monitored to support a person with cognitive and or physical decline. The first step towards making applications that can support a person throughout the day is to detect the activity being carried out by that person.

##### Fig. 2.

A single frame recorded by a XeThru X4M200 when placed in the smart environment [19].

### 3.1.Activity recognition

In the first place, we focused on collecting data related to ADLs only. In order to achieve this data collection, ten participants, all being healthy adults under the age of 40, were asked to perform a total of 15 activities with no specific indication as to how long each activity should take to complete and how they should be performed. Therefore, a high variation between the different completions of the same activity is expected in order to obtain a more realistic dataset. All activities and their respective duration are shown in Table 1.

##### Table 1

Activities recorded during the collection of the first dataset and their respective duration

 Activity Minimum duration (s) Maximum duration (s) Average duration (s) Drinking 15 28 26.6 Sleeping 38 58 55.2 Putting on Jacket 21 29 26.5 Cleaning 92 118 115 Cooking 231 299 291.3 Making Tea 129 178 169 Doing the Dishes 98 118 114.9 Brushing teeth 120 179 171.8 Washing hands 21 29 26.8 Reading 97 118 114.2 Eating 76 118 113.4 Walking 22 29 26.9 Putting on Shoes 27 43 40.7 Taking Medication 13 28 25.7 Using Computer 93 118 114.5

### 3.2.Movement recognition

In a second time, since activities may be seen as a set of actions involving several movements to be performed, we have considered that the recording of an additional dataset containing only movements. To this end, we considered interesting to suggest an evaluation of the ability of the model architecture introduced in this work to recognize relatively fine-grained movements. For that purpose, ten strategic locations were identified in the apartment where a single participant, selected from the 10 available, was ask to perform several arm movements, in a static position, with the three UWB radars placed first at 36 cm. The same procedure involving the same participant has been then repeated with the three UWB radars placed at 96 cm from the floor. The ten strategic locations are represented by blue circles in Fig. 1. For each of these locations, a total of four arm movements were achieved two times (i.e. one for each height of the UWB radars) over a period of 30 s. A detailed description of each movement is provided by Table 2. Besides, as the bathroom remains the only room in the apartment with a door, the recording of the four movements was completed twice for locations 9 and 10, either with the door open and closed.

##### Table 2

Details of the various movements recorded during the collection of the second dataset

 Movement Height of the radars Description M0 36 cm & 96 cm No movement: arms and forearms are placed along the body and the feet remain still. M1 36 cm & 96 cm Fine movements: arms are along the body, the forearms are placed in front of the person and the feet remain still. The movements are constrained to the wrists and hands (including fingers). M2 36 cm & 96 cm Medium-sized movements: arms are along the body, the forearms are placed in front of the person and the feet remain still. The movements are constrained to the forearms, wrists and hands. M3 36 cm & 96 cm Large movements: the feet remain still. The movements are constrained to the arms, forearms, wrists and hands.

## 4.Proposed system

### 4.1.Data pre-processing

In image recognition tasks, the input images used are usually RGB images with a shape varying from 150 by 150 pixels to 300 by 300 pixels. For this work, 15 s of consecutive UWB radar recordings were considered sufficient for the task of activity recognition. Since the sampling rate of the UWB radars was set to 50 frames per second, the unprocessed input shape is 750 samples by 184 bins as shown in Fig. 3.

##### Fig. 3.

Scattering matrix of all frames recorded by a single UWB radar over period of 15 seconds [19].

The following method was used to pre-process the raw data into input images that have a shape similar to the one generally used for image recognition:

• 1. 15 s from each UWB radar are combined to form a single array of shape 750×184×3.

• 2. To reduce the size of the array, MaxPooling is done on the time axis for each UWB data reducing the size on the time axis from 750 to 150, resulting in an array of shape 150×184×3.

• 3. In order to subtract the background (the environment is fully furnished) from the data, the first frame was subtracted to all other frames in the array.

• 4. The array is normalized, based on the minimum value and the maximum value present in the whole array.

The resulting array, displayed as an RGB image where each UWB radar represents a color channel, is shown in Fig. 4. In this figure, two activities are shown. The first activity is Walking and the second is Cleaning. It is interesting to see that for the activity Walking, the person is getting nearer to a specific UWB radar, then getting further. Also, in the Cleaning activity, it is possible to see that the person is going back and forth near one of the UWB radars, and that some furniture was probably moved in front of another.

##### Fig. 4.

Pre-processed UWB radar data taken from activity (a) Walking and (b) Cleaning.

For the activity recognition dataset only, the position of the participants was computed using trilateration [1] along with the Range-Domain data provided by the UWB radars. From this algorithm, five data fields are provided: the position of the person in the environment (two fields) and the estimated distance of the person relative to each UWB radar (three fields being one per UWB radar). The position of the person is computed once every second and the same 15 s of data is used. Therefore, the output shape of the position data is 15×5. While it is an interesting feature, the computed position of the participant is not precise due to the noise present in the data. The position is problematic for some activities such as Sleeping, where movement is picked up due to other human moving on the other side of the walls and causes phantom positions. Also, certain locations in the apartment cause problems, such as the bathroom located in the extremity of the apartment, where the position of the participant is seen somewhere else in the apartment. The computed location of the participant can nonetheless provide useful insight into the nature of the activity. However, regarding the dataset that contains the movements, the same pre-processing method was applied on the data, with the difference that the positions were not included. Indeed, we acknowledged that such an information did not provide any insight into the current movement since the position of the person remained static. As a result, no improvement should be expected in the best case or the noise of the position could interfere with the recognition of movements in the worst case.

As shown in Table 1, some activities take naturally longer to perform than others. This difference is particularly noticeable between some activities such as, Drinking and Cooking. Hence, in order to avoid an unbalanced dataset, a fixed number of 100 samples were extracted from each recorded activity. Moreover, since the 100 samples are evenly distributed, it therefore guarantees that the entire activity is covered each time. However, since none of the recorded activities have identical duration, the overlap between two samples may vary from one activity to another.

### 4.2.Model architecture

The proposed model is composed of two different parts. The first part, which is a CNN, exploits the pre-processed dataset. The other part, which is an LSTM, exploits the computed location of the participant in the dataset.

The first phase of the development was hence to identify which CNN model performed the best on the dataset. Keras2 [18] contains many implementations of largely used deep learning models. For this test, VGG16 [17], VGG19 [17], Xception [7] and EfficientNet B0 through B6 [27] were tested on a subset of the dataset. Since in other works autoencoders have shown great promise [24], a CNN autoencoder was built and trained for comparison. An autoencoder is trained to encode and decode the input. The principle is that the main features of the input data end up being encoded at the center of the model. In order to classify, the decoder part is dropped and replaced by one or more dense layers and trained again using transfer learning. The architecture of the resulting classifier is shown in Table 3. The preliminary results for identifying the best model are shown in Table 4.

##### Table 3

The detailed structure of the layers that compose the classifier built from the trained autoencoder model

 Layer ID Layer name Output shape 1 Input 150, 184, 3 2 Zero_Padding 152, 184, 3 3 Convolution_1 152, 184, 32 4 Max_Pooling_1 76, 92, 32 5 Convolution_2 76, 92, 64 6 Max_Pooling_2 38, 46, 64 7 Convolution_3 38, 46, 128 8 Flatten 223744 9 Dense_1 200 10 Dropout_1 200 11 Dense_2 50 12 Dropout_2 50 13 Output 15
##### Table 4

Performances of various built-in models in Keras on a subset of the first dataset containing activities only

 Built-in Accuracy VGG16 0.08 VGG19 0.08 Xception 0.55 EfficientNetB0 0.63 EfficientNetB4 0.61 EfficientNetB5 0.62 EfficientNetB6 0.6 CNN autoencoder 0.55

Based on the preliminary results, EfficientNetB0 was identified as the most promising model. EfficientNet [27] proposed a way of scaling up a CNN network. The proposed composite scaling method scales the depth, width and resolution of the model. EfficientNetB0 refers to the baseline model proposed by Mingxing and Quoc [27] which optimizes accuracy and Floating Point Operations per Second (FLOPS). The choice of EfficientNetB0 makes sense, since larger models would be more fitted for more massive datasets and could lead to overfitting on smaller datasets. This means that while EfficientNetB0 is selected for this work, future improvements on the dataset could lead to different scaled models to perform better.

In a CNN, kernels have a square shape and a depth corresponding to the number of channels. The reason behind this is versatility, since the shape of patterns are unknown and in the case of image recognition, the object to recognize can have any orientation in the input image. In our dataset, we have an insight into the size and orientation of the data. Since we are looking for temporal patterns (evolution of position in the Doppler-Domain data provided by the UWB radars), various kernel shape variations at various layers of the EfficientNetB0 model were tested. Globally, changing the shape of the kernels provided a small improvement over non-tuned EfficientNetB0. For the remainder of this paper, the EfficientNetB0 model with modified kernel shape will be called Tuned EfficientNetB0. It is important to mention that varying the shapes of the kernels was time-consuming and the approach used was non-exhaustive. This means that while an improvement is present, a better improvement (a better combination of kernel shapes) could still be made with further tests by considering to achieve an optimization phase of the hyperparameters.

##### Table 5

The detailed structure of the layers that compose the architecture of the LSTM model

 Layer ID LayeName Output shape 1 Input 15, 5 2 LSTM_1 15, 50 3 LSTM_2 50 4 Dense_1 100 5 Dropout_1 100 6 Dense_2 100 7 Dropout_2 100 8 Dense_3 50 9 Dropout_3 50 10 Output 15

For the second half of the model, LSTM has shown good performance in recognizing temporal patterns. Alone, the use of the computed position of the participant in the environment combined with the LSTM model has poor performance. The aim here is not to have state-of-the-art performance, but to provide complementary information to the final model. The architecture of the LSTM is shown in Table 5. A second version of the position dataset was also tested. In this dataset, the minimum, maximum, average and standard deviation of each field in the input was appended (the shape of the position dataset was hence increased to 19×5). Extracting some supplementary information on the position dataset did not significantly increase the average accuracy of the model (from 44.07% Top-1 average accuracy to 46.15%) and it provided less stable models (7.8% standard deviation of the Top-1 accuracy to 9.1%). Due to this underwhelming performance increase, it was decided to keep the position dataset with no supplementary computation.

Both models are combined, flattened, and fed to a dense layer for classification. The final architecture of the model is displayed in Fig. 5.

##### Fig. 5.

Graphical representation of the architecture of the complete proposed model.

## 5.Results & discussion

In order to achieve the activity recognition process, the model was trained in three separate phases. In the first one, the pre-processed dataset was used for the training of the Tuned EfficientNetB0 model. The second phase, involved to train the LSTM model using only the location dataset. Finally, transfer learning was applied from the two previous phase to the complete Tuned EfficientNet model with LSTM. In this last scenario, the weight of all the layers that are before the classification layers (i.e. the dense layers at the end of the model) are copied to the new model. In this last phase, the training of the complete model begins by freezing the non-classification layers for half the training period. This allows the dense layers to be updated without having to modify the previously trained models. In the second half of the training, the whole model is unfrozen for optimization purposes. The main reason a multi-phase training like this was performed is to avoid overfitting that may result from such a complex architecture. As regards the movement recognition only, the same methodology was applied except for the training phase of the LSTM model since the positions were not necessary for that purpose.

### 5.1.Human activity recognition

In the context of human activity recognition (HAR), both the LSTM and Tuned EfficientNet models were trained over a period of 25 epochs with a batch size of 16. These two trainings were also completed empirically over several epochs (up to 100). However, we observed that the best results were obtained in the first 20 epochs. This finding was expected since the EfficientNet technique is known to be prone to overfitting. As a result, the final model training was performed over a period of 40 epochs, being 20 epochs for the frozen and 20 epochs for the unfrozen non-classification layers. Every model training was completed using a Nvidia GeForce GTX 1080, and took 20 min for the LSTM model alone, 95 min for the Tuned EfficientNetB0 model alone and 65 min for the entire model. In that sense, a single fold of LOSO cross-validation took around 3 hours to train and a total of 30 hours was required to train all LOSO folds.

Since there are a few number of samples, it was determined that the best way to show the ability of the model to generalize was to test the performance of the model with LOSO cross-validation. In the case of a 70-15-15 training, testing and validation split, data that are very similar would have been present in the training, testing and validation set. In other terms, two instances from an activity performed by one person with a slightly different starting time contain roughly the same information. This would only have measured the ability of the model to overfit the data and not measure how well it can generalize.

Since ten participants were involved in the creation of the dataset, the dataset was split in ten different subsets to perform LOSO cross-validation. Each model was trained on each dataset. Additionally, the Top-5 results are used to measure the performances. The previous work [19] revealed that some activities were hard to identify by the different models and confusion existed between similar activities such as Cooking and Making Tea. For this reason, the Top-5 results returned by the models were used to compare the performance of the models as such an evaluation allows to determine the possibility of combining independent activities with a high degree of similarities. The average Top-5 accuracy (overall LOSO folds) of each model is shown in Fig. 6a.

##### Fig. 6.

Overview of the results obtained for the movement recognition when compared to LSTM, CNN-LSTM, EfficientNetB0 and TunedEfficientNetB0 models, where (a) is the Top-5 average accuracy; (b) is the Top-5 maximum accuracy; (c) is the Top-5 minimum accuracy and (d) is the Top-5 standard deviation of the accuracy.

The proposed model performed better than the baseline model, increasing the average Top-1 accuracy from 46.96% to 65.59% which corresponds to a 18.63% increase in Top-1 accuracy. While the difference between the performance of the proposed model and the baseline gets smaller further in the Top-N, the proposed model showed an advantageous increase of 5.54% over the baseline model. Along with the average accuracy, the proposed model kept performing better on maximum accuracy, as shown in Fig. 6b and minimum accuracy as shown in Fig. 6c. The most notable increase in performance occurs in the Top-1 minimum accuracy, which increased 21.5%. Hence, it shows that the proposed model generally performs much better on each of the ten LOSO folds. Finally, the standard deviation of the accuracy of the model as shown in Fig. 6d is consistently lower than the other models, indicating that the model is much more stable.

Despite the general increase in accuracy provided by the proposed models, the results show that certain activities are still hard to detect. The confusion matrix of the Top-1 of the proposed model on the LOSO 1-fold (i.e. participant 1 is used for testing set) is shown in Fig. 7a and the confusion matrix of the Top-1 of the baseline model on the LOSO 1-fold is shown in Fig. 7b. It is important to note here that while not all LOSO fold results are the same, they all show these limitations (i.e. some activities, no matter what improvements are made to the system, will remain difficult to recognize). The confusion matrices show that tremendous improvement was made for the recognition of the activities: Putting on Jacket, Walking and Using Computer which were not recognized with the baseline model. However, some activities are still not recognized, namely Drinking, Washing Hands and Taking Medication. These activities have some points in common. Firstly, they are shorter than most of the other ones. Secondly, by definition they contain fewer “macro” movements such as legs movements and can be performed anywhere at more than one location in the apartment.

##### Fig. 7.

Confusion matrices of the Top-1 of both the proposed model (a) and the baseline model (b) on the dataset with the first person left out.

### 5.2.Movement recognition

The movement data were processed following the same protocol as for activity recognition described previously. The dataset was, however, split into 12 Leave-One-Out (LOO) cross-validation sets, being one set by location (L9 and L10 are counted twice according to the state of the door that may be either open or close). Then, each of the two learning processes based on the Tuned EfficientNetB0 model were performed five times. The overall accuracy for each technique as regards movement recognition is provided by Table 6.

##### Table 6

Overall accuracy of the Tuned EfficientNetB0 model with and without transfer learning when computed over the movement dataset using a LOSO cross-validation strategy

 Tuned EfficientNetB0 model Top-1 Top-2 Top-3 Tuned EfficientNetB0 without transfer learning 0.56 0.8 0.91 Tuned EfficientNetB0 with transfer learning 0.73 0.89 0.96

Through such results, it is possible to observe that the proposed method is capable of recognizing movements in a fairly accurate way. In addition, based on the confusion matrices given in Table 7, it appears that the proposed Tuned EfficientNetB0 model is able to extract the key features of the movements to be recognized from the UWB data with more accurate results obtained when using the transfer learning strategy. Nevertheless, we believe that due to a small difference between these four different movements, such a model may offer better results in a context closer to the movements that are performed in real activities. Besides, since this experiment was reproduced by positioning the UWB sensors closer to the arms movement (at a height of 96 cm from the ground) we have observed an improvement of 2.1% and 4.4% in Top-1 recognition accuracy for the strategies with and without transfer learning respectively, thus supporting the hypothesis that these four movements admit too much similarities.

##### Table 7

Confusion matrices that express the percentage of recognition obtained using the Tuned EfficientNetB0 model with (a.) and without (b.) transfer learning for each movements as described in Table 2 using a LOSO cross-validation strategy

Finally, to carry out an in-depth study of the model proposed in this work, a last evaluation was performed in order to identify the locations in the apartment where the movements are most easily recognized. Hence, Table 8 exposes detailed results of the movement recognition for the two learning strategies and for each location in the apartment as illustrated in Fig. 1. Once again, the transfer learning strategy demonstrated a significantly better accuracy. However, it is possible to observe that the movements, when they are performed in locations that meet the field of operation of a single UWB radar, are less precisely recognized (e.g., L1, L2, L4, L8). Lastly, as expected with the use of UWB radar data, the impact of the state of the door as an obstacle for the movements performed in the bathroom (i.e. L9 and L10) does not change the overall accuracy since the recognition rates for such locations remain similar.

##### Table 8

Results obtained using the Tuned EfficientNetB0 model with (a) and without (b) transfer learning for each location identified in Fig. 1 with either an open (O) or closed (C) door for locations inside the bathroom (i.e. L9 and L10) using a LOSO cross-validation strategy

 (a) Tuned EfficientNetB0 without transfer learning (b) Tuned EfficientNetB0 with transfer learning Location Accuracy Location Accuracy L5 0.67 L7 0.9 L6 0.65 L9O 0.86 L7 0.62 L6 0.84 L3 0.6 L10C 0.83 L9C 0.58 L9C 0.79 L9O 0.58 L3 0.76 L10C 0.58 L5 0.7 L10O 0.52 L1 0.68 L1 0.51 L10O 0.66 L4 0.5 L2 0.64 L2 0.47 L4 0.64 L8 0.45 L8 0.46

## 6.Limitations

While improvements were undoubtedly made with the proposed work over the previous one, several other ones are still required to develop practical applications capable of recognizing ADLs efficiently. First, the recorded datasets proposed in this work remain quite limited. Indeed, the nature of the activities and the way they were captured have introduced a high variability between each participant’s record of the same activity, which is generally not optimal in small datasets such ours, but profitable in larger ones. In addition, the nature of ADLs leads to a combination of movements that are inherent to the location of both strategic appliances and furniture. However, a true generalization of algorithms designed for ADL recognition requires datasets composed of activities recorded in a wide variety of environments. Therefore, future work will be to create new datasets that include more people, more instances per activity as well as more diverse environments.

Moreover, the work we proposed was primarily focused on the research and adaptation of ready-to-use deep learning models for ADL recognition. However, we have explored only a fraction of all the deep learning architectures that are potentially suitable for the ADL recognition problem. Thus, based on the findings of this work, future work may also involve building a truly optimized model for the recognition of ADLs using data recorded from UWB radars.

## 7.Conclusion

In this paper, we proposed a deep learning model combining EfficientNetB0 and LSTM neural networks using transfer learning and minimalist data pre-processing in order to recognize ADLs from data generated by UWB radars inside a smart apartment. Our proposed model completed such a task with 65.59% Top-1 accuracy, surpassing by 18.63% the performance of our previously developed model [19]. Furthermore, in this work, a detailed analysis of the behavior of the proposed architecture for movement recognition has also been suggested. Obtained results demonstrated that most of the four proposed arm movements were accurately identified. However, due to a low variation of medium-sized movements compared to fine movements as well as large movements, we were able to observe that the overall performance of the proposed model was significantly degraded by such movements.

## Acknowledgement

The authors would like to thank the Ministère de l’Économie et de l’Innovation from the government of the province of Québec, Canada for the grant that made this project possible. The authors would also like to thank Moonshot Health for their contribution to this project.

## Conflict of interest

The authors have no conflict of interest to report.