Predicting physical activity based on smartphone sensor data using CNN + LSTM

Published in

Good Audience

4 min readMar 25, 2018

Today we want to look at how smartphones, smart-watches and the like are able to predict what kind of activities you’re doing based on sensor data and try to reproduce this process. The possibilities range from sport or health applications to games like Pokémon Go, to name a few.

Pokémon Go uses the different sensors of your smartphone to check what you’re doing

Most modern smartphones have an accelerometer and a gyroscope. An accelerometer measures changes in velocity and changes in position, whereas a gyroscope measures changes in orientation and changes in rotational velocity.

For this task we use a dataset from UCI. For this data an experiments have been carried out with a group of 30 volunteers. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer (body & total) and gyroscope, they captured 3-axial linear acceleration and 3-axial angular velocity. Each sample is assigned to one of the classes, and our goal is to predict the correct label for each sample given the nine sensor time series.

Below we plotted the xyz-values from the 3 sensors from 2 samples (walking upstairs vs sitting):

We split the data into a train, test and validation set based on unique users, so that there’s no overlap in users between the different sets. Just like in a real life application where your product have to make predictions for new unseen users.

For our problem we’ll use a technique called convolution, which helps filter our signal via the multiplication of our input with a ‘kernel’, like shown in this video. To better help you visualize what we want to achieve we’ll reproduce this process on a sample of our data, to give you an example of how a ‘filtered’ signal could look like on our data with a kernel of 1.3/1.3/1.3:

input vs output (smoother!) after convolution with a 1.3/1.3/1.3 kernel

We then output this signal into a LSTM, a model architecture that is able to store information over time in order to find temporal correlations. It does this via a cell that is able to hold on to (and forget) previous values. Our 9 models are then combined via a Merge layer, which allows us to combine all our features to extract meaningful information.

For some of our models, we will split our input data into several streams, one way is to split it per sensor (3 streams), another way is to split is per channel & per sensor (9 streams). For our architecture we’ll use several convolutional layers followed by an LSTM. The convolutional layers will have the whole sequence as the input, but we tried to do convolutions on different time windows as well via a TimeDistributed layer, without much succes.

Architecture of the DeepConvLSTM framework as proposed by Ordóñez & Roggen

For our hyperparameters and the number of layers we look at a paper by Ordóñez & Roggen (reference 4) that tackles a similar problem with the following model: C(64)−C(64)−C(64)−C(64)−LSTM(128)−LSTM(128)−Sm. After some fine-tuning we end up with the following architecture (code in Keras):

epochs = 100
kernel_size = 3 #kernel_size of 1 worked surprisingly well
pool_size = 2
dropout_rate = 0.15
f_act = 'relu'first_model = Sequential()
first_model.add(Conv1D(512, (kernel_size), input_shape=(X_trainS1.shape[1],X_trainS1.shape[2]), activation=f_act, padding='same'))
first_model.add(BatchNormalization())
first_model.add(MaxPooling1D(pool_size=(pool_size)))
first_model.add(Dropout(dropout_rate))
first_model.add(Conv1D(64, (kernel_size), activation=f_act, padding='same'))
first_model.add(BatchNormalization())
first_model.add(MaxPooling1D(pool_size=(pool_size)))
first_model.add(Dropout(dropout_rate))
first_model.add(Conv1D(32, (kernel_size), activation=f_act, padding='same'))
first_model.add(BatchNormalization())
first_model.add(MaxPooling1D(pool_size=(pool_size)))
first_model.add(LSTM(128, return_sequences=True))
first_model.add(LSTM(128, return_sequences=True))
first_model.add(LSTM(128))
first_model.add(Dropout(dropout_rate))second_model = Sequential()
second_model.add(Conv1D(512, (kernel_size), input_shape=(X_trainS2.shape[1],X_trainS2.shape[2]), activation=f_act, padding='same'))
second_model.add(BatchNormalization())
second_model.add(MaxPooling1D(pool_size=(pool_size)))
second_model.add(Dropout(dropout_rate))
second_model.add(Conv1D(64, (kernel_size), activation=f_act, padding='same'))
second_model.add(BatchNormalization())
second_model.add(MaxPooling1D(pool_size=(pool_size)))
second_model.add(Dropout(dropout_rate))
second_model.add(Conv1D(32, (kernel_size), activation=f_act, padding='same'))
second_model.add(BatchNormalization())
second_model.add(MaxPooling1D(pool_size=(pool_size)))
second_model.add(LSTM(128, return_sequences=True))
second_model.add(LSTM(128, return_sequences=True))
second_model.add(LSTM(128))
second_model.add(Dropout(dropout_rate))third_model = Sequential()
third_model.add(Conv1D(512, (kernel_size), input_shape=(X_trainS3.shape[1],X_trainS3.shape[2]), activation=f_act, padding='same'))
third_model.add(BatchNormalization())
third_model.add(MaxPooling1D(pool_size=(pool_size)))
third_model.add(Dropout(dropout_rate))
third_model.add(Conv1D(64, (kernel_size), activation=f_act, padding='same'))
third_model.add(BatchNormalization())
third_model.add(MaxPooling1D(pool_size=(pool_size)))
third_model.add(Dropout(dropout_rate))
third_model.add(Conv1D(32, (kernel_size), activation=f_act, padding='same'))
third_model.add(BatchNormalization())
third_model.add(MaxPooling1D(pool_size=(pool_size)))
third_model.add(LSTM(128, return_sequences=True))
third_model.add(LSTM(128, return_sequences=True))
third_model.add(LSTM(128))
third_model.add(Dropout(dropout_rate))model = Sequential()
model.add(Merge([first_model, second_model, third_model], mode='concat'))
model.add(Dropout(0.4))
model.add(Dense(n_classes))
model.add(BatchNormalization())
model.add(Activation('softmax'))model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])history = model.fit([X_trainS1, X_trainS2, X_trainS3], Y_train,
                    batch_size=batch_size,
                    validation_data=([X_valS1, X_valS2, X_valS3], Y_val),epochs=epochs)

The model that doesn’t split the input data but has a similar structure as above scores around 94.5% test accuracy. The model (see code above) that splits the input data into 3 streams (per sensor) scores around 95.5% test accuracy. Last but not least, we’ve built a model that splits the input data into 9 streams (per sensor + per channel), this one manages to get a test accuracy of almost 97%. This model beats pretty much all, if not all, the models that are built on top of the same data that we’ve found by quite a margin, even though we’re not using all the training data. Running the same model (no extra tuning) with all the train instances achieves a test accuracy of 98%.

For the people that want to go even further, combining different models (including other algorithms like XGBoost for example) into an ensemble is one way to boost your accuracy even more, that is, if your different errors aren’t correlated.

References:

Connect with the Raven team on Telegram

Predicting physical activity based on smartphone sensor data using CNN + LSTM

Written by David Smolders