Helping Computers to Interpret Human Emotions


Personalised machine-learning models capture subtle variations in facial expressions to better gauge how we feel

Researchers at MIT have developed a machine-learning model that takes computers a step closer to interpret our emotions as naturally as humans do.

They have developed a machine-learning model that outperforms traditional systems in capturing small facial expression variations. Moreover, by using a little extra training data, the model can be adapted to an entirely new group of people, with the same efficacy. The aim is to improve existing affective-computing technologies.

Personalised experts

Traditional affective-computing models use a “one-size-fits-all” concept where they train one set of images depicting various facial expressions. But here, the researchers combined a technique called “Mixture of Experts” (MoE), with model personalisation techniques.

In MoEs, a number of neural network models called experts are trained to specialise in a separate processing task and produce one output. The researchers also incorporated a gating network which calculates probabilities of which expert will best detect moods of unseen subjects.

For the model, they personalised the MoEs by matching each expert to one of 18 individual video recordings. They trained the model using nine subjects and evaluated them on the other nine, with all videos broken down into individual frames.

Each expert and gating network, track facial expressions of each individual with the help of a residual network (ResNet). In doing so, the model score each frame based on the level of valence (pleasant or unpleasant) and arousal (excitement). Separately, six human experts labelled each frame for valence and arousal based on a scale of -1 (low levels) to 1 (high levels), which the model also used to train.

They then performed further model personalisation, where they fed the trained model data from some frames of the remaining videos of subjects, and then tested the model on all unseen frames from those videos.

Results showed that with just 5 per cent to 10 per cent of data from the new population, the model outperformed traditional models by a large margin that means it scored valence and arousal on unseen images much closer to the interpretations of human experts.

Better machine-human interactions

Another goal is to train the model to help computers and robots automatically learn from small amounts of changing data to more naturally detect how we feel and better serve human needs.

For example, to run in the background of a computer or mobile device to track a user’s video-based conversations and learn subtle facial expression changes in different contexts.

It is expected to be helpful in monitoring depression or dementia and for educational purpose as people’s facial expressions tend to subtly change due to those conditions. Additionally, this version can be used in helping robots better interpret the moods of children with autism.

It seems intuitive that the emotional signs one person gives are not the same as the signs that another gives, therefore, it makes sense that emotion recognition works better when it is personalised.

The method of personalising reflects another intriguing point that it is more effective to train multiple experts and aggregate their judgments.