Chuangxin Cai, School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China, cai_chuangxin@163.com
Xianxuan Lin, Nanjing University of Information Science and Technology, Nanjing, China, cike0cop@gmail.com
Jing Zhang, School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, China, 334662407@qq.com
Aditi Bhattarai, Nanjing University of Information Science and Technology, Nanjing, China, aditibhattarai02@gmail.com
Chunting Cai, Education International Cooperation Group Shanghai Office, Shanghai, China, celestecai_517@163.com
Xianliang Xia, AI, Nanjing University of Information Science and Technology, Nanjing, China, canyoufly66@gmail.com
Zhigeng Pan, Nanjing University of Information Science and Technology, Nanjing, China, zgpan@nuist.edu.cn
DOI: https://doi.org/10.1145/3703619.3706033
VRCAI '24: The 19th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry, Nanjing, China, December 2024
Emotion recognition in conversation (ERC) is anchored in the burgeoning field of artificial intelligence, aiming to equip machines with the ability to discern and respond to human emotions in nuanced ways. However, recent studies have primarily focused on textual modalities, often neglecting the significant potential of non-verbal cues found in audio and video, which are critical for accurately capturing emotions. Furthermore, when researchers integrate these non-verbal cues into multimodal emotion recognition systems, they encounter challenges related to data heterogeneity across different modalities. This paper introduces the Facial Perception and Knowledge Distillation Network (FP-KDNet) to address these challenges. Specifically, a novel Facial Perceptual Attention (FPA) module was designed to capture non-verbal cues from videos, significantly enhancing the model's ability to process visual information. Additionally, a knowledge distillation (KD) strategy was proposed to improve emotion representation in the non-verbal modality by leveraging data from the text modality, facilitating effective cross-modal information exchange. A multi-head attention mechanism further optimizes the integration of features across modalities, dynamically adjusting attention allocation to enhance conversational emotion recognition. The experimental results demonstrated that FP-KDNet achieves excellent performance on the MELD and IEMOCAP datasets, and ablation studies confirm the effectiveness of the multimodal fusion approach.
Keywords: emotion recognition in conversation, multimodal fusion, knowledge distillation, multi-head attention mechanism
ACM Reference Format:
Chuangxin Cai, Xianxuan Lin, Jing Zhang, Aditi Bhattarai, Chunting Cai, Xianliang Xia, and Zhigeng Pan. 2024. FP-KDNet: Facial Perception and Knowledge Distillation Network for Emotion Recogniton in Coversation. In The 19th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry (VRCAI '24), December 01--02, 2024, Nanjing, China. ACM, New York, NY, USA 9 Pages. https://doi.org/10.1145/3703619.3706033
1 Introduction
In recent years, with the development of AI, the demand for systems capable of understanding and responding to user emotions has continuously increased [Kumar etal. 2023]. Emotion recognition has become a key component in human-computer interaction, enhancing user experience and making interactions more natural and effective [Feinberg etal. 1986; Larradet etal. 2020; Sun etal. 2021]. Multimodal emotion recognition identifies and understands user emotions by analyzing their voice, facial expressions, and text content [Khare etal. 2024]. This technology is widely applied in various fields, including education, healthcare, and commercial services [Ezzameli and Mahersia 2023; Poushneh etal. 2024].

In the field of affective computing, ERC presents a highly challenging task, especially in multi-person dialogues [Pan etal. 2023]. In such settings, participants’ emotional states not only change dynamically throughout the conversation, but each individual's emotional expressions also vary personally [Barrett etal. 2019]. The dynamic nature and individual differences in these emotional states make accurately capturing and recognizing emotions more complex [Zhang etal. 2024]. Many existing methods for conversational emotion recognition primarily rely on textual information [Ghafoor etal. 2023; Singh etal. 2023]. However, these approaches often overlook the nuanced emotional cues embedded in vocal tone and prosody [Kim and Hong 2024]. Although text-based models possess powerful contextual understanding capabilities, they lack critical information inherent in nonverbal cues [Liu etal. 2023a].
Existing research has integrated the video modality in conversational emotion recognition, thus enhancing performance. However, since video content often involves multiple speakers and complex background environments, feature extraction poses challenges [Leong etal. 2023]. To reduce the impact of environmental noise on emotion recognition, deep learning methods have demonstrated strong capabilities in extracting complex features from videos, accurately identifying subtle emotional changes [Abdullah etal. 2021; Hazmoune and Bougamouza 2024]. However, despite significant progress in multimodal methods for emotion recognition, most have not considered individual differences in emotional expression. Ignoring the varied impacts of different modal data on model performance might lead to homogenization among modalities, thus weakening the overall effectiveness of the model [Chowdary etal. 2023; Lei and Cao 2023].
This article aims to address challenges in the multimodal fusion domain by proposing an innovative network that incorporates facial perception and knowledge distillation techniques. During the feature extraction process, we introduced a facial perception attention (FPA) module to enhance the network's focus on facial regions closely related to emotional expression. This enables the model to focus on key facial areas that reflect emotional states, thereby improving the accuracy of emotion recognition. In addition, to further enhance network performance, we implemented a KD strategy. In this strategy, a well-trained text-based teacher network transfers knowledge to a student network, which learns from video and audio modalities. The text model acquires rich emotional semantic information, crucial for understanding emotional states, and then applies this knowledge to guide the learning process of the student network. Additionally, to effectively integrate features from different modalities, we devised a multimodal fusion layer based on multi-head attention. This fusion layer addresses the heterogeneity and complementarity between different modalities.
The method presented in this paper was evaluated on two widely used benchmark datasets, MELD and IEMOCAP, and compared with existing multimodal emotion recognition methods. FP-KDNet achieved a weighted average F1 score of 67.15% on the MELD dataset. The results indicate that the proposed multimodal fusion network, which combines facial perception and knowledge distillation, significantly improves performance in ERC tasks. Ablation studies further confirm the effectiveness of these mechanisms.
Our contributions can be summarized as follows.
- We propose a novel multimodal network that integrates facial perception and enhances the recognition of emotionally significant facial regions through a dedicated facial attention mechanism, allowing the model to prioritize key features that convey emotional states.
- We introduce a knowledge distillation strategy that allows a pre-trained text-based teacher network to transfer nuanced emotional semantics to a student network trained on video and audio content, enhancing the student network's learning efficiency.
- We design a multi-head attention-based fusion layer to optimize feature integration across modalities, effectively addressing heterogeneity and complementarity between them and ensuring comprehensive emotional representation.
2 Related Works
2.1 Multimodal Emotion Recognition
In the domain of multimodal emotion recognition, research is persistently advancing and yielding significant breakthroughs. This interdisciplinary field integrates various modalities such as facial expressions, vocal cues, and visual information to provide a more comprehensive understanding of human emotions. Early studies concentrated predominantly on unimodal emotion recognition, using text or audio alone to identify emotional states. However, the precision of these methods was compromised due to the omission of visual cues such as facial expressions and body language [AlMaruf etal. 2024].
Technological advances have led researchers to understand that the integration of various modalities, visual, auditory, and textual, substantially increases the precision and reliability of emotion recognition [Alharbi 2024; Geetha etal. 2024]. Bilotti et al. [Bilotti etal. 2024] examined the effects of facial features, optical flow derived from facial images, and Mel spectrogram metrics extracted from videos on emotion recognition, probing various combinations of these features. Foteinopoulou et al. [Foteinopoulou and Patras 2024] addressed the limitations of facial expression recognition by introducing an innovative visual language model to utilizes textual descriptions, expressions, and emotional cues from subtitles as natural language supervision cultivating nuanced latent representations, thereby enhancing zero-shot classification capabilities. Furthermore, Zhang et al. [Zhang etal. 2023] presented a multi-visual information-centric approach to discerning counterfeit emotions, incorporating spatial and spectral facial data from videos, physiological signals from image RGB values, and ocular data. Tiwari et al. [Tiwari etal. 2023] developed the novel shift difference accumulation linear discriminant analysis (SDA-LDA) algorithm to isolate highly distinctive, dynamic and robust features from audio and video streams, using a Support Vector Machine (SVM) classifier for the categorization of emotions. More recently, attention has shifted to intricate modality fusion methods. Khan et al. [Khan etal. 2024] proposed an attention-based fusion technique that surpasses rudimentary concatenation or weighted fusion by autonomously identifying the modal features with the highest contributing value to emotion recognition.
2.2 Knowledge Distillation
Knowledge distillation was originally proposed for model compression, which involves transferring the knowledge from a large, complex model (teacher model) to a smaller, simpler model (student model) to optimize the student model's performance, reduce model complexity, and decrease computational resource requirements [Buciluǎ etal. 2006]. Hitton et al. [Hinton 2015] accomplished knowledge transfer from the teacher model to the student model by reducing the discrepancy between the logits produced by the teacher model and those generated by the student model. In context, the main characteristic of KD lies in its ability to compress the network without being affected by the structural differences between the teacher and student networks [Buciluǎ etal. 2006; Gutstein etal. 2008]. This means that even if the two networks have different structures, the KD method can still effectively transfer the knowledge from the teacher network to the student network [Gou etal. 2021].
Recent studies have shown that this technique can effectively transfer knowledge from complex multimodal models to simpler, more efficient models, thereby enhancing their performance on emotion recognition tasks [Sun etal. 2024]. As research progressed, scholars discovered that the transferring knowledge across different modalities significantly improves the performance of the multimodal model [Liu etal. 2023b]. Ma et al. [Ma etal. 2023] applied KD to multimodal learning, achieving effective knowledge transfer between modalities through specialized distillation strategies. Subsequently, Fan et al. [Fan etal. 2024] proposed an improved heterogeneous network structurethat allows KD across different levels, including feature, decision, and semantic layers. Beyond structural innovations, there have been breakthroughs in algorithms as well. Bano et al. [Bano etal. 2024] introduced an emotion recognition system based on federated learning and KD, which utilizes emotional annotations extracted from sensor data and transmits these annotations from the sensor domain to the visual domain through KD methods. Praveen et al. [Praveen and Alam 2024] further optimized the selection and fusion of information by integrating KD and attention mechanisms, improving the accuracy of emotion recognition in complex contexts. Currently, KD methods still face several challenges, such as handling modality heterogeneity, minimizing information loss during knowledge transfer, and developing strategies to manage imbalanced modality data.

3 Methodology
3.1 Task Definition
In conversational analysis, the structure of a conversation is crucial for understanding how communication unfolds and how meaning is constructed between participants. This structure is not just about the sequence of turns or the words used, but also the patterns of interaction, the rules that govern the exchange, and the social context in which the conversation takes place. Let P represent a set of participants, U a set of utterances, and Y a set of emotion labels. Each conversation is represented as [(p1, u1, y1), (p2, u2, y2), …, (pi, ui, yi)], where i = 1, 2, …, N, pi ∈ P denotes a participant, and each conversation contains N utterances. Participants pm and pn are identified as the same speaker when m = n. Each utterance ui is assigned a category label yi, where yi ∈ E, and E is a predefined set of emotional categories. Each utterance ui comprises a textual transcript, a speech segment, and a video clip, structured as ui = {ti, ai, vi} to represent text, audio, and video modalities, respectively. The goal of the ERC task is to determine yi—the emotion associated with each utterance ui.
3.2 Framework Overview
We have proposed a facial perception and knowledge distillation network for ERC task, as illustrated in Figure 1. FP-KDNet operates under the assumption that for ERC tasks, the emotional representations across different modalities vary, and enhancing the emotional representation of weaker modalities can improve overall ERC system performance.
In particular, the audio feature extractor is responsible for extracting key features from the audio signals in conversations. For the text feature extractor, we propose a prompt-based emotional representation approach using SimCSE [Gao etal. 2021] to extract strong emotion-related features from the text modality and employ it as a teacher model. The visual feature extractor includes an innovative facial perception attention module that focuses on extracting the most relevant features from facial regions in videos. To emotional expressions from facial regions in videos. Subsequently, we propose knowledge distillation, allowing the student models(audio and visual models, to learn structured information from the teacher model, while also minimizing the impact of modal heterogeneity on the overall performance of the ERC model. Following feature extraction, FP-KDNet utilizes a multi-head attention mechanism to effectively integrate text, visual, and audio features, preserving complementary information across modalities and highlighting key emotional cues through attention weights, the features most useful for the final emotion prediction.
3.2.1 Feature extraction. Figure 1 clearly illustrates how the encoder for each modality processes the respective inputs to extract emotional features. This section will detail the methods used to extract emotional features from the input signals.
Text. In the field of multimodal emotion recognition, using pre-trained language models to capture contextual information and extract the emotional states of speakers has become an important research direction. Although BERT-based models [Vaswani 2017] are widely used for building text encoders, they may lack the ability to fully understand nuanced semantic differences and complex situational backgrounds, especially when identifying subtle distinctions in emotional states and abstract concepts. The SimCSE model enhances the distinctiveness of sentence embeddings through contrastive learning and reduces the reliance on annotated data, offering a superior solution for tasks requiring deep semantic understanding, such as emotion analysis. Therefore, this article adopts SimCSE as the text encoder and proposes a prompt-based emotional representation method, allowing detailed learning of emotional representations from the text. By integrating advanced language models, the proposed method further optimizes the capture and representation of emotional context, aiming to improve the accuracy and robustness of multimodal emotion recognition.
For utterance ui, ti is the text modality of ui. The k utterances preceding it are denoted as context $C_i^k$, To capture the emotional connection between the speaker and the utterance, we utilize a prompt-based emotional expression model, such as "In this talk: $C_i^k$ and ui make pi feel <mask >" aimed at significantly representing the emotional state of speaker pi.
\begin{equation} C_i^k=\left[ t_{i-k+1}, t_{i-k+2}, \dots, t_i \right] \end{equation}
(1)
\begin{equation} R_i^k = \text{In this talk:} \, C_i^k \, \text{and} \, t_i \, \text{make} \, p_i \, \text{feel} \, \text{< mask > } \end{equation}
(2)
\begin{equation} X_i = SimCSE(R_i^k) \end{equation}
(3)
\begin{equation} F_t^i = TextFeatureExtractor(X_i) \end{equation}
(4)
Audio. The use of transformer-based self-supervised learning methods for extracting audio and video features offers significant advantages and potential. Firstly, the Transformer architecture, especially its self-attention mechanism, effectively captures long-distance dependencies in audio signals, which is crucial for understanding complex patterns in the audio. Secondly, self-supervised learning methods do not require a large amount of annotated data, making them particularly useful in scenarios where data is scarce or annotation costs are high. Finally, implementing an audio feature encoder based on the Transformer facilitates cross-modal learning. This paper utilizes data2vec [Baevski etal. 2020] to extract the audio modality features ai corresponding to the utterance ui, ensuring that the most expressive emotion-rich features are extracted from the diverse audio streams.
\begin{equation} F_a^i = TextFeatureExtractor(a_i) \end{equation}
(5)
Video. Given an utterance ui and the corresponding video segment, we extract N frames at equal time intervals to form the video modality representation vi = {c1, c2, …, cl}, where l denotes the number of extracted frames. Subsequently, we use BlazeFace [Bazarevsky etal. 2019] to locate the facial regions within the video sequence vi, followed by cropping and alignment to obtain a refined facial sequence. Then, these sequences are input into Timesformer [Bertasius etal. 2021] to extract dynamic facial expression features over time, enabling the model interpret subtle temporal variations in facial expressions for more accurate emotion recognition. Finally, an expression attention mechanism is used to extract emotion-related features from the facial sequences, dynamically allocating weights to emphasize the most critical features for emotional expression. Formally, this can be described as:
\begin{align} v_i &= \lbrace c_1, c_2, \dots, c_l\rbrace \end{align}
(6)
\begin{align} z_i &= Align(Crop(BlazeFace(v_i))) \end{align}
(7)
\begin{equation} F_v^i = Attn(Timesformer(z_i)) \end{equation}
(8)
3.2.2 Knowledge Distillation. Knowledge Distillation, as an advanced technique, facilitates cross-modal knowledge transfer under the teacher-student learning framework, enabling student models to generalize across domains and tasks[Sun etal. 2024]. Moreover, this technology provides innovative approaches to address heterogeneity and compatibility challenges when processing data from different modalities, particularly demonstrating substantial potential in promoting knowledge transfer and integration between heterogeneous data. This paper utilizes a text model as the teacher and audio and video models as students, enhancing emotional features extracted from these two modalities that contribute less. The loss for FP-KDNet is composed of classification loss and feature loss, which are defined as follows:
\begin{equation} L_{total} = \alpha L_{cls} + \beta L_{feat}, \end{equation}
(9)
The classification loss Lcls typically employs cross-entropy loss to assess the difference between the predicted class distribution of the student model and the teacher model, which can be expressed as follows:
\begin{equation} L_{cls} = -\sum _{i}y_i \log \hat{y}_i + (1-y_i)\log (1- \hat{y}_i), \end{equation}
(10)
The feature loss Lfeat is used to measure the difference between the feature vectors extracted by the student model from the audio and video modalities and the feature vectors extracted by the teacher model from the text modality, which can be calculated as:
\begin{equation} L_{feat} = \frac{1}{N}\sum _{i=1}^{N}({\bf x}_i^{(s)} - {\bf x}_i^{(t)})^2, \end{equation}
(11)
3.2.3 Multi-head Attention based Modality Fusion. The multi-head attention mechanism allows the model to focus on features from different subspaces simultaneously, thus encoding important information from each modality more intricately when capturing complex emotional expressions. In addition, the content and contribution of information may differ between modalities; multi-head attention assigns importance to each modality through adaptive weights, effectively combining complementary information from all modalities. For feature fusion, this paper uses skip connections to prevent the model from overlooking key emotional information across modalities. These skip connections establish direct links between features at different levels, aiding the model in capturing and integrating subtle and salient emotional cues from various modalities.
Emotional features corresponding to text, video, and audio modalities are denoted as $F_t^i$, $F_v^i$, and $F_a^i$, respectively. This paper proposes using the multi-head attention mechanism to fuse features from different modalities. Firstly, the strong-emotion modality $F_t^i$ is used as Q and V in the multi-head attention. Then, the paper concatenates the weak modalities $F_v^i$ and $F_a^i$ into the combined feature vector $F_{con}^i$, which is used as K in the multi-head attention. This approach optimizes the dynamic weight allocation of information between modalities, allowing the model to capture and integrate complex patterns rich in expressive characteristics, thereby improving the accuracy of emotion state classification. Specifically, each head headi in the multi-head attention mechanism independently learns representations from different subspaces of the concatenated vector $F_{con}^i$. The outputs of these heads are subsequently merged, resulting in a cohesive and enhanced emotional feature representation $F_{fusion}^i$. The proposed fusion method can be mathematically expressed as follows:
\begin{equation} \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i), \end{equation}
(12)
\begin{equation} F_{fusion}^i = \mathrm{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h) W^O \end{equation}
(13)
\begin{equation} F_{fusion}^i = \left(\sum _{k=1}^h \text{softmax}\left(\frac{(QW^Q_k)(KW^K_k)^\top }{\sqrt {d_k}} \right)VW^V_k \right)W^O \end{equation}
(14)
Compared to previous multi-head attention mechanisms, the uniqueness of our proposed fusion strategy lies in integrating cross-modal information by altering the K of modalities while keeping the queries Q and values V unchanged. This approach enables FP-KDNet to better coordinate interactions between different modalities, thereby achieving more effective information fusion.
Finally, this paper concatenates the emotional features $F_{fusion}^i$ obtained from the multi-head attention fusion layer with the audio modality features $F_t^i$ and the video modality features, and then passes them through a fully connected layer (FC) to obtain the network's predicted emotional category.
\begin{equation} F_{final}^i = \text{Concatenate}(F_{fusion}^i, F_t^i, F_v^i) \end{equation}
(15)
\begin{equation} y_p^i = \text{FC}(F_{final}^i) \end{equation}
(16)
4 Experiments
4.1 Experimental Setting
In this study, a series of meticulously defined experimental settings was adopted to ensure the reproducibility and robustness of our results. The weights of the pre-trained model were sourced from the Huggingface repository, and feature vectors were fixed at a dimensionality of 768. The AdamW optimizer was employed, with experiments beginning at an intial learning rate of 1 × 10− 5. The first 10% of the training epochs were designed as a warm-up period, during which the learning rate was gradually increased from a minimal baseline to the predetermined initial rate. All experiments were conducted using an NVIDIA GeForce RTX 3090 Ti GPU, supported by an Intel Core i7-13700KF CPU at 3.4 GHz and complemented by 64 GB of RAM.
Table 1: Statistical information of MELD and IEMOCAP Datasets.
Dataset | Partition | Utterances | Dialogues |
MELD | train + val | 11,098 | 1,153 |
test | 2,610 | 280 | |
IEMOCAP | train + val | 5,810 | 120 |
test | 1,623 | 31 |
Table 2: Comparison results of FP-KDNet on MELD and IEMOCAP datasets
Models | MELD | IEMOCAP | |||||||
Anger | Disgust | Fear | Joy | Neutral | Sadness | Surprise | w-f1 | w-f1 | |
DialogueRNN [Majumder etal. 2019] | 41.50 | 1.70 | 1.20 | 50.70 | 73.5 | 23.80 | 49.40 | 57.03 | 62.75 |
ConGCN [Zhang etal. 2019] | 46.80 | 10.60 | 8.70 | 53.10 | 76.70 | 28.50 | 50.30 | 59.40 | 64.18 |
EmoCaps [Li etal. 2022] | 57.54 | 7.69 | 3.03 | 57.50 | 77.12 | 42.52 | 63.19 | 64.00 | 71.77 |
GA2MIF [Li etal. 2023] | 48.52 | - | - | 51.87 | 76.92 | 27.18 | 49.08 | 58.94 | - |
M2FNet [Chudasama etal. 2022] | 55.25 | 15.24 | 3.45 | 55.50 | 67.98 | 47.03 | 58.66 | 62.71 | 69.86 |
FP-KDNet (ours) | 54.50 | 18.30 | 16.34 | 61.28 | 84.78 | 43.56 | 58.10 | 67.15 | 70.22 |
4.2 Datasets
We evaluated the performance of FP-KDNet on the widely-used multimodal benchmark datasets, MELD [Poria etal. 2019] and IEMOCAP [Busso etal. 2008]. Table 1 presents the statistical information of two datasets. Considering the imbalanced distribution of categories within these datasets, we employed the weighted average F1 score as the evaluation metric.
MELD. The MELD dataset, an expansion of the EmotionLines dataset, features a vast array of multimodal dialogues from the scripted television series "Friends". It includes more than 1,400 dialogues and 13,000 utterances, each capturing text, audio, and visual data corresponding to expressed emotions.
IEMOCAP. The IEMOCAP dataset is a multimodal corpus that integrates audio, visual, and textual data. It was created through controlled sessions with professional actors and comprises approximately 12 hours of audiovisual recordings. These recordings are meticulously annotated with emotional labels across five sessions, each featuring one male and one female actor.
4.3 Comparison methods
We compare FP-KDNet against the following four advanced emotion recognition approaches. DialogueRNN [Majumder etal. 2019] captures conversational context through long short-term memory (LSTM) units while employing an attention mechanism to dynamically prioritize key conversational information. ConGCN [Zhang etal. 2019] proposes a convolutional neural network that leverages a conversation graph to simultaneously model context-sensitive and speaker-sensitive dependencies in multi-speaker conversations. Emocaps [Li etal. 2022] introduces a new structure called Emoformer, which effectively captures emotional trends within conversations. GA2MIF [Li etal. 2023] adopts a two-stage fusion approach that integrates graph-based and attention-based methods, addressing both context and cross-modal modeling.
4.4 Main Results
The experimental results presented in Table 2 provide an in-depth analysis of FP-KDNet's performance across two benchmark multimodal emotion recognition datasets, MELD and IEMOCAP.
On the MELD dataset, P-HSCL surpasses the previous best performing model, FP-KDNet achieves a weighted average F1 score (w-f1) of 67.15 on the MELD dataset, which represents an improvement 8% over ConGCN, its closest competitor, which has a w-f1 score of 59.40%. This robust performance is attributed to the integration of a novel multimodal network that incorporates facial perception, enhancing the recognition of key facial features linked to emotional expressions through a facial attention mechanism. The competencies of FP-KDNet are further validated in the ’Neutral’, ’Anger’, and ’Sadness’ categories where the model shows superior scores, highlighting the effectiveness of the knowledge distillation strategy. Significant individual differences in expressing anger and surprise require models to capture more subtle person-specific features. However, FP-KDNet may have limitations in this aspect, resulting in its performance being worse than others.
On the IEMOCAP dataset, FP-KDNet demonstrates commendable proficiency with a w-f1 score of 70. 22%, although it falls slightly short of the leading score of 71.77% held by EmoCaps. However, FP-KDNet's complete results indicate a consistent high performance, confirming the advantages derived from the model's ability to manage modality heterogeneity and complementarity, which is primarily made possible by the multi-head attention-based fusion layer designed for optimal feature integration of different modalities and ensuring a comprehensive representation of emotions.
In conclusion, FP-KDNet's innovations—facial perception with attention mechanisms, knowledge distillation, and a multi-head attention-based fusion layer—significantly contribute to its enhanced performance, reflecting the potential of these methods to refine and advance the field of multimodal emotion recognition.
Table 3: Ablation study results of FP-KDNet on the MELD dataset based on F1 score.
FP-KDNet | 67.15 |
-w/o Audio | 66.25 |
-w/o Vision | 65.20 |
-w/o Audio,Vision | 64.25 |
-w/o Text,Vision | 42.56 |
-w/o FPA Module | 66.10 |
4.5 Ablation Study
The ablation study presented in Table 3 investigates the impact of different modalities and the facial perception attention (FPA) module on FP-KDNet's performance, as measured by the F1 score on the MELD dataset.
The complete FP-KDNet architecture achieves the highest F1 score of 67.15%, indicating that the synergistic effect of multimodal data input is pivotal for the efficacy of the model. The exclusion of audio results in a marginal decrease to 66.25%, suggesting that while audio input contributes to performance, the model can still capture a significant amount of emotional information from other modalities. A more notable decline in the F1 score is observed when the visual modality is ablated, with the score dropping to 65.20%. This highlights the visual modality's considerable role, potentially due to its conveyance of non-verbal cues that are integral to human emotion interpretation. The concurrent removal of audio and visual inputs results in a further decrease of 64.25%, reinforcing the hypothesis that these modalities contain complementary information, which, when combined, enhances the model's ability to discern complex emotional states. Intriguingly, the most substantial performance degradation occurs upon the removal of text and visual inputs, with the F1 score plummeting to 42.56%. This result accentuates the text modality's preeminence for the task at hand, likely due to the rich contextual and semantic information inherent in the textual data. Finally, the exclusion of the FPA module yields an F1 score of 66.10%, only slightly below the score of the full model. This mild decline implies the FPA module's utility in refining the model's focus on salient facial features, yet its absence does not precipitously affect the overall performance of the model.
5 Conclusion
In this paper, we propose a facial perception and knowledge distillation network for the multimodal emotion recognition task. The implementation of the facial perceptual attention module and the knowledge distillation strategy significantly enhance the model's ability. By capturing essential facial cues and leveraging a well-trained text-based teacher network for knowledge transfer, the model achieves a comprehensive and nuanced understanding of emotions. The multi-head attention-based multimodal fusion layer represents a significant advancement in integrating disparate modal information to produce a cohesive emotional representation. The experimental results demonstrate the superiority of FP-KDNet, achieving a weighted average F1 score of 67.15% on the MELD dataset. This achievement lays a solid foundation for future research in affective computing.
Acknowledgments
This work was supported by the National Key Research and Development Program of China (No. 2020YFC0811004), the China National Foundation for Natural Sciences (No. 62072150), and the Open Project of the Key Laboratory of Collection Resources Revitalizing Technology, Ministry of Culture and Tourism (No. CRRT2022K03).
References
- Sharmeen M SaleemAbdullah Abdullah, Siddeeq YAmeen Ameen, MohammedAM Sadeeq, and Subhi Zeebaree. 2021. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends 2, 01 (2021), 73–79.
- Abdullah AlMaruf, Fahima Khanam, MdMahmudul Haque, ZakariaMasud Jiyad, Firoj Mridha, and Zeyar Aung. 2024. Challenges and opportunities of text-based emotion detection: A survey. IEEE Access (2024).
- Yasser Alharbi. 2024. Effective ensembling classification strategy for voice and emotion recognition. International Journal of System Assurance Engineering and Management 15, 1 (2024), 334–345.
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- Saira Bano, Nicola Tonellotto, Pietro Cassarà, and Alberto Gotta. 2024. FedCMD: A Federated Cross-modal Knowledge Distillation for Drivers’ Emotion Recognition. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–27.
- LisaFeldman Barrett, Ralph Adolphs, Stacy Marsella, AleixM Martinez, and SethD Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological science in the public interest 20, 1 (2019), 1–68.
- Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann. 2019. Blazeface: Sub-millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047 (2019).
- Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol.2. 4.
- Umberto Bilotti, Carmen Bisogni, Maria DeMarsico, and Sara Tramonte. 2024. Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets. Engineering Applications of Artificial Intelligence 130 (2024), 107708.
- Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 535–541.
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Ebrahim(Abe) Kazemzadeh, EmilyMower Provost, Samuel Kim, JeannetteN. Chang, Sungbok Lee, and ShrikanthS. Narayanan. 2008. iemocIEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (2008), 335–359.
- MKalpana Chowdary, TuN Nguyen, and DJude Hemanth. 2023. Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Computing and Applications 35, 32 (2023), 23311–23328.
- Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2fnet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4652–4661.
- Kaouther Ezzameli and Hela Mahersia. 2023. Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion 99 (2023), 101847.
- Cunhang Fan, Kang Zhu, Jianhua Tao, Guofeng Yi, Jun Xue, and Zhao Lv. 2024. Multi-level contrastive learning: Hierarchical alleviation of heterogeneity in multimodal sentiment analysis. IEEE Transactions on Affective Computing (2024).
- ToddE Feinberg, Arthur Rifkin, Carrie Schaffer, and Elaine Walker. 1986. Facial discrimination and emotional recognition in schizophrenia and affective disorders. Archives of general psychiatry 43, 3 (1986), 276–279.
- NikiMaria Foteinopoulou and Ioannis Patras. 2024. Emoclip: A vision-language method for zero-shot video facial expression recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–10.
- T Gao, X Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP 2021-2021 Conference on Empirical Methods in Natural Language Processing, Proceedings.
- AV Geetha, T Mala, D Priyanka, and E Uma. 2024. Multimodal Emotion Recognition with deep learning: advancements, challenges, and future directions. Information Fusion 105 (2024), 102218.
- Yusra Ghafoor, Shi Jinping, FernandoH Calderon, Yen-Hao Huang, Kuan-Ta Chen, and Yi-Shin Chen. 2023. TERMS: textual emotion recognition in multidimensional space. Applied Intelligence 53, 3 (2023), 2673–2693.
- Jianping Gou, Baosheng Yu, StephenJ Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.
- Steven Gutstein, Olac Fuentes, and Eric Freudenthal. 2008. Knowledge transfer in deep convolutional neural nets. International Journal on Artificial Intelligence Tools 17, 03 (2008), 555–567.
- Samira Hazmoune and Fateh Bougamouza. 2024. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Engineering Applications of Artificial Intelligence 133 (2024), 108339.
- Geoffrey Hinton. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).
- Mustaqeem Khan, Wail Gueaieb, Abdulmotaleb ElSaddik, and Soonil Kwon. 2024. MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Systems with Applications 245 (2024), 122946.
- SmithK Khare, Victoria Blanes-Vidal, EsmaeilS Nadimi, and URajendra Acharya. 2024. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion 102 (2024), 102019.
- Hakpyeong Kim and Taehoon Hong. 2024. Enhancing emotion recognition using multimodal fusion of physiological, environmental, personal data. Expert Systems with Applications 249 (2024), 123723.
- Sandeep Kumar, MohdAnul Haq, Arpit Jain, CAndy Jason, NageswaraRao Moparthi, Nitin Mittal, and ZamilS Alzamil. 2023. Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance.Computers, Materials & Continua 75, 1 (2023).
- Fanny Larradet, Radoslaw Niewiadomski, Giacinto Barresi, DarwinG Caldwell, and LeonardoS Mattos. 2020. Toward emotion recognition from physiological signals in the wild: approaching the methodological issues in real-life data collection. Frontiers in psychology 11 (2020), 1111.
- Yuanyuan Lei and Houwei Cao. 2023. Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels. IEEE Transactions on Affective Computing 14, 4 (2023), 2954–2969.
- SzeChit Leong, YukMing Tang, ChungHin Lai, and CKM Lee. 2023. Facial expression and body gesture emotion recognition: A systematic review on the use of visual data in affective computing. Computer Science Review 48 (2023), 100545.
- Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang Zeng. 2023. GA2MIF: graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Transactions on affective computing 15, 1 (2023), 130–143.
- Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1610–1618.
- Shuai Liu, Peng Gao, Yating Li, Weina Fu, and Weiping Ding. 2023a. Multi-modal fusion network with complementarity and importance for emotion recognition. Information Sciences 619 (2023), 679–694.
- Yucheng Liu, Ziyu Jia, and Haichao Wang. 2023b. Emotionkd: a cross-modal knowledge distillation framework for emotion recognition based on physiological signals. In Proceedings of the 31st ACM International Conference on Multimedia. 6122–6131.
- Wentao Ma, Qingchao Chen, Tongqing Zhou, Shan Zhao, and Zhiping Cai. 2023. Using multimodal contrastive knowledge distillation for video-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology 33, 10 (2023), 5486–5497.
- Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, Vol.33. 6818–6825.
- Bei Pan, Kaoru Hirota, Zhiyang Jia, Linhui Zhao, Xiaoming Jin, and Yaping Dai. 2023. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips. Journal of Ambient Intelligence and Humanized Computing 14, 3 (2023), 1903–1917.
- Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 527–536.
- Atieh Poushneh, Arturo Vasquez-Parraga, and RichardS Gearhart. 2024. The effect of empathetic response and consumers’ narcissism in voice-based artificial intelligence. Journal of Retailing and Consumer Services 79 (2024), 103871.
- RGnana Praveen and Jahangir Alam. 2024. Incongruity-Aware Cross-Modal Attention for Audio-Visual Fusion in Dimensional Emotion Recognition. IEEE Journal of Selected Topics in Signal Processing (2024).
- Gargi Singh, Dhanajit Brahma, Piyush Rai, and Ashutosh Modi. 2023. Text-based fine-grained emotion prediction. IEEE Transactions on Affective Computing (2023).
- Junwei Sun, Juntao Han, Yanfeng Wang, and Peng Liu. 2021. Memristor-based neural network circuit of emotion congruent memory with mental fatigue and emotion inhibition. IEEE Transactions on Biomedical Circuits and Systems 15, 3 (2021), 606–616.
- Teng Sun, Yinwei Wei, Juntong Ni, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. 2024. Muti-modal Emotion Recognition via Hierarchical Knowledge Distillation. IEEE Transactions on Multimedia (2024).
- Pradeep Tiwari, Harshil Rathod, Sakshee Thakkar, and AnandD Darji. 2023. Multimodal emotion recognition using SDA-LDA algorithm in video clips. Journal of Ambient Intelligence and Humanized Computing 14, 6 (2023), 6585–6602.
- A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017).
- Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2019. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations.. In IJCAI. Macao, 5415–5421.
- Junjie Zhang, Kun Zheng, Sarah Mazhar, Xiaohui Fu, and Jiangping Kong. 2023. Trusted emotion recognition based on multiple signals captured from video. Expert Systems with Applications 233 (2023), 120948.
- Shiqing Zhang, Yijiao Yang, Chen Chen, Xingnan Zhang, Qingming Leng, and Xiaoming Zhao. 2024. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Systems with Applications 237 (2024), 121692.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
VRCAI '24, Nanjing, China
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-1348-4/24/12.
DOI: https://doi.org/10.1145/3703619.3706033