A COMPARATIVE STUDY OF CLASSICAL MACHINE LEARNING, DEEP  LEARNING, AND TRANSFORMER-BASED ARCHITECTURES FOR  MULTIMODAL HINDI SPEECH EMOTION RECOGNITION

SUJATA KOTIAN, Dr. SANTOSH SINGH

Manuscript Title:

A COMPARATIVE STUDY OF CLASSICAL MACHINE LEARNING, DEEP LEARNING, AND TRANSFORMER-BASED ARCHITECTURES FOR MULTIMODAL HINDI SPEECH EMOTION RECOGNITION

Author:

SUJATA KOTIAN, Dr. SANTOSH SINGH

DOI Number:

DOI:10.5281/zenodo.18786899

Published : 2026-02-23

About the author(s)

1. SUJATA KOTIAN- University Department of Information Technology, University of Mumbai.
2. Dr. SANTOSH SINGH- University Department of Information Technology, University of Mumbai.

Full Text : PDF

Abstract

The area of Speech Emotion Recognition (SER) is one that is critical to building intelligent devices and systems that are designed to be useful and aware of the user or human perspective. While there has been significant research into SER systems in English and European languages, the same level of research does not exist for the SER of Hindi, particularly in applying transformer architectures. This paper includes an extensive comparative analysis of classical machine-learning models, deep-learning architectures, and transformer-based networks on Hindi SER using a single evaluation framework. A created Hindi emotional speech dataset has also been prepared through pre-processing, technical acoustic pre-processing, and feature extraction, in both Mel-spectrogram and raw waveform formats. The following models have been trained/evaluated: classical machine-learning (SVM, Random Forest, Gradient Boosting) models, deeplearning (Convolutional Neural Network (CNN), CNN-Bi-LSTM, Attention-enhanced networks) models, and transformer models (e.g. Wav2Vec2.0, HuBERT, Vision Transformer (ViT), Swin Transformer (Swin-T)), using uniform training-validation-testing configurations. The results of our experiments indicate a continuous progression in performance across the various families of models, with the transformer models outperforming all others with the highest accuracy (93.4%) and macro-F1 score, followed by the deeplearning and classical models. In addition to providing a foundation for future studies of Hindi SER, error analyses reveal an increase in the capability to separate subtle emotions (e.g. sadness and fear) by using transformer-generated embeddings. This paper provides a solid empirical and methodological foundation for future Hindi SER research and highlights major opportunities for the lightweight deployment of Hindi SER systems and opportunities for multimodal systems.

Keywords

Hindi Speech Emotion Recognition; Deep Learning; Transformers; Wav2Vec2.0; HuBERT; CNN–BiLSTM; Mel-Spectrogram; Affective Computing; Benchmarking; Acoustic Modelling.