image

Overcoming Vocal Similarities in Identical Twins: A Hybrid Deep Learning Model for Emotion-Aware Speaker and Gender Recognition

Download Paper: Download pdf
Author(s):
  • Rajani Kumari INAPAGOLLA Department of Electronics and Communication Engineering GITAM University, Vizag. India
  • K. Kalyan BABU Department of Electronics and Communication Engineering, GITAM University, Vizag, India
Abstract:

Speaker identification among identical twins remains a significant challenge in voice-based biometric systems, particularly under emotional variability. Emotions dynamically alter speech characteristics, reducing the effectiveness of conventional identification algorithms. To address this, we propose a hybrid deep learning architecture that integrates gender and emotion classification with speaker Identification, tailored specifically to the complexity of identical twin voices. The system combines Emphasized Channel Attention Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) embeddings for speaker-specific representations, Power Normalized Cepstral Coefficients (PNCC) for noise-robust spectral features, and Maximal Overlap Discrete Wavelet Transform (MODWT) for effective time-frequency denoising. A Radial Basis Function Neural Network (RBFNN) is employed to refine and fuse feature vectors, enhancing the discrimination of emotion-related cues. An attention mechanism further emphasizes emotionally salient patterns, followed by a Multi-Layer Perceptron (MLP) for final classification. The model is evaluated on speech datasets from RAVDESS, Google Research, and a proprietary corpus of identical twin voices. Results demonstrate significant improvements in speaker and emotion recognition accuracy, especially under low signal-to-noise ratio (SNR) conditions, outperforming traditional Mel Cepstral-based methods. The proposed system’s integration of robust audio fingerprinting, feature refinement, and attention-guided.


© The Author(s) 2025. Published by RITHA Publishing. This article is distributed under the terms of the license CC-BY 4.0., which permits any further distribution in any medium, provided the original work is properly cited maintaining attribution to the author(s) and the title of the work, journal citation and URL DOI.


How to cite:

Inapagolla, R. K., & Babu, K. K. (2025). Overcoming vocal similarities in identical twins: A hybrid deep learning model for emotion-aware speaker and gender recognition. Journal of Research, Innovation and Technologies, Volume IV, 1(7), 69-81. https://doi.org/10.57017/jorit.v4.1(7).05 

References:

Chien, W., & Wang, W. (2019). Feature extraction and fusion for robust emotion recognition in speech. IEEE Transactions on Audio, Speech, and Language Processing, 27(10), 1698–1707. https://doi.org/10.1109/TASLP.2019.2930311


Dinesh, B., & Agarwal, S. (2021). Hybrid deep learning model for speaker identification and emotion classification. Proceedings of the 2021 IEEE International Conference on Acoustic Signal Processing (ICASP), 45–52. https://doi.org/10.1109/ICASP.2021.9300254


Duan, Z., & Wei, H. (2020). The effect of emotional speech on speaker identification: A review of techniques and challenges. Journal of Audio, Speech, and Music Processing, 7(1), 15–28. https://doi.org/10.1177/2057034X20915712


Gerven, A.V., & He, C. (2017). Deep neural networks for emotion recognition from speech: A survey. IEEE Transactions on Affective Computing, 8(2), 194–210. https://doi.org/10.1109/TAFFC.2016.2615329


Ghosal, A., Gupta, A., & Majumder, P. (2019). Speaker recognition in emotional speech: Challenges and solutions. IEEE Transactions on Audio, Speech, and Language Processing, 27(8), 1207–1219. https://doi.org/10.1109/TASLP.2019.2904508


Inapagolla, R. J. & Babu, K. K. (2023). Designing Highly Secured Speaker Identification with Audio Fingerprinting using MODWT and RBFNN. International Journal of Intelligent Systems and Applications in Engineering, 25–30. https://www.ijisae.org/index.php/IJISAE/article/view/4779 


Inapagolla, R.K & Babu, K.K. (2025). Audio Fingerprinting to Achieve Greater Accuracy and Maximum Speed with Multi Model CNN-RNN-LSTM in Speaker Identification. International Journal of Computational and Experimental Science and Engineering, 1108–1116. https://doi.org/10.22399/ijcesen.1138


Kim, S., & Park, J. (2018). Emotional variance in twin speech recognition and its applications. International Journal of Speech Technology, 21(4), 589–600. https://doi.org/10.1007/s10772-018-9431-5


Kim, K., & Kim, J. (2021). Attention-based deep learning models for emotion-aware speaker identification. IEEE Access, 9, 172134–172145. https://doi.org/10.1109/ACCESS.2021.3068586


Kumar, S., & Aggarwal, N. (2018). Wavelet-based feature extraction for emotion and speaker recognition. Proceedings of the International Conference on Signal Processing and Communication (SPCOM), 1–6. https://doi.org/10.1109/SPCOM.2018.8766801


Lee, J., Park, H., & Jeong, H. (2021). Wavelet transform-based features for noise-robust speaker recognition. Speech Communication, 134, 12–25. https://doi.org/10.1016/j.specom.2021.03.002


Lee, J., & Cho, H. (2020). Speaker identification for identical twins using deep neural networks. IEEE Transactions on Audio, Speech, and Language Processing, 28, 1235–1245. https://doi.org/10.1109/TASLP.2020.2975365


Li, J., Wang, L., & Xu, C. (2022). Wavelet transform for robust emotion recognition in speech signals. Journal of Electrical Engineering & Technology, 17(4), 1853–1862. https://doi.org/10.1007/s42835-021-00952-6


Li, J., Zhang, L., Guo, D., Zhuo, S., & Sim, T. (2015). Audio-visual twin’s database. International Conference on Biometrics, 493–500. https://doi.org/10.1109/ICB.2015.7139115


Liu, J., & Wang, X. (2020). Towards better speaker emotion recognition via fine-grained feature fusion. IEEE Access, 8, 178491–178501. https://doi.org/10.1109/ACCESS.2020.3018085


Mavroforakis, M., & Koutroumpis, E. (2020). Robust emotion classification from speech using joint spectral features and temporal attention networks. Speech Communication, 124, 1–10. https://doi.org/10.1016/j.specom.2020.02.003


Moore, S., & Li, P. (2019). Emotion-robust speaker recognition systems: Current approaches and challenges. Journal of Voice, 33(5), 776–789. https://doi.org/10.1016/j.jvoice.2018.09.004


Park, Y., & Kim, T. (2019). Discriminating identical twins in speaker identification under emotional variability. Speech Communication, 108, 23–32. https://doi.org/10.1016/j.specom.2019.02.005


Peng, H., & Zhang, Y. (2019). Cross-corpus speech emotion recognition using deep learning techniques. IEEE Transactions on Multimedia, 21(7), 1804–1815. https://doi.org/10.1109/TMM.2019.2898553


Poria, S., & Cambria, E. (2020). Deep learning for emotion recognition: A review. IEEE Transactions on Affective Computing, 11(2), 101–110. https://doi.org/10.1109/TAFFC.2020.2975365


Singh, M., & Patel, P. (2020). Feature extraction techniques for speech emotion recognition: A comparative review. Journal of Signal Processing, 44(3), 247–261. https://doi.org/10.1109/JSP.2020.2960332


Singh, R., Gupta, A., & Rana, A. (2020). PNCC-based emotion recognition from speech: A comparative study. Journal of Signal Processing Systems, 92(2), 179–191. https://doi.org/10.1007/s11265-020-01431-5


Snyder, D., Garcia, M., & Karbasi, A. (2021). ECAPA-TDNN: A deep learning architecture for speaker recognition under emotional variability. Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, 6063–6067. https://doi.org/10.1109/ICASSP39728.2021.9413595


Yang, B., & Zhao, Z. (2021). Emotional speech recognition using deep convolutional networks: A comparative study. Proceedings of the International Conference on Speech Processing, 112–118. https://doi.org/10.1109/ICSP50755.2021.9473446


Zhang, T., & Li, X. (2018). Speaker recognition under stress and emotion: Challenges and techniques. Speech Communication, 102, 38–50. https://doi.org/10.1016/j.specom.2018.04.001


Zhao, X., Li, F., & Zhang, Y. (2020). Deep learning techniques for speech emotion recognition: A survey. IEEE Access, 8, 106728–106740. https://doi.org/10.1109/ACCESS.2020.2998756


Zeng, Z., & Li, Z. (2018). Multimodal emotion recognition: A review and new directions. IEEE Transactions on Affective Computing, 9(2), 226–240. https://doi.org/10.1109/TAFFC.2017.2709823


Zhang, Z., & Li, J. (2017). Improving speaker identification under emotional variability. Journal of Voice, 31(6), 779–788. https://doi.org/10.1016/j.jvoice.2017.02.006