Video-driven speaker-listener generation based on Transformer and neural renderer

Yang, Daowu; Yang, Qi; Jiang, Wen; Chen, Jifeng; Shao, Zhengxi; Liu, Qiong

doi:10.1007/s11042-024-18291-z

Video-driven speaker-listener generation based on Transformer and neural renderer

Published: 29 January 2024

Volume 83, pages 70501–70522, (2024)
Cite this article

Save article

View saved research

Multimedia Tools and Applications Aims and scope Submit manuscript

Daowu Yang¹,
Qi Yang²,
Wen Jiang¹,
Jifeng Chen¹,
Zhengxi Shao¹ &
…
Qiong Liu ORCID: orcid.org/0000-0002-2282-0208¹

296 Accesses
2 Citations
Explore all metrics

Abstract

The traditional speaker-centric synthesis methods prioritize language accuracy but overlook emotional connection and feedback mechanisms with the audience. This paper is dedicated to an in-depth exploration of responsive speaker-listener generation, aiming to enhance communication by providing real-time non-verbal feedback such as head movements and facial expressions. Driven by video, we extract 3DMM coefficients to model facial features and head poses. Combining this with a Transformer speech encoder extracting 45-dimensional acoustic features, we achieve speaker generation at the sentence level. For responsive listener generation, we introduce two attention mechanisms in the Transformer decoder: cross-modal multi-head attention aligning audio-motion modalities and biased causal self-attention suitable for longer audio sequences. Finally, by aligning audio with a behavioral model and optimizing an enhanced neural renderer for facial images, we successfully achieve precise control over facial movements. Extensive experiments demonstrate the superiority of our approach compared to existing technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Audio-driven single image talking face animation with transformers

Article Open access 19 January 2026

3D head-talk: speech synthesis 3D head movement face animation

Article 15 October 2023

3D facial animation driven by speech-video dual-modal signals

Article Open access 23 May 2024

Data Availibility Statement

Our data are primarily from https://github.com/dc3ea9f/vico_challenge_baseline, and we use the dataset for research purposes only.

References

Zhou M, Bai Y, Zhang W, Yao T, Zhao T, Mei T (2022) Responsive listening head generation: a benchmark dataset and baseline. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp 124–142. Springer
Yang D, Li R, Yang Q, Peng Y, Huang X, Zou J (2023) 3d head-talk: speech synthesis 3d head movement face animation. Soft Comput 1–17
Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J (2019) Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3497–3506
Prajwal K, Mukhopadhyay R, Namboodiri VP, Jawahar C (2020) A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia pp 484–492
Jamaludin A, Chung JS, Zisserman A (2019) You said that?: Synthesising talking faces from audio. Int J Comput Vision 127:1767–1779
Wang K, Wu Q, Song L, Yang Z, Wu W, Qian C, He R, Qiao Y, Loy CC (2020) Mead: A large-scale audio-visual dataset for emotional talking-face generation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI pp 700–717 Springer
Chen L, Cui G, Liu C, Li Z, Kou Z, Xu Y, Xu C (2020) Talking-head generation with rhythmic head motion. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX, pp 35–51Springer
Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36(4):1–13
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: International conference on machine learning, pp 1691–1703 PMLR
Press O, Smith NA, Lewis M (2021) Train short, test long: Attention with linear biases enables input length extrapolation. arXiv:2108.12409
Zhang C, Zhao Y, Huang Y, Zeng M, Ni S, Budagavi M, Guo X (2021) Facial: Synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF international conference on computer vision pp 3867–3876
Guo Y, Chen K, Liang S, Liu Y-J, Bao H, Zhang J (2021) Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision pp 5784–5794
Zhang Y, He W, Li M, Tian K, Zhang Z, Cheng J, Wang Y, Liao J (2022) Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) pp 4848–4852 IEEE
Chen L, Maddox RK, Duan Z, Xu C (2019) Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7832–7841
Zhou H, Sun Y, Wu W, Loy CC, Wang X, Liu Z (2021) Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp 4176–4186
Liang B, Pan Y, Guo Z, Zhou H, Hong Z, Han X, Han J, Liu J, Ding E, Wang J (2022) Expressive talking head generation with granular audio-visual control. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp 3387–3396
Wang S, Li L, Ding Y, Fan C, Yu X (2021) Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv:2107.09293
Zhou H, Liu Y, Liu Z, Luo P, Wang X (2019) Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI conference on artificial intelligence 33:9299–9306
Article Google Scholar
Thies J, Elgharib M, Tewari A, Theobalt C, Nießner M (2020) Neural voice puppetry: Audio-driven facial reenactment. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp 716–731 Springer
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4401–4410
Taylor S, Kim T, Yue Y, Mahler M, Krahe J, Rodriguez AG, Hodgins J, Matthews I (2017) A deep learning approach for generalized speech animation. ACM Transactions On Graphics (TOG) 36(4):1–11
Article Google Scholar
Tian G, Yuan Y, Liu Y (2019) Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: 2019 IEEE international conference on multimedia & expo workshops (ICMEW), pp 366–371IEEE
Gillies M, Pan X, Slater M, Shawe-Taylor J (2008) Responsive listening behavior. Computer animation and virtual worlds 19(5):579–589
Article Google Scholar
Heylen D, Bevacqua E, Pelachaud C, Poggi I, Gratch J, Schröder M (2011) Generating listening behaviour. The Humaine handbook, Emotion-oriented systems, pp 321–347
Google Scholar
McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2011) The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17
Article Google Scholar
Petridis S, Martinez B, Pantic M (2013) The mahnob laughter database. Image and Vision Computing 31(2):186–202
Article Google Scholar
Buschmeier H, Malisz Z, Skubisz J, Wlodarczak M, Wachsmuth I, Kopp S, Wagner P (2014) Alico: A multimodal corpus for the study of active listening. LREC 2014, Ninth international conference on language resources and evaluation, 26–31 May. Reykjavik, Iceland, pp 3638–3643
Google Scholar
Joo H, Simon T, Cikara M, Sheikh Y (2019) Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10873–10883
Oertel C, Jonell P, Kontogiorgos D, Mora KF, Odobez J-M, Gustafson J (2021) Towards an engagement-aware attentive artificial listener for multi-party interactions. Frontiers in Robotics and AI 8:555913
Article Google Scholar
Huang A, Huang Z, Zhou S (2022) Perceptual conversational head generation with regularized driver and enhanced renderer. In: Proceedings of the 30th ACM international conference on multimedia, pp 7050–7054
Wu W, Zhang Y, Li C, Qian C, Loy CC (2018) Reenactgan: Learning to reenact faces via boundary transfer. In: Proceedings of the European conference on computer vision (ECCV), pp 603–619
Ren Y, Li G, Chen Y, Li TH, Liu S (2021) Pirenderer: Controllable portrait image generation via semantic neural rendering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13759–13768
Zhu H, Luo M-D, Wang R, Zheng A-H, He R (2021) Deep audio-visual learning: A survey. Int JAuto Comput 18:351–376
Google Scholar
Ramamoorthi R, Hanrahan P (2001) An efficient representation for irradiance environment maps. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 497–500
Ramamoorthi R, Hanrahan P (2001) A signal-processing framework for inverse rendering. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 117–128
Deng Y, Yang J, Xu S, Chen D, Jia Y, Tong X (2019) Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 0–0
Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques pp 187–194
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 770–778
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958
MathSciNet Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning pp 448–456 pmlr
Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M (2020) U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recogn 106:107404
Article Google Scholar
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Proc 45(11):2673–2681
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning pp 448–456 pmlr
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 770–778
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929–1958
MathSciNet Google Scholar
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems pp 1–15 Springer
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Proc 13(4):600–612
Article Google Scholar
Winkler S, Mohandas P (2008) The evolution of video quality measurement: From psnr to hybrid metrics. IEEE Trans Broad 54(3):660–668
Article Google Scholar
Bohr P, Gargote R, Vhorkate R, Yawle R, Bairagi V (2013) A no reference image blur detection using cumulative probability blur detection (cpbd) metric. Int J Sci Modern Eng 1(5)
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp 4690–4699
Raina A, Arora V (2022) Syncnet: Using causal convolutions and correlating objective for time delay estimation in audio signals. arXiv:2203.14639

Download references

Funding

This work was supported in part by the Research projects funded by the Education Department of Hunan Province(23B0862,20A282,19B324).

Author information

Authors and Affiliations

Hunan International Economics University, Changsha, 410205, Hunan, China
Daowu Yang, Wen Jiang, Jifeng Chen, Zhengxi Shao & Qiong Liu
Shaoyang University, Shaoyang, 422099, Hunan, China
Qi Yang

Authors

Daowu Yang
View author publications
Search author on:PubMed Google Scholar
Qi Yang
View author publications
Search author on:PubMed Google Scholar
Wen Jiang
View author publications
Search author on:PubMed Google Scholar
Jifeng Chen
View author publications
Search author on:PubMed Google Scholar
Zhengxi Shao
View author publications
Search author on:PubMed Google Scholar
Qiong Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Daowu Yang, Qi Yang, Wen Jiang, Jifeng Chen, Zhengxi Shao and Qiong Liu. The first draft of the manuscript was written by Daowu Yang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Qiong Liu.

Ethics declarations

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Ethical and informed consent for data used

This article does not contain any studies with humanparticipants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, D., Yang, Q., Jiang, W. et al. Video-driven speaker-listener generation based on Transformer and neural renderer. Multimed Tools Appl 83, 70501–70522 (2024). https://doi.org/10.1007/s11042-024-18291-z

Download citation

Received: 02 July 2023
Revised: 13 December 2023
Accepted: 13 January 2024
Published: 29 January 2024
Version of record: 29 January 2024
Issue date: August 2024
DOI: https://doi.org/10.1007/s11042-024-18291-z

Keywords

Part of a collection:

Track 6: Computer Vision for Multimedia Applications

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video-driven speaker-listener generation based on Transformer and neural renderer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Audio-driven single image talking face animation with transformers

3D head-talk: speech synthesis 3D head movement face animation

3D facial animation driven by speech-video dual-modal signals

Explore related subjects

Data Availibility Statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now