Skip to main content
Log in

GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning

  • Research Article
  • Published:
Save article
View saved research
Frontiers of Information Technology & Electronic Engineering Aims and scope

Abstract

In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Xiang WEN designed the algorithm. Xiang WEN and Haobo WANG drafted the paper. Ke CHEN, Tianlei HU, and Gang CHEN polished, optimized, revised, and finalized the paper.

Corresponding author

Correspondence to Gang Chen.

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the Key R&D Program of Zhejiang Province (No. 2024C01021), the National Regional Innovation and Development Joint Fund of China (No. U24A20254), and the Leading Talents of Technological Innovation Program of Zhejiang Province (No. 2023R5214)

List of supplementary materials

1 Implementation process

2 Evaluation results and analysis

3 Fairness and bias analysis

4 Ablation studies

Fig. S1 Performance comparison across different visual attributes on the NUS-WIDE dataset (ZSL setting)

Fig. S2 Performance comparison across label frequency groups on the NUS-WIDE dataset (ZSL setting)

Table S1 Ablation results on the test set

Supplementary materials for

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, X., Wang, H., Chen, K. et al. GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning. Front Inform Technol Electron Eng 26, 2623–2637 (2025). https://doi.org/10.1631/FITEE.2500429

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1631/FITEE.2500429

Key words

CLC number