GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning

Wen, Xiang; Wang, Haobo; Chen, Ke; Hu, Tianlei; Chen, Gang

doi:10.1631/FITEE.2500429

GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning

Research Article
Published: 23 December 2025

Volume 26, pages 2623–2637, (2025)
Cite this article

Save article

View saved research

Frontiers of Information Technology & Electronic Engineering Aims and scope

Xiang Wen¹,
Haobo Wang³,
Ke Chen^1,2,
Tianlei Hu^1,2 &
…
Gang Chen ORCID: orcid.org/0000-0002-7483-0045^1,2

105 Accesses
Explore all metrics

Abstract

In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

MULTIFORM: Few-Shot Knowledge Graph Completion via Multi-modal Contexts

Deep Double Incomplete Multi-view Multi-label Classification via Graph-Constraint Learning

MMAgentRec, a personalized multi-modal recommendation agent with large language model

Article Open access 08 April 2025

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Akata Z, Perronnin F, Harchaoui Z, et al., 2016. Label-embedding for image classification. IEEE Trans Patt Anal Mach Intell, 38(7):1425–1438. https://doi.org/10.1109/TPAMI.2015.2487986
Article Google Scholar
Ali M, Khan S, 2023. CLIP-Decoder: ZeroShot multilabel classification using multimodal CLIP aligned representations. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.4677–4681. https://doi.org/10.1109/ICCVW60793.2023.00505
Google Scholar
Aytes SA, Baek J, Hwang SJ, 2025. Sketch-of-thought: efficient LLM reasoning with adaptive cognitive-inspired sketching. https://doi.org/10.48550/arXiv.2503.05179
Google Scholar
Balasubramanian K, Lebanon G, 2012. The landmark selection method for multiple output prediction. https://doi.org/10.48550/arXiv.1206.6479
Google Scholar
Ben-Cohen A, Zamir N, Ben-Baruch E, et al., 2021. Semantic diversity learning for zero-shot multi-label classification. Proc IEEE/CVF Int Conf on Computer Vision, p.620–630. https://doi.org/10.1109/ICCV48922.2021.00068
Google Scholar
Changpinyo S, Chao WL, Gong BQ, et al., 2016. Synthesized classifiers for zero-shot learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5327–5336. https://doi.org/10.1109/CVPR.2016.575
Google Scholar
Changpinyo S, Chao WL, Sha F, 2017. Predicting visual exemplars of unseen classes for zero-shot learning. Proc IEEE Int Conf on Computer Vision, p.3496–3505. https://doi.org/10.1109/ICCV.2017.376
Google Scholar
Chen YN, Lin HT, 2012. Feature-aware label space dimension reduction for multi-label classification. Proc 26^th Int Conf on Neural Information Processing Systems, p.1529–1537.
Google Scholar
Chen ZF, Zhou QH, Shen YK, et al., 2023. See, think, confirm: interactive prompting between vision and language models for knowledge-based visual reasoning. https://doi.org/10.48550/arXiv.2301.05226
Google Scholar
Chen ZM, Wei XS, Wang P, et al., 2019. Multi-label image recognition with graph convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5172–5181. https://doi.org/10.1109/CVPR.2019.00532
Google Scholar
Cheng X, Lin HZ, Wu XY, et al., 2021. MlTr: multi-label classification with Transformer. https://doi.org/10.48550/arXiv.2106.06195
Google Scholar
Chua TS, Tang JH, Hong RC, et al., 2009. NUS-WIDE: a real-world web image database from National University of Singapore. Proc ACM Int Conf on Image and Video Retrieval, Article 48. https://doi.org/10.1145/1646396.1646452
Google Scholar
Deng J, Ding N, Jia YQ, et al., 2014. Large-scale object classification using label relation graphs. 13^th European Conf on Computer Vision, p.48–64. https://doi.org/10.1007/978-3-319-10590-1_4
Google Scholar
Fu YW, Yang YX, Hospedales TM, et al., 2015. Transductive multi-class and multi-label zero-shot learning. https://doi.org/10.48550/arXiv.1503.07884
Google Scholar
Gong YC, Jia YQ, Leung T, et al., 2014. Deep convolutional ranking for multilabel image annotation. https://doi.org/10.48550/arXiv.1312.4894
Google Scholar
Gupta A, Narayan S, Khan S, et al., 2023. Generative multilabel zero-shot learning. https://doi.org/10.48550/arXiv.2101.11606
Google Scholar
He SN, Guo TA, Dai T, et al., 2023. Open-vocabulary multilabel classification via multi-modal knowledge transfer. 37^th AAAI Conf on Artificial Intelligence, p.808–816. https://doi.org/10.1609/aaai.v37i1.25159
Google Scholar
Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hu HX, Zhou GT, Deng ZW, et al., 2016. Learning structured inference neural networks with label relations. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2960–2968. https://doi.org/10.1109/CVPR.2016.323
Google Scholar
Huynh D, Elhamifar E, 2020. A shared multi-attention framework for multi-label zero-shot learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8773–8783. https://doi.org/10.1109/CVPR42600.2020.00880
Google Scholar
Jiang B, Zhang Z, Lin D, et al., 2019. Semi-supervised learning with graph learning-convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11313–11320.
Google Scholar
Kuznetsova A, Rom H, Alldrin N, et al., 2020. The Open Images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis, 128(7):1956–1981. https://doi.org/10.1007/s11263-020-01316-z
Article Google Scholar
Lampert CH, Nickisch H, Harmeling S, 2009. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conf on Computer Vision and Pattern Recognition, p.951–958. https://doi.org/10.1109/CVPR.2009.5206594
Google Scholar
Lampert CH, Nickisch H, Harmeling S, 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Patt Anal Mach Intell, 36(3):453–465. https://doi.org/10.1109/TPAMI.2013.140
Article Google Scholar
Lanchantin J, Wang TL, Ordonez V, et al., 2021. General multi-label image classification with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16473–16483. https://doi.org/10.1109/CVPR46437.2021.01621
Google Scholar
Lee CW, Fang W, Yeh CK, et al., 2018. Multi-label zeroshot learning with structured knowledge graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1576–1585. https://doi.org/10.1109/CVPR.2018.00170
Google Scholar
Li Q, Qiao MY, Bian W, et al., 2016. Conditional graphical Lasso for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2977–2986. https://doi.org/10.1109/CVPR.2016.325
Google Scholar
Li Y, Song Y, Luo J, 2017. Improving multi-label classification with missing labels by learning visual and semantic embeddings. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5184–5192.
Google Scholar
Liu SL, Zhang L, Yang X, et al., 2021. Query2Label: a simple Transformer way to multi-label classification. https://doi.org/10.48550/arXiv.2107.10834
Google Scholar
Luo G, Yang X, Dou W, et al., 2025. Mono-InternVL: pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. Proc Computer Vision and Pattern Recognition Conf, p.24960–24971.
Google Scholar
Marino K, Salakhutdinov R, Gupta A, 2016. The more you know: using knowledge graphs for image classification. https://doi.org/10.48550/arXiv.1612.04844
Google Scholar
Mensink T, Gavves E, Snoek CGM, 2014. COSTA: co-occurrence statistics for zero-shot classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2441–2448. https://doi.org/10.1109/CVPR.2014.313
Google Scholar
Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 27^th Int Conf on Neural Information Processing Systems, p.3111–3119.
Google Scholar
Nam J, Kim J, Loza Mencía E, et al., 2014. Large-scale multi-label text classification—revisiting neural networks. European Conf on Machine Learning and Knowledge Discovery in Databases, p.437–452. https://doi.org/10.1007/978-3-662-44851-9_28
Chapter Google Scholar
Narayan S, Gupta A, Khan F, et al., 2021. Discriminative region-based multi-label zero-shot learning. Proc IEEE/CVF Int Conf on Computer Vision, p.8711–8720. https://doi.org/10.1109/ICCV48922.2021.00861
Google Scholar
Read J, Pfahringer B, Holmes G, et al., 2011. Classifier chains for multi-label classification. Mach Learn, 85(3):333–359. https://doi.org/10.1007/s10994-011-5256-5
Article MathSciNet Google Scholar
Ren Z, Jin HL, Lin Z, et al., 2015. Multi-instance visual-semantic embedding. https://doi.org/10.48550/arXiv.1512.06963
Google Scholar
Ridnik T, Sharir G, Ben-Cohen A, et al., 2023. ML-Decoder: scalable and versatile classification head. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.32–41. https://doi.org/10.1109/WACV56688.2023.00012
Google Scholar
Schuster M, Paliwal KK, 1997. Bidirectional recurrent neural networks. IEEE Trans Signal Process, 45(11):2673–2681. https://doi.org/10.1109/78.650093
Article Google Scholar
Tai F, Lin HT, 2012. Multilabel classification with principal label space transformation. Neur Comput, 24(9):2508–2542. https://doi.org/10.1162/NECO_a_00320
Article MathSciNet Google Scholar
Tan C, Wei JX, Gao ZY, et al., 2024. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. 18^th European Conf on Computer Vision, p.305–322. https://doi.org/10.1007/978-3-031-73661-2_17
Google Scholar
Tsoumakas G, Katakis I, 2007. Multi-label classification: an overview. Int J Data Warehous Min, 3(3):1–13. https://doi.org/10.4018/jdwm.2007070101
Article Google Scholar
Tsoumakas G, Katakis I, 2008. Multi-label classification: an overview. In: Wang J (Ed.), Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications. IGI Global Scientific Publishing, Hershey, PA, USA, p.64–74. https://doi.org/10.4018/978-1-59904-951-9.ch006
Chapter Google Scholar
Wang J, Yang Y, Mao JH, et al., 2016. CNN-RNN: a unified framework for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2285–2294. https://doi.org/10.1109/CVPR.2016.251
Google Scholar
Wang XZ, Wei J, Schuurmans D, et al., 2022. Self-consistency improves chain of thought reasoning in language models. https://doi.org/10.48550/arXiv.2203.11171
Google Scholar
Wei YC, Xia W, Huang JS, et al., 2014. CNN: single-label to multi-label. https://doi.org/10.48550/arXiv.1406.5726
Google Scholar
Xian YQ, Schiele B, Akata Z, 2017. Zero-shot learning—the good, the bad and the ugly. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3077–3086. https://doi.org/10.1109/CVPR.2017.328
Google Scholar
Xie C, Liang S, Li J, et al., 2025. RelationLMM: large multimodal model as open and versatile visual relationship generalist. IEEE Trans Patt Anal Mach Intell, 47(5):3515–3529. https://doi.org/10.1109/TPAMI.2025.3531452
Article Google Scholar
Yang YH, Xu HH, Huang H, et al., 2023. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1–5. https://doi.org/10.1109/ICASSP49357.2023.10096726
Google Scholar
Yao FL, Tian CY, Liu JT, et al., 2023. Thinking like an expert: multimodal hypergraph-of-thought (HoT) reasoning to boost foundation modals. https://doi.org/10.48550/arXiv.2308.06207
Google Scholar
Yeh CK, Wu WC, Ko WJ, et al., 2017. Learning deep latent space for multi-label classification. Proc 31^st AAAI Conf on Artificial Intelligence, p.2838–2844. https://doi.org/10.1609/aaai.v31i1.10769
Google Scholar
Zhang DA, Yang JM, Lyu HJ, et al., 2024. CoCoT: contrastive chain-of-thought prompting for large multimodal models with multiple image inputs. https://doi.org/10.48550/arXiv.2401.02582
Google Scholar
Zhang ML, Zhou ZH, 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng, 18(10):1338–1351. https://doi.org/10.1109/TKDE.2006.162
Article Google Scholar
Zhang Y, Gong BQ, Shah M, 2016. Fast zero-shot image tagging. IEEE Conf on Computer Vision and Pattern Recognition, p.5985–5994. https://doi.org/10.1109/CVPR.2016.644
Google Scholar
Zhang ZM, Saligrama V, 2015. Zero-shot learning via semantic similarity embedding. Proc IEEE Int Conf on Computer Vision, p.4166–4174. https://doi.org/10.1109/ICCV.2015.474
Google Scholar
Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models. https://doi.org/10.48550/arXiv.2302.00923
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, 310027, China
Xiang Wen, Ke Chen, Tianlei Hu & Gang Chen
Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Zhejiang University, Hangzhou, 310027, China
Ke Chen, Tianlei Hu & Gang Chen
School of Software Technology, Zhejiang University, Hangzhou, 310027, China
Haobo Wang

Authors

Xiang Wen
View author publications
Search author on:PubMed Google Scholar
Haobo Wang
View author publications
Search author on:PubMed Google Scholar
Ke Chen
View author publications
Search author on:PubMed Google Scholar
Tianlei Hu
View author publications
Search author on:PubMed Google Scholar
Gang Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Xiang WEN designed the algorithm. Xiang WEN and Haobo WANG drafted the paper. Ke CHEN, Tianlei HU, and Gang CHEN polished, optimized, revised, and finalized the paper.

Corresponding author

Correspondence to Gang Chen.

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the Key R&D Program of Zhejiang Province (No. 2024C01021), the National Regional Innovation and Development Joint Fund of China (No. U24A20254), and the Leading Talents of Technological Innovation Program of Zhejiang Province (No. 2023R5214)

List of supplementary materials

1 Implementation process

2 Evaluation results and analysis

3 Fairness and bias analysis

4 Ablation studies

Fig. S1 Performance comparison across different visual attributes on the NUS-WIDE dataset (ZSL setting)

Fig. S2 Performance comparison across label frequency groups on the NUS-WIDE dataset (ZSL setting)

Table S1 Ablation results on the test set

Supplementary materials for