Abstract
In recent years, multi-label zero-shot learning (ML-ZSL) has garnered increasing attention because of its wide range of potential applications, such as image annotation, text classification, and bioinformatics. The central challenge in ML-ZSL lies in predicting multiple labels for unseen classes without requiring any labeled training data, which contrasts with conventional supervised learning paradigms. However, existing methods face several significant challenges. These include the substantial semantic gap between different modalities, which impedes effective knowledge transfer, and the intricate and typically complex relationships among multiple labels, making it difficult to model them in a meaningful and accurate manner. To overcome these challenges, we propose a graph-augmented multimodal chain-of-thought (GMCoT) reasoning approach. The proposed method combines the strengths of multimodal large language models with graph-based structures, significantly enhancing the reasoning process involved in multi-label prediction. First, a novel multimodal chain-of-thought reasoning framework is presented which imitates human-like step-by-step reasoning to produce multi-label predictions. Second, a technique is presented for integrating label graphs into the reasoning process. This technique enables the capture of complex semantic relationships among labels, thereby improving the accuracy and consistency of multi-label generation. Comprehensive experiments on benchmark datasets demonstrate that the proposed GMCoT approach outperforms state-of-the-art methods in ML-ZSL.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Akata Z, Perronnin F, Harchaoui Z, et al., 2016. Label-embedding for image classification. IEEE Trans Patt Anal Mach Intell, 38(7):1425–1438. https://doi.org/10.1109/TPAMI.2015.2487986
Ali M, Khan S, 2023. CLIP-Decoder: ZeroShot multilabel classification using multimodal CLIP aligned representations. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.4677–4681. https://doi.org/10.1109/ICCVW60793.2023.00505
Aytes SA, Baek J, Hwang SJ, 2025. Sketch-of-thought: efficient LLM reasoning with adaptive cognitive-inspired sketching. https://doi.org/10.48550/arXiv.2503.05179
Balasubramanian K, Lebanon G, 2012. The landmark selection method for multiple output prediction. https://doi.org/10.48550/arXiv.1206.6479
Ben-Cohen A, Zamir N, Ben-Baruch E, et al., 2021. Semantic diversity learning for zero-shot multi-label classification. Proc IEEE/CVF Int Conf on Computer Vision, p.620–630. https://doi.org/10.1109/ICCV48922.2021.00068
Changpinyo S, Chao WL, Gong BQ, et al., 2016. Synthesized classifiers for zero-shot learning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5327–5336. https://doi.org/10.1109/CVPR.2016.575
Changpinyo S, Chao WL, Sha F, 2017. Predicting visual exemplars of unseen classes for zero-shot learning. Proc IEEE Int Conf on Computer Vision, p.3496–3505. https://doi.org/10.1109/ICCV.2017.376
Chen YN, Lin HT, 2012. Feature-aware label space dimension reduction for multi-label classification. Proc 26th Int Conf on Neural Information Processing Systems, p.1529–1537.
Chen ZF, Zhou QH, Shen YK, et al., 2023. See, think, confirm: interactive prompting between vision and language models for knowledge-based visual reasoning. https://doi.org/10.48550/arXiv.2301.05226
Chen ZM, Wei XS, Wang P, et al., 2019. Multi-label image recognition with graph convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5172–5181. https://doi.org/10.1109/CVPR.2019.00532
Cheng X, Lin HZ, Wu XY, et al., 2021. MlTr: multi-label classification with Transformer. https://doi.org/10.48550/arXiv.2106.06195
Chua TS, Tang JH, Hong RC, et al., 2009. NUS-WIDE: a real-world web image database from National University of Singapore. Proc ACM Int Conf on Image and Video Retrieval, Article 48. https://doi.org/10.1145/1646396.1646452
Deng J, Ding N, Jia YQ, et al., 2014. Large-scale object classification using label relation graphs. 13th European Conf on Computer Vision, p.48–64. https://doi.org/10.1007/978-3-319-10590-1_4
Fu YW, Yang YX, Hospedales TM, et al., 2015. Transductive multi-class and multi-label zero-shot learning. https://doi.org/10.48550/arXiv.1503.07884
Gong YC, Jia YQ, Leung T, et al., 2014. Deep convolutional ranking for multilabel image annotation. https://doi.org/10.48550/arXiv.1312.4894
Gupta A, Narayan S, Khan S, et al., 2023. Generative multilabel zero-shot learning. https://doi.org/10.48550/arXiv.2101.11606
He SN, Guo TA, Dai T, et al., 2023. Open-vocabulary multilabel classification via multi-modal knowledge transfer. 37th AAAI Conf on Artificial Intelligence, p.808–816. https://doi.org/10.1609/aaai.v37i1.25159
Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hu HX, Zhou GT, Deng ZW, et al., 2016. Learning structured inference neural networks with label relations. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2960–2968. https://doi.org/10.1109/CVPR.2016.323
Huynh D, Elhamifar E, 2020. A shared multi-attention framework for multi-label zero-shot learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8773–8783. https://doi.org/10.1109/CVPR42600.2020.00880
Jiang B, Zhang Z, Lin D, et al., 2019. Semi-supervised learning with graph learning-convolutional networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11313–11320.
Kuznetsova A, Rom H, Alldrin N, et al., 2020. The Open Images dataset V4: unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis, 128(7):1956–1981. https://doi.org/10.1007/s11263-020-01316-z
Lampert CH, Nickisch H, Harmeling S, 2009. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conf on Computer Vision and Pattern Recognition, p.951–958. https://doi.org/10.1109/CVPR.2009.5206594
Lampert CH, Nickisch H, Harmeling S, 2014. Attribute-based classification for zero-shot visual object categorization. IEEE Trans Patt Anal Mach Intell, 36(3):453–465. https://doi.org/10.1109/TPAMI.2013.140
Lanchantin J, Wang TL, Ordonez V, et al., 2021. General multi-label image classification with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.16473–16483. https://doi.org/10.1109/CVPR46437.2021.01621
Lee CW, Fang W, Yeh CK, et al., 2018. Multi-label zeroshot learning with structured knowledge graphs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1576–1585. https://doi.org/10.1109/CVPR.2018.00170
Li Q, Qiao MY, Bian W, et al., 2016. Conditional graphical Lasso for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2977–2986. https://doi.org/10.1109/CVPR.2016.325
Li Y, Song Y, Luo J, 2017. Improving multi-label classification with missing labels by learning visual and semantic embeddings. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5184–5192.
Liu SL, Zhang L, Yang X, et al., 2021. Query2Label: a simple Transformer way to multi-label classification. https://doi.org/10.48550/arXiv.2107.10834
Luo G, Yang X, Dou W, et al., 2025. Mono-InternVL: pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. Proc Computer Vision and Pattern Recognition Conf, p.24960–24971.
Marino K, Salakhutdinov R, Gupta A, 2016. The more you know: using knowledge graphs for image classification. https://doi.org/10.48550/arXiv.1612.04844
Mensink T, Gavves E, Snoek CGM, 2014. COSTA: co-occurrence statistics for zero-shot classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2441–2448. https://doi.org/10.1109/CVPR.2014.313
Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 27th Int Conf on Neural Information Processing Systems, p.3111–3119.
Nam J, Kim J, Loza Mencía E, et al., 2014. Large-scale multi-label text classification—revisiting neural networks. European Conf on Machine Learning and Knowledge Discovery in Databases, p.437–452. https://doi.org/10.1007/978-3-662-44851-9_28
Narayan S, Gupta A, Khan F, et al., 2021. Discriminative region-based multi-label zero-shot learning. Proc IEEE/CVF Int Conf on Computer Vision, p.8711–8720. https://doi.org/10.1109/ICCV48922.2021.00861
Read J, Pfahringer B, Holmes G, et al., 2011. Classifier chains for multi-label classification. Mach Learn, 85(3):333–359. https://doi.org/10.1007/s10994-011-5256-5
Ren Z, Jin HL, Lin Z, et al., 2015. Multi-instance visual-semantic embedding. https://doi.org/10.48550/arXiv.1512.06963
Ridnik T, Sharir G, Ben-Cohen A, et al., 2023. ML-Decoder: scalable and versatile classification head. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.32–41. https://doi.org/10.1109/WACV56688.2023.00012
Schuster M, Paliwal KK, 1997. Bidirectional recurrent neural networks. IEEE Trans Signal Process, 45(11):2673–2681. https://doi.org/10.1109/78.650093
Tai F, Lin HT, 2012. Multilabel classification with principal label space transformation. Neur Comput, 24(9):2508–2542. https://doi.org/10.1162/NECO_a_00320
Tan C, Wei JX, Gao ZY, et al., 2024. Boosting the power of small multimodal reasoning models to match larger models with self-consistency training. 18th European Conf on Computer Vision, p.305–322. https://doi.org/10.1007/978-3-031-73661-2_17
Tsoumakas G, Katakis I, 2007. Multi-label classification: an overview. Int J Data Warehous Min, 3(3):1–13. https://doi.org/10.4018/jdwm.2007070101
Tsoumakas G, Katakis I, 2008. Multi-label classification: an overview. In: Wang J (Ed.), Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications. IGI Global Scientific Publishing, Hershey, PA, USA, p.64–74. https://doi.org/10.4018/978-1-59904-951-9.ch006
Wang J, Yang Y, Mao JH, et al., 2016. CNN-RNN: a unified framework for multi-label image classification. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2285–2294. https://doi.org/10.1109/CVPR.2016.251
Wang XZ, Wei J, Schuurmans D, et al., 2022. Self-consistency improves chain of thought reasoning in language models. https://doi.org/10.48550/arXiv.2203.11171
Wei YC, Xia W, Huang JS, et al., 2014. CNN: single-label to multi-label. https://doi.org/10.48550/arXiv.1406.5726
Xian YQ, Schiele B, Akata Z, 2017. Zero-shot learning—the good, the bad and the ugly. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3077–3086. https://doi.org/10.1109/CVPR.2017.328
Xie C, Liang S, Li J, et al., 2025. RelationLMM: large multimodal model as open and versatile visual relationship generalist. IEEE Trans Patt Anal Mach Intell, 47(5):3515–3529. https://doi.org/10.1109/TPAMI.2025.3531452
Yang YH, Xu HH, Huang H, et al., 2023. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1–5. https://doi.org/10.1109/ICASSP49357.2023.10096726
Yao FL, Tian CY, Liu JT, et al., 2023. Thinking like an expert: multimodal hypergraph-of-thought (HoT) reasoning to boost foundation modals. https://doi.org/10.48550/arXiv.2308.06207
Yeh CK, Wu WC, Ko WJ, et al., 2017. Learning deep latent space for multi-label classification. Proc 31st AAAI Conf on Artificial Intelligence, p.2838–2844. https://doi.org/10.1609/aaai.v31i1.10769
Zhang DA, Yang JM, Lyu HJ, et al., 2024. CoCoT: contrastive chain-of-thought prompting for large multimodal models with multiple image inputs. https://doi.org/10.48550/arXiv.2401.02582
Zhang ML, Zhou ZH, 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng, 18(10):1338–1351. https://doi.org/10.1109/TKDE.2006.162
Zhang Y, Gong BQ, Shah M, 2016. Fast zero-shot image tagging. IEEE Conf on Computer Vision and Pattern Recognition, p.5985–5994. https://doi.org/10.1109/CVPR.2016.644
Zhang ZM, Saligrama V, 2015. Zero-shot learning via semantic similarity embedding. Proc IEEE Int Conf on Computer Vision, p.4166–4174. https://doi.org/10.1109/ICCV.2015.474
Zhang ZS, Zhang A, Li M, et al., 2023. Multimodal chain-of-thought reasoning in language models. https://doi.org/10.48550/arXiv.2302.00923
Author information
Authors and Affiliations
Contributions
Xiang WEN designed the algorithm. Xiang WEN and Haobo WANG drafted the paper. Ke CHEN, Tianlei HU, and Gang CHEN polished, optimized, revised, and finalized the paper.
Corresponding author
Ethics declarations
All the authors declare that they have no conflict of interest.
Additional information
Project supported by the Key R&D Program of Zhejiang Province (No. 2024C01021), the National Regional Innovation and Development Joint Fund of China (No. U24A20254), and the Leading Talents of Technological Innovation Program of Zhejiang Province (No. 2023R5214)
List of supplementary materials
1 Implementation process
2 Evaluation results and analysis
3 Fairness and bias analysis
4 Ablation studies
Fig. S1 Performance comparison across different visual attributes on the NUS-WIDE dataset (ZSL setting)
Fig. S2 Performance comparison across label frequency groups on the NUS-WIDE dataset (ZSL setting)
Table S1 Ablation results on the test set
Rights and permissions
About this article
Cite this article
Wen, X., Wang, H., Chen, K. et al. GMCoT: a graph-augmented multimodal chain-of-thought reasoning framework for multi-label zero-shot learning. Front Inform Technol Electron Eng 26, 2623–2637 (2025). https://doi.org/10.1631/FITEE.2500429
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1631/FITEE.2500429
