Published April 4, 2025 | Version v1
Software Open

Τransfer Learning for Software Vulnerability Prediction using Transformer Models [Replication Package]

  • 1. ROR icon Centre for Research and Technology Hellas
  • 2. ROR icon University of Macedonia

Description

Replication package developed during the conduction of our publication on which we examined Transformer-based pre-trained models and we investigated the best practices for their utilization for the downstream task of Vulnerability Prediction in software systems.

Recent research endeavors in the software security-related literature have paid much attention to text mining-based methods in order to predict vulnerable software components through deep learning models. The progress in the field of natural language processing provides an advanced direction on constructing vulnerability prediction models. Studies that employ pre-trained models, which are commonly based on the disruptive Transformer architecture, have started to appear in the field. This study investigates the capacity of two established Transformer-based model architectures namely (i) Generative Pre-trained Transformer (GPT), and (ii) Bidirectional Encoder Representations from Transformers (BERT) to enhance the vulnerability prediction process by capturing semantic and syntactic information in the source code. Specifically, we examine different ways of using the CodeGPT and CodeBERT variants of the aforementioned models to build vulnerability prediction models, in an attempt to maximize the benefit of their use for the downstream task of vulnerability prediction. In particular, we explore fine-tuning and feature-based approaches, including word embedding and sentence-level embedding extraction methods. We also compare vulnerability prediction models based on Transformers trained on code from scratch (i.e., unimodal models) or after natural language pre-training (i.e., bi-modal models). Furthermore, we compare these architectures to state-of-the-art text mining-based and graph-based approaches. The results showcase that utilizing pre-trained word embeddings to feed and train a separate deep learning predictor is a more efficient approach in vulnerability prediction than either extracting sentence-level features or fine-tuning the whole Transformer model, regardless of whether it is a bimodal or a unimodal model. The findings also highlight the importance of context-aware embeddings and demonstrate the great benefit of representing token sequences with them in the models' attempt to identify vulnerable patterns in the source code.

Technical info

By extracting the vulGpt-main.zip file, one can find the vulGpt-main folder which contains the replication package of the “Transfer Learning for Software Vulnerability Prediction using Transformer Models” study. Inside the vulGpt-main folder, there are two yaml files:

  • torchenv.yml file, which is the python-conda virtual environment (venv) for the experiments that utilized the PyTorch framework

    • all the fine-tuning scripts in directories ml/transformersFineTuning/pt

  • tfenv.yml file, which is the python-conda virtual environment (venv) for the experiments that utilized the Tensorflow framework

    • all the other scripts

 

There are also three folders:

  • bigvul for the experiments on the Big-Vul dataset 

  • devign for the experiment on the Devign dataset

  • reveal for the experiment on the ReVeal dataset

 

Each of these three folders contains four sub-folders:

  • data_mining: Contains the data_mining.py file that transforms the dataset in the format that is utilized by the ML pipelines in the ml folder. The produced files are stored in the folder data. 

  • data: The folder that has the raw data and the files produced by the data_mining.py

    • In the case of Big-Vul, there is no data folder initially, as the dataset is downloaded during the execution of the data_mining.py.

  • ml: The folder with the ML pipelines that actually perform the experiments of the analysis.

    • The bag_of_words/bow.py performs the Bag of Words experiment

    • The transformersFineTuning/pt/ptBert.py performs the fine-tuning of CodeBERT (choose CodeBERT variant e.g. model_variation = "microsoft/codebert-base-mlm" (e.g., for bigvul in line 90))

    • The transformersFineTuning/pt/ptGpt.py performs the fine-tuning of CodeGPT-2 (choose CodeGPT-2 variant e.g. model_variation = "microsoft/CodeGPT-small-py" (e.g., for bigvul in line 91))

    • The word_embedding/llm_embs_featExtraction_dl.py performs the embeddings extraction approach (choose embedding_algorithm e.g. embedding_algorithm = "bert" , choose model_variation e.g. model_variation = "microsoft/codebert-base-mlm" and choose userModel e.g. userModel = "bigru")

    • The word_embedding/gensim/train_embs.py trains on code the static word embeddings (choose w2v or ft for Word2vec or FastText method in line 15 of bigvul)

    • The word_embedding/gensim/w2v_dl.py and word_embedding/gensim/ft_dl.py perform the word2vec and fastText embeddings approaches respectively

  • results: It is the folder where the results are stored after the successful completion of the experiments. It also contains the statistical_test.py that performs the Wilcoxon signed-rank statistical test to identify statistical significance between the results specified in the beginning of the file, e.g.

    • relative_path2 = os.path.join("codebert-base-mlm", "embeddingsExtraction", "bigru")

    • relative_path2 = os.path.join("codebert-base", "embeddingsExtraction", "bigru")


Each python file (.py) can be invoked by running the python filename.py command from the directory of each file.

Files

vulGpt-main.zip

Files (20.2 MB)

Name Size Download all
md5:03175a1174cbcba63dab0a1ddb6511d6
20.2 MB Preview Download

Additional details

Dates

Available
2025-04

Software

Repository URL
https://github.com/iliaskaloup/vulGpt
Programming language
Python

References

  • I. Kalouptsoglou, M. Siavvas, A. Ampatzoglou, D. Kehagias, A. Chatzigeorgiou, Transfer Learning for Software Vulnerability Prediction using Transformer Models, in: Journal of Systems and Software (JSS), Elsevier, 2025