How to use spaCy v3.0 to fine-tune BERT transformer 07/06 Update SLTechnology News&Howtos

How to use spaCy v3.0 to fine-tune BERT transformer

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to use spaCy v3.0 to fine-tune BERT transformer". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to use spaCy v3.0 to fine-tune BERT transformer" can help you solve the problem.

BERT architecture

The trimming transformer requires a powerful GPU with parallel processing function. To do this, we use Google Colab because it provides a freely available server with GPU.

In this tutorial, we will use the newly released spaCy v3.0 library to fine-tune our converter. The following is a step-by-step guide to fine-tuning the BERT model on spaCy v3.0. The code and necessary files are provided in Github repo.

To fine-tune BERT using spaCy v3.0, we need to provide training and development data in spaCy v3.0 JSON format (see here), and then convert it to .space binaries. We will provide the data in IOB format contained in the TSV file and then convert it to spaCy JSON format.

I marked only 120 job descriptions, including entities such as skills, diplomas, diplomas, majors and experience in the training dataset, as well as about 70 job descriptions in the development dataset.

In this tutorial, I used the UBIAI annotation tool because it has a wide range of features, such as:

Machine learning automatic annotation

Dictionaries, regular expressions, and rule-based automatic comments

Teamwork to share annotation tasks

Direct comment export to IOB format

Using the regular expression feature in UBIAI, I pre-commented all references to experience following the "\ d.*\ +. *" pattern, such as "5 + years of C++ experience." Then I uploaded a CSV dictionary of all software languages and assigned entity skills. Pre-comments can save a lot of time and will help you minimize manual comments.

For more information about the UBIAI comment tool, visit the documentation page.

The exported comments will look like this:

Python:

MS B-DIPLOMAin Oelectrical B-DIPLOMA_MAJORengineering I-DIPLOMA_MAJORor Ocomputer B-DIPLOMA_MAJORengineering I-DIPLOMA_MAJOR. O5 + B-EXPERIENCEyears I-EXPERIENCEof I-EXPERIENCEindustry I-EXPERIENCEexperience I-EXPERIENCE. I-EXPERIENCEFamiliar Owith Ostorage B-SKILLSserver I-SKILLSarchitectures I-SKILLSwith OHDD B-SKILLS

To convert from IOB to JSON (see the documentation here), we use the spaCy v3.0 command:

Python:

! python-m spacy convert drive/MyDrive/train_set_bert.tsv. /-t json-n 1-c json python-m spacy convert drive/MyDrive/dev_set_bert.tsv. /-t json-n 1-c iob

After converting to spaCy v3.0 JSON, we need .spacy to use this command to convert both the training and development JSON files to binaries (using your own update file path):

Python:

! python-m spacy convert drive/MyDrive/train_set_bert.json. /-t spacial python-m spacy convert drive/MyDrive/dev_set_bert.json. /-t spacy model training

Open a new Google Colab project and make sure that GPU is selected as the hardware accelerator in the notebook settings.

To speed up the training process, we need to run parallel processing on GPU. To do this, we installed the NVIDIA 9.2 CUDA library:

Python:

! wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64-O cuda-repo-ubuntu1604-9-2-local_9.2.88-1 roomamd64.debaud dpkg-I cuda-repo-ubuntu1604-9-2-local_9.2.88-1roomamd64.debauaptmurkey add / var/cuda-repo-9-2-local/7fa2af80.pub ! apt-get updateworthy apt get install cuda-9.2

To check that the correct CUDA compiler is installed, run:! nvcc-- version

Install the spacy library and spacy converter pipeline:

Python:

Pip install-U spacer python-m spacy download en_core_web_trf

Next, we install the PyTorch machine learning library configured for CUDA 9.2:

Python:

Pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2-f https://download.pytorch.org/whl/torch_stable.html

After installing PyTorch, we need to install the spaCy converter adjusted for CUDA 9.2 and change the CUDA_PATH and LD_LIBRARY_PATH as follows. Finally, install the CuPy library, which is equivalent to the NumPy library, but works with GPU:

Python:

! pip install-U spacy [cuda92,transformers]! export CUDA_PATH= "/ usr/local/cuda-9.2"! export LIBRARYPATH install cupy

SpaCy v3.0 uses a config.cfg configuration file that contains all the model training components to train the model. On the spaCy training page, you can choose the model language (English in this tutorial), components (NER), and hardware (GPU) to use and download the profile template.

The only thing we need to do is fill in the path of the train and dev.spacy files. When finished, we upload the file to Google Colab.

Now we need to automatically populate the configuration file with the rest of the parameters required by the BERT model; all you have to do is run this command:

Python:

! python-m spacy init fill-config drive/MyDrive/config.cfg drive/MyDrive/config_spacy.cfg

If an error occurs, I recommend debugging your configuration file:

Python:

! python-m spacy debug data drive/MyDrive/config.cfg

We are finally ready to train the BERT model! Just run this command to start training:

Python:

! python-m spacy train-g 0 drive/MyDrive/config.cfg-output. /

Note: if an error occurs, cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_PTX: indicates that the PTX JIT compilation failed. Simply uninstall cupy and reinstall it, and it should solve the problem.

If all goes well, you should start to see model scores and losses being updated.

At the end of the training, the model will be saved in the folder model-best. The model score is located in the file in the model-best folder of the meta.json file:

Python:

"performance": {"ents_per_type": {"DIPLOMA": {"p": 0.5584415584, "r": 0.6417910448, "f": 0.5972222222}, "SKILLS": {"p": 0.6796805679, "r": 0.6742957746, "f": 0.6769774635}, "DIPLOMA_MAJOR": {"p": 0.8666666667, "r": 0.7844827586, "f": 0.8235294118}, "EXPERIENCE": {"p": 0.4831460674, "r": 0.3233082707 "f": 0.3873873874}}, "ents_f": 0.661754386, "ents_p": 0.6745350501, "ents_r": 0.6494490358, "transformer_loss": 1408.9692438675, "ner_loss": 1269.1254348834}

Due to the limited training data set, these scores are certainly much lower than the production model level, but it is worth checking their performance in the sample job description.

Extract entities using Transformer

To test the model on the sample text, we need to load the model and run it on our text:

Python:

Nlp = spacy.load (". / model-best") text = ['Qualifications- A thorough understanding of C# and .NET Core- Knowledge of good database design and usage- An understanding of NoSQL principles- Excellent problem solving and critical thinking skills- Curious about new technologies- Experience building cloud hosted, scalable web services- Azure experience is a plusRequirements- Bachelor's degree in Computer Science or related field (Equivalent experience can substitute for earned educational qualifications)-Minimum 4 years experience with C# and .NET-Minimum 4 years overall experience in developing commercial software'''] for doc in nlp.pipe (text Disable= ["tagger", "parser"]): print ([(ent.text, ent.label_) for ent in doc.ents])

Using only 120 training documents is impressive! We can correctly extract most of the skills, diplomas, majors and experience.

With more training data, the model is sure to be further improved and produce higher scores.

This is the end of the introduction to "how to use spaCy v3.0 to fine-tune the BERT transformer". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.