How to use torchtext to import NLP datasets 07/03 Update SLTechnology News&Howtos

How to use torchtext to import NLP datasets

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to use torchtext to import NLP datasets", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor learn how to use torchtext to import NLP datasets.

Brief introduction

Torchtext is particularly powerful in text data preprocessing, but we need to know what ta can and cannot do, and how to implement our requirements in torchtext. Although torchtext is designed for pytorch, it can be used in conjunction with keras, tensorflow, and so on.

# installation! workflow of pip3 install torchtext natural language processing preprocessing:

1. Train/Validation/Test dataset segmentation

2. File data import (File Loading)

3. Tokenization the text string into a list of words

4. Build a dictionary (Vocab) build a dictionary based on the expected dataset of training

5. Digital mapping (Numericalize/Indexify) according to the dictionary, data is mapped from words to numbers, which is convenient for machine learning

6. Import the word vector trained in advance (word vector)

7. If the Batch dataset is too large, it cannot be read by the machine at one time, otherwise the machine will crash its memory. The solution is to divide the large data set into smaller data sets and process them in batches.

8. Vector mapping (Embedding Lookup) according to the preprocessed word vector data set, the index value corresponding to each word in the result of 5 is changed into a word vector.

In the above eight steps, torchtext implements 2-7. The first step requires our own diy. Fortunately, this step is not difficult.

"The quick fox jumped over a lazy dog."# participle [" The "," quick "," fox "," jumped "," over "," a "," lazy "," dog ",". "] # build a dictionary {" The "- > 0," quick "- > 1," fox "- > 2 .} # numeric mapping (mapping each word to a corresponding index value according to the dictionary) [0,1,2,.] # Vector mapping (word vector data set trained according to imported pretraining Map words to vectors) [[0.3,0.2,0.5], [0.6,0.1,0.1], [0.8,01.0.4],.] I. dataset segmentation

Generally speaking, when we do machine learning, we will divide the data into training set and test set, but in deep learning, we need multiple rounds of training and learning, each learning process includes training and verification, and then testing. Therefore, the data needs to be divided into training, verification and test data.

Import pandas as pd import numpy as np def split_csv (infile, trainfile, valtestfile, seed=999, ratio=0.2): df = pd.read_csv (infile) df ["text"] = df.text.str.replace ("" ") idxs = np.arange (df.shape [0]) np.random.seed (seed) np.random.shuffle (idxs) val_size = int (len (idxs) * ratio) df.iloc [idxs [: val_size],:] .to_csv (valtestfile, index=False) df.ilocs [Val _ size:],:] .to_csv (trainfile Index=False) # divide sms_spam.csv data into train.csv and test.csv split_csv (infile='data/sms_spam.csv', trainfile='data/train.csv', valtestfile='data/test.csv', seed=999, ratio=0.2) # and then divide train.csv into dataset_train.csv and dataset_valid.csv split_csv (infile='data/train.csv') Trainfile='data/dataset_train.csv', valtestfile='data/dataset_valid.csv', seed=999, ratio=0.2) 1.1 Parameter interpretation of split_csv (infile, trainfile, valtestfile, seed, ratio)

Infile: csv file to be split

Trainfile: segmented training cs file

Valtestfile: a split test or validation csv file

Seed: random seeds to ensure that the randomness of each random segmentation is consistent

Ratio: test (verify) set as a percentage of data

After doing the above, we have constructed the data needed for the experiment:

Training data (dataset_train.csv, not train.csv here)

Validate data (dataset_train.csv)

Test data (test.csv).

II. Word segmentation

The imported data is text in the form of a string, and we need to segment it into a list of words. The most accurate word separators in English are as follows:

Import re import spacy import jieba # English word splitter NLP = spacy.load ('en_core_web_sm') MAX_CHARS = 20000 # in order to reduce the size of the data processed, you can set the maximum text length and ignore the excess. Def tokenize1 (text): text = re.sub (r "s", ", text) if (len (text) > MAX_CHARS): text = text [: MAX_CHARS] return [x.text for x in NLP.tokenizer (text) if x.text! =" and len (x.text) > 1] # some students tokenize1 can not use it, so they can use tokenize2. Def tokenize2 (text): text = re.sub (r "s", "" Text) if (len (text) > MAX_CHARS): text = text [: MAX_CHARS] return [w for w in text.split ('') if len (w) > 1] # Chinese classifiers are relatively simple def tokenize3 (text): if (len (text) > MAX_CHARS): text = text [: MAX_CHARS] return [ W for w in jieba.lcut (text) if len (w) > 1] print (tokenize1 ('Python is powerful and satisfaction')) Print (tokenize2 ('Python is powerful and ideal')) Print (tokenize3 ('Python is strong and beautiful!'))

Run

['Python',' is', 'powerful',' and', 'beautiful'] [' Python', 'is',' powerful', 'and',' fulfilling'] ['Python',' powerful', 'beautiful'] III. Import data

Torchtext.data.TabularDataset is used in torchtext to import our own dataset, and we need to define the data type of the field before we can import it. To define the data types of fields according to the order of fields in csv, we have two fields (label, text) in our csv file

Import pandas as pd df = pd.read_csv ('data/train.csv') df.head ()

Import torch import torchtext from torchtext import data import logging LABEL = data.LabelField (dtype = torch.float) TEXT = data.Field (tokenize = tokenize1, lower=True, fix_length=100, stop_words=None) train, valid, test = data.TabularDataset.splits (path='data' # folder where the data resides: train='dataset_train.csv', validation='dataset_valid.csv', test = 'test.csv' Format='csv', skip_header=True, fields = [('label', LABEL), (' text', TEXT)]) train

Run

Fourth, build a dictionary

Build a dictionary based on the expected dataset of the training (train obtained above). There are two ways to build them, one is not using word vectors, and the other is using vectors.

The only difference is whether vectors passes in parameters.

Vects = torchtext.vocab.Vectors (name = 'glove.6B.100d.txt', cache =' data/') TEXT.build_vocab (train, max_size=2000, min_freq=50, vectors=vects) # replacing vects with None does not use the word vector unk_init = torch.Tensor.normal_) 4.1 TEXT is a Field object The methods of this object are

Print (type (TEXT)) print (type (TEXT.vocab))

Run

Dictionary-word list form, only the first 20 are shown here

TEXT.vocab.itos [: 20] [',', 'to',' you', 'the','...', 'and',' is', 'in',' me', 'it',' my', 'for',' your',., 'do',' of', 'have',' that', 'call']

Dictionary-dictionary form

TEXT.vocab.stoidefaultdict (, {'': 0,': 1, 'to': 2,' you': 3, 'the': 4,'...': 5, 'and': 6,' is': 7, 'in': 8 .... 'mother': 0,' english': 0, 'son': 0,' gradfather': 0, 'father': 0,' german': 0) 4.2 Note

There are two dictionaries generated in the train data. Here are two things to note:

It means that words you don't know are encoded as

German, father and so on are all encoded as 0, because we require that the frequency of words in the dictionary must be greater than 50, and those less than 50 must be uniformly assigned an index value.

The word vector corresponding to the word you

TEXT.vocab.vectors [3] tensor ([- 0.4989, 0.7660, 0.8975,-0.7855,-0.6855, 0.6261,-0.3965, 0.3491, 0.3333,-0.4523, 0.6122, 0.0759,0.2253,0.1637,0.2810,-0.2476,0.0099,0.7111,-0.7586,0.8742) 0.0031, 0.3580,-0.3523,-0.6650, 0.3845, 0.6268,-0.5154,-0.9665, 0.6152,-0.7545,-0.0124,1.1188,0.3572,0.0072,0.2025,0.5011,-0.4405,0.1066,0.7939,-0.8095,-0.0156 -0.2289,-0.3420,-1.0065,-0.8763, 0.1516,-0.0853,-0.6465,-0.1673,-1.4499,-0.0066, 0.0048,-0.0124,1.0474,0.1938,-2.5991,0.4053,0.4380,1.9332,0.4581,-0.0488,1.4308 -0.7864,-0.2079, 1.0900, 0.2482, 1.1487, 0.5148,-0.2183,-0.4572, 0.1389,-0.2637, 0.1365,-0.6054,0.0996,0.2334,0.1365,-0.1846,-0.0477,0.1839,0.5272,-0.2885,1.0742 -0.0467,-1.8302,-0.2120, 0.0298,-0.3096,-0.4339,-0.3646,-0.3274,-0.0093, 0.4721,-0.5169,-0.5918,-0.3234, 0.2005,-0.4118,0.4054,0.7850]) 4.3 calculate the similarity of words

It is necessary to retain more information (the relationship between words) when building feature engineering with word vectors.

In this way, you can see the vector direction of the word.

Is it synonymous or antonym?

Far and near.

Here we roughly use the cosine theorem to calculate the relationship between words, there is no synonym antonym, can only reflect the distance (similarity).

From sklearn.metrics.pairwise import cosine_similarity import numpy as np def simalarity (word1, word2): word_vec1 = TEXT.vocab.vectors [TEXT.vocab.stoi [word1]] .tolist () word_vec2 = TEXT.vocab.vector [TEXT.vocab.stoi [word2]] .tolist () vectors = np.array ([word_vec1, word_vec2]) return cosine_similarity (vectors) print (simalarity ('you',' your'))

Run

[[0.83483314] [0.83483314 1.]] V. Get_dataset function

Similar functions are merged into modules, which can increase the readability of the code. Here we combine the results of three or four get_dataset functions in stages.

From torchtext import data import torchtext import torch import logging LOGGER = logging.getLogger ("import data") def get_dataset (stop_words=None): # define the data type of the field LABEL = data.LabelField (dtype = torch.float) TEXT = data.Field (tokenize = tokenize1, lower=True, fix_length=100 Stop_words=stop_words) LOGGER.debug ("ready to read csv data...") Train, valid, test = data.TabularDataset.splits (path='data', # folder where the data is located train='dataset_train.csv', validation='dataset_valid.csv', test = 'test.csv' Format='csv', skip_header=True, fields = [('label', LABEL), (' text', TEXT)]) LOGGER.debug ("ready to import word vector.") Vectors = torchtext.vocab.Vectors (name = 'glove.6B.100d.txt', cache =' data/') LOGGER.debug ("ready to build a dictionary.") TEXT.build_vocab (train, max_size=2000, min_freq=50, vectors=vectors, unk_init = torch.Tensor.normal_) LOGGER.debug ("complete data import!") Return train,valid, test, TEXT

Interpretation of Internal parameters of get_dataset function

Data.Field (tokenize,fix_length) definition field

Tokenize=tokenize1 uses the English word splitter tokenize1 function.

Fix_length=100 makes the length of each text after word segmentation to 100 words; if it is less than 100, it can be filled to 100 words. If it exceeds 100, only 100 will be retained.

Data.TabularDataset.splits (train, validation,test, format,skip_header,fields) reads training verification data, and can read multiple files at once.

The csv file name corresponding to the train/validation/test training verification test

Skip_header=True if the csv has a header, set it to True to prevent pytorch from treating the header as a record

Fields = [('label', LABEL), (' text', TEXT)] defines the type of the field. Note that fields should be set according to the order of the fields in the csv header.

Torchtext.vocab.Vectors (name, cache) imports word vector data files

Name= 'glove.6B.100d.txt' downloads the pre-trained word vector glove.6B.100d.txt file from the Internet (the file has 6B words, each word vector length is 100)

Cache = 'data/' folder location. Glove files are stored in the data folder

TEXT.buildvocab (maxsize,minfreq,unkinit) builds dictionaries, where

Max_size=2000 sets the maximum number of words in the dictionary.

Min_freq=50 sets the word in the dictionary to appear at least 50 times.

The corresponding vectors of words that are not in the unkinit=torch.Tensor.normal dictionary are filled with torch.Tensor.normal_.

VI. In batches

If the dataset is too large, it is easy to cause memory corruption to be read by the machine at one time. The solution is to divide the large data set into smaller data sets and process them in batches.

Def split2batches (batch_size=32, device='cpu'): train, valid, test, TEXT = get_dataset () # datasets contains train, valid, test LOGGER.debug in order ("prepare data in batches.") Train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits ((train, valid, test), batch_size = batch_size, sort = False Device = device) LOGGER.debug ("complete data in batches!") Return train_iterator, valid_iterator, test_iterator, TEXT6.1 parameters interpretation of split2batches (batch_size=32, device=0)

The maximum number of comments added by batch_size per batch

Device device='cpu' runs in CPU and device='gpu' runs in GPU. On ordinary computers, only CPU returns a BucketIterator object.

Train_iterator, valid_iterator, test_iterator, TEXT = split2batches () train_iterator

Run

View train_iterator data types

Type (train_iterator) torchtext.data.iterator.BucketIterator6.2BucketIterator object

Take trainiterator as an example (validiterator and test_iterator are all the same objects). Because the data in this example has two fields, label and text,

Get the dataset of train_iterator

Train_iterator.dataset

Get the eighth object in train_iterator

Train_iterator.dataset.examples [7]

Get the contents of the lebel field of the eighth object in train_iterator

Train_iterator.dataset.examples [7]. Label'ham'

Get the contents of the text field of the eighth object in train_iterator

Train_iterator.dataset.examples [7] .text ['were',' trying', 'to',' find', 'chinese',' food', 'place',' around', 'here'] here, I believe you have a deeper understanding of "how to import NLP datasets using torchtext". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.