How to process text data with torchtext 07/06 Update SLTechnology News&Howtos

How to process text data with torchtext

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article "torchtext how to process text data", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "torchtext how to process text data" article.

First of all, an overall introduction to the components of torchtext

Torchtext consists of the following components:

Field: mainly contains the following configuration information for data preprocessing, such as specifying word segmentation method, whether to convert to lowercase, start character, end character, completion character, dictionary and so on.

Dataset: inherited from pytorch Dataset, used to load data, provides TabularDataset can point to the path, format, Field information can easily complete the data loading. At the same time, torchtext also provides pre-built Dataset objects for commonly used data sets, which can be loaded directly, and the splits method can load training sets, verification sets and test sets at the same time.

Iterator: a model iterator mainly for data output, which can support batch customization

1. Field

Field contains the setting of general parameters for writing text processing, as well as a dictionary object, which can represent the text data as a numeric type, and thus the text as the desired tensor type.

The following are the parameters contained in the Field object:

Sequential: whether to represent the data as a sequence. If it is False, you cannot use the default value of participle: True.

Use_vocab: whether to use dictionary objects or not. If it is a False data type, it must already be a numeric type. Default value: True.

Init_token: the default value of the starting character for each piece of data: None.

Eos_token: the default value of the ending character of each data: None.

Fix_length: modify the length of each piece of data to this value, and complete it with pad_token if not enough. Default value: None.

Tensor_type: the default value for converting data to tensor type: torch.LongTensor.

Preprocessing: the default value of the pipeline used after participle and before digitization: None.

Postprocessing: the default value of the pipeline used after digitization and before converting to tensor: None.

Lower: whether to convert the data to lowercase default value: False.

Tokenize: word segmentation function. Default value: str.split.

Include_lengths: whether to return a tuple of the smallest batch that has been completed and a list containing each data length. Default value: False.

Batch_first: Whether to produce tensors with the batch dimension first. Default value: False.

Pad_token: the character used for completion. Default value: ".

Unk_token: there are no characters in the dictionary. Default value: ".

Pad_first: whether to complete the first character. Default value: False.

Several important methods:

Pad (minibatch): aligns each piece of data in a batch

Build_vocab (): create a dictionary

Numericalize (): digitizes text data and returns tensor

The simple chestnut is as follows, create a Field object

TEXT = data.Field (tokenize=data.get_tokenizer ('spacy'), init_token='', eos_token='',lower=True) 2.Dataset

Torchtext's Dataset is a Dataset inherited from pytorch and provides a way to download compressed data and extract it (supports .zip, .gz, .tgz)

Splits method can read training set, verification set and test set at the same time.

TabularDataset can easily read files in CSV, TSV and or JSON formats. Examples are as follows:

Train, val, test= data.TabularDataset.splits (path='./data/', train='train.tsv', validation='val.tsv', test='test.tsv', format='tsv', fields= [('Text', TEXT), (' Label', LABEL)])

After loading the data, you can set up a dictionary, and you can use and train the word vector when building the dictionary.

TEXT.build_vocab (train, vectors= "glove.6B.100d") 3. Iterator

Iterator is the output from torchtext to the model. It provides us with general ways to deal with data, such as scrambling, sorting, etc., and can dynamically modify the batch size. There are also splits methods that can output training sets, verification sets, and test sets at the same time.

The parameters are as follows:

Dataset: loaded dataset

Batch_size: Batch size.

Batch_size_fn: a function that produces a dynamic batch size

Sort_key: sorted key

Train: whether it is a training set or not

Repeat: whether to repeat iterations in different epoch

Shuffle: whether to disrupt the data

Sort: whether to sort the data

Sort_within_batch: whether the batch is sorted internally

Device: the device that builds the batch-1:CPU; 0Power1...: the corresponding GPU

The mode of use is as follows:

Train_iter, val_iter, test_iter = data.Iterator.splits ((train, val, test), sort_key=lambda x: len (x.Text), batch_sizes= (32,256,256), device=-1) 4. Other

Torchtext provides common text datasets that can be loaded directly and used:

Train,val,test = datasets.WikiText2.splits (text_field=TEXT)

The datasets that are now included include:

Sentiment analysis: SST and IMDbQuestion classification: TRECEntailment: SNLILanguage modeling: WikiText-2Machine translation: Multi30k, IWSLT, WMT14

The complete example is as follows, batch the dictionary and data in just a few lines.

Import spacyimport torchfrom torchtext import data, datasetsspacy_en = spacy.load ('en') def tokenizer (text): # create a tokenizer function return [tok.text for tok in spacy_en.tokenizer (text)] TEXT = data.Field (sequential=True, tokenize=tokenizer, lower=True, fix_length=150) LABEL = data.Field (sequential=False, use_vocab=False) train, val, test= data.TabularDataset.splits (path='./data/', train='train.tsv', validation='val.tsv', test='test.tsv', format='tsv' Fields= [('Text', TEXT), (' Label', LABEL)]) TEXT.build_vocab (train, vectors= "glove.6B.100d") train_iter, val_iter, test_iter = data.Iterator.splits ((train, val, test), sort_key=lambda x: len (x.Text), batch_sizes= (32,256,256), device=-1) vocab = TEXT.vocab

Add: use TorchText to process our own datasets

TorchText can read three data formats: json, tsv (tab separated values tabulated separated values) and csv (comma separated values comma separated values).

Processing JSON data

Starting with json, your data must be in json row format, that is, it must be like this:

{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["I", "love", "the", "united kingdom"]} {"name": "Mary", "location": "United States", "age": 36, "quote": ["I", "want", "more", "telescopes"]}

That is, each row is a json object. Take data/trian.json as an example.

Then we define the fields:

From torchtext import datafrom torchtext import datasetsNAME = data.Field () SAYING = data.Field () PLACE = data.Field ()

Next, we must tell TorchText which field applies to which element of the json object.

For json data, we must create a dictionary:

The key matches the key of the json object

The value is a tuple, where:

The first element becomes the property name of the batch object

The second element is the name of the field

Some considerations:

The order of keys in the fields dictionary is not important, as long as its key matches the json data key.

The field name does not have to match the key in the json object; for example, we use PLACE to represent the "location" field.

When working with json data, not all keys must be used, for example, we do not use the "age" field.

Similarly, if the value of the json field is a string, the field is tokenized (by default, the string is separated by spaces), however, if the value is a list, no tokenization is applied. In general, it's a good idea to mark the data as a list, which saves time because you don't have to wait for TorchText to do it.

The value of the json field does not have to be of the same type. In some cases, the quotation marks can be strings and some are lists. Tokenization will be applied only to strings that are represented by quotation marks.

If you are using a json field, there must be an instance of that field for each example, for example, all examples in this example must have a name,location and quote. However, since we are not using the age field, it does not matter if there is no age field in the example.

Fields = {'name': (' nasty, NAME), 'location': (' paired, PLACE), 'quote': (' SAYING, SAYING)}

Now, in the training cycle, we can iterate through the data iterator and access name through batch.n, location through batch.p, and quote through batch.s.

Then we use the TabularDataset.splits function to create our dataset (train_data and test_data)

The path parameter specifies the top-level folder common to both datasets, and the train and test parameters specify the file name of each dataset, for example, where the train dataset is located in data/train.json.

We tell the function that we are using json data and pass it the previously defined fields dictionary.

Train_data, test_data = data.TabularDataset.splits (path = 'data', train =' train.json', test = 'test.json', format =' json', fields = fields)

If you already have a validation dataset, you can pass its path as a validation parameter.

Train_data, valid_data, test_data = data.TabularDataset.splits (path = 'data', train =' train.json', validation = 'valid.json', test =' test.json' Format = 'json', fields = fields)

We can then look at an example to make sure it works correctly.

Notice how the field names (n, p, and s) match what is defined in the fields dictionary.

Also notice how the word "United Kingdom" in p is marked apart, while the word "United Kingdom" in s is not. For the reasons mentioned earlier, TorchText assumes that any json field as a list has been tagged and no further tokenization is applied.

Print (vars (train_ data [0])) {'John'],' Kingdom': ['United',' Kingdom'], 'love',' the', 'united kingdom']}}

Now we can use train_data, test_data, and valid_data to build the vocabulary and create iterators. We can use batch.n, batch.p and batch.s to access all the properties that represent names, places, and sayings, respectively.

Processing CSV/TSV data

Csv is very similar to tsv, except that the elements of csv are separated by commas, while tsv is separated by tabs.

Using the above example, our tsv data will be:

Name location age quoteJohn United Kingdom 42 i love the united kingdomMary United States 36 i want more telescopes

That is, the elements of each row are separated by tabs, and each line has an example. The first row is usually the title (that is, the name of each column), but your data may also have no title.

There can be no lists in tsv or csv data.

Fields are defined in a slightly different way than json. Now we use a list of tuples, where each element is also a tuple. The first element of these internal tuples becomes the property name of the batch object, and the second element is the field name.

Unlike json data, tuples must be in the same order as in tsv data. Therefore, when skipping a column of data, we need to use a none tuple, and if not, our SAYING field will be applied to the age column of the tsv data, while the quote column will not be used.

However, if you only want to use the name and age columns, you can use only two tuples because they are the first two columns.

We change the TabularDataset to read the correct .tsv file and change the format parameter to 'tsv'.

If your data has a header, our data does, and it must be skipped by passing skip_header = True. If not, TorchText will think that the head is an example. By default, skip_header is False.

Fields = [('nasty, NAME), (' paired, PLACE), (None, None), ('SAYING, SAYING)] train_data, valid_data, test_data = data.TabularDataset.splits (path =' data', train = 'train.tsv' Validation = 'valid.tsv', test =' test.tsv', format = 'tsv', fields = fields Skip_header = True) print (vars (train_ data [0])) {'John'],' John': ['United',' Kingdom'], 'love',' the', 'united',' kingdom']}

Finally, we will discuss the csv file.

This is almost exactly the same as a tsv file, except that the format parameter is set to "csv".

Fields = [('nasty, NAME), (' paired, PLACE), (None, None), ('SAYING, SAYING)] train_data, valid_data, test_data = data.TabularDataset.splits (path =' data', train = 'train.csv' Validation = 'valid.csv', test =' test.csv', format = 'csv', fields = fields Skip_header = True) print (vars (train_ data [0])) {'John'],' packs: ['United',' Kingdom'], 'love',' the', 'united',' kingdom']} Why is JSON better than CSV/TSV?

Csv or tsv data cannot store lists. This means that data cannot be tagged, so every time you run a Python script that reads data through TorchText, the data must be tagged. Using advanced markers, such as spaCy markers, requires a lot of time that can't be ignored. Therefore, it is best to tag datasets and store them in json row format.

If tabs appear in tsv data, or commas appear in csv data, TorchText considers them to be delimiters between columns. This will cause the data to be parsed incorrectly. Worst of all, TorchText won't remind you of this because it can't tell the difference between tabs / commas in fields and tabs / commas as delimiters. Because json data is essentially a dictionary, you can access the data in the field through its keys, so you don't have to worry about the "surprise" delimiter.

Iterator

Using any of the above datasets, we can build a vocabulary and create iterators.

NAME.build_vocab (train_data) SAYING.build_vocab (train_data) PLACE.build_vocab (train_data)

We can then create an iterator after defining the batch size and device.

By default, training data is shuffled at each epoch, but verification / test data is sorted. However, TorchText doesn't know what to use to sort our data, and if we don't tell it, it will throw an error.

There are two ways to deal with this problem, you can tell the iterator not to sort the validation / test data by passing sort = False, or you can tell the iterator how to sort the data by passing sort_key. Sort key is a function that returns a key to sort the data. For example, lambda x: X.s will sort the examples according to their s attribute (that is, their quote). Ideally, you want to use sort key because BucketIterator will be able to sort the examples and then minimize the amount of padding in each batch.

We can then iterate through the iterator to get the bulk data. Note that by default the batch dimension of TorchText is in the second dimension.

Import torchdevice = torch.device ('cuda' if torch.cuda.is_available () else' cpu') BATCH_SIZE = 1train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits ((train_data, valid_data, test_data), sort = False, # don't sort test/validation data batch_size=BATCH_SIZE, device=device) train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits ((train_data, valid_data, test_data) Sort_key = lambda x: X.s, # sort by s attribute (quote) batch_size=BATCH_SIZE Device=device) print ('Train:') for batch in train_iterator: print (batch) print (' Valid:') for batch in valid_iterator: print (batch) print ('Test:') for batch in test_iterator: print (batch) Train: [torchtext.data.batch.Batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [torch.cuda.LongTensor Of size 2x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 5x1 (GPU 0)] [torchtext.data.batch.Batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [torch.cuda.LongTensor of size 2x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 4x1 (GPU 0)] Valid: [torchtext.data.batch .batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 2x1 (GPU 0)] [torchtext.data.batch.Batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [ Torch.cuda.LongTensor of size 1x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 4x1 (GPU 0)] Test: [torchtext.data.batch.Batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 3x1 (GPU 0)] [ Torchtext.data.batch.Batch of size 1] [.n]: [torch.cuda.LongTensor of size 1x1 (GPU 0)] [.p]: [torch.cuda.LongTensor of size 2x1 (GPU 0)] [.s]: [torch.cuda.LongTensor of size 3x1 (GPU 0)] these are the contents of this article on "how to process text data in torchtext" I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.