Introduction to the usage of torchtext 07/06 Update SLTechnology News&Howtos

Introduction to the usage of torchtext

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains the "introduction to the usage of torchtext". The explanation in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian and go deep into it slowly to study and learn the "introduction to the usage of torchtext" together.

torchtext contains the following components

Field : mainly contains configuration information for the following data preprocessing, such as specifying word segmentation methods, whether to convert to lowercase, starting characters, ending characters, completion characters, dictionaries, etc.

Dataset : Dataset inherited from pytorch, used to load data, provides TabularDataset can point to the path, format, Field information can be convenient to complete the data load. Torchtext also provides pre-built Dataset objects for common datasets that can be loaded directly for use, and splits can load training sets, validation sets, and test sets simultaneously.

Iterator : Mainly an iterator of the model for data output, which can support batch customization.

Field

Field contains settings for writing common parameters for text processing, as well as a dictionary object that can represent text data as numeric types, which in turn can represent text as the desired tensor type

The following parameters are included in the Field object:

sequential: Whether to represent data as a sequence, if False, cannot use participles Default: True.

use_vocab: Whether to use a dictionary object. If False the data type must already be numeric. Default: True.

init_token: The initial character of each piece of data Default: None.

eos_token: End character of each piece of data Default: None.

fix_length: modify the length of each piece of data to this value, if not enough, complete with pad_token. Default: None.

tensor_type: Tensor type to convert data to Default: torch.LongTensor.

preprocessing: Pipes used after tokenization and before numeralization Default: None.

postprocessing: Pipes used after digitizing and before converting to tensors Default: None.

lower: Whether to convert data to lowercase Default: False.

tokenize: participle function. Default: str.split.

include_lengths: Whether to return a tuple of the smallest batch completed and a list of the lengths of each item. Default: False.

batch_first: Whether to produce tensors with the batch dimension first. Default: False.

pad_token: Character used for completion. Default: "".

unk_token: There is no character in the dictionary. Default: "".

pad_first: whether to complete the first character. Default: False.

Important ways:

pad(minibatch): Align each piece of data in a batch

build_vocab(): Build a dictionary

numericalize(): digitizes text data and returns tensor

A simple chestnut is as follows, creating a Field object

TEXT = data.Field(tokenize=data.get_tokenizer('spacy'), init_token='', eos_token='',lower=True)Dataset

torchtext Dataset is inherited from pytorch Dataset, providing a way to download compressed data and extract it (supports.zip, .gz, .tgz)

Splits can read the training set, validation set, and test set simultaneously

TabularDataset can easily read CSV, TSV, or JSON files. Examples are as follows:

train, val, test = data.TabularDataset.splits(path='./ data/', train='train.tsv',validation='val.tsv', test='test.tsv', format='tsv',fields=[('Text', TEXT), ('Label', LABEL)])

After loading the data, you can build a dictionary. When building a dictionary, you can use a pre-trained word vector.

TEXT.build_vocab(train, vectors="glove.6B.100d")Iterator

Iterator is torchtext output to the model, it provides us with general processing methods for data, such as scrambling, sorting, etc., can dynamically modify batch size, there are also split methods that can output training set, validation set, test set at the same time

The parameters are as follows:

dataset: loaded dataset

batch_size: Batch size.

batch_size_fn: A function that generates dynamic batch sizes

sort_key: sort key

train: Is it a training set

repeat: whether to repeat iterations in different epochs

shuffle: whether to shuffle data

sort: whether to sort the data

sort_within_batch: Sort within batch

device: device that creates batch-1:CPU ;0,1...: a corresponding GPU

Use as follows:

train_iter, val_iter, test_iter = data.Iterator.splits((train, val, test), sort_key=lambda x: len(x.Text),batch_sizes=(32, 256, 256), device=-1) Other

torchtext provides common text datasets and can be loaded directly using:

train,val,test = datasets.WikiText2.splits(text_field=TEXT)

The datasets now included include:

Sentiment analysis: SST and IMDb

Question classification: TREC

Entailment: SNLI

Language modeling: WikiText-2

Machine translation: Multi30k, IWSLT, WMT14

The complete example is as follows. A dictionary and data batch are done in just a few lines.

import spacyimport torchfrom torchtext import data, datasetsspacy_en = spacy.load('en')def tokenizer(text): # create a tokenizer functionreturn [tok.text for tok in spacy_en.tokenizer(text)]TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True, fix_length=150)LABEL = data.Field(sequential=False, use_vocab=False)train, val, test = data.TabularDataset.splits(path='./ data/', train='train.tsv',validation='val.tsv', test='test.tsv', format='tsv',fields=[('Text', TEXT),('Label', LABEL)])TEXT.build_vocab (train, vectors="glove.6B.100d")train_iter, val_iter, test_iter = data.Iterator.splits ((train, val, test), sort_key=lambda x: len (x.Text),batch_sizes=(32, 256, 256), device=-1)vocab = TEXT.vocab Thank you for reading, the above is the content of "torchtext usage introduction", after the study of this article, I believe that everyone has a deeper understanding of torchtext usage introduction this problem, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.