In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains the "introduction to the usage of torchtext". The explanation in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian and go deep into it slowly to study and learn the "introduction to the usage of torchtext" together.
torchtext contains the following components
Field : mainly contains configuration information for the following data preprocessing, such as specifying word segmentation methods, whether to convert to lowercase, starting characters, ending characters, completion characters, dictionaries, etc.
Dataset : Dataset inherited from pytorch, used to load data, provides TabularDataset can point to the path, format, Field information can be convenient to complete the data load. Torchtext also provides pre-built Dataset objects for common datasets that can be loaded directly for use, and splits can load training sets, validation sets, and test sets simultaneously.
Iterator : Mainly an iterator of the model for data output, which can support batch customization.
Field
Field contains settings for writing common parameters for text processing, as well as a dictionary object that can represent text data as numeric types, which in turn can represent text as the desired tensor type
The following parameters are included in the Field object:
sequential: Whether to represent data as a sequence, if False, cannot use participles Default: True.
use_vocab: Whether to use a dictionary object. If False the data type must already be numeric. Default: True.
init_token: The initial character of each piece of data Default: None.
eos_token: End character of each piece of data Default: None.
fix_length: modify the length of each piece of data to this value, if not enough, complete with pad_token. Default: None.
tensor_type: Tensor type to convert data to Default: torch.LongTensor.
preprocessing: Pipes used after tokenization and before numeralization Default: None.
postprocessing: Pipes used after digitizing and before converting to tensors Default: None.
lower: Whether to convert data to lowercase Default: False.
tokenize: participle function. Default: str.split.
include_lengths: Whether to return a tuple of the smallest batch completed and a list of the lengths of each item. Default: False.
batch_first: Whether to produce tensors with the batch dimension first. Default: False.
pad_token: Character used for completion. Default: "".
unk_token: There is no character in the dictionary. Default: "".
pad_first: whether to complete the first character. Default: False.
Important ways:
pad(minibatch): Align each piece of data in a batch
build_vocab(): Build a dictionary
numericalize(): digitizes text data and returns tensor
A simple chestnut is as follows, creating a Field object
TEXT = data.Field(tokenize=data.get_tokenizer('spacy'), init_token='', eos_token='',lower=True)Dataset
torchtext Dataset is inherited from pytorch Dataset, providing a way to download compressed data and extract it (supports.zip, .gz, .tgz)
Splits can read the training set, validation set, and test set simultaneously
TabularDataset can easily read CSV, TSV, or JSON files. Examples are as follows:
train, val, test = data.TabularDataset.splits(path='./ data/', train='train.tsv',validation='val.tsv', test='test.tsv', format='tsv',fields=[('Text', TEXT), ('Label', LABEL)])
After loading the data, you can build a dictionary. When building a dictionary, you can use a pre-trained word vector.
TEXT.build_vocab(train, vectors="glove.6B.100d")Iterator
Iterator is torchtext output to the model, it provides us with general processing methods for data, such as scrambling, sorting, etc., can dynamically modify batch size, there are also split methods that can output training set, validation set, test set at the same time
The parameters are as follows:
dataset: loaded dataset
batch_size: Batch size.
batch_size_fn: A function that generates dynamic batch sizes
sort_key: sort key
train: Is it a training set
repeat: whether to repeat iterations in different epochs
shuffle: whether to shuffle data
sort: whether to sort the data
sort_within_batch: Sort within batch
device: device that creates batch-1:CPU ;0,1...: a corresponding GPU
Use as follows:
train_iter, val_iter, test_iter = data.Iterator.splits((train, val, test), sort_key=lambda x: len(x.Text),batch_sizes=(32, 256, 256), device=-1) Other
torchtext provides common text datasets and can be loaded directly using:
train,val,test = datasets.WikiText2.splits(text_field=TEXT)
The datasets now included include:
Sentiment analysis: SST and IMDb
Question classification: TREC
Entailment: SNLI
Language modeling: WikiText-2
Machine translation: Multi30k, IWSLT, WMT14
The complete example is as follows. A dictionary and data batch are done in just a few lines.
import spacyimport torchfrom torchtext import data, datasetsspacy_en = spacy.load('en')def tokenizer(text): # create a tokenizer functionreturn [tok.text for tok in spacy_en.tokenizer(text)]TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True, fix_length=150)LABEL = data.Field(sequential=False, use_vocab=False)train, val, test = data.TabularDataset.splits(path='./ data/', train='train.tsv',validation='val.tsv', test='test.tsv', format='tsv',fields=[('Text', TEXT),('Label', LABEL)])TEXT.build_vocab (train, vectors="glove.6B.100d")train_iter, val_iter, test_iter = data.Iterator.splits ((train, val, test), sort_key=lambda x: len (x.Text),batch_sizes=(32, 256, 256), device=-1)vocab = TEXT.vocab Thank you for reading, the above is the content of "torchtext usage introduction", after the study of this article, I believe that everyone has a deeper understanding of torchtext usage introduction this problem, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.