In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
What is the method of dealing with tsv csv txt json format files? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
The processing methods for data of tsv, csv, txt and json types can generally be processed using TabularDataset in torchtext.
Data requirements:
Tsv: the first line of fields field name, separated by tab, and other behavior data. The direct data of each field is separated by tab.
Csv: the first row of fields fields, other behavior data
Json: dictionary type, one dictionary for each behavior, and the key of the dictionary is fields,values for data.
This time, the dataset is in the following tsv format:
Sentiment-analysis-on-movie-reviews.zip
Format of the dataset:
Note: if some fields are missing in the test dataset, there will be problems with torchtext processing, so make sure that the train val and test datasets must handle the same fields.
Method 1: torchtext
Task: construct a dataset of translation type
Inputs: [sequence english] target: [sequence chinese] from torchtext.data import Field, TabularDataset, BucketIteratorimport torchbatch_size = 6device = torch.device ('cuda' if torch.cuda.is_available () else' cpu') tokenize_x = lambda x: x.split () tokenize_y = lambda y: yTEXT = Field (sequential=True, use_vocab=True, tokenize=tokenize_x, lower=True, batch_first=True, init_token='', eos_token='') LABEL = Field (sequential=False, use_vocab=False, tokenize=tokenize_y) Batch_first=True, init_token=None, eos_token=None) # fields = {'english': (' en', ENGLISH), 'chinese': (' cn', CHINESE)} # The first of element tuple was tsv's fields_namefields = [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)] train_data, test_data = TabularDataset.splits (path='data' Train='movie-sentiment_train.tsv', test='movie-sentiment_test.tsv', format='tsv', skip_header=True Fields=fields) TEXT.build_vocab (train_data, max_size=10000, min_freq=2) VOCAB_SIZE = len (TEXT.vocab) # The operation of vocabularyprint ("vocabulary size:", VOCAB_SIZE) print (TEXT.vocab.freqs) print (TEXT.vocab.itos [: 10]) for I V in enumerate (TEXT.vocab.stoi): if I = = 10: break print (v) print (TEXT.vocab.stoi ['apple']) print (' indx is', TEXT.vocab.stoi [']) print ('indx is') TEXT.vocab.stoi [']) UNK_STR = TEXT.unk_tokenPAD_STR = TEXT.pad_tokenUNK_IDX = TEXT.vocab.stoi [Unk _ STR] PAD_IDX = TEXT.vocab.stoi [pad _ STR] print (f'{UNK_STR} index is {UNK_IDX}') print (f'{PAD_STR} index is {PAD_IDX}') # The operation of datasetsprint (len (train_data)) print (train_data [0]. _ dict__.keys ( ) print (train_data [0]. _ _ dict__.values ()) # vars return attribute of objectprint (vars (train_ data.examples [0]) print (train_data [0] .Phrase) print (train_data [0] .Sentiment) "" batch_sizes: Tuple of batch sizes to use for the different splits Or None to use the same batch_size for all splits. "" train_iterator, test_iterator = BucketIterator.splits ((train_data, test_data), batch_size=32, batch_sizes=None, device=device Repeat=False, # shuffle=True, sort_key=lambda x: len (x.Phrase), sort=False Sort_within_batch=True) for batch in train_iterator: print (batch.Phrase.shape) print ([TEXT.vocab.itos [IDX] for idx in batch.Phrase [0]]) print (batch.Sentiment) break
If there is only one text data to be processed, remove the splits method and modify the following initialization parameters. The modified code is as follows:
Fields= [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)] train_data = TabularDataset (path='data/movie-sentiment_train.tsv', format='tsv', skip_header=True, fields=fields) train_iterator = BucketIterator (train_data) Batch_size=batch_size, device=device, shuffle=False, repeat = False, sort_key=lambda x: len (x.Phrase), sort_within_batch=False)
Whether fields needs use_vocab to be True, that is, whether a dictionary needs to be established:
For inputs data, a dictionary needs to be established, while for labels, if labels is data of numeric type (actually string type), int () is usually used in iterator to force conversion to longTensor () type. If labels is not data of numeric type, you need to establish a dictionary so that fields are converted to longTensor type in iterator.
About the difference between passing list and dict into the fieds field in TabularDataset:
List
Fields must be constructed according to the order of fields fields in the dataset. Advantages: the first row of the dataset can not write field names, and disadvantages: all fields of the train test val dataset must be exactly the same.
The skip_header field in TabularDataset should be set to True or False depending on whether the first row of the dataset has a fields name.
Fields = [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)]
Dict
When constructing fields, you can choose fields selectively according to your own needs. Advantages: all fields of the train test val dataset can not be exactly the same, and disadvantages: the first row of the dataset must have field names.
The skip_header field in TabularDataset must be False.
Fields = {'Phrase': (' Phrase', TEXT), 'Sentiment': (' Sentiment', LABEL)}
Sort and shuffle issues in BucketIterator:
The shuffle parameter is used to disrupt the fetching order of each batch. It is recommended to use the default parameter, that is, train dataset is scrambled, and other datasets are not disturbed.
Sort_key=lambda x: len (x.Phrase): how to sort
Sort: sort all datasets in descending order; recommend False.
Sort_within_batch: sorts each batch in ascending order; it is recommended to use True.
Method 2: tear the code by hand
Task: construct a dataset of translation type
Inputs: [english, chinese] target: [(english, en_len, chinese, cn_len), (...)]
Steps:
Word segmentation to generate a two-dimensional list
Create dictionaries separately
Replace English and Chinese words with indexes according to dictionaries
Construct batch
Construct the index group of batch according to the number of English sentences and batchSize
According to the created batch index, construct the batch data and return the length list of each sentence.
Import torchimport numpy as npimport nltkimport jiebafrom collections import CounterUNK_IDX = 0PAD_IDX = 1batch_size = 64train_file = 'data/translate_train.txt'dev_file =' data/translate_dev.txt' "data format: english\ t chinese reads English and Chinese translation files, and returns two lists" def load_data (in_file): cn = [] en = [] with open (in_file,'r') at the beginning and the end of the sentence. Encoding='utf-8') as f: for line in f: line = line.strip (). Split ("\ t") en.append (['BOS'] + nltk.word_tokenize (line [0] .lower ()) + [' EOS']) # cn.append (['BOS'] + [c for c in line [1]] + [' EOS']) cn. Append (['BOS'] + jieba.lcut (line [1]) + [' EOS']) return en Cn "" create dictionary "" def build_dict (sentences, max_words=50000): word_count = Counter () for sentence in sentences: for s in sentence: word_ count[ s] + = 1 ls = word_count.most_common (max_words) total_words = len (ls) + 2 word_dict = {w [0]: index for index, w in enumerate (ls) 2)} word_dict ['UNK'] = UNK_IDX word_dict [' PAD'] = PAD_IDX return word_dict, total_words# turns the sentence into an index def encode (en_sentences, cn_sentences, en_dict, cn_dict, sort_by_len=True): "" Encode the sequences. " Length = len (en_sentences) # convert the words of a sentence into the corresponding index of the dictionary out_en_sentences = [[en_dict.get (w, 0) for w in sent] for sent in en_sentences] out_cn_sentences = [[cn_dict.get (w, 0) for w in sent] for sent in cn_sentences] def len_argsort (seq): return sorted (range (len (seq)) Key=lambda x: len (seq [x]) if sort_by_len: sorted_index = len_argsort (out_en_sentences) out_en_sentences = [out_en_ sentences [I] for i in sorted_index] out_cn_sentences = [out_cn_ sentences [I] for i in sorted_index] return out_en_sentences, out_cn_sentencesdef get_minibatches (n, minibatch_size, shuffle=False): idx_list = np.arange (0, n) Minibatch_size) # [0,1,..., nmur1] if shuffle: np.random.shuffle (idx_list) minibatches = [] for idx in idx_list: minibatches.append (np.arange (idx, min (idx + minibatch_size, n)) return minibatchesdef prepare_data (seqs) Padding_idx): lengths = [len (seq) for seq in seqs] n_samples = len (seqs) max_len = np.max (lengths) x = np.full ((n_samples, max_len), padding_idx) .astype ('int32') x_lengths = np.array (lengths) .astype ("int32") for idx, seq in enumerate (seqs): X [idx,: principals [IDX]] = seq return x X_lengths # x_maskdef gen_examples (en_sentences, cn_sentences, batch_size): minibatches = get_minibatches (len (en_sentences), batch_size) all_ex = [] for minibatch in minibatches: mb_en_sentences = [en_ sentences [t] for t in minibatch] mb_cn_sentences = [cn_ sentences [t] for t in minibatch] mb_x, mb_x_len = prepare_data (mb_en_sentences PAD_IDX) mb_y, mb_y_len = prepare_data (mb_cn_sentences, PAD_IDX) all_ex.append ((mb_x, mb_x_len, mb_y, mb_y_len)) return all_extrain_en, train_cn = load_data (train_file) dev_en, dev_cn = load_data (dev_file) en_dict, en_total_words = build_dict (train_en) cn_dict Cn_total_words = build_dict (train_cn) inv_en_dict = {v: k for k, v in en_dict.items ()} inv_cn_dict = {v: k for k, v in cn_dict.items ()} train_en, train_cn = encode (train_en, train_cn, en_dict, cn_dict) dev_en, dev_cn = encode (dev_en, dev_cn, en_dict) Cn_dict) print ("" .join ([inv_cn_ words [I] for i in train_ CN [100]]) print ("" .join ([inv_en_ cycles [I] for i in train_ en [100]])) train_data = gen_examples (train_en, train_cn, batch_size) dev_data = gen_examples (dev_en, dev_cn, batch_size) print (len (train_data) print (train_data [0]) Have you mastered how to deal with tsv csv txt json files? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.