The method of applying nlp Counting method to PTB data set 04/17 Update SLTechnology News&Howtos

The method of applying nlp Counting method to PTB data set

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge points of the method of nlp counting applied to PTB data sets, the content is detailed, and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

PTB data set

The contents are as follows:

Save one sentence per line; replace rare words with special characters.

< unk >

Replace the specific number with "N"

We're talking about years ago before anyone heard of asbestos having any questionable properties there is no asbestos in our products now neither nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes we have no useful information on whether users are at risk said james a. Of boston's cancer institute dr. Led a team of researchers from the national cancer institute and the medical schools of harvard university and boston universityptb.py

Use the PTB dataset:

From the following sentence, we can see that when using the PTB dataset, all the sentences are connected at the beginning and end.

Words = open (file_path). Read (). Replace ('\ n,'). Strip (). Split ()

Ptb.py plays the role of downloading the PTB dataset, saving the dataset to a location in the folder, and then extracting the dataset to extract corpus, word_to_id, id_to_word.

Import sysimport ossys.path.append ('..') try: import urllib.requestexcept ImportError: raise ImportError ('Use Python 3) import pickleimport numpy as npurl_base =' https://raw.githubusercontent.com/tomsercu/lstm/master/data/'key_file = {'train':'ptb.train.txt',' test':'ptb.test.txt', 'valid':'ptb.valid.txt'} save_file = {' train':'ptb.train.npy' 'test':'ptb.test.npy',' valid':'ptb.valid.npy'} vocab_file = 'ptb.vocab.pkl'dataset_dir = os.path.dirname (os.path.abspath (_ _ file__)) def _ download (file_name): file_path = dataset_dir +' /'+ file_name if os.path.exists (file_path): return print ('Downloading' + file_name + '...') Try: urllib.request.urlretrieve (url_base + file_name, file_path) except urllib.error.URLError: import ssl ssl._create_default_https_context = ssl._create_unverified_context urllib.request.urlretrieve (url_base + file_name) File_path) print ('Done') def load_vocab (): vocab_path = dataset_dir +' /'+ vocab_file if os.path.exists (vocab_path): with open (vocab_path, 'rb') as f: word_to_id, id_to_word = pickle.load (f) return word_to_id Id_to_word word_to_id = {} id_to_word = {} data_type = 'train' file_name = key_ file [data _ type] file_path = dataset_dir +' /'+ file_name _ download (file_name) words = open (file_path). Read (). Replace ('\ nbread,'). Strip (). Split () for I Word in enumerate (words): if word not in word_to_id: tmp_id = len (word_to_id) word_to_ id [word] = tmp_id id_to_ word [tmp _ id] = word with open (vocab_path, 'wb') as f: pickle.dump ((word_to_id, id_to_word), f) return word_to_id Id_to_worddef load_data (data_type='train'):'': param data_type: type of data: 'train' or' test' or 'valid (val)': return:''if data_type= = 'val': data_type=' valid' save_path = dataset_dir +' /'+ save_ file [data _ type] word_to_id Id_to_word = load_vocab () if os.path.exists (save_path): corpus = np.load (save_path) return corpus, word_to_id, id_to_word file_name = key_ file [data _ type] file_path = dataset_dir +'/'+ file_name _ download (file_name) words = open (file_path). Read (). Replace ('\ n' ''). Strip (). Split () corpus = np.array ([word_to_ id [w] for w in words]) np.save (save_path, corpus) return corpus, word_to_id, id_to_wordif _ _ name__ = ='_ main__': for data_type in ('train',' val', 'test'): load_data (data_type) uses ptb.py

Corpus keeps a list of words ID, id_to_word is the dictionary that converts the word ID into words, and word_to_id is the dictionary that converts words into ID.

Use ptb.load_data () to load data. The parameters' train', 'test' and' valid' correspond to training data, test data and verification data respectively.

Import syssys.path.append ('..') from dataset import ptbcorpus, word_to_id, id_to_word = ptb.load_data ('train') print (' corpus size:', len (corpus)) print ('corpus [: 30]:', corpus [: 30]) print () print ('id_to_word [0]:', id_to_word [0]) print ('id_to_word [1]:', id_to_word [1]) print ('id_to_word [2]:' Id_to_word [2]) print () print ("word_to_id ['car']:", word_to_id [' car']) print ("word_to_id ['happy']:", word_to_id [' happy']) print ("word_to_id ['lexus']:", word_to_id [' lexus'])

Results:

Corpus size: 929589corpus [: 30]: [0 12 3 4 56 7 89 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29] id_to_word [0]: aerid_to_word [1]: banknoteid_to_word [2]: berlitzword_to_id ['car']: 3856word_to_id [' happy']: 4428word_to_id ['lexus']: 7426Process finished with exit code 0 counting method applied to PTB dataset

In fact, the difference between using PTB data sets and not using them lies in this sentence.

Corpus, word_to_id, id_to_word = ptb.load_data ('train')

The following sentence has the effect of reducing dimension.

Word_vecs = U [:,: wordvec_size]

The most time-consuming part of the whole code is on the following function:

W = ppmi (C, verbose=True)

Complete code:

Import syssys.path.append ('..') import numpy as npfrom common.util import most_similar, create_co_matrix, ppmifrom dataset import ptbwindow_size = 2wordvec_size = 100corpus, word_to_id, id_to_word = ptb.load_data ('train') vocab_size = len (word_to_id) print (' counting co-occurrence...') C = create_co_matrix (corpus, vocab_size, window_size) print ('calculating PPMI...') W = ppmi (C Verbose=True) print ('calculating SVD...') # try: # truncated SVD (fast!) print ("ok") from sklearn.utils.extmath import randomized_svdU, S, V = randomized_svd (W, n_components=wordvec_size, n_iter=5, random_state=None) # except ImportError: # SVD (slow) # U, S, V = np.linalg.svd (W) word_vecs = U [:,: wordvec_size] querys = ['you' 'year',' car', 'toyota'] for query in querys: most_similar (query, word_to_id, id_to_word, word_vecs, top=5)

The following is the result made with ordinary np.linalg.svd (W).

[query] you I: 0.7016294002532959 we: 0.6388039588928223 anybody: 0.5868048667907715 do: 0.5612815618515015'll: 0.512611985206604 [query] year month: 0.6957005262374878 quarter: 0.691483736038208 earlier: 0.6661213636398315 last: 0.6327787041664124 third: 0.6230476498603821 [query] car luxury: 0.6767407655715942 auto: 0.69930295944214 vehicle: 0.5927712635993958 cars: 0.58883762361914 truck: 0.5693153154211 [query] toyota motor: 0.748813785362437 nissan: 0.771931664014 motors: 0.3664292929292443344660.677

The following result uses the randomized_svd method in the sklearn module, uses the Truncated SVD of random numbers, and only calculates the parts with large singular values, which is faster than the conventional SVD.

Calculating SVD... ok[query] you i: 0.6678948998451233 we: 0.6213737726211548 something: 0.560122013092041 do: 0.5594725608825684 someone: 0.5490139126777649 [query] year month: 0.6444296836853027 quarter: 0.6192560791969299 next: 0.6152222156524658 fiscal: 0.5712860226631165 earlier: 0.5641934871673584 [query] car luxury: 0.6612467765808105 auto: 0.6166062355041504 corsica: 0.5270425081253052 cars: 0.5142025947570801 truck: 0.5030257105827332 [query] toyota motor: 0.7747215628623962 motors: 0.6871038675308228 lexus: 0.6786072850227356 nissan: 0.6618651151657104 mazda: 0 .6237337589263916 process finished with exit code 0 and above are all the contents of the article entitled "the method of applying nlp Counting to PTB datasets" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.