In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
How to use Python code to do an English parser, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
The parser describes the grammatical structure of a sentence and is used to help other applications to reason. Natural language introduces a lot of unexpected ambiguities that can be quickly discovered by our understanding of the world. Give me an example that I like very much:
They ate the pizza with anchovies
The correct parsing is to connect "with" and "pizza", while the wrong parsing associates "with" with "eat":
In the past few years, the Natural language processing (NLP) community has made great progress in parsing. Now, a small Python implementation may perform better than the widely used Stanford parser.
Parser accuracy speed (words / second) language position
Stanford 89.6% 19 Java > 50000 [1]
Parser.py 89.8% 2020 Python
Redshift 93.6% 2580 Cython ~ 4000
The rest of the section first sets up the question, and then shows you the concise implementation prepared for it. The first 200 lines of the parser.py code describe the tagger and learner of part of speech (here). Unless you are very familiar with NLP research, you should at least skim through this article before studying it.
Problem description
It is very friendly to enter such a command into your phone:
Set volume to zero when I'm in a meeting, unless John's school calls.
Then make the appropriate policy configuration. On Android systems, you can use Tasker to do this, and the NL interface is better. By receiving an editable semantic representation, you can understand what it thinks you mean, and you can modify his ideas, which is particularly friendly.
There are many problems to be solved in this work, but some kinds of syntactic forms are absolutely necessary. We need to know:
Unless John's school calls, when I'm in a meeting, set volume to zero
Is another way to parse instructions, and
Unless John's school, call when I'm in a meeting
Express a completely different meaning.
Relying on the parser to return a word-to-word diagram makes reasoning easier. The diagram is a tree structure with directed edges, and each node (word) has and only one arc (head dependence).
Usage example:
> parser = parser.Parser () > tokens = "Set the volume to zero when I'm in a meeting unless John's school calls" .split () > > tags, heads = parser.parse (tokens) > heads [- 1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11] > for I, h in enumerate (heads):. Head = tokens [heads] if h > = 1 else 'None'. Print (else [I] +'= 2: return data [stack [- 1]], data [stack [- 2]],''elif depth = = 1: return data [stack [- 1]],' else: return','',''def get_buffer_context (I, n) Data): if I + 1 > = n: return data [I],',''elif I + 2 > = n: return data [I], data [I + 1],' 'else: return data [I], data [I + 1], data [I + 2] def get_parse_context (word, deps) Data): if word =-1: return 0,',''deps = deps [word] valency = len (deps) if not valency: return 0,' 'elif valency = = 1: return 1, data [deps [- 1]],' 'else: return valency Data [deps [- 1]], data [deps [- 2]] features = {} # Set up the context pieces-- the word, W, and tag, T, of: # S0-2: Top three words on the stack # N0-2: First three words of the buffer # n0b1, n0b2: Two leftmost children of the first word of the buffer # s0b1, s0b2: Two leftmost children of the top word of the stack # s0f1 S0f2: Two rightmost children of the top word of the stack depth = len (stack) S0 = stack [- 1] if depth else-1 Ws0, Ws1, Ws2 = get_stack_context (depth, stack, words) Ts0, Ts1, Ts2 = get_stack_context (depth, stack, tags) Wn0, Wn1, Wn2 = get_buffer_context (n0, n, words) Tn0, Tn1, Tn2 = get_buffer_context (N0, n, tags) Vn0b, Wn0b1 Wn0b2 = get_parse_context (N0, parse.lefts, words) Vn0b, Tn0b1, Tn0b2 = get_parse_context (N0, parse.lefts, tags) Vn0f, Wn0f1, Wn0f2 = get_parse_context (N0, parse.rights, words) _, Tn0f1, Tn0f2 = get_parse_context (N0, parse.rights, tags) Vs0b, Ws0b1, Ws0b2 = get_parse_context (S0, parse.lefts, words) _, Ts0b1, Ts0b2 = get_parse_context (S0, parse.lefts) Tags) Vs0f, Ws0f1, Ws0f2 = get_parse_context (S0, parse.rights, words) _, Ts0f1, Ts0f2 = get_parse_context (S0, parse.rights, tags) # Cap numeric features at 5? # String-distance Ds0n0 = min ((N0-S0,5) if s0! = 0 else 0 features ['bias'] = 1 # Add word and tag unigrams for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2 Ws0b1, Ws0b2, Ws0f1, Ws0f2): if w: features ['w% s'% w] = 1 for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2): if t: features ['t% s'% t] = 1 # Add word/tag pairs for i, (w, t) in enumerate ((Wn0, Tn0), (Wn1) Tn1), (Wn2, Tn2), (Ws0, Ts0)): if w or t: features ['% d% s, t% s'% (I, w, t)] = 1 # Add some bigrams features ['s0w=%s, n0w% s'% (Ws0, Wn0)] = 1 features ['wn0tn0-ws0% s% s'% (Wn0, Tn0) Ws0)] = 1 features ['wn0tn0-ts0% s ws0ts0-wn0% s% s'% (Wn0, Tn0, Ts0)] = 1 features ['ws0ts0-wn0% s hand% s% s'% (Ws0, Ts0, Wn0)] = 1 features ['ws0-ts0 tn0% s big% s% s'% (Ws0, Ts0, Tn0)] = 1 features ['wt-wt% s hand% s% s hand% s% s'% (Ws0, Ts0, Wn0)] Tn0)] = 1 features ['tt s0% s n0% s n0% s'% (Ts0, Tn0)] = 1 features ['tt n0% s N1% s'% (Tn0, Tn1)] = 1 # Add some tag trigrams trigrams = (Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1) (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2), (Ts0, Ts1, Ts1)) for I, (T1, T2, T3) in enumerate (trigrams): if T1 or T2 or T3: features ['ttt-%d% s% s'% (I, T1, T2) T3)] = 1 # Add some valency and distance features vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b)) vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b)) d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0), ('t'+ Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0)) for I (wintert, vt) in enumerate (vw + vt + d): if wintert or vaudd: features ['val/d-%d% s% d'% (I, wimpt, vaudd)] = 1 return features training
Learning weights and part-of-speech tagging use the same algorithm, that is, the average perceptron algorithm. Its main advantage is that it is an online learning algorithm: examples flow in one after another, we make predictions, check real answers, and adjust opinions (weights) if the prediction is wrong.
Circular training looks like this:
Class Parser (object):... Def train_one (self, itn, words, gold_tags, gold_heads): n = len (words) I = 2; stack = [1]; parse = Parse (n) tags = self.tagger.tag (words) while stack or (I + 1)
< n: features = extract_features(words, tags, i, n, stack, parse) scores = self.model.score(features) valid_moves = get_valid_moves(i, n, len(stack)) guess = max(valid_moves, key=lambda move: scores[move]) gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads) best = max(gold_moves, key=lambda move: scores[move]) self.model.update(best, guess, features) i = transition(guess, i, stack, parse) # Return number correct return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]]) 训练过程中最有趣的部分是 get_gold_moves。 通过Goldbery 和 Nivre (2012),我们的语法解析器的性能可能会有所提升,他们曾指出我们错了很多年。 在词性标注文章中,我提醒大家,在训练期间,你要确保传递的是***两个预测标记做为当前标记的特征,而不是***两个黄金标记。测试期间只有预测标记,如果特征是基于训练过程中黄金序列的,训练环境就不会和测试环境保持一致,因此将会得到错误的权重。 在语法分析中我们面临的问题是不知道如何传递预测序列!通过采用黄金标准树结构,并发现可以转换为树的过渡序列,等等,使得训练得以工作,你获得返回的动作序列,保证执行运动,将得到黄金标准的依赖关系。 问题是,如果语法分析器处于任何没有沿着黄金标准序列的状态时,我们不知道如何教它做出的"正确"运动。一旦语法分析器发生了错误,我们不知道如何从实例中训练。 这是一个大问题,因为这意味着一旦语法分析器开始发生错误,它将停止在不属于训练数据的任何一种状态——导致出现更多的错误。 对于贪婪解析器而言,问题是具体的:一旦使用方向特性,有一种自然的方式做结构化预测。 像所有的***突破一样,一旦你理解了这些,解决方案似乎是显而易见的。我们要做的就是定义一个函数,此函数提问"有多少黄金标准依赖关系可以从这种状态恢复"。如果能定义这个函数,你可以依次进行每种运动,进而提问,"有多少黄金标准依赖关系可以从这种状态恢复?"。如果采用的操作可以让少一些的黄金标准依赖实现,那么它就是次优的。 这里需要领会很多东西。 因此我们有函数 Oracle(state): Oracle(state) = | gold_arcs ∩ reachable_arcs(state) | 我们有一个操作集合,每种操作返回一种新状态。我们需要知道: shift_cost = Oracle(state) – Oracle(shift(state)) right_cost = Oracle(state) – Oracle(right(state)) left_cost = Oracle(state) – Oracle(left(state)) 现在,至少一种操作返回0。Oracle(state)提问:"前进的***路径的成本是多少?"***路径的***步是转移,向右,或者向左。 事实证明,我们可以得出 Oracle 简化了很多过渡系统。我们正在使用的过渡系统的衍生品 —— Arc Hybrid 是 Goldberg 和 Nivre (2013)提出的。 我们把oracle实现为一个返回0-成本的运动的方法,而不是实现一个功能的Oracle(state)。这可以防止我们做一堆昂贵的复制操作。希望代码中的推理不是太难以理解,如果感到困惑并希望刨根问底的花,你可以参考 Goldberg 和 Nivre 的论文。 def get_gold_moves(n0, n, stack, heads, gold): def deps_between(target, others, gold): for word in others: if gold[word] == target or gold[target] == word: return True return False valid = get_valid_moves(n0, n, len(stack)) if not stack or (SHIFT in valid and gold[n0] == stack[-1]): return [SHIFT] if gold[stack[-1]] == n0: return [LEFT] costly = set([m for m in MOVES if m not in valid]) # If the word behind s0 is its gold head, Left is incorrect if len(stack) >= 2 and gold [stack [- 1]] = = stack [- 2]: costly.add (LEFT) # If there are any dependencies between n0 and the stack, # pushing n0 will lose them. If SHIFT not in costly and deps_between (N0, stack, gold): costly.add (SHIFT) # If there are any dependencies between S0 and the buffer, popping # S0 will lose them. If deps_between (stack [- 1], range (n0mm 1, nMel 1), gold): costly.add (LEFT) costly.add (RIGHT) return [m for m in MOVES if m not in costly]
The "dynamic oracle" training process makes a big difference in accuracy-usually 1-2%, no different from the way it is run. The old "static oracle" greedy training process is completely out of date; there is no reason to do that.
Summary
I feel that language technology, especially those related grammars, is very mysterious. I can't imagine what kind of program can be implemented.
I think it's natural for people to have solutions that can be quite complex. A 200000-line Java package feels appropriate.
However, when only a single algorithm is implemented, the algorithm code is often very short. When you implement only one algorithm, you do know what to write before you write, and you don't need to pay attention to any unnecessary abstractions that have a big performance impact.
Annotation
[1] I'm really not sure how to calculate the number of lines of code for the Stanford parser. Its jar file is loaded with 200k content, including a large number of different models. This is not important, but it seems safe at around 50k.
[2] for example, how to parse "John's school of music calls"? You need to make sure that the phrase "John's school" has the same structure as "John's school calls" and "John's school of music calls". Reasoning from different "slots" that can be put into phrases is the key way for us to reason and parse. You can think of each phrase as a connector with a different shape, and you need to insert a different slot-- each phrase also has a certain number of slots with different shapes. We are trying to figure out what kind of connector is where, so we can figure out how the sentences are connected.
[3] there is an updated version of the Stanford parser that uses "deep learning" technology, which is more accurate. However, in the end, the accuracy of the model still ranks behind the move into the reduction analyzer. This is a great article, and the idea is implemented on a parser, and it doesn't matter whether the parser is advanced or not.
[4] one detail: Stanford dependencies are actually automatically generated by a given golden standard phrase structure tree. Refer to the Stanford dependency converter page here: http://nlp.stanford.edu/software/stanford-dependencies.shtml.
Unfounded guess
For a long time, incremental language processing algorithms have been the main interest of the scientific community. If you want to write a parser to test the theory of how human statement processors work, then this parser needs to build a partial interpreter. There is ample evidence, including common sense reflection, which sets up input that we do not cache, and the speaker completes the immediate analysis of the expression.
But compared with neat scientific features, the current algorithm wins! As far as I can tell you, the secret to winning is:
Increment. Early text restrictions on search.
Error driven. The training contains an operational assumption that an error occurs and is updated.
The connection with human statement processing looks tempting. I look forward to seeing whether the breakthroughs in these projects will lead to some psycholinguistic progress.
Reference bibliography
The literature of NLP is almost completely open. All related papers can be found here: http://aclweb.org/anthology/.
The parser I describe is an implementation of a dynamic oracle arc-hybrid system:
Goldberg, Yoav; Nivre, Joakim
Training Deterministic Parsers with Non-Deterministic Oracles
TACL 2013
However, I wrote my own characteristics. The initial description of the arc-hybrid system is here:
Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio
Dynamic programming algorithms for transition-based dependency parsers
ACL 2011
The dynamic oracle training method is initially described here:
A Dynamic Oracle for Arc-Eager Dependency Parsing
Goldberg, Yoav; Nivre, Joakim
COLING 2012
When Zhang and Clark studied directed search, this work relied on major breakthroughs in the accuracy of transformation-based parsers. They have published a lot of papers, but the quotation is:
Zhang, Yue; Clark, Steven
Syntactic Processing Using the Generalized Perceptron and Beam Search
Computational Linguistics 2011 (1)
Another important article is this short feature engineering article, which further improves accuracy:
Zhang, Yue; Nivre, Joakim
Transition-based Dependency Parsing with Rich Non-local Features
ACL 2011
As the learning framework of directional parser, the generalized perceptron comes from this article.
Collins, Michael
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
EMNLP 2002
Experimental details
The results at the beginning of the article quote Article 22 of the Wall Street Journal Corpus. The Stanford parser performs the following:
Java-mx10000m-cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser\-outputFormat "penn" edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*
A small post-processing is applied to undo the hypothetical tag added by the Stanford parser to the number so that the number conforms to the PTB tag:
"Stanford parser retokenises numbers. Split them." Import sys import re qp_re = re.compile ('\ xc2\ xa0') for line in sys.stdin: line = line.rstrip () if qp_re.search (line): line = line.replace ('(CD','(QP (CD', 1) +') 'line = line.replace ('\ xc2\ xa0',') (CD') print line
The resulting PTB format file is converted to a dependency using the Stanford converter:
Do echo $f grep-v CODE $f > "$f.2" out= "$f.dep" java-mx800m-cp "$scriptdir/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure\-treeFile "$f.2"-basic-makeCopulaHead-conllx > $out done
I can't read it easily, but it should just use the general settings of the relevant literature to convert each .mrg file in a directory into an Stanford basic dependency file in CoNULL format.
Then I converted the gold standard tree from Article 22 of the Wall Street Journal corpus for evaluation. The exact score refers to the unmarked subordinate scores in all unmarked identities (such as arc head index)
To train parser.py, I output the gold standard PTB tree structure of Wall Street Journal Corpus 02-21 into the same conversion script.
In a nutshell, the Stanford model and parser.py are trained in the same set of sentences to predict on the holding test set where we know the answer. Accuracy refers to how many correct headwords we answer correctly.
Test the speed on a 2.4Ghz Xeon processor. I experimented on the server to provide more memory for the Stanford parser. The parser.py system works well on my MacBook Air. In parser.py 's experiment, I used PyPy;, which is about half as fast as the earlier benchmark.
One reason parser.py runs so fast is that it does untagged parsing. According to previous experiments, the labeled parser may be 400 times slower and about 1% more accurate. If you can access the data, adapting the program to the marked parser will be a good exercise opportunity for the reader.
The result of the RedShift parser is taken from the version of b6b624c9900f3bf, which runs as follows:
/ scripts/train.py-x zhang+stack-k 8-p / data/stanford/train.conll ~ / data/parsers/tmp. / scripts/parse.py ~ / data/parsers/tmp ~ / data/stanford/devi.txt / tmp/parse/. / scripts/evaluate.py / tmp/parse/parses ~ / data/stanford/dev.conll after reading the above, have you mastered how to use Python code to make an English parser? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.