How to write a spelling error corrector with Python 04/18 Update SLTechnology News&Howtos

How to write a spelling error corrector with Python

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to write a spelling error corrector with Python". In daily operation, I believe many people have doubts about how to write a spelling error corrector with Python. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts of "how to write a spelling error corrector with Python". Next, please follow the editor to study!

The code is as follows:

Python

# coding:utf-8

Import re

From collections import Counter

Def words (text):

Return re.findall (r'\ wrought, text.lower ())

# Statistics of word frequency

WORDS = Counter (words (open ('big.txt'). Read ()

Def P (word, N=sum (WORDS.values ():

"probability of the word 'word'"

Return float (works [word]) / N

Def correction (word):

"" the most likely corrective candidate ""

Return max (candidates (word), key=P)

Def candidates (word):

"generate a candidate set of spelling correction words"

Return (known ([word]) or known (edits1 (word)) or known (edits2 (word)) or [word])

Def known (words):

"" subset of elements that appear in the WORDS collection in the words' ""

Return set (w for w in words if w in WORDS)

Def edits1 (word):

"" all results with an editing distance of 1 from 'word' "

Letters = 'abcdefghijklmnopqrstuvwxyz'

Splits = [(word [: I], word [I:]) for i in range (len (word) + 1)]

Deletes = [L + R [1:] for L, R in splits if R]

Transposes = [L + R [1] + R [0] + R [2:] for L, R in splits if len (R) > 1]

Replaces = [L + c + R [1:] for L, R in splits for c in letters]

Inserts = [L + c + R for L, R in splits for c in letters]

Return set (deletes + transposes + replaces + inserts)

Def edits2 (word):

"" all results with an editing distance of 2 from 'word' "

Return (e2 for E1 in edits1 (word) for e2 in edits1 (E1))

The function correction (word) returns the most likely error correction restore word:

Python

> correction ('speling')

'spelling'

> correction ('korrectud')

'corrected'

How it works: the Python section

The four parts of the program:

1. Selection mechanism: in Python, the max () function with key can realize the function of argmax.

two。 Candidate model: first introduce a new concept: simple editing of a word refers to deletion (removal of a letter), substitution (interchange of two letters within a word), substitution (change of a letter within a word), and insertion (adding a letter). The function edits1 (word) returns a collection of all simple edits of a word, regardless of whether it is a legal word after editing:

Python

Def edits1 (word):

"" all results with an editing distance of 1 from 'word' "

Letters = 'abcdefghijklmnopqrstuvwxyz'

Splits = [(word [: I], word [I:]) for i in range (len (word) + 1)]

Deletes = [L + R [1:] for L, R in splits if R]

Transposes = [L + R [1] + R [0] + R [2:] for L, R in splits if len (R) > 1]

Replaces = [L + c + R [1:] for L, R in splits for c in letters]

Inserts = [L + c + R for L, R in splits for c in letters]

Return set (deletes + transposes + replaces + inserts)

This collection can be very large. A word of length n, with n delete edits, natively 1 replacement edits, 26n replacement edits, and 26 (nasty 1) insert edits, for a total of 54n+25 simple edits (there are duplicates). For example:

Python

> len (edits1 ('something'))

four hundred and forty two

However, if we limit words to known, the set of words will shrink significantly:

Python

Def known (words):

"" subset of elements that appear in the WORDS collection in the words' ""

Return set (w for w in words if w in WORDS)

> known (edits1 ('something'))

['something',' soothing']

We also need to consider the words obtained by secondary editing (the editing distance is 2, where the author cleverly uses the idea of recursion to return the function edits1 to each element in the set and get it again through edits1 processing), which is larger, but still only a small number of known words:

Python

Def edits2 (word):

"" all results with an editing distance of 2 from 'word' "

Return (e2 for E1 in edits1 (word) for e2 in edits1 (E1))

> len (set (edits2 ('something'))

90902

> known (edits2 ('something'))

{'seething',' smoothing', 'something',' soothing'}

> known (edits2 ('somthing'))

{'loathing',' nothing', 'scathing',' seething', 'smoothing',' something', 'soothing',' sorting'}

We call each word in the edits2 (w) result a distance of 2 from w.

3. Language model: we estimate P (w) by counting the frequency of each word in a million-level text big.txt, whose data comes from excerpts from the public domain in the Gutenberg project and the most frequent words in the Wikipedia dictionary, as well as the British National Corpus, where the function words (text) divides the text into phrases and calculates the frequency of each word in the variable WORDS. P evaluate the probability of each word based on this statistics:

Python

Def words (text):

Return re.findall (r'\ wrought, text.lower ())

# Statistics of word frequency

WORDS = Counter (words (open ('big.txt'). Read ()

Def P (word, N=sum (WORDS.values ():

"probability of the word 'word'"

Return float (works [word]) / N

As you can see, there are 32192 words after repetition, and they appear a total of 1115504 times. "the" is the word with the highest frequency, with a total of 79808 times (about 7%). The probability of other words is lower.

Python

> len (WORDS)

32192

> > sum (WORDS.values ())

1115504

> > WORDS.most_common (10)

[('the', 79808)

('of', 40024)

('and', 38311)

('to', 28765)

('in', 22020)

('asides, 21124)

('that', 12512)

('he', 12401)

('was', 11410)

('it', 10681)

('his', 10034)

('is', 9773)

('with', 9739)

('as', 8064)

('iTunes, 7679)

('had', 7383)

('for', 6938)

('at', 6789)

('by', 6735)

('on', 6639)]

> max (WORDS, key=P)

'the'

> P ('the')

0.07154434228832886

> P ('outrivaled')

8.9645577245801e-07

> P ('unmentioned')

0.0

4. Error model: when I was sitting in the cabin writing this program in 2007, I had no misspelled data or Internet connection (which I know might be hard to imagine today). A spelling error model cannot be built without data, so I took a shortcut and defined such a simple and flawed model: it is determined that editors with a distance of 1 for all known words must have a higher probability than those with a distance of 2. and the probability must be lower than the word with distance 0 (that is, the original word). So the priority of the function candidates (word) is as follows:

1. The original word (if known), otherwise to 2.

two。 All words with a distance of 1, if empty to 3.

3. All words with a distance of 2, if empty to 4.

4. The original word, even if it is not a known word.

At this point, the study on "how to write a spelling error corrector with Python" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.