Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to write a language detector with Python code

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

How to use Python code to write a language detector, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Have you ever wondered how Chrome browsers know the language of a web page and provide translation services for foreign-language web pages? Or, how does Facebook translate the foreign language that your friend wrote on your home page? Detecting a language is actually very simple, improves the user experience, and doesn't require the user to do anything.

I stumbled upon ActiveState recipe for a language detector in Python, which is a very good program, but I decided to make some small improvements. Provide some background knowledge for those who are not familiar with natural language processing or procedural linguistics.

If you are an experienced programmer, you may be able to skip directly to the bottom part of the program. It's surprisingly simple.

You need to be familiar with Python syntax.

What detected a language?

Before you write a language-distinguishing program, you need to answer a question: what distinguishes the two languages?

Interestingly, the answer to this question varies from language to language. For example:

Women are full of milk and milk. (translator's Note: women drink milk.)

How do you know this sentence is not in English? You may not be familiar with Japanese, but you certainly know that these characters are not English, and you don't even need to know which characters don't exist in the English alphabet.

La femme boit du lait. (translation Note: French: women drink milk.)

How do you know this sentence is not in English? It's a little troublesome. Every letter is in English. Even every letter and sentence structure is very similar to the sentence with the same meaning in English-"The woman drank milk." Translator's Note: women drink milk.) . Your brain uses another feature to judge this: although the letters are similar, there is no resemblance in the pronunciation of the two sentences.

There are many more complex ways to detect two different languages (for example, grammar, syntax, etc.) the two features mentioned above seem to be sufficient to distinguish a lot of written words.

Question: can you think of a contrary example? Two languages that cannot be distinguished by characters or pronunciation? (translator's note: this is what I thought of, and it has nothing to do with the editor. The distinction between Hindi and Nepali is extremely low, and the distinction between a language of India and the official language of Nepal is very low, with low character differences and 50 per cent similarity in pronunciation. Of course, they are two languages of the same language family. )

How do you use a computer to detect these features?

* features already exist in any modern machine-character encodings character decoding allows any computer to render each character through binary code. We're going to use unicode in the Python program.

The second feature is more interesting. How can a computer detect the pronunciation of a string? The answer is simpler than expected: the string order is decoded according to the sound! They have a direct and stable correspondence-the language changes very slowly.

Therefore, you can use the following two features to detect a line of text language:

Repeatability of a single character

Repeatability of a string

In fact, these two features are condensed into one feature: the order of strings. The repeatability of a single character is only the repetition of a string.

Quick knowledge supplement: in computer linguistics, the length n of a string is defined as n-gram. "a" is a gram, 1-gram. "bc" is two gram,2-gram or bigram. "def" is three gram, 3-gram or trigram, and so on.

Using python to realize

First, we need to calculate the number of times a string appears in a particular text. To encapsulate the result, we will create a NGram class.

After reading the above, have you mastered how to write a language detector in Python code? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report