Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Pyhanlp Pinyin conversion and character regularization

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article focuses on "pyhanlp Pinyin conversion and character regularization". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "pyhanlp Pinyin conversion and character regularization".

"Java implementation of Chinese characters to Pinyin and simple to complex conversion"

From pyhanlp import *

# conversion of complexity and simplicity

Print (HanLP.convertToTraditionalChinese) ("when you become queen, you can buy strawberries to celebrate". Found a gray hair)

Print (HanLP.convertToSimplifiedChinese ("HanLP"))

# simplified Chinese to traditional Taiwanese

Print (HanLP.s2tw ("hankcs writes Code in Taiwan"))

# Taiwan traditional Chinese to simplified

Print (HanLP.tw2s ("hankcs is checking the code"))

# simplified Chinese to Hong Kong traditional

Print (HanLP.s2hk ("hankcs writes Code in Hong Kong")

# from traditional Hong Kong to simplified

Print (HanLP.hk2s ("hankcs in Hong Kong Code")

# from Hong Kong traditional to Taiwan traditional

Print (HanLP.hk2tw ("hankcs in the code"))

# Taiwan traditional to Hong Kong traditional

Print (HanLP.tw2hk ("hankcs in Hong Kong Code")

# conversion between Hong Kong / Taiwan traditional Chinese and HanLP standard traditional Chinese

Print (HanLP.t2tw ("hankcs in the code"))

Print (HanLP.t2hk ("hankcs in the code"))

Print (HanLP.tw2t ("hankcs is checking the code"))

Print (HanLP.hk2t ("hankcs license Code"))

Later, when you become queen, you will be able to wish strawberries. Found a piece of jelly.

Write program HanLP with notebook computer

Hankcs is checking the code.

Hankcs writes code in Taiwan

Hankcs password in Hong Kong

Hankcs writes code in Hong Kong

Hankcs is checking the code.

Hankcs password in Hong Kong

Hankcs is checking the code.

Hankcs writes the code in Taiwan

The code of hankcs is registered.

The code of hankcs is registered.

Chinese characters to pinyin

The function of converting Chinese characters to pinyin in HanLP is also very powerful.

Description

HanLP not only supports basic Chinese characters to pinyin, but also supports initial consonant, vowel, tone, phonetic alphabet and input method initial consonant function.

HanLP can recognize polysyllabic characters as well as pinyin to traditional Chinese.

Most importantly, the pattern matching adopted by HanLP is upgraded to AhoCorasickDoubleArrayTrie, which greatly improves performance and provides millisecond response speed!

Detailed explanation of algorithm

"Java implementation of Chinese characters to Pinyin and simple to complex conversion"

# Chinese characters to Pinyin

Pinyin = JClass ("com.hankcs.hanlp.dictionary.py.Pinyin")

Text = "overloading is not an important task!"

Pinyin_list = HanLP.convertToPinyinList (text)

Print ("original,", end= "")

Print (text)

Print ("Pinyin (numeric tone),", end= "")

Print (pinyin_list)

Print ("Pinyin (symbol tone),", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getPinyinWithToneMark (), end= "")

Print ("\ nPinyin (no tone),", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getPinyinWithoutTone (), end= "")

Print ("\ ntone,", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getTone (), end= "")

Print ("\ nconsonant,", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getShengmu (), end= "")

Print ("\ nvowel,", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getYunmu (), end= "")

Print ("\ ninput header,", end= "")

For pinyin in pinyin_list:

Print ("s," pinyin.getHead (), end= "")

Print ()

# Pinyin conversion is optional to retain the original characters without pinyin

Print (HanLP.convertToPinyinString (as of 2012, True))

Print (HanLP.convertToPinyinString (as of 2012, False))

In the original text, overloading is not an important task!

Pinyin (digital tone), [chong2, zai3, bu2, shi4, zhong4, ren4, none5]

Pinyin (symbolic tone), ch ó ng, Zambii, b ú, sh tone, zh ng, r è n, none

Pinyin (no tone), chong, zai, bu, shi, zhong, ren, none

Tone, 2, 3, 2, 4, 4, 4, 5

Initial consonant, ch, z, b, sh, zh, r, none

Vowels, ong, ai, u, I, ong, en, none

Input method header, ch, z, b, sh, zh, r, none

Jie zhi none none none none nian none

Jie zhi 2 0 1 2 nian

Pinyin to Chinese

The data structures and interfaces in HanLP are flexible. By combining these interfaces, you can create new functions by yourself. We can use the longest word splitter implemented by AhoCorasickDoubleArrayTrie, which requires the user to call setTrie () to provide an AhoCorasickDoubleArrayTrie.

StringDictionary = JClass (

"com.hankcs.hanlp.corpus.dictionary.StringDictionary")

CommonAhoCorasickDoubleArrayTrieSegment = JClass (

"com.hankcs.hanlp.seg.Other.CommonAhoCorasickDoubleArrayTrieSegment")

Config = JClass ("com.hankcs.hanlp.HanLP$Config")

TreeMap = JClass ("java.util.TreeMap")

TreeSet = JClass ("java.util.TreeSet")

Dictionary = StringDictionary ()

Dictionary.load (Config.PinyinDictionaryPath)

Entry = {}

M_map = TreeMap ()

For entry in dictionary.entrySet ():

Pinyins = entry.getValue () .replace ("[\\ d,]", "")

Words = m_map.get (pinyins)

If words is None:

Words = TreeSet ()

M_map.put (pinyins, words)

Words.add (entry.getKey ())

Words = TreeSet ()

Words.add ("green")

Words.add ("filter")

M_map.put ("lvse", words)

Segment = CommonAhoCorasickDoubleArrayTrieSegment (m_map)

Print (segment.segment ("renmenrenweiyalujiangbujianlvse"))

Print (segment.segment ("lvsehaihaodajiadongxidayinji"))

[renmenrenweiyalujiangbujian/null, lvse/ [filter, green]]

[lvse/ [filter, green], haihaodajiadongxidayinji/null]

Character regularization

Demonstrate the effect of normalizing character configuration items (traditional-> simplified, full-width-> half-width, uppercase-> lowercase).

This configuration item is located in hanlp.properties and is enabled through Normalization=true (now it can be opened directly through HanLP.Config.Normalization).

The CustomDictionary.txt.bin cache must be deleted after switching configurations, otherwise only dynamically inserted new words will be affected.

A week before I started writing, some students have added the ability to automatically delete the cache after adding a custom dictionary. Click https://github.com/hankcs/HanLP/pull/954 for the address. Now you just need to turn on regularization.

CustomDictionary = JClass ("com.hankcs.hanlp.dictionary.CustomDictionary")

Print ("HanLP.Config.Normalization = False\ n")

HanLP.Config.Normalization = False

CustomDictionary.insert ("Love 4G", "nz 1000")

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("subscription 4G"))

Print (HanLP.segment ("like 4G"))

Print (HanLP.segment ("hankcs in the code"))

Print ("\ nHanLP.Config.Normalization = True\ n")

HanLP.Config.Normalization = True

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("Love 4G"))

Print (HanLP.segment ("subscription 4G"))

Print (HanLP.segment ("like 4G"))

Print (HanLP.segment ("hankcs in the code"))

HanLP.Config.ShowTermNature = False

Text = HanLP.s2tw ("now HanLP has added the ability to automatically delete the cache after adding a custom dictionary, now you only need to turn on regularization")

Print (text)

Print (HanLP.segment (text))

HanLP.Config.ShowTermNature = False

HanLP.Config.Normalization = False

[love to listen to 4G]

[love listening to 4G]

[love, listen, 4, G]

[love, listen, 4, G]

[love, love, 4, G]

[like, 4, G]

[hankcs, write, write, code]

HanLP.Config.Normalization = True

[love to listen to 4G]

[love to listen to 4G]

[love to listen to 4G]

[love to listen to 4G]

[love to listen to 4G]

[like, 4, g]

[hankcs, in, Taiwan, write, code]

Now the HanLP has added the function of automatically deleting the cache after adding the custom definition dictionary. Now you only need to turn on the normalization.

[now, hanlp, already, add, add, customize, dictionary, and then, automatically, delete, fast, fetch, function, now, only, need, open, correct, turn, transform, can]

At this point, I believe you have a deeper understanding of "pyhanlp Pinyin conversion and character regularization". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report