In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article focuses on "pyhanlp Pinyin conversion and character regularization". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "pyhanlp Pinyin conversion and character regularization".
"Java implementation of Chinese characters to Pinyin and simple to complex conversion"
From pyhanlp import *
# conversion of complexity and simplicity
Print (HanLP.convertToTraditionalChinese) ("when you become queen, you can buy strawberries to celebrate". Found a gray hair)
Print (HanLP.convertToSimplifiedChinese ("HanLP"))
# simplified Chinese to traditional Taiwanese
Print (HanLP.s2tw ("hankcs writes Code in Taiwan"))
# Taiwan traditional Chinese to simplified
Print (HanLP.tw2s ("hankcs is checking the code"))
# simplified Chinese to Hong Kong traditional
Print (HanLP.s2hk ("hankcs writes Code in Hong Kong")
# from traditional Hong Kong to simplified
Print (HanLP.hk2s ("hankcs in Hong Kong Code")
# from Hong Kong traditional to Taiwan traditional
Print (HanLP.hk2tw ("hankcs in the code"))
# Taiwan traditional to Hong Kong traditional
Print (HanLP.tw2hk ("hankcs in Hong Kong Code")
# conversion between Hong Kong / Taiwan traditional Chinese and HanLP standard traditional Chinese
Print (HanLP.t2tw ("hankcs in the code"))
Print (HanLP.t2hk ("hankcs in the code"))
Print (HanLP.tw2t ("hankcs is checking the code"))
Print (HanLP.hk2t ("hankcs license Code"))
Later, when you become queen, you will be able to wish strawberries. Found a piece of jelly.
Write program HanLP with notebook computer
Hankcs is checking the code.
Hankcs writes code in Taiwan
Hankcs password in Hong Kong
Hankcs writes code in Hong Kong
Hankcs is checking the code.
Hankcs password in Hong Kong
Hankcs is checking the code.
Hankcs writes the code in Taiwan
The code of hankcs is registered.
The code of hankcs is registered.
Chinese characters to pinyin
The function of converting Chinese characters to pinyin in HanLP is also very powerful.
Description
HanLP not only supports basic Chinese characters to pinyin, but also supports initial consonant, vowel, tone, phonetic alphabet and input method initial consonant function.
HanLP can recognize polysyllabic characters as well as pinyin to traditional Chinese.
Most importantly, the pattern matching adopted by HanLP is upgraded to AhoCorasickDoubleArrayTrie, which greatly improves performance and provides millisecond response speed!
Detailed explanation of algorithm
"Java implementation of Chinese characters to Pinyin and simple to complex conversion"
# Chinese characters to Pinyin
Pinyin = JClass ("com.hankcs.hanlp.dictionary.py.Pinyin")
Text = "overloading is not an important task!"
Pinyin_list = HanLP.convertToPinyinList (text)
Print ("original,", end= "")
Print (text)
Print ("Pinyin (numeric tone),", end= "")
Print (pinyin_list)
Print ("Pinyin (symbol tone),", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getPinyinWithToneMark (), end= "")
Print ("\ nPinyin (no tone),", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getPinyinWithoutTone (), end= "")
Print ("\ ntone,", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getTone (), end= "")
Print ("\ nconsonant,", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getShengmu (), end= "")
Print ("\ nvowel,", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getYunmu (), end= "")
Print ("\ ninput header,", end= "")
For pinyin in pinyin_list:
Print ("s," pinyin.getHead (), end= "")
Print ()
# Pinyin conversion is optional to retain the original characters without pinyin
Print (HanLP.convertToPinyinString (as of 2012, True))
Print (HanLP.convertToPinyinString (as of 2012, False))
In the original text, overloading is not an important task!
Pinyin (digital tone), [chong2, zai3, bu2, shi4, zhong4, ren4, none5]
Pinyin (symbolic tone), ch ó ng, Zambii, b ú, sh tone, zh ng, r è n, none
Pinyin (no tone), chong, zai, bu, shi, zhong, ren, none
Tone, 2, 3, 2, 4, 4, 4, 5
Initial consonant, ch, z, b, sh, zh, r, none
Vowels, ong, ai, u, I, ong, en, none
Input method header, ch, z, b, sh, zh, r, none
Jie zhi none none none none nian none
Jie zhi 2 0 1 2 nian
Pinyin to Chinese
The data structures and interfaces in HanLP are flexible. By combining these interfaces, you can create new functions by yourself. We can use the longest word splitter implemented by AhoCorasickDoubleArrayTrie, which requires the user to call setTrie () to provide an AhoCorasickDoubleArrayTrie.
StringDictionary = JClass (
"com.hankcs.hanlp.corpus.dictionary.StringDictionary")
CommonAhoCorasickDoubleArrayTrieSegment = JClass (
"com.hankcs.hanlp.seg.Other.CommonAhoCorasickDoubleArrayTrieSegment")
Config = JClass ("com.hankcs.hanlp.HanLP$Config")
TreeMap = JClass ("java.util.TreeMap")
TreeSet = JClass ("java.util.TreeSet")
Dictionary = StringDictionary ()
Dictionary.load (Config.PinyinDictionaryPath)
Entry = {}
M_map = TreeMap ()
For entry in dictionary.entrySet ():
Pinyins = entry.getValue () .replace ("[\\ d,]", "")
Words = m_map.get (pinyins)
If words is None:
Words = TreeSet ()
M_map.put (pinyins, words)
Words.add (entry.getKey ())
Words = TreeSet ()
Words.add ("green")
Words.add ("filter")
M_map.put ("lvse", words)
Segment = CommonAhoCorasickDoubleArrayTrieSegment (m_map)
Print (segment.segment ("renmenrenweiyalujiangbujianlvse"))
Print (segment.segment ("lvsehaihaodajiadongxidayinji"))
[renmenrenweiyalujiangbujian/null, lvse/ [filter, green]]
[lvse/ [filter, green], haihaodajiadongxidayinji/null]
Character regularization
Demonstrate the effect of normalizing character configuration items (traditional-> simplified, full-width-> half-width, uppercase-> lowercase).
This configuration item is located in hanlp.properties and is enabled through Normalization=true (now it can be opened directly through HanLP.Config.Normalization).
The CustomDictionary.txt.bin cache must be deleted after switching configurations, otherwise only dynamically inserted new words will be affected.
A week before I started writing, some students have added the ability to automatically delete the cache after adding a custom dictionary. Click https://github.com/hankcs/HanLP/pull/954 for the address. Now you just need to turn on regularization.
CustomDictionary = JClass ("com.hankcs.hanlp.dictionary.CustomDictionary")
Print ("HanLP.Config.Normalization = False\ n")
HanLP.Config.Normalization = False
CustomDictionary.insert ("Love 4G", "nz 1000")
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("subscription 4G"))
Print (HanLP.segment ("like 4G"))
Print (HanLP.segment ("hankcs in the code"))
Print ("\ nHanLP.Config.Normalization = True\ n")
HanLP.Config.Normalization = True
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("Love 4G"))
Print (HanLP.segment ("subscription 4G"))
Print (HanLP.segment ("like 4G"))
Print (HanLP.segment ("hankcs in the code"))
HanLP.Config.ShowTermNature = False
Text = HanLP.s2tw ("now HanLP has added the ability to automatically delete the cache after adding a custom dictionary, now you only need to turn on regularization")
Print (text)
Print (HanLP.segment (text))
HanLP.Config.ShowTermNature = False
HanLP.Config.Normalization = False
[love to listen to 4G]
[love listening to 4G]
[love, listen, 4, G]
[love, listen, 4, G]
[love, love, 4, G]
[like, 4, G]
[hankcs, write, write, code]
HanLP.Config.Normalization = True
[love to listen to 4G]
[love to listen to 4G]
[love to listen to 4G]
[love to listen to 4G]
[love to listen to 4G]
[like, 4, g]
[hankcs, in, Taiwan, write, code]
Now the HanLP has added the function of automatically deleting the cache after adding the custom definition dictionary. Now you only need to turn on the normalization.
[now, hanlp, already, add, add, customize, dictionary, and then, automatically, delete, fast, fetch, function, now, only, need, open, correct, turn, transform, can]
At this point, I believe you have a deeper understanding of "pyhanlp Pinyin conversion and character regularization". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.