In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail how to use Hanlp to load a large dictionary. Xiaobian thinks it is quite practical, so share it with you for reference. I hope you can gain something after reading this article.
problem
Because you need to load a dictionary of nearly 1G into Hanlp, at first you use CustomDictionay.add() method to load it one by one. As expected, in the middle, the cost of maintaining DoubleArraTre is too high. Adding a node will take a long time. It doesn't matter if it takes a long time. As long as you train the.bin file, the second load will be very fast. However, as a DAT structure that trades space for time, the memory consumption is very large. It appears within expectations.
out of memory: heap size
1
the problem. Later, I tried to load a 1G dictionary directly, obviously not enough.
ideas
Hanlp read part of the source code, but also asked the original author part of the problem, it is intended to start from the source code. The initial idea was to take the original dictionary
Split into multiple copies, and then train multiple small dictionaries into multiple small.bin files, and then load them completely into memory, based on the principle that loading two 10M dictionaries costs less than one 20M dictionary.
Then optimized a part, now load a dictionary about 1G, accounting for about 3g+ memory, can already be used.
approximate flow
Modify CustomDictionary.java to set up a hashmap or a list to store all the small Dats
After loading all the dats, there is no longer any distinction between primary and secondary dictionaries.
Modify Segment.java inside the combineByCustomDictionary function, the source code only one dat, where we need to select one of our container dat as to match the use, before using the scheme is, traverse all the dat, know that there is a match, but this defect is obvious, can not solve multiple dictionaries matching the same word string situation, here my investigation scheme is, the dictionary of the same word beginning of the entry mapped to the same file, so there will be no string problem.
About "how to use Hanlp to load a big dictionary" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.