How to use UDF to realize text Segmentation in Hive 04/21 Update SLTechnology News&Howtos

How to use UDF to realize text Segmentation in Hive

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

It is believed that many inexperienced people have no idea about how to use UDF to achieve text segmentation in Hive. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Introduction to UDF

Hive as a sql query engine, comes with some basic functions, such as count (count), sum (summation), sometimes these basic functions can not meet our needs, then we have to write hive hdf (user defined funation), also called user-defined functions. To write a Hive UDF:

Add related dependencies and create a project. The management tool I use here is maven, so I also created a maven project (at this time you need to choose the appropriate dependency version, mainly Hadoop and Hive, you can use hadoop version and hive-- version to view the version respectively)

Inherit the org.apache.hadoop.hive.ql.exec.UDF class, implement the evaluate method, and then package

Use the add method to add the jar package to the distributed cache. If the jar package is uploaded to the $HIVE_HOME/lib/ directory, there is no need to execute the add command

Create a temporary function through create temporary function, and create a permanent function without adding temporary

Use the UDF you created in SQL

UDF participle

This is a relatively common scene. For example, the company's products generate a large number of on-screen comments or comments every day. At this time, we may want to analyze what is the hot topic that people are most concerned about. Or we will analyze what the recent network trend is, but there is a problem here, that is, the construction of your thesaurus. Because you use a general thesaurus may not be able to achieve a good effect of word segmentation, especially there are many popular online words it is not in the thesaurus, there is also the problem of stopping words, because many times discontinued words are meaningless, so here we need to filter them, and the way to filter is through the discontinued thesaurus.

At this time, we mainly have two kinds of solutions, one is to use some thesaurus provided by third parties, and the other is to build our own thesaurus, and then have special personnel to maintain it, which is also a relatively common situation.

The last one is the word segmentation tools we use, because there are many mainstream word segmentation tools, choosing different word segmentation tools may have a lot of impact on our word segmentation results.

Word segmentation tool

1:Elasticsearch 's open source Chinese word splitter IK Analysis (Star:2471)

The use of IK Chinese word separator on Elasticsearch. Native IK Chinese word segmentation reads dictionaries from the file system, and es-ik itself can be extended to read dictionaries from different sources. Currently available to read from the sqlite3 database. How to use es-ik-plugin-sqlite3: 1. Set the location of your sqlite3 dictionary in elasticsearch.yml: ik_analysis_db_path: / opt/ik/dictionary.db

2: open source java Chinese thesaurus IKAnalyzer (Star:343)

IK Analyzer is an open source, java-based lightweight Chinese word segmentation toolkit. Since the release of version 1.0 in December 2006, IKAnalyzer has launched four major versions. At first, it is a Chinese word segmentation component based on the open source project Luence, which combines dictionary word segmentation and grammar analysis algorithm. Since version 3.0, IK has developed into a common word segmentation component for Java, independent of the Lucene project

3:java open source Chinese word segmentation Ansj (Star:3019)

Ansj Chinese word Segmentation this is an ictclas java implementation. Basically rewrite all the data structures and algorithms. The dictionary is provided by the open source version of ictclas. And part of the manual optimization of word segmentation speed of about 2 million words per second, the accuracy can reach more than 96%.

At present, it has been realized. Chinese word segmentation. Chinese name recognition. Part of speech tagging, user-defined dictionary, keyword extraction, automatic summary, keyword tagging and other functions.

It can be applied to natural language processing and other aspects, and it is suitable for all kinds of projects that require high effect of word segmentation.

4: stuttering participle ElasticSearch plug-in (Star:188)

Elasticsearch officially only provides smartcn, a Chinese word segmentation plug-in, and the effect is not very good. Fortunately, there are two Chinese word segmentation plug-ins written by medcl (one of the earliest people who studied es in China), one for ik and the other for mmseg.

5:Java distributed Chinese word Segmentation component-word Segmentation (Star:672)

Word word segmentation is a distributed Chinese word segmentation component implemented by Java, which provides a variety of dictionary-based word segmentation algorithms, and uses ngram model to eliminate ambiguity. Can accurately identify English, numbers, date, time and other quantifiers, can identify unknown words such as person name, place name, organization name, etc.

6:Java open source Chinese word splitter jcseg (Star:400)

What is Jcseg? Jcseg is a lightweight open source Chinese word separator based on mmseg algorithm, which integrates keyword extraction, key phrase extraction, key sentence extraction and automatic article summarization, and provides the latest version of lucene, solr, elasticsearch word segmentation interface, Jcseg comes with a jcseg.properties file.

7: Chinese Thesaurus Paoding

Pao Ding Chinese word Segmentation Library is a Chinese search engine word segmentation component developed using Java and can be integrated into Lucene applications for the Internet and intranets. Paoding fills the gap of open source components in Chinese word segmentation in China, and hopes to become the first choice of Chinese word segmentation open source components for Internet websites. Paoding Chinese word segmentation pursues high efficiency and good user experience.

8: Chinese word Segmentation mmseg4j

Mmseg4j uses Chih-Hao Tsai's MMSeg algorithm (http://technology.chtsai.org/mmseg/) to implement a Chinese word splitter, and implements lucene's analyzer and solr's TokenizerFactory to facilitate the use of Lucene and Solr.

9: Chinese word segmentation Ansj (Star:3015)

Ansj Chinese word Segmentation this is an ictclas java implementation. Basically rewrite all the data structures and algorithms. The dictionary is provided by the open source version of ictclas. And has carried on the partial manual optimization memory Chinese word segmentation about 1 million words per second (the speed has surpassed the ictclas) the file reads the word segmentation rate about 300000 words per second can achieve 96% above at present. .

10:Lucene Chinese Thesaurus ICTCLAS4J

Ictclas4j Chinese word segmentation system is a java open source word segmentation project completed by sinboy on the basis of FreeICTCLAS developed by Zhang Huaping and Liu Qun of the Chinese Academy of Sciences, which simplifies the complexity of the original word segmentation program and aims to provide a better learning opportunity for Chinese word segmentation lovers.

Code implementation

Step 1: introduce dependency

Here we introduce two dependencies, which are actually two different word segmentation tools.

Org.ansj ansj_seg 5.1.6 compile com.janeluo ikanalyzer 2012_u6

Before we begin, let's write a demo to play with, so that we can have a basic understanding.

@ Test public void testAnsjSeg () {String str = "my name is Li Taibai, I am a poet and I lived in the Tang Dynasty"; / / choose which separator to use BaseAnalysis ToAnalysis NlpAnalysis IndexAnalysis Result result = ToAnalysis.parse (str); System.out.println (result); KeyWordComputer kwc = new KeyWordComputer (5); Collection keywords = kwc.computeArticleTfidf (str); System.out.println (keywords);}

Output result

I / r, my name is / v, Li Taibai / nr,/w, I / r, is / v, a / m, poet / n magistrate, I / r, life / vn, in / p, Tang Dynasty / t [Li Taibai / 24.72276098504223, poet / 3.05021859688885, Tang Dynasty / 0.8965677022546215, life / 0.6892230219652541]

[Li Taibai / 24.72276098504223, poet / 3.0502185968368885, Tang Dynasty / 0.8965677022546215, life / 0.6892230219652541]

Step 2: introduce a thesaurus of disabled words

Because it is a disabled thesaurus, it is not very big, so I put it directly in the project. Of course, you can also put it in other places, such as HDFS.

Step 3: write UDF

The code is very simple and I will not explain it in detail. What we need to pay attention to is the rules for the use of some methods in GenericUDF. As for the quality of the code design and what improvements we will talk about later, the ideas of the following two sets of implementations are almost the same, but the difference is the difference in the use of word segmentation tools.

The realization of ansj

/ * Chinese words segmentation with user-dict in com.kingcall.dic * use Ansj (a java open source analyzer) * / / this information is the @ Description (name = "ansj_seg", value = "_ FUNC_ (str)-chinese words segment using ansj) returned every time you use desc to get function information. Return list of words. ", extended =" Example: select _ FUNC_ ('I am a test string') from src limit 1;\ n "+" [\ "I\",\ "Yes\",\ "Test\",\ "string\"]) public class AnsjSeg extends GenericUDF {private transient ObjectInspectorConverters.Converter [] converters; private static final String userDic = "/ app/stopwords/com.kingcall.dic" / / load userDic in hdfs static {try {FileSystem fs = FileSystem.get (new Configuration ()); FSDataInputStream in = fs.open (new Path (userDic)); BufferedReader br = new BufferedReader (new InputStreamReader (in)); String line = null; String [] strs = null While ((line = br.readLine ())! = null) {line = line.trim (); if (line.length () > 0) {strs = line.split ("\ t"); strs [0] = strs [0] .toLowerCase (); DicLibrary.insert (DicLibrary.DEFAULT, strs [0]) / / ignore nature and freq}} MyStaticValue.isNameRecognition = Boolean.FALSE; MyStaticValue.isQuantifierRecognition = Boolean.TRUE;} catch (Exception e) {System.out.println ("Error when load userDic" + e.getMessage ());} @ Override public ObjectInspector initialize (ObjectInspector [] arguments) throws UDFArgumentException {if (arguments.length)

< 1 || arguments.length >

2) {throw new UDFArgumentLengthException ("The function AnsjSeg (str) takes 1 or 2 arguments.");} converters = new ObjectInspectorConverters.Converter [arguments.length]; converters [0] = ObjectInspectorConverters.getConverter (arguments [0], PrimitiveObjectInspectorFactory.writableStringObjectInspector); if (2 = arguments.length) {converters [1] = ObjectInspectorConverters.getConverter (arguments [1], PrimitiveObjectInspectorFactory.writableIntObjectInspector) } return ObjectInspectorFactory.getStandardListObjectInspector (PrimitiveObjectInspectorFactory.writableStringObjectInspector);} @ Override public Object evaluate (DeferredObject [] arguments) throws HiveException {boolean filterStop = false; if (arguments [0] .get () = null) {return null;} if (2 = = arguments.length) {IntWritable filterParam = (IntWritable) converters [1] .convert (arguments [1] .get ()) If (1 = = filterParam.get () filterStop = true;} Text s = (Text) converters [0] .convert (arguments [0] .get ()); ArrayList result = new ArrayList () If (filterStop) {for (Term words: DicAnalysis.parse (s.toString ()) .recognition (StopLibrary.get () {if (words.getName (). Trim (). Length () > 0) {result.add (new Text (words.getName (). Trim () } else {for (Term words: DicAnalysis.parse (s.toString () {if (words.getName () .trim () .length () > 0) {result.add (words.getName () .trim () } return result;} @ Override public String getDisplayString (String [] children) {return getStandardDisplayString ("ansj_seg", children);}}

The realization of ikanalyzer

@ Description (name = "ansj_seg", value = "_ FUNC_ (str)-chinese words segment using Iknalyzer. Return list of words. ", extended =" Example: select _ FUNC_ ('I am a test string') from src limit 1;\ n "+" [\ "I\",\ "Yes\",\ "Test\",\ "string\"]) public class IknalyzerSeg extends GenericUDF {private transient ObjectInspectorConverters.Converter [] converters; / / set used to store deactivated words Set stopWordSet = new HashSet () @ Override public ObjectInspector initialize (ObjectInspector [] arguments) throws UDFArgumentException {if (arguments.length)

< 1 || arguments.length >

2) {throw new UDFArgumentLengthException ("The function AnsjSeg (str) takes 1 or 2 arguments.");} / / read the stop word file BufferedReader StopWordFileBr = null; try {StopWordFileBr = new BufferedReader (new FileInputStream (new File ("stopwords/baidu_stopwords.txt") / / String stopWord = null; for (; (stopWord = StopWordFileBr.readLine ())! = null;) {stopWordSet.add (stopWord);}} catch (FileNotFoundException e) {e.printStackTrace ();} catch (IOException e) {e.printStackTrace () } converters = new ObjectInspectorConverters.Converter [arguments.length]; converters [0] = ObjectInspectorConverters.getConverter (arguments [0], PrimitiveObjectInspectorFactory.writableStringObjectInspector); if (2 = = arguments.length) {converters [1] = ObjectInspectorConverters.getConverter (arguments [1], PrimitiveObjectInspectorFactory.writableIntObjectInspector);} return ObjectInspectorFactory.getStandardListObjectInspector (PrimitiveObjectInspectorFactory.writableStringObjectInspector) } @ Override public Object evaluate (DeferredObject [] arguments) throws HiveException {boolean filterStop = false; if (arguments [0] .get () = null) {return null;} if (2 = = arguments.length) {IntWritable filterParam = (IntWritable) converters [1] .convert (arguments [1] .get ()); if (1 = = filterParam.get ()) filterStop = true } Text s = (Text) converters [0] .convert (arguments [0] .get ()); StringReader reader = new StringReader (s.toString ()); IKSegmenter iks = new IKSegmenter (reader, true); List list = new ArrayList (); if (filterStop) {try {Lexeme lexeme While ((lexeme = iks.next ())! = null) {if (! stopWordSet.contains (lexeme.getLexemeText () {list.add (new Text (lexeme.getLexemeText () } catch (IOException e) {}} else {try {Lexeme lexeme; while ((lexeme = iks.next ())! = null) {list.add (new Text (lexeme.getLexemeText () } catch (IOException e) {}} return list;} @ Override public String getDisplayString (String [] children) {return "Usage: evaluate (String str)";}}

Step 4: write test cases

GenericUDF provides us with methods that can be used to build the environment and parameters needed for testing, so that we can test the code.

@ Test public void testAnsjSegFunc () throws HiveException {AnsjSeg udf = new AnsjSeg (); ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector; ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector; ObjectInspector [] init_args = {valueOI0, valueOI1}; udf.initialize (init_args); Text str = new Text ("I am a test string"); GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject (str); GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject (0); GenericUDF.DeferredObject [] args = {valueObj0, valueObj1} ArrayList res = (ArrayList) udf.evaluate (args); System.out.println (res);} @ Test public void testIkSegFunc () throws HiveException {IknalyzerSeg udf = new IknalyzerSeg (); ObjectInspector valueOI0 = PrimitiveObjectInspectorFactory.javaStringObjectInspector; ObjectInspector valueOI1 = PrimitiveObjectInspectorFactory.javaIntObjectInspector; ObjectInspector [] init_args = {valueOI0, valueOI1}; udf.initialize (init_args); Text str = new Text ("I am the test string"); GenericUDF.DeferredObject valueObj0 = new GenericUDF.DeferredJavaObject (str) GenericUDF.DeferredObject valueObj1 = new GenericUDF.DeferredJavaObject (0); GenericUDF.DeferredObject [] args = {valueObj0, valueObj1}; ArrayList res = (ArrayList) udf.evaluate (args); System.out.println (res);}

We saw that the load stop word could not be found, but the whole thing ran because the file on HDFS could not be read.

But our second example is that we don't need to load the stop word information from HDFS, so we can run the test perfectly.

Note later, in order to update the file externally, I put it on HDFS, just like the code in AnsjSeg

Step 5: create a UDF and use the

Add jar / Users/liuwenqiang/workspace/code/idea/HiveUDF/target/HiveUDF-0.0.4.jar; create temporary function ansjSeg as' com.kingcall.bigdata.HiveUDF.AnsjSeg'; select ansjSeg ("I am a string, what are you");-- enable deactivate word filtering select ansjSeg ("I am a string, what are you", 1); create temporary function ikSeg as' com.kingcall.bigdata.HiveUDF.IknalyzerSeg'; select ikSeg ("I am a string, what are you") Select ikSeg ("I am a string, what are you", 1)

The second parameter of the above method is whether to enable disable word filtering. Let's use the ikSeg function to demonstrate it.

Let's try to get the description of the function.

If it's not written, it's like this.

Other application scenarios

Writing Hive UDF can easily help us achieve a large number of common requirements, and other scenarios include:

Ip address transfer region: convert the ip field in the reported user log into country-province-city format to facilitate statistical analysis of regional distribution

Using the label data calculated by Hive SQL, you do not want to write Spark programs. You can initialize the connection pool in the static code block through UDF, and use the parallel MR task started by Hive to import a large amount of data into codis in parallel, which can be applied to some recommended services.

There are other relatively complex tasks implemented by sql that can be converted by writing a permanent Hive UDF

After reading the above, have you mastered how to use UDF to achieve text segmentation in Hive? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.