Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Notes on Apache Pig and Solr (1)

2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

To record some problems about Pig0.12.0 and Solr4.10.2 that Sanxian encountered in her work in the past two days, there are a total of three as follows

Question 1 how to load and split data using the delimiters of ASCII and hexadecimal hexadecimal in Pig

Note that this question will be reflected in 2 scenarios in Pig

The first is when Pig loads load data.

Second, when Pig processes split or regular interception of data.

Let's talk a little bit about why we use hexadecimal field delimiters instead of our common spaces, commas, colons, semicolons and other characters, although these characters can also be used, but if we have data that conflicts with these symbols, then some unexpected Bug will occur during parsing, so it's a good choice to choose unreadable hexadecimal data just to be on the safe side. This is also a case-by-case decision.

For a detailed documentation on ASCII and hexadecimal decimal, please refer to Wikipedia.

Let's get back to the point. In this example, our data format is stored like this.

Java code

One record per line, UTF-8 coded

Each record includes a field name and field content

Fields are separated by ascii code 1

Field names and contents are separated by ascii code 2

One record per row, UTF-8 encoding each record includes field name and field content fields separated by ascii code 1 and field name separated by ascii code 2

A small example in eclipse is as follows

Java code

Public static void main (String [] args) {

/ / Note\ 1 and\ 2 will appear differently in the interface of the terminal device of Linux in NotePad++ in our IDE

/ / you can learn more about the display method in Wikipedia

/ / data example

String s = "prod_cate_disp_id019"

/ / split rules

String ss [] = s.split ("\ 2")

For (String st:ss) {

System.out.println (st)

}

}

Public static void main (String [] args) {/ / Note\ 1 and\ 2 all show different / / display methods in the interface of Linux terminal devices in our IDE NotePad++. You can learn more about / / data examples String s = "prod_cate_disp_id019" in Wikipedia. / / split rule String ss [] = s.split ("\ 2"); for (String st:ss) {System.out.println (st);}}

For the types of delimiters supported when load functions are loaded, you can refer to the documentation on the official website.

Let's look at the code of the Pig script.

Java code

-- Hadoop Technology Exchange Group 415886155

/ * the delimiters supported by Pig include

1, any string

2, any escape character

Characters\\ U001 or\\ u002 of 3dec

4 Sixteen proceed character\\ x0A\\ x0B

, /

Note that the delimiter in this load represents the 1 of ASCII as the direct parsing of dec in Pig.

A = load'/ tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using PigStorage ('\ u001')

/ * *

Note that the following separator ^ B is a delimited metacharacter only on the terminal device.

Show that this symbol stands for 2 of ASCII

, /

A = foreach a generate REGEX_EXTRACT ($0,'(. *) ^ B (. *)', 2) as time

REGEX_EXTRACT ($1,'(. *) ^ B (. *)', 2) as kw

REGEX_EXTRACT ($2,'(. *) ^ B (. *)', 2) as ic

REGEX_EXTRACT ($3,'(. *) ^ B (. *)', 2) as cid

REGEX_EXTRACT ($4,'(. *) ^ B (. *)', 2) as cname

REGEX_EXTRACT ($5,'(. *) ^ B (. *)', 2) as pname

REGEX_EXTRACT ($6,'(. *) ^ B (. *)', 2) as snt

REGEX_EXTRACT ($7,'(. *) ^ B (. *)', 2) as cnt

REGEX_EXTRACT ($8,'(. *) ^ B (. *)', 2) as fnt

REGEX_EXTRACT ($9,'(. *) ^ B (. *)', 2) as ant

REGEX_EXTRACT ($10,'(. *) ^ B (. *)', 2) as pnt

-- get the string length

A = foreach a generate SIZE (cid) as len

-- grouped by length

B = group a by len

-- count the number under each length

C = foreach b generate group, COUNT (. 1)

-- output printing

Dump c

The delimiters supported by the Hadoop technical exchange group 415886155/*Pig include 1, arbitrary string 2, any escape character 3dec\\ u001 or\\ u002416 carry character\\ x0A\\ x0B\\ x0B\\ x0B * * the delimiter in this load represents the 1 of ASCII as the direct parsing method of dec in Pig a = load'/ tmp/dongliang/20150401/20150301/tmp_search_keywords_cate_stat/' using PigStorage ('\\ u001') / * * Note that the following separator ^ B is a delimited metacharacter that only displays on the terminal device a = foreach a generate REGEX_EXTRACT ($0,'(. *) ^ B (. *), 2) as time, REGEX_EXTRACT ($1, (. *) ^ B (. *), 2) as kw, REGEX_EXTRACT ($2) '(. *) ^ B (. *), 2) as ic, REGEX_EXTRACT ($3, (. *) ^ B (. *), 2) as cid, REGEX_EXTRACT ($4, (. *) ^ B (. *), 2) as cname, REGEX_EXTRACT ($5, (. *) ^ B (. *), 2) as pname REGEX_EXTRACT ($6, (. *) ^ B (. *), 2) as snt, REGEX_EXTRACT ($7, (. *) ^ B (. *), 2) as cnt, REGEX_EXTRACT ($8, (. *) ^ B (. *), 2) as fnt, REGEX_EXTRACT ($9 '(. *) ^ B (. *)', 2) as ant, REGEX_EXTRACT ($10, (. *) ^ B (. *)', 2) as pnt -- get string length a = foreach a generate SIZE (cid) as len;-- group by length b = group a by len;-- count the number under each length c = foreach b generate group, COUNT ($1);-- output print dump c

Question 2 how to query the length of a non-participle field in Apache Solr

Solr does not directly provide such functions like lenth in JAVA or SIZE in Pig, so how should we query it?

Although Solr does not directly support such queries, we can use regular queries to achieve this purpose in disguise as follows

1 query fixed length cid:/. {6} / filter only records of length 6

2 query range length cid:/. {6pr 9} / filter only records of length 6 to 9

3 query cid:/. {6}. * / with a minimum length of 6

Question 3 when using Pig+MapReduce to add indexes to Solr in batch, it is found that there is no error exception but there is no data in the index?

This is a rather weird problem. Sanxian originally thought there was something wrong with the program, but later found that the same code added data to another collection and it was normal to look at solr's log and found that some of the information printed in it was as follows.

Java code

INFO-2015-04-01 21VO8 optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false 36.097; org.apache.solr.update.DirectUpdateHandler2; start commit {, optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

INFO-2015-04-01 21: 0814: 36.098; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping IW.commit.

INFO-2015-04-01 21 not re-opening 0815 36.101; org.apache.solr.core.SolrCore; SolrIndexSearcher has not changed-not re-opening: org.apache.solr.search.SolrIndexSearcher

INFO-2015-04-01 21 0815 36.102; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush

INFO-2015-04-01 21 INFO 36.097; org.apache.solr.update.DirectUpdateHandler2; start commit {, optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO-2015-04-01 21 8 optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false 36.098; org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. Skipping IW.commit.INFO-2015-04-01 21 not re-opening 08V 36.101; org.apache.solr.core.SolrCore; SolrIndexSearcher has not changed-not re-opening: org.apache.solr.search.SolrIndexSearcherINFO-2015-04-01 21V 08v 36.102; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush

Explain the meaning of the above information, that is to say, after the data is indexed, but no commit data is found, so skipping commit is very strange when the program runs, because at least 1.1 million of the data in the data source HDFS has no data. Then through Google search, some people have found that similar strange situations are successful in rebuilding the index. There is no data in the index and what is most puzzling is that none of these online cases has a solution.

We have no choice but to look at the program again and print out the data that needs to be indexed in the middle to see what the result is. The results are rows of empty data, the original delimiters are invalid when using regular intercept data, so the problem is basically located that there is no data in the solr index because there is no data. The strange log result caused by the submission rebuilt the index again after Sanxian repaired the bug and found that it was successful this time and the data could be queried normally in Solr. If you have a similar situation, please first make sure that you can get the data correctly, whether it is reading remotely or parsing the data in wordexcel or txt, you must first make sure that the data can be parsed correctly, and then if it is still not completed, you can fix it according to the log of solr or the exception thrown.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report