Python General Forum body extraction\ python Forum comment extraction\ python Forum user Information extraction 07/02 Update SLTechnology News&Howtos

Python General Forum body extraction\ python Forum comment extraction\ python Forum user Information extraction

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

I have long sold a large amount of Weibo data, travel website review data, and provide a variety of designated data crawling services, Message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange group: 99918768

Background

Participate in the Teddy Cup data mining competition, this time really learned a lot of things, and finally almost complete the required content, the accuracy is also good. The total code, including the middle process processing is not more than 500 lines, the code idea is also relatively simple, mainly according to the short text characteristics of the forum and the similarity between the floors to complete. (to put it colloquially, it means denoising, and then leaving only relatively regular dates, content)

Preparation in advance

Software and development environment: Pycharm,Python2.7,Linux system

The main Python packages used: jieba, requests, BeautifulSoup, goose, selenium, PhantomJS, pymongo, etc. (I have introduced the installation of some software in my previous blog) web page preprocessing

First of all, because many websites are dynamic, you can't get some information directly with bs4, so we use selenium and phantomjs to save the file locally and then process it.

The related code is

Def save (baseUrl): driver = webdriver.PhantomJS () driver.get (baseUrl) # seconds try: element = WebDriverWait (driver, 10) .upload (isload (driver) is True) except Exception, e: print e finally: data = driver.page_source # fetch the page content driver.quit () return data after js is loaded

As there is a lot of noise (advertisements, pictures, etc.) in the web page, first of all, we need to remove all the noises that are inconsistent with our extracted content as much as possible. We first choose to remove some noise tags with typical noise meaning, such as script, etc., and we choose BeautifulSoup to complete the method.

The code looks something like this.

For element in soup (text=lambda text: isinstance (text Comment): element.extract () [s.extract () for s in soup ('script')] [s.extract () for s in soup (' meta')] [s.extract () for s in soup ('style')] [s.extract () for s in soup (' link')] [s.extract () for s in soup ('img')] [s.extract () for s in soup (' input') ] [s.extract () for s in soup ('br')] [s.extract () for s in soup (' li')] [s.extract () for s in soup ('ul')] print (soup.prettify ())

Comparison of web pages after processing

We can see that there is much less noise on the web page, but it is still not enough to extract what we want from so much noise.

Since we don't need the label, we only need the text inside the tag, so we can use BeautifulSoup to extract the text and analyze it.

For string in soup.stripped_strings: print (string) with open (os.path.join (os.getcwd ()) + "/ data/3.txt",'a') as f: f.writelines (string.encode ('utf-8') +'\ n')

It can be seen that it is still very messy, but it is very regular. We can find that the text content in each floor is essentially the same, it can be said to repeat a lot, and there are some specific words, such as: direct floor, bench, sofa, etc., so we need to delete these words and then analyze them.

The method I use is to use jieba word segmentation to segment the obtained web page text, and count the words with the highest frequency, which are also easy to appear in noise articles. The code is as follows.

Import jieba.analysetext = open (r ". / data/get.txt", "r"). Read () dic = {} cut = jieba.cut_for_search (text) for fc in cut: if fc in dic: dic [fc] + = 1 else: dic [fc] = 1blog = jieba.analyse.extract_tags (text, topK=1000, withWeight=True) for word_weight in blog: # print (word_weight [0] .encode ('utf-8'), dic.get (word_weight [0] 'not found')) with open (' cut.txt', 'a') as f: f.writelines (word_weight [0] .encode ('utf-8') + "+ str (dic.get (word_weight [0],' not found')) +'\ n')

The stop words that are counted and then tested and screened are these.

Reply post

integration

Post

Forum

Offline

time

Author

Theme

Essence

Client

Mobile phone

download

At present, the number of words counted is about 200.

And then there's the work of removing duplicate text.

# deduplicated function def remove_dup (items): pattern1 = re.compile (r 'published on') pattern2 = re.compile ('\ d {4} -\ d {1 ~ (2)} -\ d {1} 2}\ d {2}:\ d {2}:\ d {2}') pattern3 = re.compile ('\ d {1} 2} -\ d {1} 2}\ d {2}:\ d {2}') pattern4 = re.compile ('\ d {4} -\ d {1 9a-zA-Z 2} -\ d {1J 2}\ d {2}:\ d {2}') pattern5 = re.compile (r'[^ 0-9a-zA-Z] {7) }') # use collections as containers To do part of the repetition judgment, the other part is made by matching # yield is used to get the appropriate text from the generator to get the iterator, so the text is deleted Outside the function # you can use the function to iterate over the text seen = set () for item in items: match2 = pattern1.match (item) match3 = pattern2.match (item) match4 = pattern3.match (item) match5 = pattern4.match (item) match6 = pattern5.match (item) if item not in seen or match2 or match3 or match4 or match5 or match6: Yield item seen.add (item) # add item to the collection The collection will automatically delete duplicate items.

After observing and processing the web page text, we find that there is another noise that can not be ignored, that is, pure numbers. Because there are a lot of pure numbers in the web page text but do not repeat, such as likes, I am going to use regular matching to get pure numbers and delete them. But this will lead to problems. Because some user names are purely numerical, so we will delete the user name. To solve this problem, we use pure numbers with more than 7 characters, which not only removes most of the useless information but also preserves the user name as much as possible.

The related code is as follows

St = [] for stop_word in stop_words: st.append (stop_word.strip ('\ n')) t = tuple (st) # t, the difference between tuples and lists is that you cannot modify the use of (,), and [ ] list is different lines = [] # remove stop words and short numbers to implement for j in after_string: # if the beginning of a line does not begin with a stop word, then read the line if not j.startswith (t): # how one line is not all numbers Or the number of digits in this line is greater than 7 (distinguish between extraneous numbers and numeric user names) read this line if not re.match ('\ dflowers numbers, j) or len (j) > 7: lines.append (j.strip ()) # remove all spaces and output print (j.strip ())

The processed text is as follows, and the law is very obvious.

And then it's time for us to extract the content.

Content extraction

Content extraction is nothing more than finding the comment block, and the comment block is already very clear in our picture above, so we naturally want to distinguish the comment block according to the date. After observation, there are only 5 forms of dates in all forums (currently only 5, of course, can be added later). We can use regular matching to match the line of the date, and complete our comment extraction based on the comment content and user name sandwiched between the lines of the two dates.

Pass in the processed text and match the number of lines on which the date is located.

# matching date returns get_listdef match_date (lines): pattern1 = re.compile (r 'published on') pattern2 = re.compile ('\ d {4} -\ d {1 ~ (2)} -\ d {1} 2}\ d {2}:\ d {2}:\ d {2}') pattern3 = re.compile ('\ d {1} 2} -\ d {1} 2}\ d {2}:\ d {2}') pattern4 = re.compile ('\ d {4} -\ d {1pattern2.match 2}\ d {2}:\ d {2}') pattern5 = re.compile (r 'publication date') pre_count =-1 get_list = [] # matching date text for string in lines: match2 = pattern1.match (string) match3 = pattern2.match (string) match4 = pattern3.match (string) Match5 = pattern4.match (string) match6 = pattern5.match (string) pre_count + = 1 if match2 or match3 or match4 or match5 or match6: get_dic = {'count': pre_count 'date': string} get_list.append (get_dic) # returns the information return get_list after the matching date

Because there is a reply and no reply is not handled in the same way, so we need to classify and discuss. Because we know that the content of the comment is between the two matching dates, there is a problem that the content area of the last comment is not easy to divide. But considering that most of the last posts are on one line, we can temporarily take a value of 3 (sub==3, consider one comment and one user name), and then come up with a more scientific method, such as judging the text density of the next few lines, if very small means that there is only one line of comment is more likely.

The following code gets the difference between the number of rows on which the date is located and the number of rows between two dates

# return my_countdef get_count (get_list): my_count = [] date = [] # the number of lines in which the acquisition time is located: for i in get_list: K, t = i.get ('count') I.get ('date') my_count.append (k) date.append (t) if len (get_list) > 1: # the last line temporarily takes 3 my_count.append (my_count [- 1] + 3) return my_count else: return my_count# to get the line difference def get_sub (my_count) between the two times: Sub = [] for i in range (len (my_count)-1): sub.append (my_ Count [I + 1]-my_ Count [I]) return sub

Next, there will be a classified discussion.

If only the landlord has no comments (that is, my--count==1), we can use the open source text extraction software goose to extract the text.

If there are comments, we need to classify them according to the value of sub. If sub==2 is in the majority (or more than sub==3), then we think that the user name may be deleted, and there are many reasons for deletion. For example, when someone replies in the building to remove duplicates, the user name may be deleted repeatedly. It is possible that the tags of the website are more special than those of the user name deleted when the user name is removed. The situation is more complex and the frequency is not too high, which has not been considered for the time being. What's more, it doesn't affect us to extract the comment content, just classify it and consider it.

Note: the following cosine similarity is that I thought too much at the beginning! Most of the cases are: date-comment-user name, and then I did not consider the cosine similarity classification, the code is less, and the accuracy has not decreased. Not deleting here is to leave a process of thinking. Just look at the code, and finally there is a modified source code.

There is also the most common content, that is, the majority of sub==3. Because most comments are one line of text, what we need to consider is which line of comment text we got when sub==3. In popular terms, is the content of these three lines date-comment-user name, or date-user name-comment? Although most of them are in the first case, we cannot ignore the second situation. How to judge these two situations? It really made me think about it for a long time, and then I thought that cosine similarity could be used to solve the problem. The cosine similarity of popular science can be seen here. To put it simply, the length of user names is similar, but the content length of comments varies greatly. For example, the length of a user name is about 7 characters, but the length of comments can be hundreds or only one. So we can compare the cosine similarity in pairs, and then take the average, the most similar is the user name. In this way, we can distinguish the comment content and extract it! This is the main idea. The rest is the implementation of the code.

Simply post the relevant code

# def goose_content (my_count, lines, my_url): G = Goose ({'stopwords_class': StopWordsChinese}) content_1 = g.extract (url=my_url) host = {} my_list = [] host [' content'] = content_1.cleaned_text host ['date'] = lines [my _ count [0]] host [' title'] = get_title (my_url) result = {"post": host "replys": my_list} SpiderBBS_info.insert (result) # calculate cosine similarity function def cos_dist (a, b): if len (a)! = len (b): return None part_up = 0.0 a_sq = 0.0 b_sq = 0.0 for A1, b1 in zip (a) B): part_up + = A1 * b1 a_sq + = A1 * * 2 b_sq + = b1 * * 2 part_down = math.sqrt (a_sq * b_sq) if part_down = = 0.0: return None else: return part_up / part_down# determines which line the comment content is on (probably in the middle of the comment block of 3 lines) Probably at the end of the comment block of three lines) def get_3_comment (my_count) Lines): get_pd_1 = [] get_pd_2 = [] # if the interval is 3, the text length of the line taken out test_sat_1 = [] test_sat_2 = [] for num in range (len (my_count)-1): if my_ [num + 1]-3 = = my_count [num]: pd_1 = (len (lines [my_ Count [num])) Len (Lines [my _ count[ num] + 2]) get_pd_1.append (pd_1) pd_2 = (len (my _ countnum]]), len (Lines [my _ countnum] + 1]) get_pd_2.append (pd_2) for i_cos in range (len (get_pd_1)-1): for j_cos in range (i_cos+1) Len (get_pd_1): # calculate text cosine similarity test_sat_1.append (cos_dist (get_pd_1 [j_cos], get_pd_1 [i_cos])) test_sat_2.append (cos_dist (get_pd_2 [j_cos]) Get_pd_2 [i_cos]) # calculate the average cosine similarity get_mean_1 = numpy.array (test_sat_1) print (get_mean_1.mean ()) get_mean_2 = numpy.array (test_sat_2) print (get_mean_2.mean ()) # compare whether the size should be returned as if get_mean_1.mean () > = get _ mean_2.mean (): return 1 elif get_mean_1.mean () < get_mean_2.mean (): return comments get comment content def solve__3 (num My_count, sub, lines, my_url): # if the value returned by get_3_comment () is 1 Then it is more likely that the last line is a user name. Otherwise, the first line is more likely to be a user name if num = = 1: host = {} my_list = [] host ['content'] =' .join (Lines [my _ count [0] + 1: my_count [1] + sub [0]-1]) host ['date'] = Lin [my _ count [0]] host [' title'] = get_title (my_url) for use in range (1) Len (my_count)-1): pl = {'content':' '.join (Lines [my _ count[ use] + 1 content': [use + 1]-1]),' date': lines [my _ count[ use]], 'title': get_title (my_url)} my_list.append (pl) result = {"post": host "replys": my_list} SpiderBBS_info.insert (result) if num = 2: host = {} my_list = [] host ['content'] =' .join (host [my _ count [0] + 2: my_count [1] + sub [0]]) host ['date'] = Lin [my _ count [0]] host [' title'] = get_title (my) _ url) for use in range (1 Len (my_count)-1): pl = {'content':' '.join (date': [my _ count[ use] + 2 content': [use + 1]]),' date': Lines [my _ countUse], 'title': get_title (my_url)} my_list.append (join) join = {"post": host "replys": my_list} SpiderBBS_info.insert (result) Prospect

The accuracy of extraction should analyze more bbs websites, optimize the deletion of repetitive words (too rough), optimize the stop words, optimize the situation where there is no reply to the short text, accurately extract the user name of the landlord, etc., but the time is too tight to further optimize. Little talent and learning, just a few months of learning python, the code is inevitably unreasonable, I hope you can put forward your valuable suggestions.

Personal blog

8aoy1.cn

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.