What is the method of jspXCMS user collection and management? 07/09 Update SLTechnology News&Howtos

What is the method of jspXCMS user collection and management?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the relevant knowledge of what is the method of jspXCMS user collection and management, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this jspXCMS user collection and management method. Let's take a look at it.

Collection can collect articles and news from other websites into your own system. When the old system is migrated to the new system, the data from the old system can also be collected into the new system.

The system comes with some website collection rules, but if the layout of the relevant website is changed, it may not be able to collect correctly.

Principle

Collect and analyze two types of pages: column list page and article detail page. The articles of the website are generally classified by the way of columns, first find the column list page to be collected, analyze the page source code to find the article list code, and then analyze the URL address of the article; then analyze the detailed page source code of the article, parse out the title, release date, text and other data.

How to view the HTML source code of a web page

Right-click in the blank space of the browser page (do not right-click on the picture or text), the menu will pop up (individual websites will block the right-click), click "View Page Source Code" in the menu (the name of each browser will be slightly different), the HTML source code of the page will be displayed.

Collection list

Click the background function navigation "generate"-"Collection Management" to enter the collection list page.

Collection and add

Click add on the Collection Management-list page.

Go to the new collection page.

Name: name of the collection.

Save to column: to which column the collected data is saved.

Page coding: the code of the collected page. Usually UTF-8 or GBK. If the coding setting is incorrect, garbled will occur. Check the source code of the page to be collected to confirm the encoding format, such as:. If the code displayed on the page is GB2312, it can also be set to GBK because GBK contains GB2312.

Whether to submit or not: "No", the collected data is "collected" status, which needs to be audited before it will be displayed on the website; "Yes" is used to collect the data submitted by the user, if the collecting user has the final approval authority, the collected data is in the status of "released" and will be displayed directly on the website.

Interval time: the interval between the last piece of data and the next piece of data, taking the random number from the minimum to the maximum. Some websites will block frequent requests, and can simulate the behavior of normal users browsing the site at random intervals when collecting data.

User agent: User Agent, which simulates the User Agent information accessed by the browser. It usually defaults to "Mozilla/5.0". When the browser visits the website, it will bring User Agent information, including browser version, operating system version and other information. Some websites will determine whether it is a normal user browsing or a robot crawler based on User Agent information. If a robot crawler visits the site, the site may refuse to visit or return a different page. If you encounter such problems, you can set up a User Agent that is more like browser access.

List address: the address of the collected list page. You can fill in multiple items, one line at a time. You can use a placeholder (*), which will be replaced with "number of pages". For example, http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_(*).shtml, if the number of pages is 2 to 10, it is equivalent to http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_2.shtml http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_3.shtml. Http://roll.news.sina.com.cn/news/gnxw/gdxw1/index_10.shtml .

Reverse acquisition: if the number of pages is 2 to 10, the collection starts from page 10.

Article URL address: parse the address of the detailed page of the article from the column list page. Region HTML, select the area of the article list on the list page; entry HTML, select the URL address of the article detail page from the area HTML. Whether regular expressions: whether to match through regular expressions.

Article URL address setting

After setting the "list address", click "Settings" at the "URL address of the article" to enter the settings page. The settings page can test matching rules and verify that matching rules are correct.

There are some garbled codes here, which is due to the difference between the list page code (GB2312) and the detail page code (UTF-8) of Sina. Because the content collected is mainly in the detail page, UTF-8 is used as the page code collected, which does not affect the collection effect. It is very rare for the list page and detail page of the same site to have different codes. It may be in the process of revision, only half of it has been changed, and the other half has not had time to change it.

URL address set: the top drop-down box shows the URL address set that collects the "list page address" of the new page. If each list page is not exactly the same, you can select a different page to verify that the matching rules are common.

HTML source code: the left area is the HTML source code of the column list page to be collected. Click "get" to reload the HTML source code of the current URL address.

Region HTML: first matches the detail page list area of the list page. (*) is a placeholder that represents the content to be matched. Matching rules are sensitive to spaces and line breaks, which can be used to achieve better matching results. After setting the matching rules, click "match", and the "HTML source code" on the left will display the matching results. If you do not achieve the effect, you can click "get", modify the matching rules, and rematch. For complex pages, you can check "regular expression or not", which applies to java regular expression.

Entry HTML: determine the region HTML, click the "match" button of the region HTML, the left "HTML source code" shows the matching result, then set the entry HTML matching rule, click "match", match the URL of the detail page from the matching result in the region HTML. (*) is a placeholder that represents the content to be matched. At this point, you can see that the URL address of the detailed page is displayed in the "HTML Source Code" on the left, indicating that the matching rule is set successfully. Click "OK" button, and the content of the setting will be written back to the collection new page.

Regular expression matching

For complex pages, placeholders (*) may not match, so you can use omnipotent regular expressions. Check "regular expression" to turn on the regular expression pattern, which is matched by parentheses ().

Because html contains line feeds, it cannot be used directly. Match any character, but use [\ d\ D] to match any character.

Change the regular expression to ([\ d\ D] *?)

Change the regular expression to

Collection field list

The new collection defines the list page to be collected, and parses the URL address of the detailed page of the list page, while the collection field parses the title, release date, text and other contents of the detailed page.

After "collect add" save, click "Field list".

Go to the Collection Field list page. No fields are set at this time and there is no data in the list.

New collection field

"Collection Management-Field list" page, click "Field add".

Go to the new page of the collection field.

The fields shown here are related to the document model. There is no need to add all the fields, the commonly used fields are title, body, and release time. Check the new fields you want and click "Save".

Collection field settings

The release date can be formatted (Java's date formatting rules), which is consistent with the format of the collected date data. For example, the date format is yyyy-MM-dd HH:mm:ss at 13:41:58 on 2016-03-24 and HH:mm at 23:14 on March 24, 2016.

Click the Settings button in the field to enter the settings page.

Filter expressions: support Java regular expressions and delete unnecessary data, such as advertisements, based on the matching results.

Collection start and stop

After setting up the collection rules and saving them, click "start" on the "Collection Management-list" page. After the collection is over, it will stop automatically. If you want to force the collection to stop during the collection process, you can click the "stop" button.

View the collection results

In the background "document" management, you can see the collection results. Because it takes time to collect, the data collected will increase gradually, instead of collecting all the data in an instant.

The document list is sorted by release date by default, and if the collected data is released earlier, it may not appear on the first page of the document list, but on the next few pages.

If the "submit" is set to "No" in the collection, you can click the Collection tab of the document list page to view it.

This is the end of the article on "what is the method of jspXCMS user collection and management". Thank you for reading! I believe that everyone has a certain understanding of the knowledge of "what is the method of jspXCMS user collection and management". If you want to learn more knowledge, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.