In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces you what are the practical skills of python crawler, the content is very detailed, interested friends can refer to, hope to be helpful to you.
The importance of data
Now it is the era of big data, data is driving business development and operation means. With the support of data, users can be portrayed and customized, and data can indicate the direction of scheme design and decision optimization, so the development of Internet products is inseparable from data collection and analysis. One way of data collection is to capture user interaction on its own platform by reporting to API. Another means is to crawl the data of the competitive platform through the development of crawler programs. Later, we focus on the problems that will be encountered in the application scenarios and practice of crawlers and some routines and skills of anti-crawlers.
Application scenario
Internet platform, preference to sales companies, crawling of customer information
The crawling of customer information can release the time for salespeople to find customer resources and improve the efficiency of sales to market development.
Crawl the customer information on the relevant platform, report it to the CRM management system, and provide it to the sales staff for development
Information crawling and applying to platform business
When you often browse the information, you will find that the hot information content of many platforms is very similar, and the platform that respects copyright will indicate the source.
Crawling information and applying it to the information business can reduce the pressure on information content editors. If you don't need to create your own content, you can trust it all to the program AI operation.
Analysis and Application of important data Mining in competitive products Company
Competitive product platform important business data, such as: car X home model information, X where hotel information, return X net commodity information,.
Crawl the important data of the competition, filter and process the data, and then display it in the business to increase the amount of data in this business and reduce the pressure on the operation and editing of this resource.
......
Crawler development
Python development crawler (recommended)
The entry is also relatively simple, the code is short and lean, and all kinds of modules and frameworks are convenient for crawlers to develop.
Other languages
Many languages can also develop crawlers, but all of them are not very comprehensive. According to the actual technology stack and development scenarios, the language is just a tool, and the idea is universal.
Necessary skills for reptiles
To do crawler development, you need to have a relatively comprehensive and in-depth understanding of WEB, so that you can only be handy when you encounter anti-crawlers.
Learn about HTML
Will use HTML tags to construct pages, know how to parse the tags in DOM, and extract the desired data content
Learn about CSS
Understand CSS, will parse the data content in the style
Learn about JS
Basic JS syntax, can write and read, and understand JS library: Jquery,Vue, etc., you can debug JS using developer tools
Learn about JSON
Understand JSON data, serialize and deserialize the data, and get the data content by parsing JSON objects
Learn about HTTP/HTTPS
Ability to analyze request information and response information, and can construct requests through code
Will be regular analysis
Extract the desired data content by regularly matching the string that conforms to the rules
Know how to operate database
Store crawled data through database operations, such as MYSQL syntax
Can use the bag grabbing tool
Browser F12 developer debugging tool (recommended: Google), Network (Web) column can get package capture information.
Tools: Charles,Fiddler (grab package HTTPS, grab package APP)
Through the packet grabbing tool, you can filter out the data interface or address, analyze the request information and response information, and locate the field or HTML tag where the data is located.
Can use developer tools
Browser F12 opens developer tools
Need to be able to debug HTML,CSS,JS using developer tools
Will simulate the request
Tool: Charles,Fiddler,Postman
By simulating the request, we analyze the necessary information needed for the request, such as parameters, COOKIE, request header, and know how to simulate the request and how to construct it when coding.
Can locate data
Data in API: frontend / native APP requests data API,API returns most of the data in JSON format, and then renders the display
Data in HTML: check the HTML source code of the page. If there is data you want to obtain in the source code, it means that the data has been bound in the HTML on the server side.
Data in JS code: check the HTML source code of the page. If the acquired data is not in HTML and there is no request for data API, you can check whether the data is bound to the JS variable.
Will deploy
You can deploy to a Windows or Linux server, use tools to monitor the crawler process, and then perform regular rotation training to crawl
Anti-crawler fighting skills
Anti-crawlers can be divided into server restrictions and front-end restrictions.
Server-side restrictions: server-side request restrictions to prevent crawlers from making data requests
Front-end restrictions: the front-end interferes with and confuses key data through CSS and HTML tags to prevent crawlers from easily obtaining data
Set request header (server-side limit)
Referer
User-Agent
......
Signature rules (server-side restrictions)
If the request is initiated by JS, the signature rules can be found in the JS function, and then the signature is constructed according to the rules.
If the request is initiated by APP, the front end may call the native encapsulated method, or it may be initiated natively. This is rather confusing, and the APP package needs to be decompiled, which may not be successful.
Delay, or random delay (server limit)
If the request is limited, it is recommended to try the request delay (xxx milliseconds / x seconds) and set the appropriate time according to the actual situation.
Proxy IP (server side restrictions)
If the delay request is still limited, or it takes a long time to delay before it is not limited, you can consider using proxy IP and apply it according to the actual situation and restrictions. Generally, you can switch the proxy IP of the request as long as it is restricted, so that you can basically bypass the limit.
At present, there are a lot of fee-based proxy IP service platforms, there are a variety of service methods, you can search and understand, the fees are generally within an acceptable range
Login restrictions (server side restrictions)
Request to bring the COOKIE information of the logged in user
If the login user's COOKIE information expires within a fixed period, find the login interface, simulate the login, store the COOKIE, and then initiate a data request. After the COOKIE fails, restart this step.
CAPTCHA limit (server limit)
Simple CAPTCHA code to identify and read the letters or numbers in the picture, which can be realized by using the module package that recognizes the picture.
Complex CAPTCHA codes cannot be identified by map recognition, so you can consider using third-party fee-based services.
CSS/HTML obfuscation interference limit (front-end limit)
The front end interferes with and confuses key data through CSS or HTML tags. Cracking requires sampling analysis, finding rules, and then replacing them with correct data.
1. Font-face, custom font interference
Such as Liezi: car X home comment post, Cat X movie score
Font-face {font-family: 'myfont';src:' myfont';src: url ('/ / k2.autoimg6AMyAADPhhJcHCg43... eot'); src: url ('/ k3.autoimg.cnplicM08Hammer) with D5DDH41oAOg6AMyIvAADhJcHCg43); url ('embedded-opentype'), url (' / k3.autoimg.cngemg 13G13GM05D3, 23wjByImCtp047ttff`) (woff'); Mercedes + recruitment of members
Crack the idea:
Find the address of the ttf font file, download it, and parse the ttf file using the font parsing module package. You can parse a collection of font codes, map them to the text codes in dom, and then map them to Chinese according to the serial numbers of the codes in ttf.
You can use the FontForge/FontCreator tool to open a ttf file for analysis
2. Pseudo-element concealment
Display important data content through pseudo-elements
Such as example: car X home
.hs _ kw60_configod::before {content: "FAW";}. Hs_kw23_configod::before {content: "Volkswagen";}. Hs_kw26_configod::before {content: "Audi";}-
Crack the idea:
Find the style file, then match the class name in the HTML tag to match the content of the content in the corresponding class in CSS to replace it.
3. Backgroud-image
Display numbers / symbols through the position position offset of the background picture, such as price, score, etc.
Mapping based on backgroud- position values and picture numbers
4. Html tag interference
By adding some hidden content tags to the tags of important data to interfere with the acquisition of data
For example: xxIP proxy platform
two
2 02. 1
. 1 09. 23 7
seven
. three
3 5: 80 $(".IP: eq (0) > *: hidden") .remove () $(".IP: eq (0)") .text ()
Crack the idea:
Filter out HTML tags that interfere with confusion, or read only the contents of HTML tags with valid data
...... (how big the brain of anti-reptile is, how slutty the idea of anti-reptile is.)
Prevent poisoning
Some platforms will not restrict and block the crawlers when they find them, but provide misleading data to the crawlers, which will affect competitive companies to make wrong decisions, which is poisoning.
In order to prevent poisoning, the data need to be sampled and checked.
It is concluded that at present, most small and medium-sized platforms still have a relatively weak awareness of defense against reptiles, which promotes the prevalence of reptiles, through which reptiles can gain greater benefits at a relatively small cost.
The mining, analysis and application of competitive data play an important role in business growth, and crawler development is an indispensable technology for Internet product companies.
At present, there is no technology that can completely avoid crawlers, so adding an anti-crawler strategy only increases a certain threshold of difficulty, as long as the technology is hard enough, it can still be broken through.
Anti-reptile and anti-reptile is a contest of technology, and this war without gunpowder smoke will never stop. (why should programmers make it difficult for programmers)
For reference code
Font parses C # and Python implementation
C#
/ / need to introduce PresentationCore.dllprivate void Test () {string path = @ "F:\ font.ttf"; / / read the font file PrivateFontCollection pfc = new PrivateFontCollection (); pfc.AddFontFile (path); / / instantiate the font Font f = new Font (pfc.Families [0], 16) / / set font txt_mw.Font = f; / / traversal output var families = Fonts.GetFontFamilies (path); foreach (System.Windows.Media.FontFamily family in families) {var typefaces = family.GetTypefaces (); foreach (Typeface typeface in typefaces) {GlyphTypeface glyph Typeface.TryGetGlyphTypeface (out glyph); IDictionary characterMap = glyph.CharacterToGlyphMap; var datas = characterMap.OrderBy (d = > d.Value). ToList (); foreach (KeyValuePair kvp in datas) {var str = $"[{kvp.Value}] [{kvp.Key}] [{(char) kvp.Key}]\ r\ n" Txt_mw.AppendText (str);}
Python
# pip install TTFontfrom fontTools.ttLib import TTFontfrom fontTools.merge import * me = Merger () font = TTFont ('. / font.ttf') cmaps = font.getBestCmap () orders = font.getGlyphOrder () # font.saveXML ('FGV hand 1.xml') print cmapsprint orders's practical skills about python crawler are shared here. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.