In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
HTTP Protocol (HyperText Transfer Protocal):
Hypertext transfer Protocol (Hypertext transfer Protocol) is a method of publishing and receiving HTML pages.
HTTPS Protocol (HyperText Transfer Protocal over Secure Socket Layer):
It can be understood as the secure version of HTTP, that is, the SSL layer is added to the HTTP protocol.
SSL (Secure Sockets Layer secure Sockets sublayer):
It is mainly used for secure transmission of WEB, and data can be encrypted at the transport layer. It is proposed by netscape Company.
Protocol port number:
HTTP:80
HTTPS:443
How crawlers work:
The crawling process of web crawler can be understood as the process of simulating browser operation.
The main function of the browser is to make a request to the server and display the resources returned by the server in the window.
Request and response of HTTP:
HTTP communication consists of two parts: client request message and server response message.
The process by which the browser sends a HTTP request:
When the user enters a URL in the address bar and enter, the browser will send a HTTP request to the HTTP server. The longest HTTP request uses the GET and POST methods.
The browser sends a Request request to get the server-side HTML file, and the server returns a Response object.
The browser will analyze the HTML in Response and find that other files, such as p_w_picpaths css js, will be referenced, and the browser will request to obtain these resources again.
When all the files are downloaded successfully, the browser will form the final page according to the syntax structure of HTML.
Client HTTP request:
The client sends an HTTP request to the server in the following format:
Request line | request header | blank line | request data
The following figure shows the general format of the request message:
An example of a typical HTTP request:
GET https://www.douban.com/ HTTP/1.1Host: www.douban.comConnection: keep-alivePragma: no-cacheCache-Control: no-cacheUser-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3253.3 Safari/537.36Upgrade-Insecure-Requests: 1Accept: text/html,application/xhtml+xml,application/xml;q=0.9,p_w_picpath/webp,p_w_picpath/apng,*/* Q=0.8#Accept-Encoding: gzip, deflate, brAccept-Language: zh,en-US;q=0.9,en;q=0.8,zh-TW;q=0.7,zh-CN;q=0.6Cookie: bid=2MYBpxuz2yQ; _ _ yadk_uid=bxInROHOuKKEb7tkiSiEZygLYuYP2kxO; gr_user_id=14916ea7-aee0-43admuri 83eeMui 7a236df37d47; viewed= "20451827mm 25861795"; _ vwo_uuid_v2=C055442D3B3854F97DDE6AC4D757E5BC | 34bccb8c4f1faab1336ba5e19cea3c; ll= "108288"; _ ga=GA1.2.310445079.1508424221; ps=y; push_noty_num=0; push_doumail_num=0; _ utmv=30149280.14370; ap=1 _ _ utmz=30149280.1509712941.8.4.utmcsr=baidu | utmccn= (organic) | utmcmd=organic; _ pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1509723845%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DMLpogEZkCppQDqzj-PhnXBPTzkvUx6DiIQSWuIdGr7pLuzgf-AdrA2UCWYNaYEjf%26wd%3D%26eqid%3Dfb0db111000170200000000359fc642a%22%5D; _ pk_id.100001.8cb4=280d7bc2f732b51c.1508424213.8.1509723845.1509712941.; _ pk_ses.100001.8cb4=*; _ _ utma=30149280.310445079.1508424221.1509712941.1509723846.9; _ _ utmc=30149280; _ _ utmt=1 _ _ utmb=30149280.1.10.1509723846
Generally speaking, when capturing the data, the data is not compressed, that is, the following line is commented out:
Accept-Encoding: gzip, deflate, sdch, br
Request method:
There are several methods for HTTP requests, but the most commonly used methods are GET and POST:
GET is to get data from the server, and POST is to transfer data to the server.
The parameters of the GET request are displayed on the browser URL, and the HTTP server generates the response content according to the parameters contained in the request, that is, the parameters of the GET request will become a part of the URL.
The POST request parameters are stored in the request body (usually in the form), and the message length is unlimited and is transmitted implicitly. It is usually used to submit data with a large amount of information to the server. The request parameters are contained in the "Content-Type" message header, indicating the media type and encoding of the message.
GET is generally not used to submit forms because of the exposure and sensitive information that may be displayed
Commonly used request headers:
1.HOST (host number and port number): corresponds to the WEB name and port number in the URL URL, which is used to specify the Internet host number and port number of the requested resource.
2. Connection (link type): indicates the connection type between the client and the server
3. Upgrade-Insecure-Requests (upgrade to HTTPS request): upgrade an unsafe request, which means that the HTTPS request will be automatically replaced when the HTTP resource is loaded, so that the browser no longer displays the alarm in the HTTPS page.
4.User-Agent (browser name)
5.Accept (File transfer Type): refers to the MIME (Multipurpose Internet Mail extension) type that can be accepted by browsers or other clients, based on which the server can determine and return the appropriate file format.
Accept:*/* means you can receive anything.
P_w_picpath/gif represents a picture
6.Referer (Page Jump): indicates which URL the page that generated the request came from, and the user visited the currently requested page from that Referer page. This property can be used to track which page the WEB request came from, the source site, etc.
Sometimes when downloading a picture of a website, you need to use the corresponding referer, otherwise the picture cannot be downloaded, this is because the website has made hotlink protection, the principle is to judge whether it is the address of this website according to referer, if it is, it can be downloaded, if not, it can be refused.
7.Accept-Encoding (File Codec format): an encoding that is acceptable to browsers. Encoding is different from the file format, it is to compress the file to speed up the transfer speed. The browser decodes the WEB response after receiving it, and then checks the file format.
8.Accept-Language (language category): indicates the types of languages that browsers can accept, such as en or en-us or zh-cn. It is used when the server can provide more than one language version.
9.Accept-Charset (character Encoding): refers to the character encoding that the browser can accept. If this field is not set in the request message, any character set can be accepted by default.
10.Cookie: the browser uses this attribute to send cookie,cookie to the server. It is a small volume of data stored in the browser. It can record user information related to the server and can also be used to implement session functions.
11.Content-Type (post data type): the type of content used to represent in a POST request
Server-side HTTP response:
The HTTP response also consists of four parts, namely: the status line message header blank line response body
Common status codes:
100-199: indicates that the server has successfully received part of the request, requiring the client to continue to submit the rest of the request in order to complete the whole process
200-299: indicates that the server has successfully received the request and has completed the entire processing
300-399: in order to complete the request, the customer needs to further refine the request. The requested resource has been moved to a new address (302 indicates that the requested page has been temporarily transferred to the new URL 307and 304indicates the use of cache)
400-499: incorrect client request (404: the server could not find the relevant page 403 the server denied access, insufficient permissions)
500-599: an error occurred on the server side. 500 is commonly used (unknown condition for server with incomplete request)
Cookie and Session:
The interaction between the client and the server is limited to requests and responses, and will be disconnected after the end, and the next interaction will be considered as a new connection. In order for the server to record the status of this user, it is necessary to find a place to record the user's information.
Cookie: confirm identity through information recorded on the client
Session: confirm identity by recording information on the server side
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.