In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how the Nginx server blocks and forbids blocking web crawlers. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
Every website usually encounters a lot of non-search engine crawlers, most of which are used for content collection or written by beginners. Unlike search engine crawlers, they have no frequency control and tend to consume a lot of server resources, resulting in wasted bandwidth.
In fact, Nginx can easily filter requests based on User-Agent. We only need to filter crawler requests that do not meet the requirements through a simple regular expression at the location where we need URL entry:
Location / {if ($http_user_agent ~ * "python | curl | java | wget | httpclient | okhttp") {return 503;} # other normal configurations.
Note: the variable $http_user_agent is a Nginx variable that can be referenced directly in location. ~ * means case-insensitive regular matching, and 80% of Python crawlers can be filtered out through python.
Blocking web crawlers is prohibited in Nginx
Server {listen 80; server_name www.xxx.com; # charset koi8-r; # access_log logs/host.access.log main; # location / {# root html; # index index.html index.htm; #} if ($http_user_agent ~ * "qihoobot | Baiduspider | Googlebot | Googlebot-Mobile | Googlebot-Image | Mediapartners-Google | Adsbot-Google | Feedfetcher-Google | Yahoo! Slurp | Yahoo! Slurp China | YoudaoBot | Sosospider | Sogou spider | Sogou web spider | MSNBot | ia_archiver | Tomato Bot ") {return 403;} location ~ ^ / (. *) ${proxy_pass http://localhost:8080; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k Proxy_connect_timeout 90; proxy_send_timeout 90; proxy_read_timeout 90; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k;} # error_page 404 / 404.html; # redirect server error pages to the static page / 50x.html # error_page 500502503504 / 50x.html Location = / 50x.html {root html;} # proxy the PHP scripts to Apache listening on 127.0.0.1 proxy the PHP scripts to Apache listening on 80 # # location ~. Php$ {# proxy_pass http://127.0.0.1; #} # pass the PHP scripts to FastCGI server listening on 127.0.0.1 root html; 9000 # location ~. Php$ {# root html # fastcgi_pass 127.0.0.1 include fastcgi_params; 9000; # fastcgi_index index.php; # fastcgi_param SCRIPT_FILENAME / scripts$fastcgi_script_name; # include fastcgi_params; #} # deny access to .htaccess files, if Apache's document root # concurs with nginx's one # # location ~ /\ .ht {# deny all; #}}
You can test it with curl.
This is the end of the curl-I-A "qihoobot" www.xxx.com article on "how to block and prohibit Nginx servers from blocking web crawlers". I hope the above content can be helpful to you so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.