网站被爬虫爬得扛不住了。。想把这些爬虫都屏蔽了。。
nginx下配有多个站点。。nginx.conf里是没有server字段的。。只有各个站点的.conf文件里有。。
我按照网上的说法,在nginx.conf的server字段(nginx.conf的server字段也是我自己加的)中加入了以下内容:
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {
return 403;
}
但是用curl -I -A "Googlebot" www.XXX.com,仍然没有返回403。。
太不靠谱了。。
求高手指点。。
另外robots.txt也不好使。。那个东西全靠自觉。。我想要能主动禁止他们。。。因为有的流氓爬虫显然没法用那个解决
nginx.conf内容如下:
#user nobody;
worker_processes 2;
#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
#pid文件的位置
pid nginx.pid;
events {
worker_connections 10240;
}
http {
include mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
#access_log logs/access.log main;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
#gzip on;
# open(OUTFILE, ">>/home/wamdm/perl_learn/a");
# print OUTFILE ($r->uri,"\n");
# close (OUTFILE);
perl_set $fix_upper_lower_case '
use File::Basename;
sub {
my $r = shift;
my $uri = $r->uri;
my $filepath = $r->filename;
my $uri_prefix = substr($uri, 0, rindex($uri, "/") + 1);
my $dir = dirname($filepath);
my $filename = basename($filepath);
opendir(my $dh, $dir) || die ("~~fail to open dir $dir");
my @files = grep { /$filename/i && -f "$dir/$_" } readdir($dh);
closedir($dh);
if (@files > 0) {
return "$uri_prefix@files[0]";
}
return $r->uri;
}
';
server {
if ($http_user_agent ~* "MJ12bot|qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|YandexBot|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot")
{
return 403;
}
# server_name localhost;
#charset koi8-r;
#access_log logs/host.access.log main;
# location / {
# root html;
# index index.html index.htm;
# }
#error_page 404 /404.html;
# redirect server error pages to the static page /50x.html
#
# error_page 500 502 503 504 /50x.html;
# location = /50x.html {
# root html;
# }
# proxy the PHP scripts to Apache listening on 127.0.0.1:80
#
#location ~ \.php$ {
# proxy_pass http://127.0.0.1;
#}
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
#location ~ \.php$ {
# root html;
# fastcgi_pass 127.0.0.1:9000;
# fastcgi_index index.php;
# fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name;
# include fastcgi_params;
#}
# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
# deny all;
#}
}
# another virtual host using mix of IP-, name-, and port-based configuration
#
#server {
# listen 8000;
# listen somename:8080;
# server_name somename alias another.alias;
# location / {
# root html;
# index index.html index.htm;
# }
#}
# HTTPS server
#
#server {
# listen 443;
# server_name localhost;
# ssl on;
# ssl_certificate cert.pem;
# ssl_certificate_key cert.key;
# ssl_session_timeout 5m;
# ssl_protocols SSLv2 SSLv3 TLSv1;
# ssl_ciphers HIGH:!aNULL:!MD5;
# ssl_prefer_server_ciphers on;
# location / {
# root html;
# index index.html index.htm;
# }
#}
}
站点的conf文件如下:
server {
listen 80;
server_name computer.cdblp.cn;
access_log /home/wamdm/sites/logs/computer.access.log main;
error_log /home/wamdm/sites/logs/computer.error.log error;
root /home/wamdm/sites/searchscholar/computer;
index index.php index.html index.htm;
rewrite "^/conference/([^/]+)$" /con_detail.php?con_title=$1 last;
rewrite "^/conference/([^/]+)/$" /con_detail.php?con_title=$1 last;
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {
return 403;
}
#大小转换的补丁,处理从windows平台(大小写不敏感)迁移到ubuntu(大小写敏感)的站点
#对于需要url重写生效的请求失效
#if ( !-e $request_filename ) {
# rewrite ^(.*)$ $fix_upper_lower_case last;
#}
#location /{
# include agent_deny.conf;
# }
#favicon.ico不用打日志
location = /favicon.ico {
log_not_found off;
access_log off;
}
#不允许访问隐藏文件
location ~ /\. {
deny all;
access_log off;
log_not_found off;
}
#访问图片,flash文件等不用打日志
location ~ .*\.(gif|jpg|jpeg|png|bmp|swf)$ {
expires 7d; #文件返回的过期时间是7天
access_log off;
}
#访问js和css文件不用打日志
location ~ .*\.(js|css)?$ {
expires 1d; #文件返回的过期时间是1天
access_log off;
}
#设置php-cgi
location ~ [^/]\.php(/|$) {
fastcgi_split_path_info ^(.+?\.php)(/.*)$;
#拦截不存在的php页面请求
if (!-f $document_root$fastcgi_script_name) {
return 404;
}
}
}
配置robots.txt禁止爬虫来爬就好了吧。。不过要是碰到流氓爬虫不理会robots.txt的配置,谷歌百度搜狗这种大部分是遵守的
User-agent: *
Disallow: /
关键是两条,一个是robots.txt禁止爬虫,这个是否配置正确。
参考:http://bar.baidu.com/robots/
另一个是查询下访问的蜘蛛的ip是不是来自google。因为一些山寨流氓搜索引擎,比如某数字公司,会仿冒知名公司的useragent,同时完全不管robots.txt的存在。对于这些流氓搜索引擎,只能屏蔽ip了。
添加此配置之后,可以通过nginx -t命令先验证一下配置语法是否有问题,若没有问题,则可以通过nginx -s reload来应用配置。
只有reload之后,配置才生效
我试了。。依然不行。。。
wamdm@WAMDM52:~$ curl -I -A "Googlebot" cdblp.cn/index.php
HTTP/1.1 200 OK
Server: nginx/1.4.1
Date: Mon, 15 Dec 2014 03:47:08 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.10
这是神马情况。。那个robots.txt是放在网页前端代码的根目录下么?
把引号改成括号
if ($http_user_agent ~* (qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot)) {
return 403;
}
测试了你的网站,确实没有起作用。
你说你在nginx.conf的server字段(nginx.conf的server字段也是我自己加的),这个server字段是怎么写的?你把if ($http_user_agent ~* 的这个判断加到各个站点的.conf文件里面(当然你可以先添加到cdblp.cn的这个server节点里面测试一下)。可能是因为你自己建的server节点和cdblp.cn的server节点是相互独立的,所以没起作用。
如果还有问题,可以提出来。
我的配置是这样的,直接在server节点下面写的,没有外加location。你的server节点是怎么写的?可以贴出来看看,当然,可以把涉及到安全的部分设置替换掉。
#user nobody;
worker_processes 1;
#error_log logs/error.log;
#error_log logs/error.log notice;
#error_log logs/error.log info;
#pid logs/nginx.pid;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
#log_format main '$remote_addr - $remote_user [$time_local] "$request" '
# '$status $body_bytes_sent "$http_referer" '
# '"$http_user_agent" "$http_x_forwarded_for"';
#access_log logs/access.log main;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 650;
#tcp_nodelay on;
fastcgi_connect_timeout 3000;
fastcgi_send_timeout 3000;
fastcgi_read_timeout 3000;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 128k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
#gzip on;
gzip on;
gzip_min_length 1k;
gzip_buffers 4 32k;
gzip_http_version 1.1;
gzip_comp_level 2;
gzip_types text/plain application/x-javascript text/css application/xml;
gzip_vary on;
gzip_disable "MSIE [1-6].";
server_names_hash_bucket_size 128;
client_max_body_size 100m;
client_header_buffer_size 256k;
large_client_header_buffers 4 256k;
server {
#charset koi8-r;
#access_log logs/host.access.log main;
listen 80 default;
## SSL directives might go here
server_name 127.0.0.1 localhost; ## Domain is here twice so server_name_in_redirect will favour the www
root G:\WWW;
location / {
index index.html index.php; ## Allow a static html file to be shown first
try_files $uri $uri/ @handler; ## If missing pass the URI to Magento's front handler
expires 30d; ## Assume all files are cachable
}
location /. { ## Disable .htaccess and other hidden files
return 404;
}
location ~ .php/ { ## Forward paths like /js/index.php/x.js to relevant handler
rewrite ^(.*.php)/ $1 last;
}
location ~ .php$ { ## Execute PHP scripts
if (!-e $request_filename) { rewrite / /index.php last; } ## Catch 404s that try_files miss
expires off; ## Do not cache dynamic content
fastcgi_pass 127.0.0.1:9000;
#fastcgi_param HTTPS $fastcgi_https;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include fastcgi_params; ## See /etc/nginx/fastcgi_params
}
if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") {
return 403;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
}
}