如何阻止spyder / Nutch-2等爬虫访问特定页面?

I have a Windows client application that consumes a php page hosted in a shared commercial webserver.

In this php page I am returning an encrypted json. Also in this page I have a piece of code to keep track of which IPs are visiting this php page, and I have noticed that there is a spyder/Nutch-2 crawler visiting this page.

I am wandering how is possible that a crawler could find a page that is not published in any search engines. I there a way to block crawlers from visiting this specific page?

Shall I use .htaccess file to configure it?

You can indeed use a .htaccess. robots.txt is another option but some crawlers will ignore this. You can also block specific user agent strings. (They differ from crawler to crawler)

robots.txt:

User-agent: *
Disallow: /

This example tells all robots to stay out of the website: You can block specific directories

Disallow: /demo/

More information about robots.txt

You can forbid specific crawlers by doing thatfollowing;

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (spyder/Nutch-2) [NC]
#For multi block
#RewriteCond %{HTTP_USER_AGENT} (spyder/Nutch-2|baidu|google|...) [NC]
RewriteRule .* - [R=403,L]

That crawler, can change agent name, so this may not be the solution. You need to block that crawler by looking at ip address in need;

Order Deny,Allow
Deny from x.x.x.x

However, that bot can also change his ip address. This means, you need to track your access logs. And decide which agents to be blocked and add them to list manually

You can ban the particular IP address with .htaccess file:

Order Deny,Allow
Deny from xxx.xx.xx.xx

where xxx represents IP address

Close. It would be better to use a robots.txt file. The page linked goes through why you would want to set one up and how to do so. In summary:

  1. It avoids wasting server resources as the spiders and bots run the scripts on the page.
  2. It can save bandwidth.
  3. It removes clutter from the webstats.
  4. You can fine-tune it to exclude only certain robots.

One caveat I should mention. Some spiders are coded to disregard the robots.txt file and will even examine it to see what you don't want them to visit. However, spiders from legit sources will obey the robots.txt directives.

You could use .htaccess or another option would be to use php code. At the top of the php code simply put something like this:

if(strpos($_SERVER['HTTP_USER_AGENT'],'spyder/Nutch-2') !== false) {
    die();
}
//rest of code here