禁用外部链接的页面URL检查PHP链接爬虫

I have created a standalone link crawler script for finding the broken links in the site using the following script http://phpcrawl.cuab.de/example.html.

Its working fine to crawl the links. but it check the external link and its content page urls also. but this process is not needed only check the internal link , internal link's content page url and external link. does not want to check the external links content page url. So i need to disable the checking of the external link's content page url and its imge src. only check the external link is broken or not. dont check that link's content page url.

If you read the documentation for the framework you are using you would have found the addURLFollowRule() method that can force the crawler to only follow specific URL-patterns.

Add this to your code and apply the correct REGEX pattern to match your interal URL(s):

$crawler->addURLFollowRule("#https?://internal/.*# i");

Documentation: http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_addURLFollowRule.htm

.. or simply use one of the setFollowMode()-settings:

http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_setFollowMode.htm

E.g. $crawler->setFollowMode(2); // Cralwer stays in host