编写Web机器人[关闭]

Today it came to my mind to write a web bot/crawler/spider/etc in PHP that only crawls News websites. First of all I read articles about crawlers and then encountered with this issue:

How can a bot recognize a URL/post/article/text as it's related to News!

The only soultion I came with, is to check them for some particular keywords, but No! I don't think that's a good and workable practice. At least not perfect!

So any ideas about better sloutions, is appreciated.

You could use preg_match for matching the keywords and the technique is pretty awesome and working:

$text = "News: Flooding is expected today" ;
$news_found = preg_match("/(news|sensation|discovery)/i", $text);

No reason to think that is not a good solution.

You are right you can't depend on this only

This is my contribution

  • Match URL against some Keywords
  • Search in page description
  • Search in page keywords
  • See others links for this page (pages that your crawler visited earlier)

All of above are factors to guide you what the type of website , also you may have a categorized database like arts sites , ... etc

and remember every algorithm need just start and ideas will come up to your mind