编写Web机器人[关闭]

Today it came to my mind to write a web bot/crawler/spider/etc in PHP that only crawls News websites. First of all I read articles about crawlers and then encountered with this issue:

How can a bot recognize a URL/post/article/text as it's related to News!

The only soultion I came with, is to check them for some particular keywords, but No! I don't think that's a good and workable practice. At least not perfect!

So any ideas about better sloutions, is appreciated.

You could use preg_match for matching the keywords and the technique is pretty awesome and working:

$text = "News: Flooding is expected today" ;
$news_found = preg_match("/(news|sensation|discovery)/i", $text);

No reason to think that is not a good solution.

You are right you can't depend on this only

This is my contribution

Match URL against some Keywords
Search in page description
Search in page keywords
See others links for this page (pages that your crawler visited earlier)

All of above are factors to guide you what the type of website , also you may have a categorized database like arts sites , ... etc

and remember every algorithm need just start and ideas will come up to your mind