Today it came to my mind to write a web bot/crawler/spider
/etc in PHP
that only crawls News
websites. First of all I read articles about crawlers and then encountered with this issue:
How can a bot recognize a URL/post/article/text as it's related to News
!
The only soultion I came with, is to check them for some particular keywords, but No! I don't think that's a good and workable practice. At least not perfect!
So any ideas about better sloutions, is appreciated.
You could use preg_match
for matching the keywords and the technique is pretty awesome and working:
$text = "News: Flooding is expected today" ;
$news_found = preg_match("/(news|sensation|discovery)/i", $text);
No reason to think that is not a good solution.
You are right you can't depend on this only
This is my contribution
All of above are factors to guide you what the type of website , also you may have a categorized database like arts sites , ... etc
and remember every algorithm need just start and ideas will come up to your mind