多语言网站和搜索引擎

I'm developing a site for a company that has clients from all over the world and the site will be served in two languages: Italian (local) and English. Once a visitor visits the site I check the IP, if its coming from Italy I show the site in Italian , if it's not I show it in English. Of course they will have the option to manually override the language. What exactly happens when the search engine bots inspect the site to index the pages?

  • usually the crawlers always have USA based IPs
  • even if the crawlers "click" on the "change language" link to show Italian pages since they can't accept cookies (and so sessions) I can't keep the language set or keep trace of what has been chosen

So the question is , how can you handle this situation in a way that search engines scan both the languages and also index them?

Google actually has an article in their Webmaster guidelines on this subject. You may want to take a look, as they specifically address the issues you have raised: http://www.google.com/support/webmasters/bin/answer.py?answer=182192

I'd use subdomains:

eng.mysite.com/whatever
it.mysite.com/whatever

Then have a sitemap which points to the home page of each of those language subdomains, and they should all be crawled just fine.

You can use the following approach:

  • Scan the Accept-Language header ($_SERVER['HTTP_ACCEPT_LANGUAGE']) for languages that the user agent prefers. This is usually more reliable than checking the IP address for their country.
  • Check the User-Agent header ($_SERVER['HTTP_USER_AGENT']) to see if the request comes from a search engine, such as "Googlebot" and "Yahoo! Slurp".