I'm writing a simple query to find urls on commons.wikimedia.org but i can't seem to get around which specific sanitizing rules i should use to get the exact name files used there.
Eg.: The flag of Ivory Coast is listed in french as Drapeau_de_la_Côte_d%27Ivoire
so i get it that apostrophes are being sanitized but the regular ô
isn't. I've seen a lot of other file names with special characters preserved.
Is it safe to assume that all special chars are preserved and all punctuation and/or non-letters are sanitized?
Wikipedia uses all the url escaped in %nnnn
format (according all URL RFCs), and your browser does the final work for you, just to have the urls more friendly.
So even though my chrome shows http://en.wikipedia.org/wiki/Flag_of_Côte_d'Ivoire
url, originally it was http://en.wikipedia.org/wiki/Flag_of_C%C3%B4te_d'Ivoire