HTML到纯文本 - 未知的原始编码

I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.

They need to be saved to the database in utf-8. My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?

the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.

What is the best approach?

Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>, <li> tags would convert to line breaks etc and tags like <script> would be completely removed with their content.

  • Use mb_detect_encoding() for encoding detection.

  • Use strip_tags() to get rid of HTML tags.

Rest of the subjects like formatting the output depends on your needs.

Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.

http://www.phpwact.org/php/i18n/utf-8

This function may be useful to you:

<?php
function FixEncoding($x){
  if(mb_detect_encoding($x)=='UTF-8'){
    return $x;
  }else{
    return utf8_encode($x);
  }
}
?>