I'm working with PHP, getting html from websites, converting them to plain text and saving them to the database.
They need to be saved to the database in utf-8. My first problem is that I don't know the original encoding, what's the best way to encode to utf-8 from an unknown encoding?
the 2nd issue is the html to plain text conversion. I tried using html2text but it messed up all the foreign utf characters.
What is the best approach?
Edit: It seems the part about plain text is not clear enough. What i need not to just strip the html tags. I want to strip the tags while maintaining a kind of document structure. <p>
, <li>
tags would convert to line breaks etc and tags like <script>
would be completely removed with their content.
Use mb_detect_encoding()
for encoding detection.
Use strip_tags()
to get rid of HTML tags.
Rest of the subjects like formatting the output depends on your needs.
Edit: I don't know if a complete solution exists but this link is really helpful to improve existing html to text PHP scripts on your own.
This function may be useful to you:
<?php
function FixEncoding($x){
if(mb_detect_encoding($x)=='UTF-8'){
return $x;
}else{
return utf8_encode($x);
}
}
?>