剥离输入以完成纯文本

Currently finalising the coding for my comment system, and it want it to work a little how Stack Overflow works with their posts etc, I would like my users to be able to use BOLD, Italic and Underscore only, and to do that I would use following:

_ Text _ * BOLD * -Italic-

Now, firstly I would like to know a way of stripping a comment completely clean of any tags, html entities and such, so for example, if a user was to use any html / php tags, they would be removed from the input.

I am currently using Strip_tags, but that can leave the output looking quite nasty, even if an abusive or blatent XSS/Injection attempt has been made, I would still like the plain-text to be outputted in full, and not chopped up as strip_tags seems to make an absolute mess when it comes to that.

What I will then do, is replace the asterisks with bold html tags, and so on AFTER stripping the content clean of html tags.

How do people suggest I do this, currently this is the comment sanitize function

function cleanNonSQL( $str )
{
    return strip_tags( stripslashes( trim( $str ) ) );
}

PHP tags are surrounded by <? and ?>, or maybe <% and %>on some ages-old installations, so removing PHP tags can be managed by a regex:

$cleaned=preg_replace('/\<\?.*?\?\>/', '', $dirty);
$cleaned=preg_replace('/\<\%.*?\%\>/', '', $cleaned);

Next you take care of the HTML tags: These are surrounded by < and >. Again you can do this with a regex

$cleaned=preg_replace('/\<.*?\>/','',$cleaned);

This will transform

$dirty="blah blah blah <?php echo $this; ?> foo foo foo <some> html <tag> and <another /> bar bar";

into

$cleaned="blah blah blah  foo foo foo  html  and  bar bar";

You could try using regular expressions to strip the tags, such as:

preg_replace("/\<(.+?)\>/", '', $str);

Not sure if that's what you're looking for, but it will remove anything inside < and >. You can also make it a little more foolproof by requiring the first character after the < to be a letter.

The correct way is not to delete html tags from your user's comment, but to tell the browser that the following text should not be interpreted as HTML, Javascript, whatever. Imagine someone wants to post example code like we do here on stackoverflow. If you just bluntly remove any parts of a comment that seem to be code, you will mess up the user's comment.

The solution is to use htmlentities which will escape symbols used for html markup in the comment so that it will actually show up as just text in the browser.

For example the browser will interpret a < as the beginning of a html tag. if you just want the browser to display a <, you have to write < in the source code. htmlentities will convert all the relevant symbols into their html entities for you.

Longer Example

echo htmlentities("<b>this text should not be bold</b><?php echo PHP_SELF;?>");

Outputs

&lt;b&gt;this text should not be bold&lt;/b&gt;&lt;?php echo PHP_SELF;?&gt;

The browser will output

<b>this text should not be bold</b><?php echo PHP_SELF;?>

Consider the following real life example with the solution, you accepted. Imagine a user writing this comment.

i'm in a bad mood today :<. but your blog made me really happy :>

You will now do your preg_replace("/\<(.+?)\>/", '', $comment); on the text and it will remove half the comment:

i'm in a bad mood today :

If that's what you wanted, never mind this answer. If you don't, use htmlentities.

If you want to save the comment as a file and not have the server interpret PHP code inside it, save it with an extension like '.html' or '.txt', so that the web server won't call the PHP interpreter in the first place. There is usually no need to escape PHP code.