The text being stored in the database also includes the CSS styling.
<p>ABC | Min. XYZ
<style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}-->
</style>
<span data-sheets-userformat="{"2":3011,"3":{"1":0},"4":[null,2,16777215],"9":1,"10":1,"11":4,"12":0,"14":[null,2,0]}" data-sheets-value="{"1":2,"2":"PQR"}" style="font-size: 10pt; font-family: Arial; color: rgb(0, 0, 0); text-align: center;">PQR</span></p>
To get rid of   I have used html.Unescape() and it works perfectly fine.
When fetched from database I want to display it in this format : ABC | Min. XYZ PQR
But the actual result (after using html.Unescape()
) is :
ABC | Min. XYZ
<style type="text/css">
<!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}-->
</style>
<span data-sheets-userformat="{"2":3011,"3":{"1":0},"4":[null,2,16777215],"9":1,"10":1,"11":4,"12":0,"14":[null,2,0]}" data-sheets-value="{"1":2,"2":"PQR"}" style="font-size: 10pt; font-family: Arial; color: rgb(0, 0, 0); text-align: center;">PQR</span></p>
This seems simple but requires you to do 3 things:
<p>
and <style type="text/css">
U+00A0
) with single spacesYou can do this with the following with github.com/microcosm-cc/bluemonday
, html
and strings
:
// Your input text
input := `<p>ABC | Min. XYZ
<style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}-->
</style>
<span data-sheets-userformat="{"2":3011,"3":{"1":0},"4":[null,2,16777215],"9":1,"10":1,"11":4,"12":0,"14":[null,2,0]}" data-sheets-value="{"1":2,"2":"PQR"}" style="font-size: 10pt; font-family: Arial; color: rgb(0, 0, 0); text-align: center;">PQR</span></p>`
// Strip all HTML tags
p := bluemonday.StrictPolicy()
output := p.Sanitize(input)
// Unescape HTML entities
output = html.UnescapeString(output)
// Condense whitespace
output = strings.Join(strings.Fields(strings.TrimSpace(output)), " ")
output is now ABC | Min. XYZ PQR
For the last step, using strings.Fields
looks cleaner than using a regexp since \s
doesn't cover non-breaking spaces (U+00A0
) and thus requires the following:
// Leading and trailing spaces
output = regexp.MustCompile(`^[\s\p{Zs}]+|[\s\p{Zs}]+$`).ReplaceAllString(output, "")
// middle spaces
output = regexp.MustCompile(`[\s\p{Zs}]{2,}`).ReplaceAllString(output, " ")
See more on matching whitespace here: How to remove redundant spaces/whitespace from a string in Golang?
Finally, you can combine the above into a function as follows in github.com/grokify/gotilla/html/htmlutil
var bluemondayStrictPolicy = bluemonday.StrictPolicy()
func HTMLToTextCondensed(s string) string {
return strings.Join(
strings.Fields(
strings.TrimSpace(
html.UnescapeString(
bluemondayStrictPolicy.Sanitize(s),
),
)),
" ",
)
}