Usually, when I'm replacing newlines I jump to Regexp, like in this PHP
preg_replace('/\R/u', "
", $String);
Because I know that to be a very durable way to replace any kind of Unicode newline (be it , , , etc.)
I was trying to something like this in Go as well, but I get
error parsing regexp: invalid escape sequence:
\R
On this line
msg = regexp.MustCompilePOSIX("\\R").ReplaceAllString(html.EscapeString(msg), "<br>
")
I tried using (?:(?> )|\v)
from https://stackoverflow.com/a/4389171/728236, but it looks like Go's regex implementation doesn't support that either, panicking with invalid or unsupported Perl syntax: '(?>'
What's a good, safe way to replace newlines in Go, Regex or not?
I see this answer here Golang: Issues replacing newlines in a string from a text file saying to use ?
, but I'm hesitant to believe that it would get all Unicode newlines, mainly because of this question that has answer listing many more newline codepoints than the 3 that ?
covers,
You may "decode" the \R
pattern as
U+000DU+000A|[U+000AU+000BU+000CU+000DU+0085U+2028U+2029]
See the Java regex docs explaining the \R
shorthand:
Linebreak matcher \R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
In Go, you may use the following:
func removeLBR(text string) string {
re := regexp.MustCompile(`\x{000D}\x{000A}|[\x{000A}\x{000B}\x{000C}\x{000D}\x{0085}\x{2028}\x{2029}]`)
return re.ReplaceAllString(text, ``)
}
Here is a Go demo.
Some of the Unicode codes can be replaced with regex escape sequences supported by Go regexp:
re := regexp.MustCompile(`
|[
\v\f\x{0085}\x{2028}\x{2029}]`)
While using regexp usually yields an elegant and compact solution, often it's not the fastest.
For tasks where you have to replace certain substrings with others, the standard library provides a really efficient solution in the form of strings.Replacer
:
Replacer replaces a list of strings with replacements. It is safe for concurrent use by multiple goroutines.
You may create a reusable replacer with strings.NewReplacer()
, where you list the pairs containing the replaceable parts and their replacements. When you want to perform a replacing, you simply call Replacer.Replace()
.
Here's how it would look like:
const replacement = "<br>
"
var replacer = strings.NewReplacer(
"
", replacement,
"", replacement,
"
", replacement,
"\v", replacement,
"\f", replacement,
"\u0085", replacement,
"\u2028", replacement,
"\u2029", replacement,
)
func replaceReplacer(s string) string {
return replacer.Replace(s)
}
Here's how the regexp solution from Wiktor's answer looks like:
var re = regexp.MustCompile(`
|[
\v\f\x{0085}\x{2028}\x{2029}]`)
func replaceRegexp(s string) string {
return re.ReplaceAllString(s, "<br>
")
}
The implementation is actually quite fast. Here's a simple benchmark comparing it to the above pre-compiled regexp solution:
const input = "1st
second
third4th\u0085fifth\u2028sixth"
func BenchmarkReplacer(b *testing.B) {
for i := 0; i < b.N; i++ {
replaceReplacer(input)
}
}
func BenchmarkRegexp(b *testing.B) {
for i := 0; i < b.N; i++ {
replaceRegexp(input)
}
}
And the benchmark results:
BenchmarkReplacer-4 3000000 495 ns/op
BenchmarkRegexp-4 500000 2787 ns/op
For our test input, strings.Replacer
was more than 5 times faster.
There's also another advantage. In the example above we obtain the result as a new string
value (in both solutions). This requires a new string
allocation. If we need to write the result to an io.Writer
(e.g. we're creating an HTTP response or writing the result to a file), we can avoid having to create the new string
in case of strings.Replacer
as it has a handy Replacer.WriteString()
method which takes an io.Writer
and writes the result into it without allocating and returning it as a string
. This further significantly increases the performance gain compared to the regexp solution.