使用正则表达式从Forth源代码中剥离注释

I am trying to match all content between parentheses, including parentheses in a non-greedy way. There should be a space before and after the opening parentheses (or the start of a line before the opening parentheses) and a space before and after the closing parentheses. Take the following text:

 ( )
  ( This is a comment )
    1 2 +
\ a
: square dup * ;
( foo bar 
baz )
(quux)
( ( )
(
( )

The first line should be matched, the second line including its content should be matched, the second last line should not be matched (or raise an error) and the last line should be matched. The two lines foo bar baz should be matched, but (quux) should not as it doesn't contain a space before and after the parentheses. The line with the extra opening parentheses inside should be matched.

I tried a few conventional regexes for matching content between parentheses but without much success. The regex engine is that of Go's.

re := regexp.MustCompile(`(?s)\(( | .*? )\)`)
s = re.ReplaceAllString(s, "")

Playground: https://play.golang.org/p/t93tc_hWAG

Regular expressions "can't count" (that's over-simplified, but bear with me), so you can't match on an unbounded amount of parenthesis nesting. I guess you're mostly concerned about matching only a single level in this case, so you would need to use something like:

foo := regexp.MustCompile(`^ *\( ([^ ]| [^)]*? \)$`)

This does require the comment to be the very last thing on a line, so it may be better to add "match zero or more spaces" there. This does NOT match the string "( ( ) )" or try to cater for arbitrary nesting, as that's well outside the counting that regular expressions can do.

What they can do in terms of counting is "count a specific number of times", they can't "count how many blah, then make sure there's the same number of floobs" (that requires going from a regular expression to a context-free grammar).

Playground

Here is a way to match all the 3 lines in question:

(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()
]*[\pZs}\t])?\)[\pZs}\t]*$

See the Go regex demo at the new regex101.com

Details:

(?m) - multiline mode on
^ - due to the above, the start of a line
[\t\p{Zs}]* - 0+ horizontal whitespaces
\( - a (
[\pZs}\t] - exactly 1 horizontal whitespace
(?:[^() ]*[\pZs}\t])? - an optional sequence matching:
- [^() ]* - a negated character class matching 0+ characters other than (, ) and a newline
- [\pZs}\t] - horizontal whitespace
\) - a literal )
[\pZs}\t]* - 0+ horizontal whitespaces
$ - due to (?m), the end of a line.

Go playground demo:

package main

import (
    "regexp"
    "fmt"
)

func main() {
    var re = regexp.MustCompile(`(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()
]*[\pZs}\t])?\)[\pZs}\t]*$`)
    var str = ` ( )
  ( This is a comment )
    1 2 +
\ a
: square dup * ;
( foo bar 
baz )
(quux)
( ( )
(
( )`

    for i, match := range re.FindAllString(str, -1) {
        fmt.Println("'", match, "' (found at index", i, ")")
    }
}