In an earlier thread about inserting brackets around "comments" in a chess pgn-like string, I got excellent help finishing a regex that matches move lists and comments separately.
Here is the current regex:
((?:\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}).)+)
The three capture groups are:
In action here: http://regex101.com/r/dQ9lY5
Everything works correctly for "Your regular expression in" PCRE(PHP): it matches all three groups correctly. When I switch to "Your regular expression in" Javascript, however, it matches everything as Capture Group 1. Is there something in my regex that isn't supported by the Javascript regex engine? I tried to research this, but haven't been able to solve it. There is so much information on this topic, and I've already spent hours and hours.
I know one solution is to use the regex as-is, and pass it to PHP through AJAX, etc, but I don't know how to do that yet (it's on my list to learn).
Question 1: But I am also very curious about what it is in this regex that doesn't work on the Javascript regex engine.
Also, here is my Javascript CleanPgnText
function. I am most interested in the while
, but if anything else seems wrong, I would appreciate any help.
function CleanPgnText(pgn) {
var pgnTextEdited = '';
var str;
var pgnInputTextArea = document.getElementById("pgnTextArea");
var pgnOutputArea = document.getElementById("pgnOutputText");
str = pgnInputTextArea.value;
str = str.replace(/\[/g,"("); //sometimes he uses [ incorrectly for variations
str = str.replace(/\]/g,")");
str = str.replace(/[
¬]*/g,""); // remove newlines and that weird character that MS Word sticks in
str = str.replace(/\s{2,}/g," "); // turn more than one space into one space
while ( str =~ /((?:\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})[^\)\(])+)|((?:\)\s\())/g ) {
if ($1.length > 0) { //
pgnTextEdited += $1;
}
else if ($2.length > 0) {
pgnTextEdited += '{' + $2 + '}';
}
else if ($3.length > 0) {
pgnTextEdited += $3;
}
}
pgnOutputArea.innerHTML = pgnTextEdited;
}
Question 2: Regarding the =~
in the while
statement
while ( str =~
I got the =~
from helpful code in my original thread, but it was written in Perl. I don't quite understand how the =~
operator works. Can I use this same operator in Javascript, or should I be using something else?
Question 3: Can I use .length the way I am, when I say
if ($1.length > 0)
to see if the first capture group had a match?
Thank you in advance for any help. (If the regex101 link doesn't work for you, you can get a sample pgn to test on from the original thread).
I corrected your javascript code and got the following:
Personally I think the matching (group) problems are related to http://regex101.com/. Your expression works definitly in JavaScript (see the fiddle) and in Java (with escaping corrections). I minimalized your JavaScript slightly and used the pgn data from a parameter not a text input.
I am not aware that =~
is available in JavaScript, but maybe I am wrong. Using JavaScript you loop through the matches using something like: (Why does it not format like code???)
pattern=/myregexp/; while ((match=pattern.exec(mytext))!=null) { //do something }
If no match is found for a group it returns null
. You adress the groups by using the match
variable from above with an index like match[2]
is matching group 2.
I was looking at your new regex, its not quite right. Even though it looks to work with @wumpz 's JS code,
You can't just exclude [^)(]
parenth's in the comment's section, because you are
only matching a string literal ) (
sequence (in capture group 3).
This could potentially exclude parenths from a match, where it doesn't become part of the newstring
that is constructed. Its not likely because the moves matches parenths.
To fix that, just exclude ') (`'s from comments, then match it first (group 1).
Also, I left some notes of the changes made from your new regex.
Try it out. I think @wumpz deserves the credit.
# /(\)\s*\()|((?:\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-])(?!\)\s*\()[\S\s])+)/
( \) \s* \( ) # (1), 'Special Comment' configuration (must match first)
| # OR,
( # (2 start), 'Moves' configuration
(?:
\s?
[()]? \s? [()]?
\s?
[0-9]{1,3} \.{1,3}
\s
[NBRQK]? [a-h1-8]? x? [a-hO] [1-8-] [O-]{0,3} [!?+#=]{0,2} [NBRQ]?
[!?+#]{0,2}
(?:
\s
[NBRQK]? [a-h1-8]? x? [a-hO] [1-8-] [O-]{0,3} [!?+#=]{0,2} [NBRQ]? [!?+#]{0,2}
)?
\s?
[()]? \s? [()]?
\s?
)+
) # (2 end)
| # OR,
( # (3 start), 'Normal Comment' configuration
(?:
(?! # Not the 'Moves configuration'
\s?
[()]? \s? [()]?
\s?
[0-9]{1,3} \.{1,3}
\s
[NBRQK]? [a-h1-8]? x? [a-hO] [1-8-]
# ----
# Next line is not needed
# because all its items are
# optional
# ----
### [O-]{0,3} [!?+#=]{0,2} [NBRQ]? [!?+#]{0,2} <- not needed
)
### [^)(] <- replaced by '[\S\s]' below
# ----
# The above line is replaced by any char.
# because it excludes all ()'s and is not appropriate
(?! \) \s* \( ) # Also, Not the 'Sspecial comment' configuration
[\S\s] # Consume any char
)+
) # (3 end)
Modifing @wumpz JS code, it would look like this with modified regex
function CleanPgnText(pgn) {
var pgnTextEdited = '';
var str;
var pgnOutputArea = document.getElementById("pgnOutputText");
str = pgn;
str = str.replace(/\[/g, "("); //sometimes he uses [ incorrectly for variations
str = str.replace(/\]/g, ")");
str = str.replace(/[
¬]*/g, ""); // remove newlines and that weird character that MS Word sticks in
str = str.replace(/\s{2,}/g, " "); // turn more than one space into one space
//Start regexp processing
var pattern = /(\)\s*\()|((?:\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-])(?!\)\s*\()[\S\s])+)/g;
while ((match = pattern.exec(str)) != null) {
if (match[1] != null) { // Special Comment configuration, don't add '{}'
pgnTextEdited += match[1];
} else if (match[2] != null) { // Moves configuration
pgnTextEdited += match[2];
} else if (match[3] != null) { // Normal Comment configuration, add '{}'
pgnTextEdited += '{' + match[3] + '}';
}
}
//end regexp processing
pgnOutputArea.innerHTML = pgnTextEdited;
}
Running this in a Perl program, the output is:
{Khabarovsk is the capital of Far East of Russia. My 16-year-old opponent was a promising local prodigy. Now he is a very strong FM with a FIDE rating of 2437 and lives... in the USA, too! A small world.} 1. e4 c5 2. Nf3 e6 3. c3 Nf6 4. e5 Nd5 5. d4 cxd4 6. cxd4 d6 7. Nc3 Nc6 8. Bd3!? Nxc3 9. bxc3 dxe5 10. dxe5 Qa5 11. O-O Be7 12. Qb3 Nxe5 13. Nxe5 Qxe5 14. Bb5+ Kf8 15. Ba3 Qc7 16. Rad1 g6 17. c4! Bxa3 18. Qxa3+ Kg7 19. Rd6 Rd8 20. c5 Bd7 21. Bc4 Bc6 22. Rfd1 Rd7 23. Qg3 Rad8 {Finally with accurate, solid play Black has consolidated yet White still keeps some pressure and has some compensation for the pawn.} 24. h4 {A typical march in such positions, simply nothing else to do better.} 24... h5?! ( 24... h6 {would be a more careful response. }) ({ But the best defense was} 24... Rd6! 25. cd6 Qa5 ) 25. Qe5+ Kh7 26. Bd3 {Very natural} 26... Kh6? ( {Missing} 26... Ba4! 27. Qxh5+ Kg7 28. Qe5+ Kg8! {and now Black has many own threats. White would have to force a perpetual after} 29. h5! Bxd1 30. h6 f6 31. Qxf6 Bh5 32. Qxe6+ Kh7 33. Bxg6+ Bxg6 34. Qxg6+ Kh8 35. Qf6+ {Now, after 26...Kh6 everything is ready for preparing a decisive blow.} ) 27. Qf6! Kh7 ( {There is no} 27... Rxd6 28. cxd6 Rxd6? {due to} 29. Qh8# ) 28. g4! hxg4 29. h5 Rxd6 30. cxd6 Rxd6 31. hxg6+ Kg8 32. g7! {This pawn is the vital factor until the end now. With any other move, White loses.} 32... Qd8! {The only defense against Qh6 and Qh8 checkmating or queening.} 33. Qh6 f5 34. Rd2!! {The idea is the white rook cannot be taken with a check anymore. The bishop will be easily unpinned with the crushing Bxf5 or Bc4. The Black pin on d file was an illusion! In fact it's Black's rook that is pinned and cannot leave d file.} 34... Bd5 ( {The best try - to close d file with protecting more e6 pawn. No help is} 34... Rd7 35. Bf5 ef5 36. Qh8 Kf7 37. Rd7 ) ( {But maybe the best practical chance was} 34... g3!? {and now} 35. Bxf5 {doesn't win because of} 35... gxf2+ 36. Kh2 f1=N+! 37. Kh3 Bg2+! 38. Rxg2 Rd3+! 39. Bxd3 Qxd3+ {with an amazing perpetual} 40. Kh4 Qe4+ 41. Rg4 Qh1+ 42. Kg5 Qd5+ 43. Kf6 Qd8+ 44. Kg6 Qd3+ ) ( {But after} 34... g3!? {White wins using another wing tactic:} 35. Bc4! Bd5 36. Bxd5 exd5 37. Qh8+ Kf7 38. Rc2 gxf2+ 39. Kf1! {and there is no defense against Rc8. Now after 35...Bd5 again everything looks well protected.} ) 35. Qh8 Kf7 36. Bb5! {The bishop still makes his way breaking through. The coming Be8 is a killer.} 36... Qg8 37. Be8+! Qxe8 38. Qe8+ Kxe8 39. g8=Q+ Kd7 40. Qg7+ {It was White's 40th move Which means time control was over for me. I was short on time. A piece and three pawns for a queen is not enough. Black resigned. 1-0 }