使用python和javascript的正则表达式慢,但在go和php中快速失败

I wrote a regex to parse PostgreSQL errors to try and show to the user which field had duplicated data. The regex is this one:

^DETAIL:.[^\(]+.(.[^\)]+).[^\(]+.(.[^\)]+). already exists

It will be pretty fast if you run it against the right message like this one (https://regex101.com/r/GZuREV/1):

ERROR:  duplicate key value violates unique constraint "uq_content_block_internal_name_store_id"
DETAIL:  Key (lower(internal_name::text), store_id)=(some content block-32067683, 0c6d20a7-d843-44f3-af9c-4a2cf2a47e4c) already exists.

But if PostgreSQL emits another message like the following one, it will take python about 30 seconds in my machine to answer (https://regex101.com/r/GZuREV/2).

ERROR:  null value in column "active" violates not-null constraint
DETAIL:  Failing row contains (2018-08-16 14:23:52.214591+00, 2018-08-16 14:23:52.214591+00, null, 6f6d1bc9-c47e-46f8-b220-dae49bd58090, bf24d26e-4871-4335-9f18-83c5a52f1b3a, Some Product-a1c03dde-2de9-401c-92d5-5c1500908984, {"de_DE": "Fugit tempore voluptas quos est vitae.", "en_GB": "Qu..., {"de_DE": "Fuga reprehenderit nobis reprehenderit natus magni es..., {"de_DE": "Fuga provident dolorum. Corrupti sunt in tempore quae..., my-product-53077578, SKU-53075778, 600, 4300dc25-04e2-4193-94c0-8ee97b636739, 52553d24-6d1c-4ce6-89f9-4ad765599040, null, 38089c3c-423f-430c-b211-ab7a57dbcc13, 7d7dc30e-b06b-48b7-b674-26d4f705583b, null, {}, 0, null, 9980, 100, 1, 5).

If go to the regex101 link you can see that if you toggle to different languages like php or go, they all return quite fast saying that no match was found but if you choose python or javascript you will get a timeout.

My quick dirty fix was something like this:

match = 'already exists' in error_message and compiled_regex.search(error_message)

What do you think that could be causing that? Could it be the greedy operators to consume until I reach my desired data?

Update 1

Using re.IGNORECASE in python, makes it around 9 seconds slower as it spends too much time lowercasing something.

With ignorecase

With ignorecase

Without ignorecase

Without ignorecase

Update 2

Playing around I saw that to make it work and fail a simple change will break

^DETAIL:.[^\(]+?\((.[^\)]+?).[^\(]+?.(.[^\)]+?). already exists
                            ^ just changing this to \) make it stop timing out
^DETAIL:.[^\(]+?\((.[^\)]+?)\)[^\(]+?.(.[^\)]+?). already exists

This is a regex monster :)

Why not split the 2 regexes?

  1. Check if already exists is a match (very quick)
  2. extract the data you want to display with your existing regex ^DET.[^\(]+.(.[^\)]+).[^\(]+.(.[^\)]+)

That should speed up your code by a lot. (You can even shorten DETAIL as I did)

Package regexp

import "regexp"

Package regexp implements regular expression search.

The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C. For an overview of the syntax, run

go doc regexp/syntax

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see

http://swtch.com/~rsc/regexp/regexp1.html

or any book about automata theory.


By design, Go regular expressions are guaranteed to run in time linear in the size of the input, a property not guaranteed by some other implementations of regular expressions. See Regular Expression Matching Can Be Simple And Fast.

It's not really an answer to the question but I think the problem could be the greedy operators. anyway, I think you should make some part of it lazy to fail fast.

I used this pattern and it's ok on all of the engines of the languages on regex101:

^DETAIL:.+?\((.+)\).+?\((.+)\) already exists.

link to the matched error

link to the error with no match

TL;DR: Use this:

^DETAIL:\s*+Key[^\(]++\((.+)\)[^\(]+\(([^\)]+)\) already exists

See matching example and non-matching one

Explained:

First of all, the original regexp does not seem to match the whole key group, you stopped at lower(internal_name::text, leaving out some columns of the composite key and also an unbalanced parenthesis. If you modify it like this it should work capturing the composite key. If that is not supposed to do so, just let me know:

^DETAIL:.[^\(]+.(.+)\)[^\(]+.(.[^\)]+). already exists

Just by changing this, the regex is 'runnable', but still quite slow.

One of he main reasons is this [^\(]+. It first matches up to DETAIL: Failing row contains(space) and follows on with the rest of the regex. It will not match, so It backtracks to one less character, up to DETAIL: Failing row contains and goes on with the rest of the regexp. It will not match, so will go back to DETAIL: Failing row contain... etc.

One way to avoid this is use a possessive quantifier. That means that once you fetch something, you cannot go back. So using this [^\(]++ instead of this [^\(]+ (that is: ^DETAIL:.[^\(]++.(.+)\)[^\(]+.(.[^\)]+). already exists) makes the regexp decrease the steps from 28590 to 1290.

But you can still improve it. If you know that your required data uses the keyword key, use it! That way, since It is not present on the failing example, it will make the regex fail soon (once it reads DETAIL and the next word)

So if you use ^DETAIL:\s*+Key[^\(]++.(.+)\)[^\(]+.(.[^\)]+). already exists steps are are now just 12.

If you feel like using key is too specific, you could use something less generic trying to find "not 'Fail'". Like this:

^DETAIL:\s*+(?!Fail)[^\(]++.(.+)\)[^\(]+.(.[^\)]+). already exists

That way It is 17 steps.

Finally, you can tune the regex for the matching content.

Change this:

^DETAIL:\s*+Key[^\(]++.(.+)\)[^\(]+
.           # <============= here, use \( instead
(.[^\)]+). already exists

by this:

^DETAIL:\s*+Key[^\(]++.(.+)\)[^\(]+\((.[^\)]+). already exists

That reduces steps from 538 to 215 as you do less backtracking.

Then, after removing several useless dots and replacing some (supposed-to-be-parenthesis) dots by \( or \) (a personal taste) you have the final regex:

^DETAIL:\s*+Key[^\(]++\((.+)\)[^\(]+\(([^\)]+)\) already exists