Shortish version
I am using this regex:
(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?
To try and extract all the element coefficient and order numbers from equations like this:
y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1
I want the regex to ignore the erroneous 4x^
which is missing its power number (doesn't currently do this) and allow me to get to this final result:
((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0))
Where first coordinate is coefficient and second is order for each element. Currently the regex above 'nearly' works if I take groups 1&2 and 5&6 to give me the coefficient and order respectively.
It just falls over on the erroneous 4x^
plus feels extremely inelegant, but I am somewhat noob at regex and am not sure what improvements to make.
How can I improve this regex, and also fix so that 4x^
is considered 'wrong' but 4x2
and 4x^2
are both fine?
tl;dr version
I am trying parse polynomial equations entered by users in order to validate and then decompose the equation into a series of elements. The equations will be presented as strings.
Here is an example of how the users are asked to format their string:
y = 2.0x^2.5 - 3.1x + 5.2
Where x
is the independent variable (not a times symbol) and y
is the dependent variable.
In reality the users commonly make any of the following mistakes:
y =
*
to coefficients such as y = 2.0*x
y = 5x
^
when setting the order e.g. y = x3
However, for all of these I'd say it's still easily understandable what the user is trying to write. By that I mean it is obvious what the coefficient and order are meant to be for each element.
So what I want to do is write some regex that correctly splits the entered string into separate elements and can get me A
(the coefficient) and B
(the order) of each element where an element in general is of the form Ax^B
and A
and B
can each be any real number.
I devised the following example:
y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1
Which I believe covers all of the potential issues I outlined above, in addition to one other straight up mistake 4x^+2x^2
is missing the order on the element 4x^
.
For this example I'd like to get to: ((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0))
where 4x^
has been ignored.
I am somewhat new to regex but I have made an effort using regex101.com to create the following:
(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?
This appears to nearly work, with the following issues:
4x^
given above - I am not sure how to make the optionality of the order number 'conditional' on the presence of ^
whilst also working when ^
is not present but the order number is such as y = 4x2
Also please note I am happily ignoring the issue of repeated elements with the same order not being summed, e.g. I am happy to ignore y = x^2 + x^2
not appearing as y = 2x^2
.
Thank you for any help.
p.s. Program to be written in Go, but I am also somewhat noob at Go so I am first prototyping in Python. Not sure if this will make any difference to the regex (I really am that new to regex).
The following regex will mostly do:
(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)
I say mostly because this solution takes the "4x^" case as having order 1, given the requirements are already pretty lenient and otherwise trying to ignore such term makes the RE much much more complicated or even impossible because it creates an ambiguity which can not be parsed with a RE.
Please note that absent coeficients/exponents will not be captured as '1.0' as you represent in your example result, that will have to be done after applying the regex and taking all empty capture groups as '1' (or '0' for the exponent depending on the captured groups).
Here you have the regex in regex101.com for checking/trying how it works.
And here a working program in golang which tests a couple of cases:
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
const e = `(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)`
var cases = []string{
"y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1",
"3.3X^-50",
}
func parse(d float64, ss ...string) float64 {
for _, s := range ss {
if s != "" {
c, _ := strconv.ParseFloat(strings.Replace(s, " ", "", -1), 64)
return c
}
}
return d
}
func main() {
re := regexp.MustCompile(e)
for i, c := range cases {
fmt.Printf("testing case %v: %q
", i, c)
ms := re.FindAllStringSubmatch(c, -1)
if ms == nil {
fmt.Println("no match")
continue
}
for i, m := range ms {
fmt.Printf(" match %v: %q
", i, m[0])
c := parse(1.0, m[1], m[4])
de := 1.0
if m[4] != "" {
de = 0.0
}
e := parse(de, m[2], m[3])
fmt.Printf(" c: %v
", c)
fmt.Printf(" e: %v
", e)
}
}
}
Which outputs:
testing case 0: "y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1"
match 0: "x"
c: 1
e: 1
match 1: "+3.3X^-50"
c: 3.3
e: -50
match 2: "+ 15x25.5"
c: 15
e: 25.5
match 3: "- 4x"
c: -4
e: 1
match 4: "+2x^2"
c: 2
e: 2
match 5: "+3*x-2.5"
c: 3
e: -2.5
match 6: "+1.1"
c: 1.1
e: 0
testing case 1: "3.3X^-50"
match 0: "3.3X^-50"
c: 3.3
e: -50