I'm hoping someone can help me get to the bottom of a problem I am having. I had a script put together about a year ago which parses incoming email and stores details in a database.
I get the email through with headers like so:
-------- Forwarded Message --------
Subject: FS.G02 Fleet Street - j** associates (AG69)
Date: Thu, 14 Apr 2016 11:27:32 +0000
From: Stephanie Zo*****ou <Stephanie.Zo****ou@********.co.uk>
To: 'lucien@********.com' <lucien@********.com>
I use the following regex and PHP code to separate various pieces of data out ($text contains the above email string):
//Set RegEx to parse data out of text/plain email string
$re1 = '~(?<=From: )(.*?)(?: \<)(.*?)(?=\>)~';
$re2 = "~(?<=To: ').*(?=')~";
$re3 = "~(?<=Sent: ).*(?=)~";
$re4 = "~(?<=Subject: ).*(?=)~";
$re5 = "~(?<=Subject:\s)(.*?)(?=\s)(?:.*\s\-\s)(.*)~";
$re6 = "~\((.*?)\)~";
//Pull the data out using above expressions
if(preg_match($re1, $text, $matches1)) {
$from_name = $matches1[1];
$from_email = $matches1[2];
}
if(preg_match($re2, $text, $matches2))
$to_email = $matches2[0];
if(preg_match($re3, $text, $matches3))
$sent_date = $matches3[0];
if(preg_match($re4, $text, $matches4))
$subject_line = $matches4[0];
if(preg_match($re5, $text, $matches5)) {
$unit_code = $matches5[1];
$company_name = $matches5[2];
}
//Change sent date to timestamp
$sent_date = strtotime($sent_date);
//break the unit code and building code apart
$unit_code = explode('.',$unit_code,2);
$building_code = $unit_code[0];
$unit_code = $unit_code[1];
//break the (C0D3) off the end of the company / subject line
$company_name = preg_replace($re6,'' ,$company_name);
The data I am trying to separate so that I can store in the DB are:
My problem is that the script has stopped working properly. My RegEx isn't giving me the timestamp, nor is it breaking down the subject line in to it's component parts:
FS.G02 Fleet Street - j** associates (AG69)
The code at the beginning is one piece of data I need. I then break it up in to the first two letters, and then the resulting alphanumerical second half.
FS.G02 Fleet Street - j associates** (AG69)
The second part I need is always after the hyphen - it's a company / customer name.
The format of this hasn't change since I last got it working so I can't tell if I have broken the RegEx. Is anyone who has a little more experience than I with RegEx able to see where I am going wrong?
Many thanks, Jonathan
Have you tried using imap_rfc822_parse_headers()
(Docs) instead of using a regex? It would certainly make it a lot simpler.
EDIT: Realised the docs don't actually say a lot about the function. Here's a sample output, called on your data there:
object(stdClass)#1 (12) {
["date"]=> string(31) "Thu, 14 Apr 2016 11:27:32 +0000"
["Date"]=> string(31) "Thu, 14 Apr 2016 11:27:32 +0000"
["subject"]=> string(43) "FS.G02 Fleet Street - j** associates (AG69)"
["Subject"]=> string(43) "FS.G02 Fleet Street - j** associates (AG69)"
["toaddress"]=> string(69) "'lucien@********.com', UNEXPECTED_DATA_AFTER_ADDRESS@".SYNTAX-ERROR.""
["to"]=> array(2) {
[0]=> object(stdClass)#2 (2) {
["mailbox"]=> string(7) "'lucien"
["host"]=> string(13) "********.com'"
}
[1]=> object(stdClass)#3 (2) {
["mailbox"]=> string(29) "UNEXPECTED_DATA_AFTER_ADDRESS"
["host"]=> string(14) ".SYNTAX-ERROR."
}
}
["fromaddress"]=> string(55) "Stephanie Zo*****ou "
["from"]=> array(1) {
[0]=> object(stdClass)#4 (3) {
["personal"]=> string(19) "Stephanie Zo*****ou"
["mailbox"]=> string(18) "Stephanie.Zo****ou"
["host"]=> string(14) "********.co.uk"
}
}
["reply_toaddress"]=> string(55) "Stephanie Zo*****ou "
["reply_to"]=> array(1) {
[0]=> object(stdClass)#5 (3) {
["personal"]=> string(19) "Stephanie Zo*****ou"
["mailbox"]=> string(18) "Stephanie.Zo****ou"
["host"]=> string(14) "********.co.uk"
}
}
["senderaddress"]=> string(55) "Stephanie Zo*****ou "
["sender"]=> array(1) {
[0]=> object(stdClass)#6 (3) {
["personal"]=> string(19) "Stephanie Zo*****ou"
["mailbox"]=> string(18) "Stephanie.Zo****ou"
["host"]=> string(14) "********.co.uk"
}
}
}
Here's a regex for your subject line as well:
([A-Z0-9]*\.[A-Z0-9]*)\s([A-Za-z\s]*)\s-\s([A-Za-z\s]*)\s(\([A-Z0-9]*\))
When called with preg_match()
, like:
$output = [];
$input = "FS.G02 Fleet Street - Something associates (AG69)";
preg_match("/([A-Z0-9]*\.[A-Z0-9]*)\s([A-Za-z\s]*)\s-\s([A-Za-z\s]*)\s(\([A-Z0-9]*\))/", $input, $output);
You will receive something like:
array(
0 => "FS.G02 Fleet Street - Something associates (AG69)",
1 => "FS.G02",
2 => "Fleet Street",
3 => "Something associates",
4 => "(AG69)"
)