These are some samples of the data I am working with(I made some comments on the side):
TSG MUM
BS06-312
RQWE. FKB
BS06-204
NM. JAK
BS06-E05
DB. FKB
BS06-312
IGT. resetk
Wender. //--> special CASE
ENG I.
WEHN BS06-E06
ENG II
FLEM BS06-203 //--> special CASE: 2 Subjects
ITSI. MUM
BS06-E02
PQT. RIE
BS11-QCR PQT
MARK BS11-QCR
PQT FIS
BS11-QCR //--> special CASE: several Subjects
INC FEY
BS06-309
FU MAT
SKU BS06-309
ABS. DOE
BS06 ABS
VOG BS06
ABS HEI
BS06 ABS
MOR BS06
ABS REM
BS06 ABS
DEI BS06
ABS THA
BS06
ENG III.
GLIT BS06-209
ENG II
WANN BS06-208
These are subjects in a class schedule. The first letters represent the taught subject. After that its the teachers initials separated by a space. The last position is the building and room number.
Sometimes there are several subjects being taught on a specific time.
The data comes from an ics calendar file and I simply copied it here. The new line characters also need to be considered.
I need to extract the subject name, teachers initials and the room number so i can work with it. Any ideas on how to proceed? A complete regex pattern would be ideal.
I am working with php.
Thank you for your help.
I will exclude this line :
FU MAT
SKU BS06-309
What have we got here ?
FU
: subjectMAT
: teacherSKU
: ???BS06-309
: roomSolution :
Anyway, for the rest of the block you can user this regex :
(?:\s|\
\
)*(?<subject>\S+(?:\s[IVX]+\.?)?)(?:\s|\
\
)+(?<teacher>\S+)(?:\s|\
\
)+(?<room>\S+)(?:\s|\
\
)*
Details :
(?:\s|\
\
)* # spaces or
- not caught
(?<subject>\S+(?:\s[IVX]+\.?)?) # non-spaces plus I., II., III, IV... -> subject
(?:\s|\
\
)+ # spaces or
- not caught
(?<teacher>\S+) # non-spaces -> teacher
(?:\s|\
\
)+ # spaces or
- not caught
(?<room>\S+) # non-spaces -> room
(?:\s|\
\
)* # spaces or
- not caught
Result :
+-------+----------+---------+----------+
| MATCH | SUBJECT | TEACHER | ROOM |
+-------+----------+---------+----------+
| 1 | TSG | MUM | BS06-312 |
| 2 | RQWE. | FKB | BS06-204 |
| 3 | NM. | JAK | BS06-E05 |
| 4 | DB. | FKB | BS06-312 |
| 5 | IGT. | resetk | Wender. |
| 6 | ENG I. | WEHN | BS06-E06 |
| 7 | ENG II | FLEM | BS06-203 |
| 8 | ITSI. | MUM | BS06-E02 |
| 9 | PQT. | RIE | BS11-QCR |
| 10 | PQT | MARK | BS11-QCR |
| 11 | PQT | FIS | BS11-QCR |
| 12 | INC | FEY | BS06-309 |
| 13 | ABS. | DOE | BS06 |
| 14 | ABS | VOG | BS06 |
| 15 | ABS | HEI | BS06 |
| 16 | ABS | MOR | BS06 |
| 17 | ABS | REM | BS06 |
| 18 | ABS | DEI | BS06 |
| 19 | ABS | THA | BS06 |
| 20 | ENG III. | GLIT | BS06-209 |
| 21 | ENG II | WANN | BS06-208 |
+-------+----------+---------+----------+
Try it :
Improve it !
There're roman numbers sometimes : ENG I.
, ENG II
...
I assume you'll only numbers from 1 to 39, that's why I only use [IVX]
. You can improve this part adding L
, C
, M
... Or using a real regex for roman numbers.