Stackoverflow: I need your help!
I've been tasked with turning some (fairly) complex work diagrams for railway staff extracted from a Word document into something more usable for further processing, such as into a PHP array.
Here is a sample of one of the work diagrams:
LTP BH 4000
( Link 5)
DVR Su
On 00.22 PASS Barnham 00+34 5H97
Off 08.03 Lham 00+42
Hrs 7:41 PPTC Lham (06+24) 5N08
Traction for the above Service is
Days Su class 377
From 18/05/2014 377 PC Lham 01+46 5S62 DOO
To 24/08/2014 (Via CET)
TC Lham O Sh 01+50
PNB
377 PC Lham O Sh 03+10 5W62 DOO
(Via CWM)
DTCS Lham 03+32
377 PP Lham Shed 04+10 5W00 DOO
(Via CWM)
DTCS Lham Shed 04+24
PPTC Lham Shed (07+39) 5E24
Traction for the above Service is
class 377
PPTC Lham (06+37) 5H92
Traction for the above service is
class 377
377 PP Lham Shed 05+45 5W01 DOO
(Via CET)
377 Lham O Sh 05+57 06+28 5W01 DOO
(Via CWM)
TC Lham Shed 06+42
PPTC Lham Shed (09+58) 5H67
Traction for the above Service is
class 377
PPTC Lham Shed (07+41) 5P29 RP MO
Traction for the above Service is
class 377
(Unit forms part of 22+17
attachment)
PASS Lham 07.54 2P31
(To Bognor Regis)
Barnham 08.02
Routes 919
I've managed to process some of the data using simple regular expressions, but where I am struggling is the "middle" data which actually shows the work to be done. I am struggling because there is no real structure that defines what each line should look like, you will notice that many lines are different with some even including free text notes.
What I am looking to accomplish is to turn each row into an array that looks like the following:
$row = array("stock", "activity", "location", "departure_time", "arrival_time", "train_id", "notes");
The difficulty comes as not every line fits into this format - some lines have every "column", whereas others have one or more columns missing and other lines consist of free text.
I am by no means a text processing expert, but I cannot seem to find a solution to this problem. I'm not after a complete solution, just some pointers would be gratefully received!
Update Just for clarification, I'm not interested in the free text rows. The data they contain is not important for what I am trying to accomplish.
I found what was causing me grief solving this. I'm loading the Word document using a tool called "antiword". Antiword seems to strip special characters such as tabs. However, I found that by passing the "-w 0" switch, these characters are preserved and parsing the diagrams using simple regular expressions became trivial. Many thanks to @Iserni for taking to time to help me, none the less.
I'll refine this answer more as soon as more data comes in, but in the meantime I'd go with what amounts to a state machine.
You read the text one line after the other. Initially you are in the "WAITING FOR DIAGRAM" state:
$status = array(
'file' => $fp,
'manager' => 'waitForDiagram',
);
$chunk = 0;
$lineno = 0;
$manage = $status['manager'];
while (!feof($fp)) {
$line = fgets($fp, 1024); // is 1 Kb enough? Maybe not.
$lineno ++;
$manage($status, $line);
if ($status['manager'] != $manage)) {
$chunk = 0;
if (!function_exists($status['manager'])) {
trigger_error("{$manage}({$line}) -> {$status['manager']}: no such state");
}
$manage = $status['manager'];
}
if (++$chunk > ALERT) {
trigger_error("Stuck in state {$manage} since {$chunk} lines!", E_USER_ERROR);
}
}
Then you define a function for each state, beginning with the first:
function waitForDiagram(&$status, $line) {
// Part common to most such state functions:
$tokens = tokenise($line);
// Quickly check whether anything needs doing.
if (!in_array($token[0], [ "LTP" ]) {
// if not, return.
return;
}
$status['diagram'] = array(
'diagram' => array(
'title' => $token[0],
'whatever' => $token[1],
'comment' => '',
)
);
...
// In this case, all information is only in one line, so we can
// continue to the next state, which in this case is always waitForOnAndGetComments.
$status['manager'] = 'waitForOnAndGetComments';
}
function waitForOnAndGetComments(&$status, $line) {
$tokens = tokenise($line);
// If we get "On" it's the line, otherwise it is still the comment
if (!in_array($token[0], [ "On" ]) {
$status['diagram']['comments'] .= $line;
return;
}
// Otherwise we have On 00.22 PASS Barnham 00+34
// and always a next line.
$offTok = tokenise(fgets($status['fp'], 1024));
if ($offTok['0'] != "Off") {
trigger_error("Found ON, but next row is not OFF, what gives?", E_USER_ERROR);
}
$status['diagram']['on'] = array(
'time' => $tokens[1],
...
);
...
$status['diagram']['off'] = array(
'time' => $offTok[1],
'line' => $offTok[2],
...
);
$status['manager'] = 'waitForSomethingElse';
}
...and so on...
One important thing is how you tokenise
the lines. If you have a clear delimiter (such as a tab) and can use explode
, all well and good. Else you can try with preg_split('#\\s{2,}#')
, using sequences of two or more whitespaces to separate "cells" in each "row".