I'm using a csv parser class (http://code.google.com/p/php-csv-parser/) to parse and extract data from csv files. The problem I'm encountering is that it only works for certain csv file types. (It seems that there is a csv type for Mac, for Ms-Dos, and for Windows.)
The code works if I use a csv file which was saved on a mac (in excel) using the csv - windows option. However, if I save a file on a windows machine simply as csv, that doesn't work. (You would think that that would be the same format as saving csv-windows on a mac.) It does work from a windows machine if I save it as a csv-MSDOS file. This seems a little ridiculous.
Is there a way to standardize these three file types so that my code can read any type of csv that is uploaded?
i'm thinking it would be something like this:
$standardizedCSV = preg_replace_all('/[^
]/', '
', $csvContent);
I know it has something to do with how each file type handles end of lines, but I'm a little put out trying to figure out those differences. If anybody has any advice, please let me know.
Thanks.
UPDATE: This is the relevant code from the csv parser I'm using which extracts data row by row:
$c = 0;
$d = $this->settings['delimiter'];
$e = $this->settings['escape'];
$l = $this->settings['length'];
$res = fopen($this->_filename, 'r');
while ($keys = fgetcsv($res, $l, $d, $e)) {
if ($c == 0) {
$this->headers = $keys;
} else {
array_push($this->rows, $keys);
}
$c ++;
}
I guess I need to understand how fgetcsv handles eol's, so that I can make sure that csv files of any format are handled in the same manner.
This seems to do the trick:
ini_set("auto_detect_line_endings", true);
The problem was with line endings, but I didn't need to create my own EOL parser. This runtime setting does it for me. See http://us.php.net/manual/en/filesystem.configuration.php#ini.auto-detect-line-endings.
I don't think the line endings is an issue. The thing about CSV is that it's only a "comma separated values" file and not standardized beyond that. So some systems separate the values using commas, some using semicolons (;
). I'm sure there are variations that use even other value separators.
Additionally, the escape character (most often backslash \
) can be different between CSV files, and some CSV files also use quotation marks around each value ("
).
A CSV file can use any variation between the above. For instance, I'm fairly certain that Microsoft Excel exports CSV files separating values using semicolons and without any quotation around the values.
I'm sure there are ways to auto-detect how to parse the CSV file, but the best way would be allowing the user to decide. That's what Excel does.
If you use CSV files, you have to agree on many details which are not properly standardized:
If you create a CSV reader, you can automatically handle different variations of line endings and field quoting. But the rest has to be known to the CSV parser beforehand.
The defacto standard is the CSV format produced by Excel. However, Excel uses different variations of the format: