Byte Order Mark Madness

Last modified 2024/12/07 11:48

We had an issue today where a CSV file was being read via. SplFileInfo.

The array keys were being inferred from the headers, and the values transformed based on rules per column.

The issue was that the column was not found:

// data read from CSV
$data = [ 
   'SALUTATION' => '1',
];
var_dump($data['SALUTATION']); // UNDEFINED INDEX ERROR

Very confusing!

It turns out that the CSV file was prefixed with a byte order mark, which was being incorrectly parsed into the first header (SALUTATION).

As we found out using:

var_dump(array_map(function (string $key) {
    return array_map(function (string $char) {
        return ord($char);
    }, str_split($key));
}, array_keys($result)));

Resulting in:

  0 => array:13 [
    0 => 239                                                                                                      
    1 => 187
    2 => 191
    3 => 83 
    4 => 65       
    5 => 76 
    6 => 85 
    7 => 84     
    8 => 65 
    9 => 84      
    10 => 73
    11 => 79    
    12 => 78     
  ]

Where 83 .. 78 are the chars for SALUTATION and 187, 191 and 83 are the extra chars.

We googled these numbers and discovered that they are a byte-order-mark, and removing them fixes the issue.

Not sure why this BOM causes an issue (other files also have BOM but are seemingly correctly parsed).