UTF-8 BOM and PHP

Discussion in 'Programming/Scripts' started by Brian_A, Aug 31, 2011.

  1. Brian_A

    Brian_A New Member

    We had a site that must be internationalized, to be available in several European languages, so we used UTF-8 string encoding throughout. This, however, was not without a certain number of headaches having all types of display issues in the browsers; mainly extra line spaces showing up that were not obvious from the html source, and of course IE (7, 8 &9) going into quirks mode. The problem turns out to be that we had BOMs. http://en.wikipedia.org/wiki/Byte_order_mark

    So to provide some possible help to other that go down this route here is what we found.

    1. If you have any UTF-8 encoded file that contains a BOM anywhere included in your page generation script, PHP will add a BOM to the resulting file or output stream. This means if you read-in or otherwise include another PHP or any text file, concatenate a file with other text, reference a JavaScript file, css file, echo, or copy or read-in an html template whatever you like to do, if any of these files contains a BOM then PHP will include the BOM in the final result.
    2. None of the browsers we tested; FireFox, Chrome, and IE support the BOM.
    3. If you use MS windows notepad to save a UTF-8 file it will automatically add a BOM. So NEVER ever use notepad. Of course the browser with the biggest problems with the BOM is IE.
    4. We use Netbeans as an IDE. If a file contains a BOM and you edit and save it with Netbeans it will still contain the BOM. If you copy/paste a file in Netbeans that has a BOM the result will also have a BOM. If you start a new UTF-8 file in Netbeans it will not have a BOM.
    5. So how did we identify this problem? The browser will identify the encoding from the meta tag if it is present. <meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8"/>,. We loaded the page (in FireFox) then told it to change the encoding to ISO-8859-1. The BOM will show up before the <!DOCTYPE HTML> at the beginning of the file as 3 strange marks.
    6. So how did we how find out which of our files had BOMs. We used the following code. We didn’t write this, we found it on another site but we did not make a note of the author. So if the original author sees this post please feel free to add your credit or add a post and I will do it for you.

    PHP:
    <?php
    // Tell me the root folder path.
    // You can also try this one
    // $HOME = $_SERVER["DOCUMENT_ROOT"];
    // Or this
    // dirname(__FILE__)
    //$HOME = dirname(__FILE__);
    $HOME $_SERVER["DOCUMENT_ROOT"].'/V2';
    // Is this a Windows host ? If it is, change this line to $WIN = 1;
    $WIN 0;

    // That's all I need
    ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>UTF8 BOM FINDER</title>
    <style>
    body { font-size: 10px; font-family: Arial, Helvetica, sans-serif; background: #FFF; color: #000; }
    .FOUND { color: #F30; font-size: 14px; font-weight: bold; }
    </style>
    </head>
    <body>
    <?php

    $BOMBED 
    = array();
    RecursiveFolder($HOME);
    echo 
    '<h2>These files have UTF8 BOM:</h2><p class="FOUND">';
    foreach (
    $BOMBED as $utf) { echo substr($utf,30) ."<br />\n"; }
    echo 
    '</p>';

    // Recursive finder
    function RecursiveFolder($sHOME) {
      global 
    $BOMBED$WIN;

      
    $win32 = ($WIN == 1) ? "\\" "/";

      
    $folder dir($sHOME);

      
    $foundfolders = array();
      while (
    $file $folder->read()) {
        if(
    $file != "." and $file != "..") {
          if(
    filetype($sHOME $win32 $file) == "dir"){
            
    $foundfolders[count($foundfolders)] = $sHOME $win32 $file;
          } else {
            
    $BOM SearchBOM(file_get_contents($sHOME $win32 $file));
            if (
    $BOM$BOMBED[count($BOMBED)] = $sHOME $win32 $file;
          }
        }
      }
      
    $folder->close();

      if(
    count($foundfolders) > 0) {
        foreach (
    $foundfolders as $folder) {
          
    RecursiveFolder($folder$win32);
        }
      }
    }

    // Searching for BOM in files
    function SearchBOM($string) {
        if(
    substr($string0,3) == pack("CCC",0xef,0xbb,0xbf)) return true;
        return 
    false;
    }
    ?>
    </body>
    </html>
    7. Now to remove the offending BOMs; we didn’t have so many infected files so we did it by hand. We used a text editor called BabelPad that lets you save the file with or without the BOM. http://www.babelstone.co.uk/Software/BabelPad.html

    Having removed all the BOMs everything on the site compiles and runs without problem.

    Hope we can save you the time it took us to identify and solve this problem.
     
    Last edited by a moderator: Sep 1, 2011
  2. Ben

    Ben ISPConfig Developer ISPConfig Developer

    Just wrapped your code in php vbb tags :)

    Besides this, good work!
     

Share This Page