A working RTF to HTML converter in PHP

A working RTF to HTML converter in PHP

In a recent project, I desperately needed an RTF to HTML converter written in PHP. Googling around turned up some matches, but I could not get them to work properly. Also, one of them called passthru() to use a RTF2HTML executable, which is something I didn’t want. I was looking for a RTF2HTML converter written purely in PHP.

Since I couldn’t find anything ready-made, I sat down and coded one up myself. It’s short, and it works, implementing the subset of RTF tags that you’ll need in HTML and ignoring the rest. As it turns out, the RTF format isn’t that complicated when you really look at it, but it isn’t something you code a parser for in 15 minutes either.

How to use it

Include the file rtf.php somewhere in your project. Then do this:

$reader = new RtfReader();
$rtf = file_get_contents("test.rtf"); // or use a string
$reader->Parse($rtf);

If you’d like to see what the parser read, then call this:

$reader->root->dump();

To convert the parser’s parse tree to HTML, call this:

$formatter = new RtfHtml();
echo $formatter->Format($reader->root);

Update 3 Sep ’14:

  • Fixed bug: underlining would start but never end. Now it does.
  • Feature request: images are now filtered out of the output.

The code

  /**
   * RTF parser/formatter
   *
   * This code reads RTF files and formats the RTF data to HTML.
   *
   * PHP version 5
   *
   * @author     Alexander van Oostenrijk
   * @copyright  2014 Alexander van Oostenrijk
   * @license    GNU
   * @version    1
   * @link       http://www.websofia.com
   * 
   * Sample of use:
   * 
   * $reader = new RtfReader();
   * $rtf = file_get_contents("itc.rtf"); // or use a string
   * $reader->Parse($rtf);
   * //$reader->root->dump(); // to see what the reader read
   * $formatter = new RtfHtml();
   * echo $formatter->Format($reader->root);   
   */
 
  class RtfElement
  {
    protected function Indent($level)
    {
      for($i = 0; $i < $level * 2; $i++) echo "&nbsp;";
    }
  }
 
  class RtfGroup extends RtfElement
  {
    public $parent;
    public $children;
 
    public function __construct()
    {
      $this->parent = null;
      $this->children = array();
    }
 
    public function GetType()
    {
      // No children?
      if(sizeof($this->children) == 0) return null;
      // First child not a control word?
      $child = $this->children[0];
      if(get_class($child) != "RtfControlWord") return null;
      return $child->word;
    }    
 
    public function IsDestination()
    {
      // No children?
      if(sizeof($this->children) == 0) return null;
      // First child not a control symbol?
      $child = $this->children[0];
      if(get_class($child) != "RtfControlSymbol") return null;
      return $child->symbol == '*';
    }
 
    public function dump($level = 0)
    {
      echo "<div>";
      $this->Indent($level);
      echo "{";
      echo "</div>";
 
      foreach($this->children as $child)
      {
        if(get_class($child) == "RtfGroup")
        {
          if ($child->GetType() == "fonttbl") continue;
          if ($child->GetType() == "colortbl") continue;
          if ($child->GetType() == "stylesheet") continue;
          if ($child->GetType() == "info") continue;
          // Skip any pictures:
          if (substr($child->GetType(), 0, 4) == "pict") continue;
          if ($child->IsDestination()) continue;
        }
        $child->dump($level + 2);
      }
 
      echo "<div>";
      $this->Indent($level);
      echo "}";
      echo "</div>";
    }
  }
 
  class RtfControlWord extends RtfElement
  {
    public $word;
    public $parameter;
 
    public function dump($level)
    {
      echo "<div style='color:green'>";
      $this->Indent($level);
      echo "WORD {$this->word} ({$this->parameter})";
      echo "</div>";
    }
  }
 
  class RtfControlSymbol extends RtfElement
  {
    public $symbol;
    public $parameter = 0;
 
    public function dump($level)
    {
      echo "<div style='color:blue'>";
      $this->Indent($level);
      echo "SYMBOL {$this->symbol} ({$this->parameter})";
      echo "</div>";
    }    
  }
 
  class RtfText extends RtfElement
  {
    public $text;
 
    public function dump($level)
    {
      echo "<div style='color:red'>";
      $this->Indent($level);
      echo "TEXT {$this->text}";
      echo "</div>";
    }    
  }
 
  class RtfReader
  {
    public $root = null;
 
    protected function GetChar()
    {
      $this->char = $this->rtf[$this->pos++];
    }
 
    protected function ParseStartGroup()
    {
      // Store state of document on stack.
      $group = new RtfGroup();
      if($this->group != null) $group->parent = $this->group;
      if($this->root == null)
      {
        $this->group = $group;
        $this->root = $group;
      }
      else
      {
        array_push($this->group->children, $group);
        $this->group = $group;
      }
    }
 
    protected function is_letter()
    {
      if(ord($this->char) >= 65 && ord($this->char) <= 90) return TRUE;
      if(ord($this->char) >= 90 && ord($this->char) <= 122) return TRUE;
      return FALSE;
    }
 
    protected function is_digit()
    {
      if(ord($this->char) >= 48 && ord($this->char) <= 57) return TRUE;
      return FALSE;
    }
 
    protected function ParseEndGroup()
    {
      // Retrieve state of document from stack.
      $this->group = $this->group->parent;
    }
 
    protected function ParseControlWord()
    {
      $this->GetChar();
      $word = "";
      while($this->is_letter())
      {
        $word .= $this->char;
        $this->GetChar();
      }
 
      // Read parameter (if any) consisting of digits.
      // Paramater may be negative.
      $parameter = null;
      $negative = false;
      if($this->char == '-') 
      {
        $this->GetChar();
        $negative = true;
      }
      while($this->is_digit())
      {
        if($parameter == null) $parameter = 0;
        $parameter = $parameter * 10 + $this->char;
        $this->GetChar();
      }
      if($parameter === null) $parameter = 1;
      if($negative) $parameter = -$parameter;
 
      // If this is \u, then the parameter will be followed by 
      // a character.
      if($word == "u") 
      {
      }
      // If the current character is a space, then
      // it is a delimiter. It is consumed.
      // If it's not a space, then it's part of the next
      // item in the text, so put the character back.
      else
      {
        if($this->char != ' ') $this->pos--; 
      }
 
      $rtfword = new RtfControlWord();
      $rtfword->word = $word;
      $rtfword->parameter = $parameter;
      array_push($this->group->children, $rtfword);
    }
 
    protected function ParseControlSymbol()
    {
      // Read symbol (one character only).
      $this->GetChar();
      $symbol = $this->char;
 
      // Symbols ordinarily have no parameter. However, 
      // if this is \', then it is followed by a 2-digit hex-code:
      $parameter = 0;
      if($symbol == '\'')
      {
        $this->GetChar(); 
        $parameter = $this->char;
        $this->GetChar(); 
        $parameter = hexdec($parameter . $this->char);
      }
 
      $rtfsymbol = new RtfControlSymbol();
      $rtfsymbol->symbol = $symbol;
      $rtfsymbol->parameter = $parameter;
      array_push($this->group->children, $rtfsymbol);
    }
 
    protected function ParseControl()
    {
      // Beginning of an RTF control word or control symbol.
      // Look ahead by one character to see if it starts with
      // a letter (control world) or another symbol (control symbol):
      $this->GetChar();
      $this->pos--;
      if($this->is_letter()) 
        $this->ParseControlWord();
      else
        $this->ParseControlSymbol();
    }
 
    protected function ParseText()
    {
      // Parse plain text up to backslash or brace,
      // unless escaped.
      $text = "";
 
      do
      {
        $terminate = false;
        $escape = false;
 
        // Is this an escape?
        if($this->char == '\\')
        {
          // Perform lookahead to see if this
          // is really an escape sequence.
          $this->GetChar();
          switch($this->char)
          {
            case '\\': $text .= '\\'; break;
            case '{': $text .= '{'; break;
            case '}': $text .= '}'; break;
            default:
              // Not an escape. Roll back.
              $this->pos = $this->pos - 2;
              $terminate = true;
              break;
          }
        }
        else if($this->char == '{' || $this->char == '}')
        {
          $this->pos--;
          $terminate = true;
        }
 
        if(!$terminate && !$escape)
        {
          $text .= $this->char;
          $this->GetChar();
        }
      }
      while(!$terminate && $this->pos < $this->len);
 
      $rtftext = new RtfText();
      $rtftext->text = $text;
      array_push($this->group->children, $rtftext);
    }
 
    public function Parse($rtf)
    {
      $this->rtf = $rtf;
      $this->pos = 0;
      $this->len = strlen($this->rtf);
      $this->group = null;
      $this->root = null;
 
      while($this->pos < $this->len)
      {
        // Read next character:
        $this->GetChar();
 
        // Ignore \r and \n
        if($this->char == "\n" || $this->char == "\r") continue;
 
        // What type of character is this?
        switch($this->char)
        {
          case '{':
            $this->ParseStartGroup();
            break;
          case '}':
            $this->ParseEndGroup();
            break;
          case '\\':
            $this->ParseControl();
            break;
          default:
            $this->ParseText();
            break;
        }
      }
    }
  }
 
  class RtfState
  {
    public function __construct()
    {
      $this->Reset();
    }
 
    public function Reset()
    {
      $this->bold = false;
      $this->italic = false;
      $this->underline = false;
      $this->end_underline = false;
      $this->strike = false;
      $this->hidden = false;
      $this->fontsize = 0;
    }
  }
 
  class RtfHtml
  {
    public function Format($root)
    {
      $this->output = "";
      // Create a stack of states:
      $this->states = array();
      // Put an initial standard state onto the stack:
      $this->state = new RtfState();
      array_push($this->states, $this->state);
      $this->FormatGroup($root);
      return $this->output;
    }
 
    protected function FormatGroup($group)
    {
      // Can we ignore this group?
      if ($group->GetType() == "fonttbl") return;
      if ($group->GetType() == "colortbl") return;
      if ($group->GetType() == "stylesheet") return;
      if ($group->GetType() == "info") return;
      // Skip any pictures:
      if (substr($group->GetType(), 0, 4) == "pict") return;
      if ($group->IsDestination()) return;
 
      // Push a new state onto the stack:
      $this->state = clone $this->state;
      array_push($this->states, $this->state);
 
      foreach($group->children as $child)
      {
        if(get_class($child) == "RtfGroup") $this->FormatGroup($child);
        if(get_class($child) == "RtfControlWord") $this->FormatControlWord($child);
        if(get_class($child) == "RtfControlSymbol") $this->FormatControlSymbol($child);
        if(get_class($child) == "RtfText") $this->FormatText($child);
      }
 
      // Pop state from stack.
      array_pop($this->states);
      $this->state = $this->states[sizeof($this->states)-1];
    }
 
    protected function FormatControlWord($word)
    {
      if($word->word == "plain") $this->state->Reset();
      if($word->word == "b") $this->state->bold = $word->parameter;
      if($word->word == "i") $this->state->italic = $word->parameter;
      if($word->word == "ul") $this->state->underline = $word->parameter;
      if($word->word == "ulnone") $this->state->end_underline = $word->parameter;
      if($word->word == "strike") $this->state->strike = $word->parameter;
      if($word->word == "v") $this->state->hidden = $word->parameter;
      if($word->word == "fs") $this->state->fontsize = ceil(($word->parameter / 24) * 16);
 
      if($word->word == "par") $this->output .= "<p>";
 
      // Characters:
      if($word->word == "lquote") $this->output .= "&lsquo;";
      if($word->word == "rquote") $this->output .= "&rsquo;";
      if($word->word == "ldblquote") $this->output .= "&ldquo;";
      if($word->word == "rdblquote") $this->output .= "&rdquo;";
      if($word->word == "emdash") $this->output .= "&mdash;";
      if($word->word == "endash") $this->output .= "&ndash;";
      if($word->word == "bullet") $this->output .= "&bull;";
      if($word->word == "u") $this->output .= "&loz;";
    }
 
    protected function BeginState()
    {
      $span = "";
      if($this->state->bold) $span .= "font-weight:bold;";
      if($this->state->italic) $span .= "font-style:italic;";
      if($this->state->underline) $span .= "text-decoration:underline;";
      if($this->state->end_underline) $span .= "text-decoration:none;";
      if($this->state->strike) $span .= "text-decoration:strikethrough;";
      if($this->state->hidden) $span .= "display:none;";
      if($this->state->fontsize != 0) $span .= "font-size: {$this->state->fontsize}px;";
      $this->output .= "<span style='{$span}'>";
    }
 
    protected function EndState()
    {
      $this->output .= "</span>";
    }
 
    protected function FormatControlSymbol($symbol)
    {
      if($symbol->symbol == '\'')
      {
        $this->BeginState();
        $this->output .= htmlentities(chr($symbol->parameter), ENT_QUOTES, 'ISO-8859-1');
        $this->EndState();
      }
    }
 
    protected function FormatText($text)
    {
      $this->BeginState();
      $this->output .= $text->text;
      $this->EndState();
    }
  }

21 Comments

  1. Eugene Valeriano says:

    You Save my Life dude.. Thanks!

  2. Anne says:

    Very useful script, thanks! Just one question. This code doesn’t filter embedded images, so the output may contain large text strings. Might it be useful to add an image filter or something? :)

    • alex says:

      That’s a good idea. The principle of this script is simplicity, so I would want to filter out images rather than including them because deciphering them would be expensive for the server. I’ll add an image filter when I update the code.

    • alex says:

      I’ve updated the code. Images are now filtered out.

  3. Pratiksha says:

    Nice Article ..

    It help me a lot

    Thank you.

  4. Klaus Delacroix says:

    Hi, found your moudle, which really helped me a lot.

    I have found 2 problems, though:

    1. The underline of underlined words is not terminated after the word, but is in effect until the end of the text.

    2. Colours are not supported

    Any idea on how I could teach the module to behave correctly also in these cases?

    • alex says:

      Actually looking at the code I don’t see why underlining doesn’t get turned off. You could make a simple RTF test file with one word underlined, and see what the dump() method says. There’s probably a typo somewhere.

      For colours I would have to go back and look at the RTF spec again. For my project needs I didn’t require underlining or colours, so this was never tested.

      • alex says:

        Actually, Sebastien posted below about the same underline issue. It turns out that RTF’s \ul tag gets closed with \ulnone, not \ul0. I thought it would work like the bold \b tag, which is closed with \b0, but no. This would have to be changed in the code.

    • alex says:

      I’ve updated the code. Underlining now terminates correctly.

  5. Sergio Gabriel says:

    Thanks a lot! I have a little problem, the class don’t convert “tabs control” to “nbsp;”, how can do this?

  6. Sergio Gabriel says:

    In formatControlWord method of RtfHtml class, I put this if($word->word == “tab”) $this->output .= “ ”; and work!

    • alex says:

      Nice work. In order for this code to be more robust, the HTML formatter should be made into a separate class, so you can more easily extend it as you have done. Also, you could have plaintext and XML formatters. But oh well, this was meant to be simple and solve only the RTF to HTML problem.

  7. Chris says:

    This is really A W E S O M E !
    Thank you! :)

  8. Sebastien says:

    I’m trying out this script for a projet and evreything is underline anyclue why?

    RFT :

    {\rtf1\ansi\ansicpg1252\deff0\nouicompat{\fonttbl{\f0\fnil\fcharset0 Tahoma;}{\f1\fnil Tahoma;}{\f2\fnil\fcharset2 Symbol;}}
    {\colortbl ;\red0\green0\blue0;}
    {\*\generator Riched20 6.3.9600}\viewkind4\uc1
    \pard\cf1\ul\b\f0\fs22\lang3084 Maintenance des transporteurs\ulnone\b0\par
    \fs18\par
    Ajouter une nouvel onglet ‘Param\’e8tres EDI’ dans laquelle on retrouvera :\par
    \par
    \par

    \pard{\pntext\f2\’B7\tab}{\*\pn\pnlvlblt\pnf2\pnindent0{\pntxtb\’B7}}\fi-200\li200 Dans le haut : \f1\par

    \pard{\pntext\f2\’B7\tab}{\*\pn\pnlvlblt\pnf2\pnindent0{\pntxtb\’B7}}\fi-200\li520\f0 Un titre ‘EDI – Achats’\f1\par
    {\pntext\f2\’B7\tab}\f0 Nom du transporteur \’e0 exporter\f1\par
    {\pntext\f2\’B7\tab}\f0 No de compte d\’e9faut\f1\par

    \pard\par

    \pard{\pntext\f2\’B7\tab}{\*\pn\pnlvlblt\pnf2\pnindent0{\pntxtb\’B7}}\fi-200\li200\f0 Une grille\f1\par

    \pard{\pntext\f2\’B7\tab}{\*\pn\pnlvlblt\pnf2\pnindent0{\pntxtb\’B7}}\fi-200\li520\f0 No division\f1\par
    {\pntext\f2\’B7\tab}\f0 Nom de la division (en affichage)\f1\par
    {\pntext\f2\’B7\tab}\f0 No de de compte \’e0 utiliser pour cette division\f1\par

    \pard\par
    \ul\f0 Note\ulnone\par
    \par
    Le no de compte qui sera utilis\’e9 en priorit\’e9 sera :\par

    \pard\li320\par
    1) Selon la division de la commande\par
    2) Selon la division m\’e8re de la commande\par
    3) No de compte d\’e9faut\par

    \pard\li800\par

    \pard\ul\b\fs22 Module ‘Bon d’achats’\par
    \ulnone\b0\fs18\par
    Dans l’onglet ‘Ent\’eate’ / Sous onglet ‘Termes’, ajouter une section pour ‘EDI’ dans laquelle on pourra saisir un code du transporteur exig\’e9.\par
    \par
    \par
    \ul\b\fs22 Module ‘R\’e9quisition’\par
    \ulnone\b0\fs18\par
    Dans la fen\’eatre ‘Cr\’e9ation des bons d’achat’, onglet ‘Bon d’achats” / Section ‘Termes’ de la grille des BA’s \’e0 \’e9mettre, ajouter une colonne ‘Trp-EDI’ repr\’e9sentant le code du transporteur exig\’e9.\par
    \f1\par
    \ul\b\f0\fs22 Module ‘EDI’\par
    \ulnone\b0\fs18\par

    \pard{\pntext\f2\’B7\tab}{\*\pn\pnlvlblt\pnf2\pnindent0{\pntxtb\’B7}}\fi-200\li200 Cr\’e9er une nouvelle proc\’e9dure ‘EDI_SP_Export_BonAchEnt_Trp’ qui retournera les infos requises pour construire le segment ‘PO_TRANSPORTEUR’.\par
    {\pntext\f2\’B7\tab}Ajouter cette proc\’e9dure aux proc\’e9dures disponibles dans le catalogue EDI.\par

    \pard\par
    \ul\b Important\ulnone\b0 \par
    \par
    La d\’e9finition du segment / colonnes EDI et l’ajout du segment \’e0 l’enveloppe n’est pas incluse dans cet estim\’e9. Normalement, vous pouvea effectuer cette t\’e2che. Si toutefois, vous aviez besoin d’assistance, un formateur pourra intervenir en appliquant son temps contre votre banque d’heures.\par
    \par
    \f1\par
    \par
    }

    thx for your time.

  9. Anon says:

    Sensational: there’s only a couple of missing features that I feel would expand it to be able to convert rtf from the majority of basic rtf editors:

    Font (face and colour)
    Alignment (left, center, right)
    List items (bullets and numbering)
    Super / subscript

    Love it – keep up the good work!

  10. Serge says:

    Very nice Alexander !

    A question: when I have a text like “Poëzie” it converts to (shortened):
    Po
    ë
    zie

    Which gives as result : Po ë zie

    Any idea about a quick solution?

    Greetings & thanks Serge

  11. Nick says:

    Everything works, except no ‘br” or “p” line breaks.
    So my output is one run-on line of text.
    There’s lots of “span”s, but adding a “br” after each didn’t work, because it wrapped a pair of “span”s around the “bullet point” symbol. My adding a “br” after “/span” therefore also put a “br” after the bullet, which put the bullet point text on a new line.

    Provided web hosting services, and website development, in English and Spanish.

    Managed on-going work schedules.
    Is there supposed to be some css that goes with this?
    Is there something wrong with my RTF file?

  12. Doug says:

    Finally a PHP RTF to HTML class that works. Everything else I’ve found is horribly written, buggy, throws errors, doesn’t work, or all of the above.

    Thank you… a million times, thank you. I wasn’t looking forward to writing my own.


Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">