Andre's Blog
Perfection is when there is nothing left to take away
BBCode parser

While most content management systems, such as blogs, allow users edit HTML directly, more specialized ones, such as discussion forums, allow users to use alternative syntax that is easier to control and adapt to particular needs. BBCode, which stands for Bulletin Board Code, is one example of such alternative.

BBCode tags are enclosed in square brackets instead of angle brackets used in HTML, which makes it easy to mix BBCode and HTML because square brackets have no significance in the latter.

In other words, if a forum page is being rendered, posts can be HTML-encoded first to avoid any HTML security issues and then BBCode tags may be converted to HTML using regular expressions. Any mismatched BBCode tags are either ignored or forced to close to generate well-formed HTML.

A typical approach to replacing BBCode tags is to use a set of regular expressions, one for each tag, similar to the one below, which replaces any sequence of [b] tags in the specified string variable with HTML equivalents:

text.replace( /\[b\](.+?)\[\/b]/gi, "<b>$1</b>" );

This regular expression guarantees that any incomplete tags missing either the start or the end tag will be ignored. It also will, however, replace any mismatched tags, such as these:

[b][i]abc[/b]def[/i]

, producing malformed HTML.

One way to control mismatched tags is to process start and end tags individually and maintain a parsing state, which can be used to detect malformed BBCode markup. The source code below is written in JavaScript and uses regular expressions to match three patterns and call the textToHtmlCB function every time a match is found. The function keeps all open tags in a stack and checks the top of the stack when closing tags.

// -----------------------------------------------------------------------
// Copyright (c) 2008, Stone Steps Inc. 
// All rights reserved
// http://www.stonesteps.ca/legal/bsd-license/
//
// This is a BBCode parser written in JavaScript. The parser is intended
// to demonstrate how to parse text containing BBCode tags in one pass 
// using regular expressions.
//
// The parser may be used as a backend component in ASP or in the browser, 
// after the text containing BBCode tags has been served to the client. 
//
// Following BBCode expressions are recognized:
//
// [b]bold[/b]
// [i]italic[/i]
// [u]underlined[/u]
// [s]strike-through[/s]
// [samp]sample[/samp]
//
// [color=red]red[/color]
// [color=#FF0000]red[/color]
// [size=1.2]1.2em[/size]
//
// [url]http://blogs.stonesteps.ca/showpost.asp?pid=33[/url]
// [url=http://blogs.stonesteps.ca/showpost.asp?pid=33][b]BBCode[/b] Parser[/url]
//
// [q=http://blogs.stonesteps.ca/showpost.asp?pid=33]inline quote[/q]
// [q]inline quote[/q]
// [blockquote=http://blogs.stonesteps.ca/showpost.asp?pid=33]block quote[/blockquote]
// [blockquote]block quote[/blockquote]
//
// [pre]formatted 
//     text[/pre]
// [code]if(a == b) 
//   print("done");[/code]
//
// text containing [noparse] [brackets][/noparse]
//
// -----------------------------------------------------------------------
var opentags;           // open tag stack
var crlf2br = true;     // convert CRLF to <br>?
var noparse = false;    // ignore BBCode tags?
var urlstart = -1;      // beginning of the URL if zero or greater (ignored if -1)

// aceptable BBcode tags, optionally prefixed with a slash
var tagname_re = /^\/?(?:b|i|u|pre|samp|code|colou?r|size|noparse|url|s|q|blockquote)$/;

// color names or hex color
var color_re = /^(:?black|silver|gray|white|maroon|red|purple|fuchsia|green|lime|olive|yellow|navy|blue|teal|aqua|#(?:[0-9a-f]{3})?[0-9a-f]{3})$/i;

// numbers
var number_re = /^[\\.0-9]{1,8}$/i;

// reserved, unreserved, escaped and alpha-numeric [RFC2396]
var uri_re = /^[-;\/\?:@&=\+\$,_\.!~\*'\(\)%0-9a-z]{1,512}$/i;

// main regular expression: CRLF, [tag=option], [tag] or [/tag]
var postfmt_re = /([\r\n])|(?:\[([a-z]{1,16})(?:=([^\x00-\x1F"'\(\)<>\[\]]{1,256}))?\])|(?:\[\/([a-z]{1,16})\])/ig;

// stack frame object
function taginfo_t(bbtag, etag)
{
   this.bbtag = bbtag;
   this.etag = etag;
}

// check if it's a valid BBCode tag
function isValidTag(str)
{
   if(!str || !str.length)
      return false;

   return tagname_re.test(str);
}

//
// m1 - CR or LF
// m2 - the tag of the [tag=option] expression
// m3 - the option of the [tag=option] expression
// m4 - the end tag of the [/tag] expression
//
function textToHtmlCB(mstr, m1, m2, m3, m4, offset, string)
{
   //
   // CR LF sequences
   //
   if(m1 && m1.length) {
      if(!crlf2br)
         return mstr;

      switch (m1) {
         case '\r':
            return "";
         case '\n':
            return "<br>";
      }
   }

   //
   // handle start tags
   //
   if(isValidTag(m2)) {
      // if in the noparse state, just echo the tag
      if(noparse)
         return "[" + m2 + "]";

      // ignore any tags if there's an open option-less [url] tag
      if(opentags.length && opentags[opentags.length-1].bbtag == "url" && urlstart >= 0)
         return "[" + m2 + "]";

      switch (m2) {
         case "code":
            opentags.push(new taginfo_t(m2, "</code></pre>"));
            crlf2br = false;
            return "<pre><code>";

         case "pre":
            opentags.push(new taginfo_t(m2, "</pre>"));
            crlf2br = false;
            return "<pre>";

         case "color":
         case "colour":
            if(!m3 || !color_re.test(m3))
               m3 = "inherit";
            opentags.push(new taginfo_t(m2, "</span>"));
            return "<span style=\"color: " + m3 + "\">";

         case "size":
            if(!m3 || !number_re.test(m3))
               m3 = "1";
            opentags.push(new taginfo_t(m2, "</span>"));
            return "<span style=\"font-size: " + Math.min(Math.max(m3, 0.7), 3) + "em\">";

         case "s":
            opentags.push(new taginfo_t(m2, "</span>"));
            return "<span style=\"text-decoration: line-through\">";

         case "noparse":
            noparse = true;
            return "";

         case "url":
            opentags.push(new taginfo_t(m2, "</a>"));
            
            // check if there's a valid option
            if(m3 && uri_re.test(m3)) {
               // if there is, output a complete start anchor tag
               urlstart = -1;
               return "<a href=\"" + m3 + "\">";
            }

            // otherwise, remember the URL offset 
            urlstart = mstr.length + offset;

            // and treat the text following [url] as a URL
            return "<a href=\"";

         case "q":
         case "blockquote":
            opentags.push(new taginfo_t(m2, "</" + m2 + ">"));
            return m3 && m3.length && uri_re.test(m3) ? "<" + m2 + " cite=\"" + m3 + "\">" : "<" + m2 + ">";

         default:
            // [samp], [b], [i] and [u] don't need special processing
            opentags.push(new taginfo_t(m2, "</" + m2 + ">"));
            return "<" + m2 + ">";
            
      }
   }

   //
   // process end tags
   //
   if(isValidTag(m4)) {
      if(noparse) {
         // if it's the closing noparse tag, flip the noparse state
         if(m4 == "noparse")  {
            noparse = false;
            return "";
         }
         
         // otherwise just output the original text
         return "[/" + m4 + "]";
      }
      
      // highlight mismatched end tags
      if(!opentags.length || opentags[opentags.length-1].bbtag != m4)
         return "<span style=\"color: red\">[/" + m4 + "]</span>";

      if(m4 == "url") {
         // if there was no option, use the content of the [url] tag
         if(urlstart > 0)
            return "\">" + string.substr(urlstart, offset-urlstart) + opentags.pop().etag;
         
         // otherwise just close the tag
         return opentags.pop().etag;
      }
      else if(m4 == "code" || m4 == "pre")
         crlf2br = true;

      // other tags require no special processing, just output the end tag
      return opentags.pop().etag;
   }

   return mstr;
}

//
// post must be HTML-encoded
//
function parseBBCode(post)
{
   var result, endtags, tag;

   // convert CRLF to <br> by default
   crlf2br = true;

   // create a new array for open tags
   if(opentags == null || opentags.length)
      opentags = new Array(0);

   // run the text through main regular expression matcher
   result = post.replace(postfmt_re, textToHtmlCB);

   // reset noparse, if it was unbalanced
   if(noparse)
      noparse = false;
   
   // if there are any unbalanced tags, make sure to close them
   if(opentags.length) {
      endtags = new String();
      
      // if there's an open [url] at the top, close it
      if(opentags[opentags.length-1].bbtag == "url") {
         opentags.pop();
         endtags += "\">" + post.substr(urlstart, post.length-urlstart) + "</a>";
      }
      
      // close remaining open tags
      while(opentags.length)
         endtags += opentags.pop().etag;
   }

   return endtags ? result + endtags : result;
}

The HTML below can be used to see the parser in action. Save the parser in a file called bbcode.js and save the HTML in another file in the same directory.

<html>
<head>
<title>BBCode Test</title>
<script type="text/javascript" src="bbcode.js"></script>
<script type="text/javascript">
function outputBBCode(textarea)
{
   var out = document.getElementById("out");
   var out_html = document.getElementById("out_html");
   var html = parseBBCode(textarea.value);
   
   if(!out.firstChild)
      out.appendChild(document.createTextNode(html));
   else   
      out.replaceChild(document.createTextNode(html), out.firstChild);
      
   out_html.innerHTML = html;
}
</script>
</head>
<body>
<textarea id="in" rows="12" cols="80">[b]bold[/b], [i]italic[/i], [u]underlined[/u], [s]strike-through[/s], [samp]sample[/samp] 
[url]http://blogs.stonesteps.ca/showpost.asp?pid=33[/url]
[url=http://blogs.stonesteps.ca/showpost.asp?pid=33][i]BBCode[/i] Parser[/url]
Inline [q=http://blogs.stonesteps.ca/showpost.asp?pid=33]quote[/q]
[blockquote=http://blogs.stonesteps.ca/showpost.asp?pid=33]Block quote[/blockquote][pre]formatted 
     text[/pre][code]if(a == b) 
   print("done");[/code]text containing [noparse] [brackets] [/noparse]
c[b][color=red]o[/color][/b][b][color=green]l[/color][/b][b][color=blue]o[/color][/b]rs and [size=1.2]text size[/size]
[b][i]mismatched [/b] tags[/i] 
remaining text should not affect page HTML.
</textarea>
<div>
<input type="submit" value="Submit" onclick="outputBBCode(document.getElementById(&quot;in&quot;))" style="vertical-align: top">
</div>
<div id="out_html" style="border: 1px solid #777; margin: 1em auto; padding: 5px 3px;">&nbsp;</div>
<p>This paragraph should not be formatted in any way after 
BBCode is converted to HTML, even if there are mismatched 
or mixed BBCode tags.</p>
<div id="out" style="border: 1px solid #777; margin: 1em auto; padding: 5px 3px;">&nbsp;</div>
</body>
</html>

The parser may be used with ASP, as long as the script language is identified as JScript. Alternatively, it can be used as a client script to convert BBCode tags to HTML directly in the browser.

Comments:
Posted Thu Jun 30 07:11:06 EDT 2011 by fduch

url=foo.bar#id doesn't work. The following patch fixes it:

--- bbcode.js.orig 2011-06-30 14:54:18.000000000 +0400
+++ bbcode.js 2011-06-30 15:08:48.000000000 +0400
@@ -53,7 +53,7 @@
var number_re = /^[\\.0-9]{1,8}$/i;

// reserved, unreserved, escaped and alpha-numeric [RFC2396]
-var uri_re = /^[-;\/\?:@&=\+\$,_\.!~\*'\(\)%0-9a-z]{1,512}$/i;
+var uri_re = /^[-;\/\?:@&=\+\$,_\.!~\*'\(\)%0-9a-z#]{1,512}$/i;

// main regular expression: CRLF, [tag=option], [tag] or [/tag]
var postfmt_re = /([\r\n])|(?:\[([a-z]{1,16})(?:=([^\x00-\x1F"'\(\)<>\[\]]{1,256}))?\])|(?:\[\/([a-z]{1,16})\])/ig;
 

Posted Fri Nov 28 08:09:32 EST 2014 by Kodi

Hi.

Nice post man, thank you for this.

Name:

Comment: