Is your HTML code Tidy?

Written by on January 3, 2008 in PHP - 4 Comments

When creating huge Internet applications it sometimes hard to keep your HTML code in good shape, especially if users are allowed to submit their articles or posts, there is always possibility of none closed tag or any similar small problem, with big consequences.

Basically there are two solutions for this problem, write complex scripts to check syntax of submitted text, or configure Tidy extension, write few lines of code and never think about it again. In my opinion second solution sounds more tempting, and in this article I want to show you how it is done.

But, before we start. What is Tidy extension? According to PHP manual: Tidy is a binding for the Tidy HTML clean and repair utility which allows you to clean and otherwise manipulate HTML documents.

Installation

If you are using PHP5 on Windows system, then, all you need to do is uncomment extension=php_tidy.dll line in your php.ini file, restart your server and you should be able to use Tidy.

On Linux systems you need to compile Apache –with-tidy configuration option, but in order to do that you need to download and install TidyLib (http://tidy.sourceforge.net/) first.

When you will have TidyLib installed, you can compile PHP5.

Alternatively, if you have PEAR available on your Linux system, you can use the pear installer to install the Tidy extension, use the following command: pecl install tidy

Tidy Dual Nature

Similarly to other PHP5 extensions, Tidy has got set of procedural functions and an object with methods named exactly the same as procedural functions, and with the same functionality as those functions.

Very simple procedural script can look like that:

<?php
$tidy = tidy_parse_file('file.html');
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
?>

On the other hand object-oriented script will look like that:

<?php
$tidy = new tidy();
$tidy->parseFile('file.html');
$tidy->cleanRepair();
echo $tidy;
?>

Exactly the same functionality, but quite different approach. Moreover procedural and object oriented syntax can be mixed, checkout example below, it was taken directly from the PHP manual.

<?php
$html = '<p>paragraph</i>';
$tidy = tidy_parse_string($html);
$tidy->CleanRepair();
echo tidy_get_output($tidy);
?>

As you see syntaxes can be mixed easily, but I do not endorse to do it, it is beast to select syntax which suits you best and stick to it. So the obvious question is: which one is better? Actually, none of this two solutions is better or worse, most of the time in both cases, you will need to use the same amount of variables and even lines of code.

However, I prefer object oriented programming over structural programming, so in this article I will use objects in all examples.

Using Tidy To Cleanup Files

We have installed Tidy on our computers and selected syntax, so now it is time for the fun part … coding.

Let’s start with HTML document, I called it dirty.html

<html>
<body>
</i><p>Tidy tutorial</i></p>
</html>

It looks dirty, isn’t it? Now it’s time for index.php file:

<?php
$tidy = new tidy();
$tidy->parseFile('dirty.html');
$tidy->cleanRepair();
echo $tidy;
?>

What we are doing here is very simple, first we create new object, parse dirty.html file and finally we execute cleanup and repair operations on parsed markup.

Now let’s run these script in web browser and see result, if you are not creating scripts along with me then below is source code from my browser.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<p>Tidy tutorial</p>
</body>
</html>

From completely unreadable document , to nice looking html document , not bad. We can achieve the same result even without dirty.html file, take a look at the code below and note that there is only one line changed:

<?php
$tidy = new tidy();
$tidy->parseString('<p>Tidy tutorial</p>');
$tidy->cleanRepair();
echo $tidy;
?>

In these example instead of using tidy->parseFile() method and remote file, we simply used tidy->parseString() method, this can be very helpful if we want to parse article or some other text submitted by the user.

The only problem would be html, head and body tags added to article, most of the time we do not want this tags in articles, fortunately there is an easy way to skip them, to do so we would have to take advantage of …

Tidy Configuration and Options

Before we go any further, let’s look again at tidy->parseString() and

tidy->parseFile() methods, we only used one parameter with them, mainly because it was all we needed, but by doing so, we were limiting ourselves to one default parsing configuration.

Here is a definition of tidy->parseString() method:

bool tidy->parseString(
string input
[, mixed config
[, string encoding]] )

We already know what input variable is for, so take a look at next one, mixed config, this is the parameter which interest us most.

Mixed means that it can be any kind of variable, but in these particular instance, it can be string or an array. If we pass config as a string, then tidy->parseString() will treat it as a path to configuration file. On the other hand if we pass config as array PHP will treat it as options themselves. Note that you need to have key, value pairs in array not just values, we will get to that later.

The last parameter is encoding, we won’t be using this in any of provided examples, here is what PHP manual has to say about it: The encoding parameter sets the encoding for input/output documents. The possible values for encoding are: ascii, latin0, latin1, raw, utf8, iso2022, mac, win1252, ibm858, utf16, utf16le, utf16be, big5 and shiftjis.

tidy->parseFile() method definition looks pretty much the same as
tidy->parseString() definition, there is only one, parameter more:
bool tidy->parseFile (
string filename
[, mixed config
[, string encoding
[, bool use_include_path]]] )

Boolean use_include_path determines if default path to configuration file should be used, we will also get to this later, as for now all we need to know is that default path is written in php.ini file.

Now let’s have some fun with config variable, but the first things first. In the paragraph about config variable I mentioned that config can be an array or a path to file, how does this file looks like then?

It is just normal text file containing key, value pairs each separated by new line, for example:

indent-spaces: 4
wrap: 80
indent: true

There is a lot of configuration options available, but listing them all here is not my goal, I want to share with you options that are the most useful (in my opinion of course), but if you want to find more interesting options, and I encourage you to do so, then visit http://tidy.sourceforge.net/docs/quickref.html for the full list.

In our examples we won’t be using, configuration files, we will go with arrays. Here is how will the same configuration as above look in array.

<?php
$config = array(
'indent-spaces' => 4,
'wrap' => 72,
'indent' => true
);
?>

If you are using Tidy 1.0 (you can check it with phpinfo() command), then there is also third way to change configuration options. They can be updated using tidy_ setopt() function. In Tidy 2.0 this function is unavailable.

<?php
tidy_setopt('indent', true);
?>

Repairing User Submitted Texts

As we discussed earlier. It would be nice to parse with Tidy user submitted HTML code, like I said in the beginning there is always a possibility of not closed tag in submitted text, which can break down whole page layout. Better safe then sorry, so we need to prevent this from happening.

How to do it? It is again very easy, PHP code will not differ much from the previous code, all we will ever need for this operation is show-body-only option.

<?php
$tidy = new tidy();
$config = array('show-body-only' => true);
$txt = '<p>Tidy tutorial</i></p>';
$tidy->parseString($txt, $config);
$tidy->cleanRepair();
echo $tidy;
?>

See how it works, when you run this and view source code you will see that there is only this text <p>Tidy tutorial </p>, this is exactly what we wanted to achieve.

What about newsletters? Each line should have at most 80 signs, how to do it with Tidy? Simply change second line to:

<?php
$tidy = new tidy();
$config = array(
'wrap' => 80,
'show-body-only' => true);
$tidy->parseFile('http://php.net', $config);
$tidy->cleanRepair();
echo $tidy;

?>

There you go, article ready to be send with newsletter. Note, I changed

tidy->parseString() to $tidy->parseFile() as our text was to short to see the effect.

Handling Errors

Tidy is an utility to repair broken HTML code, and with broken code comes errors and warnings, Tidy will take care of them right away, so most of the time there is no reason to bother with errors, but in case you will need catch errors, here are few interesting functions to use.

Every time error occurs in HTML document, Tidy sends it to tidy->errorBuffer variable, each error is in new line.

<?php
$tidy = new tidy();
$tidy->parseFile('dirty.html');
$tidy->cleanRepair();
echo $tidy->errorBuffer . "\n\n";
$tidy->diagnose();
echo $tidy->errorBuffer;
?>

echo $tidy->errorBuffer; displays all errors, but take a look what happens with

tidy->errorBuffer variable if we use tidy->diagnose() first.

Here is the output:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected </i>
line 3 column 21 - Warning: discarding unexpected </i>
line 2 column 1 - Warning: inserting missing 'title' element
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 3 column 1 - Warning: discarding unexpected </i>
line 3 column 21 - Warning: discarding unexpected </i>
line 2 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 0 errors were found!

What tidy->diagnostics() method do is: it runs diagnostic tests on the given tidy object, and adds some more information about the document in the error buffer.

Converting Documents

Using Tidy we can easily convert document from HTML to XHTML or even XML, and again all we need to do in order to achieve it, is change one configuration option. Here are options you may consider using:

  • output-html outputs data in HTML format.
  • output-xhtml outputs data in XHTML format.
  • output-xml outputs data as XML file.

All option are of course Boolean, and obviously we may use only one of them in options array, obviously document can’t be XML and XHTML at the same time.

Reducing Document Size

There are few options that allow us to reduce document size. It won’t be huge amounts of bytes, but when website has got a lot of traffic then even small reduction will be useful.

Here are options you can use, along with their descriptions:

  • drop-proprietary-attributes removes all tags that are not a part of web standard, it means everything like <foo>text</foo> will be removed.
  • drop-font-tags font tag is deprecated, but occasionally you can find it in your website HTML code, this option will help you to get rid of this tag, note that this option only removes <font> tag and do not replaces it with equivalent HTML code, so be careful with it, as this is double edged sword.
  • drop-empty-paras all empty <p> tags will be removed, it seems like not a big reduction, but you will be amazed how much empty tags people have in their code.
  • hide-comments this is my favorite option, it removes all comments from parsed document.

All this options are Boolean, I believe you already know how to make use of them, but in case you will need code, here it is:

<?php
$tidy = new tidy();
$options = array(
"drop-proprietary-attributes" => true,
"drop-empty-paras" => true,
"hide-comments" => true,
"drop-font-tags" => true
);

$tidy->parseFile('dirty.html', $options);
$tidy->cleanRepair();
echo $tidy;
?>

I also updated dirty.html file, with even more dirty code, to test if our new options work correctly, now it looks like this:

<html>
<body>
<!-- i am a comment -->
</i><p>Tidy tutorial</i></p><p></p><br>
<foo>This is text in foo tag</foo><br>
<font color="red">red text</font>
</html>

And this is what we get on output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<p>Tidy tutorial</p>
<br>
This is text in foo tag<br>
red text
</body>
</html>

Output Buffering With TidyHandler

Remembering to parse every line of code with Tidy may be sometimes a lot of work, and the end effect maybe not as good as expected if we use wrong options when parsing data.

Fortunately there is a simple solutions for this. We can turn output buffering on and use ob_tidy_handler() as a callback for ob_start() function, so before sending any data to user it will be buffered and parsed by Tidy, this means less work for a programmer and even cleaner HTML code.

This is how it is done:

<?php
ob_start('ob_tidyhandler');
include('dirty.html');
?>

Just two lines of code, the only problem in this situation is that ob_tidyhandler() uses options from default configuration file, so we have to get to it, if we will want to customize parsing, and in most cases we will want to.

Tidy and php.ini

At the beginning of this article I mentioned “default path to configuration file”. Where this default path is located? Of course in php.ini file.

Open php.ini file on your computer and find section [Tidy], the whole section should look similar to this:

[Tidy]
; The path to a default tidy configuration file to use when using tidy
;tidy.default_config = d:\htdocs\tidy\default.tcfg
; Should tidy clean and repair output automatically?
; WARNING: Do not use this option if you are generating non-html content
; such as dynamic images
tidy.clean_output = Off

Now uncomment tidy.default_config and set it to point to the file which contains default config. Probably currently this file doesn’t exists, so you may set it to whatever you want, I chose d:\htdocs\tidy\default.tcfg because this is directory in which I keep files for this article.

If done, then it is time create this file, so tidy.default_config won’t point to not existing file, this is how my file looks like:

output-xhtml: true
hide-comments: true

Restart Apache, and navigate with your web browser to index.php file, it should be XHTML valid file with removed comments (if you set the same options as I did).

We can also take advantage of second option (tidy.clean_output) and set it to ON. Let’s modify index.php file and leave there only one line of code:

<?php
include('dirty.html');
?>

If you again restart Apache, and run this script you will see that HTML code sent to your browser is the same as last time, it is because now tidy cleaning and repairing operations are run automatically when output is sent to client.

If you are going to use tidy.clean_output then keep in mind, you won’t be able to generate non-html content such as images.

Conclusion

I like Tidy, because as we saw, it do not require a lot of coding, none of provided examples was longer then 10 lines of code. In fact by correctly configuring Apache server we can use Tidy without writing a single line of code.

So basically Tidy is more about selecting right options and configuring this extension then writing complex code. Of course Tidy is not limited to options we used, there are literally tons of them, if you are interested in exploring other options, visit Tidy homepage at SourceForge.

About the Author

Greg Winiarski is a freelance PHP and JavaScript programmer. He specializes in web applications and WordPress development.

4 Comments on "Is your HTML code Tidy?"

  1. Lauren August 1, 2009 at 5:19 am ·

    Awesome! Great review of Tidy. I was having trouble finding any non-official examples of how ob_tidyhandler() works.

  2. Kredit April 7, 2010 at 9:14 pm ·

    wow this is great!even though I haven’t experience coding a huge program using HTML but still it makes me amazed!. thank a Lot for letting me know and be more informative.

  3. แทงบอล October 11, 2010 at 4:03 am ·

    Your blog article is very intersting and fanstic,at the same time the blog theme is unique and perfect,great job.To your success.

Trackbacks for this post

  1. Jessie

Leave a Comment