HTMLdoc: Parser & Minifier

A tokeniser based HTML document parser and minifier, written in PHP.

Project Description

HTMLdoc Logo An HTML parser, primarily designed for minifying HTML documents, it also enables the document structure to be queried allowing attribute and TextNode values to be extracted.

The parser is designed around a tokeniser to make the document processing more reliable than regex based minifiers, which are a bit blunt and can be problematic if they match patterns in the wrong places.

The software is also capable of processing and minifying SVG documents.

How HTMLdoc Works

Under the hood, the software processes are split into a number of stages:

Tokenisation

The input HTML is loaded into the tokeniser as a string, and a regular expression splits it up into categorised tokens.

Parsing

The tokens are passed to the parser which then loops through and consumes each token through an object based Finite State Machine to create an internal object structure that represents the document. This enables irregular tokens to be ignored and the document to be parsed with more reliability.

Once parsed, the document will contain an array of child objects each representing the node type in the original document. The tag type can contain its own child objects and so on.

Minification

Minification is performed by each object on its own structure, the command passed down each level from the one above.

Each object has its own minification process, and sometimes (Such as the text object) references it siblings or parent through its parent object.

As an example of a process, when whitespace is removed from the document, each text object will remove non-significant whitespace from its content property.

Compiling

The compilation process reconstructs the HTML from its object representation. Each object generates itself as a string, and then requests its children generate themselves. The result is all concatenated together and either output as a string or saved to the requested location.

Performance

HTMLdoc has been written with performance in mind, as its main purpose is for the minification of HTML. Since PHP is widely used to generate HTML, it is expected and designed for the use case that minification will happen on the fly.

For most HTML documents the speed and memory usage will be well within acceptable boundaries, although if you put large complex documents into it with a large amount of nodes, the speed will suffer.

For the most part the object is memory efficient, most memory will be used by the tokenisation process, which uses a regular expression to split the input string into tokens. This part of the program has been optimised to minimise the amount of memory used.

You can try out HTMLdoc in my apps.