Documentation » API Reference

Application:

Class: Messy extends XML_HTMLSax3

Messy is a package that parses and cleans invalid HTML markup.



Properties


$comment false

Whether the parser is in a comment or not.


$block false

Whether the parser is in a block or not.


$list = array ()

  • Access: public

The structure created from the previous call to parse().


$output = array ()

  • Access: public

The output from the last call to parse().


$selfClosing = array (
        
'img',
        
'br',
        
'hr',
        
'meta',
        
'link',
        
'area',
    )

  • Access: public

Contains a list of tags that are self-closing (ie.
they do not contain any data, such as a br tag).


$stripTags = array (
        
'font',
        
'spacer',
        
'blink',
        
'xml:namespace',
        
'o:p',
        
'st1:city',
        
'st1:address',
        
'st1:street',
        
'st1:state',
        
'st1:place',
        
'st1:placename',
        
'st1:placetype',
        
'st1:personname',
        
'st1:country-region',
        
'v:shapetype',
        
'span',
        
'del',
        
'frame',
        
'frameset',
        
'layer',
        
'ilayer',
        
'link',
        
'meta',
        
'xml',
        
'minmax_bound',
        
'place',
        
'placename',
        
'placetype',
        
'city',
        
'state',
        
'street',
        
'personname',
        
'country-region',
    )

  • Access: public

Contains a list of tags that should be stripped
from the output.


$stripTagsSafe = array (
        
'font',
        
'spacer',
        
'blink',
        
'xml:namespace',
        
'o:p',
        
'st1:city',
        
'st1:address',
        
'st1:street',
        
'st1:state',
        
'st1:place',
        
'st1:placename',
        
'st1:placetype',
        
'st1:personname',
        
'st1:country-region',
        
'v:shapetype',
        
'span',
        
'del',
        
'script',
        
'applet',
        
'object',
        
'iframe',
        
'frame',
        
'frameset',
        
'layer',
        
'ilayer',
        
'embed',
        
'bgsound',
        
'link',
        
'meta',
        
'xml',
        
'minmax_bound',
        
'place',
        
'placename',
        
'placetype',
        
'city',
        
'state',
        
'street',
        
'personname',
        
'country-region',
    )

  • Access: public

Contains a list of tags that should be stripped
from the output.


$stripAttrs = array (
    )

  • Access: public

Contains a list of attributes that should be stripped
from the output.


$stripAttrsSafe = array (
        
'onclick',
        
'onsubmit',
        
'onselect',
        
'onchange',
        
'onmouseover',
        
'onmouseout',
        
'onfocus',
        
'onblur',
        
'ondblclick',
        
'onhelp',
        
'onkeydown',
        
'onkeypress',
        
'onkeyup',
        
'onmousedown',
        
'onmousemove',
        
'onmouseup',
        
'onresize',
        
'dataformatas',
        
'data',
        
'datafld',
        
'datasrc',
        
'dynsrc',
    )

  • Access: public

Contains a list of attributes that should be stripped
from the output.


$transform = array (
        
'b' => 'strong'// direct switches can just list the new tag name
        
'i' => 'em',
        
'center' => array ( // array allows you to set attributes on transformations
            
'tag' => 'div',
            
'attrs' => array (
                
'align' => 'center',
            ),
        ),
    )

  • Access: public

Contains a list of tags that should be transformed
into other tags in the output.


$levels = array ()

  • Access: public

This array is used to compare opening and closing
tags within the document structure, and to try to repair
them by inserting missing tags where necessary.


$safe true

  • Access: public

This tells Messy whether to use the stripTags and stripAttrs lists
or the stripTagsSafe and stripAttrsSafe lists, which contain
additional tags and attributes that are considered potentially
unsafe. The default is to use the latter and be more secure by
default.

Return to Top



Methods


Messy () 

  • Access: public

Constructor method.


parse ($data$isXml false

  • Access: public
  • Return: array

Parses the given HTML or XML $data into an array of
"tokens", which are associative arrays with the following
properties: tag (the name of the tag), attributes (a key/value
array of tag attributes/properties), level (the depth of this
tag within the document), type (either 'open', 'complete'
- as in self-closing, 'cdata' - as in Character DATA, or
'close'), and the value of the tag (AKA the contents of it).
This is also stored in the $output property of your Messy
object.


handle_data (&$parser$data

  • Access: private

The internal character data handler.


handle_comment (&$parser$data

  • Access: private

The internal comment handler.


handle_doctype (&$parser$data

  • Access: private

The internal doctype handler.


handle_start_tag (&$parser$tag$attrs = array ()) 

  • Access: private

The open and complete tag handler.


handle_end_tag (&$parser$tag

  • Access: private

The close tag handler.


pad ($length

  • Access: public
  • Return: string

Returns a string of empty space, whose length
is determined by the $length parameter.


toXML () 

  • Access: public
  • Return: string

Uses the internal $output array from a previous
call to parse() and returns an XML representation of
the document.


clean ($doc$isXml false

  • Access: public
  • Return: string

Returns a "clean" version of the HTML or XML data
provided, by calling both parse() then toXML() for you
and return the result.


&toXMLDoc () 

  • Access: public
  • Return: object reference

Uses the internal $output array from a previous
call to parse() and returns an XMLDoc object representation
of the document. Sets $error, $err_code, $err_line, etc.
from the SloppyDOM error values and returns false should
an error occur, which it easily could because there's
no guarantee "cleaned up" markup is necessarily correctly
formatted markup.

Return to Top