Class: Messy extends XML_HTMLSax3
- Package: saf.HTML
Messy is a package that parses and cleans invalid HTML markup.
Properties
$comment = false
Whether the parser is in a comment or not.
$block = false
Whether the parser is in a block or not.
$list = array ()
- Access: public
The structure created from the previous call to parse().
$output = array ()
- Access: public
The output from the last call to parse().
$selfClosing = array (
'img',
'br',
'hr',
'meta',
'link',
'area',
)
'img',
'br',
'hr',
'meta',
'link',
'area',
)
- Access: public
Contains a list of tags that are self-closing (ie.
they do not contain any data, such as a br tag).
$stripTags = array (
'font',
'spacer',
'blink',
'xml:namespace',
'o:p',
'st1:city',
'st1:address',
'st1:street',
'st1:state',
'st1:place',
'st1:placename',
'st1:placetype',
'st1:personname',
'st1:country-region',
'v:shapetype',
'span',
'del',
'frame',
'frameset',
'layer',
'ilayer',
'link',
'meta',
'xml',
'minmax_bound',
'place',
'placename',
'placetype',
'city',
'state',
'street',
'personname',
'country-region',
)
'font',
'spacer',
'blink',
'xml:namespace',
'o:p',
'st1:city',
'st1:address',
'st1:street',
'st1:state',
'st1:place',
'st1:placename',
'st1:placetype',
'st1:personname',
'st1:country-region',
'v:shapetype',
'span',
'del',
'frame',
'frameset',
'layer',
'ilayer',
'link',
'meta',
'xml',
'minmax_bound',
'place',
'placename',
'placetype',
'city',
'state',
'street',
'personname',
'country-region',
)
- Access: public
Contains a list of tags that should be stripped
from the output.
$stripTagsSafe = array (
'font',
'spacer',
'blink',
'xml:namespace',
'o:p',
'st1:city',
'st1:address',
'st1:street',
'st1:state',
'st1:place',
'st1:placename',
'st1:placetype',
'st1:personname',
'st1:country-region',
'v:shapetype',
'span',
'del',
'script',
'applet',
'object',
'iframe',
'frame',
'frameset',
'layer',
'ilayer',
'embed',
'bgsound',
'link',
'meta',
'xml',
'minmax_bound',
'place',
'placename',
'placetype',
'city',
'state',
'street',
'personname',
'country-region',
)
'font',
'spacer',
'blink',
'xml:namespace',
'o:p',
'st1:city',
'st1:address',
'st1:street',
'st1:state',
'st1:place',
'st1:placename',
'st1:placetype',
'st1:personname',
'st1:country-region',
'v:shapetype',
'span',
'del',
'script',
'applet',
'object',
'iframe',
'frame',
'frameset',
'layer',
'ilayer',
'embed',
'bgsound',
'link',
'meta',
'xml',
'minmax_bound',
'place',
'placename',
'placetype',
'city',
'state',
'street',
'personname',
'country-region',
)
- Access: public
Contains a list of tags that should be stripped
from the output.
$stripAttrs = array (
)
)
- Access: public
Contains a list of attributes that should be stripped
from the output.
$stripAttrsSafe = array (
'onclick',
'onsubmit',
'onselect',
'onchange',
'onmouseover',
'onmouseout',
'onfocus',
'onblur',
'ondblclick',
'onhelp',
'onkeydown',
'onkeypress',
'onkeyup',
'onmousedown',
'onmousemove',
'onmouseup',
'onresize',
'dataformatas',
'data',
'datafld',
'datasrc',
'dynsrc',
)
'onclick',
'onsubmit',
'onselect',
'onchange',
'onmouseover',
'onmouseout',
'onfocus',
'onblur',
'ondblclick',
'onhelp',
'onkeydown',
'onkeypress',
'onkeyup',
'onmousedown',
'onmousemove',
'onmouseup',
'onresize',
'dataformatas',
'data',
'datafld',
'datasrc',
'dynsrc',
)
- Access: public
Contains a list of attributes that should be stripped
from the output.
$transform = array (
'b' => 'strong', // direct switches can just list the new tag name
'i' => 'em',
'center' => array ( // array allows you to set attributes on transformations
'tag' => 'div',
'attrs' => array (
'align' => 'center',
),
),
)
'b' => 'strong', // direct switches can just list the new tag name
'i' => 'em',
'center' => array ( // array allows you to set attributes on transformations
'tag' => 'div',
'attrs' => array (
'align' => 'center',
),
),
)
- Access: public
Contains a list of tags that should be transformed
into other tags in the output.
$levels = array ()
- Access: public
This array is used to compare opening and closing
tags within the document structure, and to try to repair
them by inserting missing tags where necessary.
$safe = true
- Access: public
This tells Messy whether to use the stripTags and stripAttrs lists
or the stripTagsSafe and stripAttrsSafe lists, which contain
additional tags and attributes that are considered potentially
unsafe. The default is to use the latter and be more secure by
default.
Methods
Messy ()
- Access: public
Constructor method.
parse ($data, $isXml = false)
- Access: public
- Return: array
Parses the given HTML or XML $data into an array of
"tokens", which are associative arrays with the following
properties: tag (the name of the tag), attributes (a key/value
array of tag attributes/properties), level (the depth of this
tag within the document), type (either 'open', 'complete'
- as in self-closing, 'cdata' - as in Character DATA, or
'close'), and the value of the tag (AKA the contents of it).
This is also stored in the $output property of your Messy
object.
handle_data (&$parser, $data)
- Access: private
The internal character data handler.
handle_comment (&$parser, $data)
- Access: private
The internal comment handler.
handle_doctype (&$parser, $data)
- Access: private
The internal doctype handler.
handle_start_tag (&$parser, $tag, $attrs = array ())
- Access: private
The open and complete tag handler.
handle_end_tag (&$parser, $tag)
- Access: private
The close tag handler.
pad ($length)
- Access: public
- Return: string
Returns a string of empty space, whose length
is determined by the $length parameter.
toXML ()
- Access: public
- Return: string
Uses the internal $output array from a previous
call to parse() and returns an XML representation of
the document.
clean ($doc, $isXml = false)
- Access: public
- Return: string
Returns a "clean" version of the HTML or XML data
provided, by calling both parse() then toXML() for you
and return the result.
&toXMLDoc ()
- Access: public
- Return: object reference
Uses the internal $output array from a previous
call to parse() and returns an XMLDoc object representation
of the document. Sets $error, $err_code, $err_line, etc.
from the SloppyDOM error values and returns false should
an error occur, which it easily could because there's
no guarantee "cleaned up" markup is necessarily correctly
formatted markup.
