Messy is a package that parses and cleans invalid HTML markup.
Properties
$comment = false
Whether the parser is in a comment or not.
$block = false
Whether the parser is in a block or not.
$list = array ()
The structure created from the previous call to parse().
$output = array ()
The output from the last call to parse().
$selfClosing = array ( 'img', 'br', 'hr', 'meta', 'link', 'area', )
Contains a list of tags that are self-closing (ie.
they do not contain any data, such as a br tag).
$stripTags = array ( 'font', 'spacer', 'blink', 'xml:namespace', 'o:p', 'st1:city', 'st1:address', 'st1:street', 'st1:state', 'st1:place', 'st1:placename', 'st1:placetype', 'st1:personname', 'st1:country-region', 'v:shapetype', 'span', 'del', 'frame', 'frameset', 'layer', 'ilayer', 'link', 'meta', 'xml', 'minmax_bound', 'place', 'placename', 'placetype', 'city', 'state', 'street', 'personname', 'country-region', )
Contains a list of tags that should be stripped
from the output.
$stripTagsSafe = array ( 'font', 'spacer', 'blink', 'xml:namespace', 'o:p', 'st1:city', 'st1:address', 'st1:street', 'st1:state', 'st1:place', 'st1:placename', 'st1:placetype', 'st1:personname', 'st1:country-region', 'v:shapetype', 'span', 'del', 'script', 'applet', 'object', 'iframe', 'frame', 'frameset', 'layer', 'ilayer', 'embed', 'bgsound', 'link', 'meta', 'xml', 'minmax_bound', 'place', 'placename', 'placetype', 'city', 'state', 'street', 'personname', 'country-region', )
Contains a list of tags that should be stripped
from the output.
$stripAttrs = array ( )
Contains a list of attributes that should be stripped
from the output.
$stripAttrsSafe = array ( 'onclick', 'onsubmit', 'onselect', 'onchange', 'onmouseover', 'onmouseout', 'onfocus', 'onblur', 'ondblclick', 'onhelp', 'onkeydown', 'onkeypress', 'onkeyup', 'onmousedown', 'onmousemove', 'onmouseup', 'onresize', 'dataformatas', 'data', 'datafld', 'datasrc', 'dynsrc', )
Contains a list of attributes that should be stripped
from the output.
$transform = array ( 'b' => 'strong', // direct switches can just list the new tag name 'i' => 'em', 'center' => array ( // array allows you to set attributes on transformations 'tag' => 'div', 'attrs' => array ( 'align' => 'center', ), ), )
Contains a list of tags that should be transformed
into other tags in the output.
$levels = array ()
This array is used to compare opening and closing
tags within the document structure, and to try to repair
them by inserting missing tags where necessary.
$safe = true
This tells Messy whether to use the stripTags and stripAttrs lists
or the stripTagsSafe and stripAttrsSafe lists, which contain
additional tags and attributes that are considered potentially
unsafe. The default is to use the latter and be more secure by
default.
Return to Top
Methods
Messy ()
Constructor method.
parse ($data, $isXml = false)
- Access: public
- Return: array
Parses the given HTML or XML $data into an array of
"tokens", which are associative arrays with the following
properties: tag (the name of the tag), attributes (a key/value
array of tag attributes/properties), level (the depth of this
tag within the document), type (either 'open', 'complete'
- as in self-closing, 'cdata' - as in Character DATA, or
'close'), and the value of the tag (AKA the contents of it).
This is also stored in the $output property of your Messy
object.
handle_data (&$parser, $data)
The internal character data handler.
handle_comment (&$parser, $data)
The internal comment handler.
handle_doctype (&$parser, $data)
The internal doctype handler.
handle_start_tag (&$parser, $tag, $attrs = array ())
The open and complete tag handler.
handle_end_tag (&$parser, $tag)
The close tag handler.
pad ($length)
- Access: public
- Return: string
Returns a string of empty space, whose length
is determined by the $length parameter.
toXML ()
- Access: public
- Return: string
Uses the internal $output array from a previous
call to parse() and returns an XML representation of
the document.
clean ($doc, $isXml = false)
- Access: public
- Return: string
Returns a "clean" version of the HTML or XML data
provided, by calling both parse() then toXML() for you
and return the result.
&toXMLDoc ()
- Access: public
- Return: object reference
Uses the internal $output array from a previous
call to parse() and returns an XMLDoc object representation
of the document. Sets $error, $err_code, $err_line, etc.
from the SloppyDOM error values and returns false should
an error occur, which it easily could because there's
no guarantee "cleaned up" markup is necessarily correctly
formatted markup.
Return to Top
|