The HTML DOM specification explicitly states that element and attribute names follow the semantics, including case-sensitivity, specified in the HTML 4 specification. In addition, section 1.2.1 of the HTML 4.01 specification states:
Element names are written in uppercase letters (e.g., BODY). Attribute names are written in lowercase letters (e.g., lang, onsubmit).
The Xerces HTML DOM implementation (used by default in the
NekoHTML DOMParser
class) follows this convention.
Therefore, even if the
"http://cyberneko.org/html/properties/names/elems" property is
set to "lower", the DOM will still uppercase the element names.
To get around this problem, instantiate a Xerces2 DOMParser
object using the NekoHTML parser configuration. By default, the
Xerces DOM parser class creates a standard XML DOM tree, not
an HTML DOM tree. Therefore, the element and attribute names
will follow the settings for the
"http://cyberneko.org/html/properties/names/elems" and
"http://cyberneko.org/html/properties/names/attrs" properties.
However, realize that the application will not be able to cast
the document nodes to the HTML DOM interfaces for accessing the
document's information.
The following sample code shows how to instantiate a DOM parser using the NekoHTML parser configuration:
// import org.apache.xerces.parsers.DOMParser; // import org.cyberneko.html.HTMLConfiguration; DOMParser parser = new DOMParser(new HTMLConfiguration());
The NekoHTML parser has a property that allows you to append custom filter components at the end of the parser pipeline as detailed in the Pipeline Filters documentation. But this means that processing occurs after the tag-balancer does its job. However, the same property can also be used to insert custom components before the tag-balancer as well.
The secret is to disable the tag-balancing feature and
then add another instance of the HTMLTagBalancer
component at the end of your custom filter pipeline. The following
example shows how to add a custom filter before the tag-balancer
in the DOM parser. (This also works on all other types of parsers
that use the HTMLConfiguration
.)
// import org.cyberneko.html.HTMLConfiguration; // import org.cyberneko.html.parsers.DOMParser; // import org.apache.xerces.xni.parser.XMLDocumentFilter; DOMParser parser = new DOMParser(); parser.setFeature("http://cyberneko.org/html/features/balance-tags", false); XMLDocumentFilter[] filters = { new MyFilter(), new HTMLTagBalancer() }; parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
Frequently, HTML is used within applications and online forms to allow users to enter rich-text. In these situations, it is useful to be able to parse the entered text as a document fragment. In other words, the entered text represents content within the HTML <body> element — it is not a full HTML document.
Starting with version 0.7.0, NekoHTML has added an
experimental feature that allows the
application to parse HTML document fragments. Setting the
"http://cyberneko.org/features/document-fragment
"
feature to true
instructs the tag-balancer to
balance only tags found within the HTML <body> element.
The surrounding <body> and <html> elements are not
inserted.
Note:
The document-fragment feature should not be
used on the DOMParser
class since it relies on
balanced elements in order to correctly construct the DOM
tree. However, a new parser class has been added to NekoHTML
to allow you parser DOM document fragments. Please refer to
the Usage Instructions
for more information.
While NekoHTML is a rather small library, many users complain about the size of the Xerces2 library. However, the full Xerces2 library is not required in order to use the NekoHTML parser. Because the CyberNeko HTML parser is written using the Xerces Native Interface (XNI) framework that forms the foundation of the Xerces2 implementation, only that part is required to write applications using NekoHTML.
For convenience, a small Jar file containing only the necessary
parts of the framework and utility classes from Xerces2 is
distributed with the NekoHTML package. The Jar file, called
xercesMinimal.jar
, can be found in the
lib/
directory of the distribution. Simply add
this file to your classpath along with nekohtml.jar
.
However, there are a few restrictions if you choose to use
the xercesMinimal.jar
file instead of the full
Xerces2 package. First, you cannot use the DOM and SAX parsers
included with NekoHTML because they use the Xerces2 base
classes. Second, because you cannot use the convenience
parser classes, your application must be written using the
XNI framework. However, using the XNI framework is not
difficult for programmers familiar with SAX. [Note: future
versions of NekoHTML may include custom implementations of
the DOM and SAX parsers to avoid this dependence on the
Xerces2 library.]
Most users of the CyberNeko HTML parser will not have a
problem including the full Xerces2 package because the
application is likely to need an XML parser implementation.
However, for those users that are concerned about Jar file
size, then using the xercesMinimal.jar
file
may be a useful alternative.