com.google.streamhtmlparser
Interface HtmlParser

All Superinterfaces:
Parser
All Known Implementing Classes:
HtmlParserImpl

public interface HtmlParser
extends Parser

Methods exposed for HTML parsing of text to facilitate implementation of Automatic context-aware escaping. The HTML parser also embeds a Javascript parser for processing Javascript fragments. In the future, it will also embed other specific parsers and hence most likely remain the main interface to callers of this package.

Note: These are the exact methods exposed in the original C++ Parser. The names are simply modified to conform to Java.


Nested Class Summary
static class HtmlParser.ATTR_TYPE
          Indicates the type of HTML attribute that the parser is currently in or NONE if the parser is not currently in an attribute.
static class HtmlParser.Mode
          The Parser Mode requested for parsing a given template.
 
Field Summary
static ExternalState STATE_ATTR
           
static ExternalState STATE_COMMENT
           
static ExternalState STATE_CSS_FILE
           
static ExternalState STATE_JS_FILE
           
static ExternalState STATE_TAG
           
static ExternalState STATE_TEXT
          All the states in which the parser can be.
static ExternalState STATE_VALUE
           
 
Fields inherited from interface com.google.streamhtmlparser.Parser
STATE_ERROR
 
Method Summary
 String getAttribute()
          Returns the name of the HTML attribute the parser is currently processing.
 HtmlParser.ATTR_TYPE getAttributeType()
          Returns the type of the attribute that the parser is in or ATTR_TYPE.NONE if we are not parsing an attribute.
 ExternalState getJavascriptState()
          Returns the state the Javascript parser is in.
 String getTag()
          Returns the name of the HTML tag if the parser is currently within one.
 String getValue()
          Returns the value of an HTML attribute if the parser is currently within one.
 int getValueIndex()
          Returns the current position of the parser within the HTML attribute value, zero being the position of the first character in the value.
 boolean inAttribute()
          Returns true if and only if the parser is currently within an attribute, be it within the attribute name or the attribute value.
 boolean inCss()
          Returns true if and only if the parser is currently within a CSS context.
 boolean inJavascript()
          Returns true if the parser is currently processing Javascript.
 void insertText()
          A specialized directive to tell the parser there is some content that will be inserted here but that it will not get to parse.
 boolean isAttributeQuoted()
          Returns true if and only if the parser is currently within an attribute value and that attribute value is quoted.
 boolean isJavascriptQuoted()
          Returns true if the parser is currently processing a Javascript litteral that is quoted.
 boolean isUrlStart()
          Returns true if and only if the current position of the parser is at the start of a URL HTML attribute value.
 void resetMode(HtmlParser.Mode mode)
          Resets the state of the parser, allowing for reuse of the HtmlParser object.
 
Methods inherited from interface com.google.streamhtmlparser.Parser
getColumnNumber, getLineNumber, getState, parse, parse, reset, setColumnNumber, setLineNumber
 

Field Detail

STATE_TEXT

static final ExternalState STATE_TEXT
All the states in which the parser can be. These are external states. The parser has many more internal states that are not exposed and which are instead mapped to one of these external ones. STATE_TEXT the parser is in HTML proper. STATE_TAG the parser is inside an HTML tag name. STATE_COMMENT the parser is inside an HTML comment. STATE_ATTR the parser is inside an HTML attribute name. STATE_VALUE the parser is inside an HTML attribute value. STATE_JS_FILE the parser is inside javascript code. STATE_CSS_FILE the parser is inside CSS code.

All these states map exactly to those exposed in the C++ (original) version of the HtmlParser.


STATE_TAG

static final ExternalState STATE_TAG

STATE_COMMENT

static final ExternalState STATE_COMMENT

STATE_ATTR

static final ExternalState STATE_ATTR

STATE_VALUE

static final ExternalState STATE_VALUE

STATE_JS_FILE

static final ExternalState STATE_JS_FILE

STATE_CSS_FILE

static final ExternalState STATE_CSS_FILE
Method Detail

inJavascript

boolean inJavascript()
Returns true if the parser is currently processing Javascript. Such is the case if and only if, the parser is processing an attribute that takes Javascript, a Javascript script block or the parser is (re)set with HtmlParser.Mode.JS.

Returns:
true if the parser is processing Javascript, false otherwise

isJavascriptQuoted

boolean isJavascriptQuoted()
Returns true if the parser is currently processing a Javascript litteral that is quoted. The caller will typically invoke this method after determining that the parser is processing Javascript. Knowing whether the element is quoted or not helps determine which escaping to apply to it when needed.

Returns:
true if and only if the parser is inside a quoted Javascript literal

inAttribute

boolean inAttribute()
Returns true if and only if the parser is currently within an attribute, be it within the attribute name or the attribute value.

Returns:
true if and only if inside an attribute

inCss

boolean inCss()
Returns true if and only if the parser is currently within a CSS context. A CSS context is one of the below:

Returns:
true if and only if the parser is inside CSS

getAttributeType

HtmlParser.ATTR_TYPE getAttributeType()
Returns the type of the attribute that the parser is in or ATTR_TYPE.NONE if we are not parsing an attribute. The caller will typically invoke this method after determining that the parser is processing an attribute.

This is useful to determine which escaping to apply based on the type of value this attribute expects.

Returns:
type of the attribute
See Also:
HtmlParser.ATTR_TYPE

isAttributeQuoted

boolean isAttributeQuoted()
Returns true if and only if the parser is currently within an attribute value and that attribute value is quoted.

Returns:
true if and only if the attribute value is quoted

getTag

String getTag()
Returns the name of the HTML tag if the parser is currently within one. Note that the name may be incomplete if the parser is currently still parsing the name. Returns an empty String if the parser is not in a tag as determined by getCurrentExternalState.

Returns:
the name of the HTML tag or an empty String if we are not within an HTML tag

getAttribute

String getAttribute()
Returns the name of the HTML attribute the parser is currently processing. If the parser is still parsing the name, then the returned name may be incomplete. Returns an empty String if the parser is not in an attribute as determined by getCurrentExternalState.

Returns:
the name of the HTML attribute or an empty String if we are not within an HTML attribute

getValue

String getValue()
Returns the value of an HTML attribute if the parser is currently within one. If the parser is currently parsing the value, the returned value may be incomplete. The caller will typically first determine that the parser is processing a value by calling getCurrentExternalState.

Returns:
the value, could be an empty String if the parser is not in an HTML attribute value

getValueIndex

int getValueIndex()
Returns the current position of the parser within the HTML attribute value, zero being the position of the first character in the value. The caller will typically first determine that the parser is processing a value by calling Parser.getState().

Returns:
the index or zero if the parser is not processing a value

isUrlStart

boolean isUrlStart()
Returns true if and only if the current position of the parser is at the start of a URL HTML attribute value. This is the case when the following three conditions are all met:

  1. The parser is in an HTML attribute value.
  2. The HTML attribute expects a URL, as determined by getAttributeType() returning .ATTR_TYPE#URI.
  3. The parser has not yet seen any characters from that URL.

This method may be used by an Html Sanitizer or an Auto-Escape system to determine whether to validate the URL for well-formedness and validate the scheme of the URL (e.g. HTTP, HTTPS) is safe. In particular, it is recommended to use this method instead of checking that getValueIndex() is 0 to support attribute types where the URL does not start at index zero, such as the content attribute of the meta HTML tag.

Returns:
true if and only if the parser is at the start of the URL

resetMode

void resetMode(HtmlParser.Mode mode)
Resets the state of the parser, allowing for reuse of the HtmlParser object.

See the HtmlParser.Mode enum for information on all the valid modes.

Parameters:
mode - is an enum representing the high-level state of the parser

insertText

void insertText()
                throws ParseException
A specialized directive to tell the parser there is some content that will be inserted here but that it will not get to parse. Used by the template system that may not be able to give some content to the parser but wants it to know there typically will be content inserted at that point. This is a hint used in corner cases within parsing of HTML attribute names and values where content we do not get to see could affect our parsing and alter our current state.

Returns false if and only if the parser encountered a fatal error which prevents it from continuing further parsing.

Note: The return value is different from the C++ Parser which always returns true but in my opinion makes more sense.

Throws:
ParseException - if an unrecoverable error occurred during parsing

getJavascriptState

ExternalState getJavascriptState()
Returns the state the Javascript parser is in.

See JavascriptParser for more information on the valid external states. The caller will typically first determine that the parser is processing Javascript and then invoke this method to obtain more fine-grained state information.

Returns:
external state of the javascript parser


Copyright © 2010-2012 Google. All Rights Reserved.