com.google.streamhtmlparser.impl
Class HtmlParserImpl

java.lang.Object
  extended by com.google.streamhtmlparser.impl.GenericParser
      extended by com.google.streamhtmlparser.impl.HtmlParserImpl
All Implemented Interfaces:
HtmlParser, Parser

public class HtmlParserImpl
extends GenericParser
implements HtmlParser

A custom specialized parser - ported from the main C++ version - used to implement context-aware escaping of run-time data in web-application templates.

This is the main class in the package. It implements the HtmlParser interface.

This class is not thread-safe, in particular you cannot invoke any state changing operations (such as parse from multiple threads on the same object.

If you are looking at this class, chances are very high you are implementing Auto-Escaping for a new template system. Please see the landing page including a design document at Auto-Escape Landing Page.


Nested Class Summary
 
Nested classes/interfaces inherited from interface com.google.streamhtmlparser.HtmlParser
HtmlParser.ATTR_TYPE, HtmlParser.Mode
 
Field Summary
 
Fields inherited from class com.google.streamhtmlparser.impl.GenericParser
columnNumber, currentState, initialState, intToExtStateTable, lineNumber, parserStateTable
 
Fields inherited from interface com.google.streamhtmlparser.HtmlParser
STATE_ATTR, STATE_COMMENT, STATE_CSS_FILE, STATE_JS_FILE, STATE_TAG, STATE_TEXT, STATE_VALUE
 
Fields inherited from interface com.google.streamhtmlparser.Parser
STATE_ERROR
 
Constructor Summary
HtmlParserImpl()
          Creates an HtmlParserImpl object.
HtmlParserImpl(HtmlParserImpl aHtmlParserImpl)
          Creates an HtmlParserImpl that is a copy of the one provided.
 
Method Summary
 String getAttribute()
          Returns the name of the HTML attribute the parser is currently processing.
 HtmlParser.ATTR_TYPE getAttributeType()
          Returns the type of the attribute that the parser is in or ATTR_TYPE.NONE if we are not parsing an attribute.
 ExternalState getJavascriptState()
          Returns the state the Javascript parser is in.
 String getTag()
          Returns the name of the HTML tag if the parser is currently within one.
 String getValue()
          Returns the value of an HTML attribute if the parser is currently within one.
 int getValueIndex()
          Returns the current position of the parser within the HTML attribute value, zero being the position of the first character in the value.
protected  com.google.streamhtmlparser.impl.InternalState handleEnterState(com.google.streamhtmlparser.impl.InternalState currentState, com.google.streamhtmlparser.impl.InternalState expectedNextState, char input)
          Invoked when the parser enters a new state.
protected  com.google.streamhtmlparser.impl.InternalState handleExitState(com.google.streamhtmlparser.impl.InternalState currentState, com.google.streamhtmlparser.impl.InternalState expectedNextState, char input)
          Invoked when the parser exits a state.
protected  com.google.streamhtmlparser.impl.InternalState handleInState(com.google.streamhtmlparser.impl.InternalState currentState, char input)
          Invoked for each character read when no state change occured.
 boolean inAttribute()
          Returns true if and only if the parser is currently within an attribute, be it within the attribute name or the attribute value.
 boolean inCss()
          Returns true if and only if the parser is currently within a CSS context.
 boolean inJavascript()
          Returns true if the parser is currently processing Javascript.
 void insertText()
          A specialized directive to tell the parser there is some content that will be inserted here but that it will not get to parse.
 boolean isAttributeQuoted()
          Returns true if and only if the parser is currently within an attribute value and that attribute value is quoted.
 boolean isJavascriptQuoted()
          Returns true if the parser is currently processing a Javascript litteral that is quoted.
 boolean isUrlStart()
          Returns true if and only if the current position of the parser is at the start of a URL HTML attribute value.
protected  void record(char input)
          Invokes recording on all CharacterRecorder objects.
 void reset()
          Resets the state of the parser to the initial state of parsing HTML.
 void resetMode(HtmlParser.Mode mode)
          Resets the state of the parser, allowing for reuse of the HtmlParser object.
 
Methods inherited from class com.google.streamhtmlparser.impl.GenericParser
getColumnNumber, getLineNumber, getState, parse, parse, setColumnNumber, setLineNumber, setNextState
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface com.google.streamhtmlparser.Parser
getColumnNumber, getLineNumber, getState, parse, parse, setColumnNumber, setLineNumber
 

Constructor Detail

HtmlParserImpl

public HtmlParserImpl()
Creates an HtmlParserImpl object.

Both for performance reasons and to leverage code a state-flow machine that is automatically generated from Python for multiple target languages, this object uses a static ParserStateTable that is read-only and obtained from the generated code in HtmlParserFsm. That code also maintains the mapping from internal states (InternalState) to external states (ExternalState).


HtmlParserImpl

public HtmlParserImpl(HtmlParserImpl aHtmlParserImpl)
Creates an HtmlParserImpl that is a copy of the one provided.

Parameters:
aHtmlParserImpl - the HtmlParserImpl object to copy
Method Detail

inJavascript

public boolean inJavascript()
Description copied from interface: HtmlParser
Returns true if the parser is currently processing Javascript. Such is the case if and only if, the parser is processing an attribute that takes Javascript, a Javascript script block or the parser is (re)set with HtmlParser.Mode.JS.

Specified by:
inJavascript in interface HtmlParser
Returns:
true if the parser is processing Javascript, false otherwise

isJavascriptQuoted

public boolean isJavascriptQuoted()
Description copied from interface: HtmlParser
Returns true if the parser is currently processing a Javascript litteral that is quoted. The caller will typically invoke this method after determining that the parser is processing Javascript. Knowing whether the element is quoted or not helps determine which escaping to apply to it when needed.

Specified by:
isJavascriptQuoted in interface HtmlParser
Returns:
true if and only if the parser is inside a quoted Javascript literal

inAttribute

public boolean inAttribute()
Description copied from interface: HtmlParser
Returns true if and only if the parser is currently within an attribute, be it within the attribute name or the attribute value.

Specified by:
inAttribute in interface HtmlParser
Returns:
true if and only if inside an attribute

inCss

public boolean inCss()
Returns true if and only if the parser is currently within a CSS context. A CSS context is one of the below:

Specified by:
inCss in interface HtmlParser
Returns:
true if and only if the parser is inside CSS

getAttributeType

public HtmlParser.ATTR_TYPE getAttributeType()
Description copied from interface: HtmlParser
Returns the type of the attribute that the parser is in or ATTR_TYPE.NONE if we are not parsing an attribute. The caller will typically invoke this method after determining that the parser is processing an attribute.

This is useful to determine which escaping to apply based on the type of value this attribute expects.

Specified by:
getAttributeType in interface HtmlParser
Returns:
type of the attribute
See Also:
HtmlParser.ATTR_TYPE

getJavascriptState

public ExternalState getJavascriptState()
Description copied from interface: HtmlParser
Returns the state the Javascript parser is in.

See JavascriptParser for more information on the valid external states. The caller will typically first determine that the parser is processing Javascript and then invoke this method to obtain more fine-grained state information.

Specified by:
getJavascriptState in interface HtmlParser
Returns:
external state of the javascript parser

isAttributeQuoted

public boolean isAttributeQuoted()
Description copied from interface: HtmlParser
Returns true if and only if the parser is currently within an attribute value and that attribute value is quoted.

Specified by:
isAttributeQuoted in interface HtmlParser
Returns:
true if and only if the attribute value is quoted

getTag

public String getTag()
Description copied from interface: HtmlParser
Returns the name of the HTML tag if the parser is currently within one. Note that the name may be incomplete if the parser is currently still parsing the name. Returns an empty String if the parser is not in a tag as determined by getCurrentExternalState.

Specified by:
getTag in interface HtmlParser
Returns:
the name of the HTML tag or an empty String if we are not within an HTML tag

getAttribute

public String getAttribute()
Description copied from interface: HtmlParser
Returns the name of the HTML attribute the parser is currently processing. If the parser is still parsing the name, then the returned name may be incomplete. Returns an empty String if the parser is not in an attribute as determined by getCurrentExternalState.

Specified by:
getAttribute in interface HtmlParser
Returns:
the name of the HTML attribute or an empty String if we are not within an HTML attribute

getValue

public String getValue()
Description copied from interface: HtmlParser
Returns the value of an HTML attribute if the parser is currently within one. If the parser is currently parsing the value, the returned value may be incomplete. The caller will typically first determine that the parser is processing a value by calling getCurrentExternalState.

Specified by:
getValue in interface HtmlParser
Returns:
the value, could be an empty String if the parser is not in an HTML attribute value

getValueIndex

public int getValueIndex()
Description copied from interface: HtmlParser
Returns the current position of the parser within the HTML attribute value, zero being the position of the first character in the value. The caller will typically first determine that the parser is processing a value by calling Parser.getState().

Specified by:
getValueIndex in interface HtmlParser
Returns:
the index or zero if the parser is not processing a value

isUrlStart

public boolean isUrlStart()
Description copied from interface: HtmlParser
Returns true if and only if the current position of the parser is at the start of a URL HTML attribute value. This is the case when the following three conditions are all met:

  1. The parser is in an HTML attribute value.
  2. The HTML attribute expects a URL, as determined by HtmlParser.getAttributeType() returning .ATTR_TYPE#URI.
  3. The parser has not yet seen any characters from that URL.

This method may be used by an Html Sanitizer or an Auto-Escape system to determine whether to validate the URL for well-formedness and validate the scheme of the URL (e.g. HTTP, HTTPS) is safe. In particular, it is recommended to use this method instead of checking that HtmlParser.getValueIndex() is 0 to support attribute types where the URL does not start at index zero, such as the content attribute of the meta HTML tag.

Specified by:
isUrlStart in interface HtmlParser
Returns:
true if and only if the parser is at the start of the URL

resetMode

public void resetMode(HtmlParser.Mode mode)
Resets the state of the parser, allowing for reuse of the HtmlParser object.

See the HtmlParser.Mode enum for information on all the valid modes. Resets the state of the parser to a state consistent with the Mode provided. This will reset finer-grained state information back to a default value, hence use only when you want to parse text from a very clean slate.

See the HtmlParser.Mode enum for information on all the valid modes.

Specified by:
resetMode in interface HtmlParser
Parameters:
mode - is an enum representing the high-level state of the parser

reset

public void reset()
Resets the state of the parser to the initial state of parsing HTML.

Specified by:
reset in interface Parser
Overrides:
reset in class GenericParser

insertText

public void insertText()
                throws ParseException
A specialized directive to tell the parser there is some content that will be inserted here but that it will not get to parse. Used by the template system that may not be able to give some content to the parser but wants it to know there typically will be content inserted at that point. This is a hint used in corner cases within parsing of HTML attribute names and values where content we do not get to see could affect our parsing and alter our current state.

The two cases where #insertText() affects our parsing are:

Specified by:
insertText in interface HtmlParser
Throws:
ParseException - if an unrecoverable error occurred during parsing

handleEnterState

protected com.google.streamhtmlparser.impl.InternalState handleEnterState(com.google.streamhtmlparser.impl.InternalState currentState,
                                                                          com.google.streamhtmlparser.impl.InternalState expectedNextState,
                                                                          char input)
Description copied from class: GenericParser
Invoked when the parser enters a new state.

Overrides:
handleEnterState in class GenericParser
Parameters:
currentState - the current state of the parser
expectedNextState - the next state according to the state table definition
input - the last character parsed
Returns:
the state to change to, could be the same as the expectedNextState provided

handleExitState

protected com.google.streamhtmlparser.impl.InternalState handleExitState(com.google.streamhtmlparser.impl.InternalState currentState,
                                                                         com.google.streamhtmlparser.impl.InternalState expectedNextState,
                                                                         char input)
Description copied from class: GenericParser
Invoked when the parser exits a state.

Overrides:
handleExitState in class GenericParser
Parameters:
currentState - the current state of the parser
expectedNextState - the next state according to the state table definition
input - the last character parsed
Returns:
the state to change to, could be the same as the expectedNextState provided

handleInState

protected com.google.streamhtmlparser.impl.InternalState handleInState(com.google.streamhtmlparser.impl.InternalState currentState,
                                                                       char input)
                                                                throws ParseException
Description copied from class: GenericParser
Invoked for each character read when no state change occured.

Overrides:
handleInState in class GenericParser
Parameters:
currentState - the current state of the parser
input - the last character parsed
Returns:
the state to change to, could be the same as the expectedNextState provided
Throws:
ParseException - if an unrecoverable error occurred during parsing

record

protected void record(char input)
Invokes recording on all CharacterRecorder objects. Currently we do not check that one and only one of them is recording. I did a fair bit of testing on the C++ parser and was not convinced there is such a guarantee.

Overrides:
record in class GenericParser
Parameters:
input - the input character to operate on


Copyright © 2010-2012 Google. All Rights Reserved.