com.google.streamhtmlparser.util
Class HtmlUtils

java.lang.Object
  extended by com.google.streamhtmlparser.util.HtmlUtils

public final class HtmlUtils
extends Object

Utility functions for HTML and Javascript that are most likely not interesting to users outside this package.

The HtmlParser will be open-sourced hence we took the decision to keep these utilities in this package as well as not to leverage others that may exist in the google3 code base.

The functionality exposed is designed to be 100% compatible with the corresponding logic in the C-version of the HtmlParser as such we are particularly concerned with cross-language compatibility.

Note: The words Javascript and ECMAScript are used interchangeably unless otherwise noted.


Nested Class Summary
static class HtmlUtils.META_REDIRECT_TYPE
          Indicates the type of content contained in the content HTML attribute of the meta HTML tag.
 
Method Summary
static String encodeCharForAscii(char chr)
          Encodes the specified character using Ascii for convenient insertion into a single-quote enclosed String.
static boolean isAttributeJavascript(String attribute)
          Determines if the HTML attribute specified expects javascript for its value.
static boolean isAttributeStyle(String attribute)
          Determines if the HTML attribute specified expects a style for its value.
static boolean isAttributeUri(String attribute)
          Determines if the HTML attribute specified expects a URI for its value.
static boolean isHtmlSpace(char chr)
          Determines if the specified character is an HTML whitespace character.
static boolean isJavascriptIdentifier(char chr)
          Determines if the specified character is a valid character in an ECMAScript identifier.
static boolean isJavascriptRegexpPrefix(String input)
          Determines if the input token provided is a valid token prefix to a javascript regular expression.
static boolean isJavascriptWhitespace(char chr)
          Determines if the specified character is an ECMAScript whitespace or line terminator character.
static HtmlUtils.META_REDIRECT_TYPE parseContentAttributeForUrl(String value)
          Parses the given String to determine if it contains a URL in the format followed by the content attribute of the meta HTML tag.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

isAttributeJavascript

public static boolean isAttributeJavascript(String attribute)
Determines if the HTML attribute specified expects javascript for its value. Such is the case for example with the onclick attribute.

Currently returns true for any attribute name that starts with "on" which is not exactly correct but we trust a developer to not use non-spec compliant attribute names (e.g. onbogus).

Parameters:
attribute - the name of an HTML attribute
Returns:
false if the input is null or is not an attribute that expects javascript code; true

isAttributeStyle

public static boolean isAttributeStyle(String attribute)
Determines if the HTML attribute specified expects a style for its value. Currently this is only true for the style HTML attribute.

Parameters:
attribute - the name of an HTML attribute
Returns:
true iff the attribute name is one that expects a style for a value; otherwise false

isAttributeUri

public static boolean isAttributeUri(String attribute)
Determines if the HTML attribute specified expects a URI for its value. For example, both href and src expect a URI but style does not. Returns false if the attribute given was null.

Parameters:
attribute - the name of an HTML attribute
Returns:
true if the attribute name is one that expects a URI for a value; otherwise null
See Also:
ATTRIBUTE_EXPECTS_URI

isHtmlSpace

public static boolean isHtmlSpace(char chr)
Determines if the specified character is an HTML whitespace character. A character is an HTML whitespace character if and only if it is one of the characters below. Note: The list includes the zero-width space (​) which is not included in the C version.

Parameters:
chr - the char to check
Returns:
true if the character is an HTML whitespace character White space

isJavascriptWhitespace

public static boolean isJavascriptWhitespace(char chr)
Determines if the specified character is an ECMAScript whitespace or line terminator character. A character is a whitespace or line terminator if and only if it is one of the characters below:

Encompasses the characters in sections 7.2 and 7.3 of ECMAScript 3, in particular, this list is quite different from that in Character.isWhitespace. ECMAScript Language Specification

Parameters:
chr - the char to check
Returns:
true or false

isJavascriptIdentifier

public static boolean isJavascriptIdentifier(char chr)
Determines if the specified character is a valid character in an ECMAScript identifier. This determination is currently not exact, in particular: We are considering leveraging Character.isJavaIdentifierStart and Character.isJavaIdentifierPart given that Java and Javascript follow similar identifier naming rules but we lose compatibility with the C-version.

Parameters:
chr - char to check
Returns:
true if the chr is a Javascript whitespace character; otherwise false

isJavascriptRegexpPrefix

public static boolean isJavascriptRegexpPrefix(String input)
Determines if the input token provided is a valid token prefix to a javascript regular expression. The token argument is compared against a Set of identifiers that can precede a regular expression in the javascript grammar, and returns true if the provided String is in that Set.

Parameters:
input - the String token to check
Returns:
true iff the token is a valid prefix of a regexp

encodeCharForAscii

public static String encodeCharForAscii(char chr)
Encodes the specified character using Ascii for convenient insertion into a single-quote enclosed String. Printable characters are returned as-is. Carriage Return, Line Feed, Horizontal Tab, back-slash and single quote are all backslash-escaped. All other characters are returned hex-encoded.

Parameters:
chr - char to encode
Returns:
an Ascii-friendly encoding of the given char

parseContentAttributeForUrl

public static HtmlUtils.META_REDIRECT_TYPE parseContentAttributeForUrl(String value)
Parses the given String to determine if it contains a URL in the format followed by the content attribute of the meta HTML tag.

This function expects to receive the value of the content HTML attribute. This attribute takes on different meanings depending on the value of the http-equiv HTML attribute of the same meta tag. Since we may not have access to the http-equiv attribute, we instead rely on parsing the given value to determine if it contains a URL. The specification of the meta HTML tag can be found in: http://dev.w3.org/html5/spec/Overview.html#attr-meta-http-equiv-refresh

We return HtmlUtils.META_REDIRECT_TYPE indicating whether the value contains a URL and whether we are at the start of the URL or past the start. We are at the start of the URL if and only if one of the two conditions below is true:

Examples:

Parameters:
value - String to parse
Returns:
HtmlUtils.META_REDIRECT_TYPE indicating the presence of a URL in the given value


Copyright © 2010-2012 Google. All Rights Reserved.