it.unimi.dsi.mg4j.util.parser
Class BulletParser

java.lang.Object
  extended by it.unimi.dsi.mg4j.util.parser.BulletParser

Deprecated. Moved to dsiutils.

@Deprecated
public class BulletParser
extends Object

A fast, lightweight, on-demand (X)HTML parser.

The bullet parser has been written with two specific goals in mind: web crawling and targeted data extraction from massive web data sets. To be usable in such environments, a parser must obey a number of restrictions:

Thus, in fact the bullet parser is not a parser. It is a bunch of spaghetti code that analyses a stream of characters pretending that it is an (X)HTML document. It has a very defensive attitude against the stream character it is parsing, but at the same time it is forgiving with all typical (X)HTML mistakes.

The bullet parser is officially StringFree™. MutableStrings are used for internal processing, and Java strings are used only to return attribute values. All internal maps are reference-based maps from fastutil, which helps to accelerate further the parsing process.

HTML data

The bullet parser uses attributes and methods of HTMLFactory, Element, Attribute and Entity. Thus, for instance, whenever an element is to be passed around it is one of the shared objects contained in Element (e.g., Element.BODY).

Callbacks

The result of the parsing process is the invocation of a callback. The callback interface of the bullet parser remembers closely SAX2, but it has some additional methods targeted at (X)HTML, such as Callback.cdata(it.unimi.dsi.mg4j.util.parser.Element,char[],int,int), which returns characters found in a CDATA section (e.g., a stylesheet).

Each callback must configure the parser, by requesting to perform the analysis and the callbacks it requires. A callback that wants to extract and tokenise text, for instance, will certainly require parseText(true), but not parseTags(true). On the other hand, a callback wishing to extract links will require to parse selectively certain attribute types.

A more precise description follows.

Writing callbacks

The first important issue is what has to be required to the parser. A newly created parser does not invoke any callback. It is up to every callback to add features so that it can do its job. Remember that since many callbacks can be composed, you must always add features, never remove them, and moreover your callbacks must be ready to be invoked with features they did not request (e.g., attribute types added by another callback).

The following parse features may be configured; most of them are just boolean features, a.k.a. flags: unless otherwise specified, by default all flags are set to false (e.g., by the default the parser will not parse tags):

Invoking the parser

After setting the parser callback, you just call parse(char[], int, int).


Field Summary
protected  Reference2ObjectMap<Attribute,MutableString> attrMap
          Deprecated. A map from attributes to attribute values.
protected  Callback callback
          Deprecated. The callback of this parser.
protected static TextPattern CLOSED_CDATA
          Deprecated. Closed section (conditional, CDATA, etc.).
protected static TextPattern CLOSED_COMMENT
          Deprecated. Closed comment.
protected static TextPattern CLOSED_PERCENT
          Deprecated. Closed ASP or similar tag.
protected static TextPattern CLOSED_PIC
          Deprecated. Closed processing instruction.
protected static TextPattern CLOSED_SECTION
          Deprecated. Closed section (conditional, etc.).
 ParsingFactory factory
          Deprecated. The parsing factory used by this parser.
protected static int HEXADECIMAL
          Deprecated. The base for non-decimal entity.
protected  char lastEntity
          Deprecated. The character represented by the last scanned entity.
protected static int MAX_DEC_ENTITY_LENGTH
          Deprecated. The maximum number of digits of a decimal numeric entity.
protected static int MAX_ENTITY_VALUE
          Deprecated. The maximum Unicode value accepted for a numeric entity.
protected static int MAX_HEX_ENTITY_LENGTH
          Deprecated. The maximum number of digits of a hexadecimal numeric entity.
protected static char[] NONSPACE_WHITESPACE
          Deprecated. An array containing the non-space whitespace.
protected  boolean parseAttributes
          Deprecated. Whether we should parse attributes.
protected  boolean parseCDATA
          Deprecated. Whether we should invoke the CDATA section handler.
 ReferenceSet<Attribute> parsedAttributes
          Deprecated. An externally visible, immutable subset of attributes whose values will be actually parsed.
protected  ReferenceArraySet<Attribute> parsedAttrs
          Deprecated. The subset of attributes whose values will be actually parsed (if, of course, parseAttributesis true).
protected  boolean parseTags
          Deprecated. Whether we should parse tags.
protected  boolean parseText
          Deprecated. Whether we should invoke the text handler.
protected static TextPattern SCRIPT_CLOSE_TAG_PATTERN
          Deprecated. Closing tag for a script element.
protected static char[] SPACE
          Deprecated. An array, parallel to NONSPACE_WHITESPACE, containing spaces.
protected static int STATE_BEFORE_END_TAG_NAME
          Deprecated. Scanning a closing tag.
protected static int STATE_BEFORE_START_TAG_NAME
          Deprecated. Scanning attribute name/value pairs.
protected static int STATE_IN_END_TAG
          Deprecated. Scanning a closing tag.
protected static int STATE_IN_START_TAG
          Deprecated. Scanning attribute name/value pairs.
protected static int STATE_TEXT
          Deprecated. Scanning text..
protected static TextPattern STYLE_CLOSE_TAG_PATTERN
          Deprecated. Closing tag for a style element.
 
Constructor Summary
BulletParser()
          Deprecated. Creates a new bullet parser using the default factory HTMLFactory.INSTANCE.
BulletParser(ParsingFactory factory)
          Deprecated. Creates a new bullet parser.
 
Method Summary
protected  char entity2Char(MutableString name)
          Deprecated. Returns the character corresponding to a given entity name.
protected  int handleMarkup(char[] text, int pos, int end)
          Deprecated. Handles markup.
protected  int handleProcessingInstruction(char[] text, int pos, int end)
          Deprecated. Handles processing instruction, ASP tags etc.
 void parse(char[] text)
          Deprecated. Analyze the text document to extract information.
 void parse(char[] text, int offset, int length)
          Deprecated. Analyze the text document to extract information.
 BulletParser parseAttribute(Attribute attribute)
          Deprecated. Adds the given attribute to the set of attributes to be parsed.
 boolean parseAttributes()
          Deprecated. Returns whether this parser will parse attributes.
 BulletParser parseAttributes(boolean parseAttributes)
          Deprecated. Sets the attribute parsing flag.
 boolean parseCDATA()
          Deprecated. Returns whether this parser will invoke the CDATA-section handler.
 BulletParser parseCDATA(boolean parseCDATA)
          Deprecated. Sets the CDATA-section handler flag.
 boolean parseTags()
          Deprecated. Returns whether this parser will parse tags and invoke element handlers.
 BulletParser parseTags(boolean parseTags)
          Deprecated. Sets whether this parser will parse tags and invoke element handlers.
 boolean parseText()
          Deprecated. Returns whether this parser will invoke the text handler.
 BulletParser parseText(boolean parseText)
          Deprecated. Sets the text handler flag.
protected  void replaceEntities(MutableString s, MutableString entity, boolean loose)
          Deprecated. Replaces entities with the corresponding characters.
protected  int scanEntity(char[] a, int offset, int length, boolean loose, MutableString entity)
          Deprecated. Searches for the end of an entity.
 BulletParser setCallback(Callback callback)
          Deprecated. Sets the callback for this parser, resetting at the same time all parsing flags.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

STATE_TEXT

protected static final int STATE_TEXT
Deprecated. 
Scanning text..

See Also:
Constant Field Values

STATE_BEFORE_START_TAG_NAME

protected static final int STATE_BEFORE_START_TAG_NAME
Deprecated. 
Scanning attribute name/value pairs.

See Also:
Constant Field Values

STATE_BEFORE_END_TAG_NAME

protected static final int STATE_BEFORE_END_TAG_NAME
Deprecated. 
Scanning a closing tag.

See Also:
Constant Field Values

STATE_IN_START_TAG

protected static final int STATE_IN_START_TAG
Deprecated. 
Scanning attribute name/value pairs.

See Also:
Constant Field Values

STATE_IN_END_TAG

protected static final int STATE_IN_END_TAG
Deprecated. 
Scanning a closing tag.

See Also:
Constant Field Values

MAX_ENTITY_VALUE

protected static final int MAX_ENTITY_VALUE
Deprecated. 
The maximum Unicode value accepted for a numeric entity.

See Also:
Constant Field Values

HEXADECIMAL

protected static final int HEXADECIMAL
Deprecated. 
The base for non-decimal entity.

See Also:
Constant Field Values

MAX_HEX_ENTITY_LENGTH

protected static final int MAX_HEX_ENTITY_LENGTH
Deprecated. 
The maximum number of digits of a hexadecimal numeric entity.

See Also:
Constant Field Values

MAX_DEC_ENTITY_LENGTH

protected static final int MAX_DEC_ENTITY_LENGTH
Deprecated. 
The maximum number of digits of a decimal numeric entity.

See Also:
Constant Field Values

SCRIPT_CLOSE_TAG_PATTERN

protected static final TextPattern SCRIPT_CLOSE_TAG_PATTERN
Deprecated. 
Closing tag for a script element.


STYLE_CLOSE_TAG_PATTERN

protected static final TextPattern STYLE_CLOSE_TAG_PATTERN
Deprecated. 
Closing tag for a style element.


NONSPACE_WHITESPACE

protected static final char[] NONSPACE_WHITESPACE
Deprecated. 
An array containing the non-space whitespace.


SPACE

protected static final char[] SPACE
Deprecated. 
An array, parallel to NONSPACE_WHITESPACE, containing spaces.


CLOSED_COMMENT

protected static final TextPattern CLOSED_COMMENT
Deprecated. 
Closed comment. It should be "-->", but mistakes are common.


CLOSED_PERCENT

protected static final TextPattern CLOSED_PERCENT
Deprecated. 
Closed ASP or similar tag.


CLOSED_PIC

protected static final TextPattern CLOSED_PIC
Deprecated. 
Closed processing instruction.


CLOSED_SECTION

protected static final TextPattern CLOSED_SECTION
Deprecated. 
Closed section (conditional, etc.).


CLOSED_CDATA

protected static final TextPattern CLOSED_CDATA
Deprecated. 
Closed section (conditional, CDATA, etc.).


factory

public final ParsingFactory factory
Deprecated. 
The parsing factory used by this parser.


callback

protected Callback callback
Deprecated. 
The callback of this parser.


attrMap

protected Reference2ObjectMap<Attribute,MutableString> attrMap
Deprecated. 
A map from attributes to attribute values.


parseText

protected boolean parseText
Deprecated. 
Whether we should invoke the text handler.


parseCDATA

protected boolean parseCDATA
Deprecated. 
Whether we should invoke the CDATA section handler.


parseTags

protected boolean parseTags
Deprecated. 
Whether we should parse tags.


parseAttributes

protected boolean parseAttributes
Deprecated. 
Whether we should parse attributes.


parsedAttrs

protected ReferenceArraySet<Attribute> parsedAttrs
Deprecated. 
The subset of attributes whose values will be actually parsed (if, of course, parseAttributesis true).


parsedAttributes

public ReferenceSet<Attribute> parsedAttributes
Deprecated. 
An externally visible, immutable subset of attributes whose values will be actually parsed.


lastEntity

protected char lastEntity
Deprecated. 
The character represented by the last scanned entity.

Constructor Detail

BulletParser

public BulletParser(ParsingFactory factory)
Deprecated. 
Creates a new bullet parser.


BulletParser

public BulletParser()
Deprecated. 
Creates a new bullet parser using the default factory HTMLFactory.INSTANCE.

Method Detail

parseText

public boolean parseText()
Deprecated. 
Returns whether this parser will invoke the text handler.

Returns:
whether this parser will invoke the text handler.
See Also:
parseText(boolean)

parseText

public BulletParser parseText(boolean parseText)
Deprecated. 
Sets the text handler flag.

Parameters:
parseText - the new value.
Returns:
this parser.

parseCDATA

public boolean parseCDATA()
Deprecated. 
Returns whether this parser will invoke the CDATA-section handler.

Returns:
whether this parser will invoke the CDATA-section handler.
See Also:
parseCDATA(boolean)

parseCDATA

public BulletParser parseCDATA(boolean parseCDATA)
Deprecated. 
Sets the CDATA-section handler flag.

Parameters:
parseCDATA - the new value.
Returns:
this parser.

parseTags

public boolean parseTags()
Deprecated. 
Returns whether this parser will parse tags and invoke element handlers.

Returns:
whether this parser will parse tags and invoke element handlers.
See Also:
parseTags(boolean)

parseTags

public BulletParser parseTags(boolean parseTags)
Deprecated. 
Sets whether this parser will parse tags and invoke element handlers.

Parameters:
parseTags - the new value.
Returns:
this parser.

parseAttributes

public boolean parseAttributes()
Deprecated. 
Returns whether this parser will parse attributes.

Returns:
whether this parser will parse attributes.
See Also:
parseAttributes(boolean)

parseAttributes

public BulletParser parseAttributes(boolean parseAttributes)
Deprecated. 
Sets the attribute parsing flag.

Parameters:
parseAttributes - the new value for the flag.
Returns:
this parser.

parseAttribute

public BulletParser parseAttribute(Attribute attribute)
Deprecated. 
Adds the given attribute to the set of attributes to be parsed.

Parameters:
attribute - an attribute that should be parsed.
Returns:
this parser.
Throws:
IllegalStateException - if parseAttributes(true) has not been invoked on this parser.

setCallback

public BulletParser setCallback(Callback callback)
Deprecated. 
Sets the callback for this parser, resetting at the same time all parsing flags.

Parameters:
callback - the new callback.
Returns:
this parser.

entity2Char

protected char entity2Char(MutableString name)
Deprecated. 
Returns the character corresponding to a given entity name.

Parameters:
name - the name of an entity.
Returns:
the character corresponding to the entity, or an ASCII NUL if no entity with that name was found.

scanEntity

protected int scanEntity(char[] a,
                         int offset,
                         int length,
                         boolean loose,
                         MutableString entity)
Deprecated. 
Searches for the end of an entity.

This method will search for the end of an entity starting at the given offset (the offset must correspond to the ampersand).

Real-world HTML pages often contain hundreds of misplaced ampersands, due to the unfortunate idea of using the ampersand as query separator (please use the comma in new code!). All such ampersand should be specified as &amp;. If named entities are delimited using a transition from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter loose is false, named entities can be delimited only by whitespace or by a comma.

Parameters:
a - a character array containing the entity.
offset - the offset at which the entity starts (the offset must point at the ampersand).
length - an upper bound to the maximum returned position.
loose - if true, named entities can be terminated by any non-alphabetical character (instead of whitespace or comma).
entity - a support mutable string used to query ParsingFactory.getEntity(MutableString).
Returns:
the position of the last character of the entity, or -1 if no entity was found.

replaceEntities

protected void replaceEntities(MutableString s,
                               MutableString entity,
                               boolean loose)
Deprecated. 
Replaces entities with the corresponding characters.

This method will modify the mutable string s so that all legal occurrences of entities are replaced by the corresponding character.

Parameters:
s - a mutable string whose entities will be replaced by the corresponding characters.
entity - a support mutable string used by scanEntity(char[], int, int, boolean, MutableString).
loose - a parameter that will be passed to scanEntity(char[], int, int, boolean, MutableString).

handleMarkup

protected int handleMarkup(char[] text,
                           int pos,
                           int end)
Deprecated. 
Handles markup.

Parameters:
text - the text.
pos - the first character in the markup after <!.
end - the end of text.
Returns:
the position of the first character after the markup.

handleProcessingInstruction

protected int handleProcessingInstruction(char[] text,
                                          int pos,
                                          int end)
Deprecated. 
Handles processing instruction, ASP tags etc.

Parameters:
text - the text.
pos - the first character in the markup after <%.
end - the end of text.
Returns:
the position of the first character after the processing instruction.

parse

public void parse(char[] text)
Deprecated. 
Analyze the text document to extract information.

Parameters:
text - a char array of text to be parsed.

parse

public void parse(char[] text,
                  int offset,
                  int length)
Deprecated. 
Analyze the text document to extract information.

Parameters:
text - a char array of text to be parsed.
offset - the offset in the array from which the parsing will begin.
length - the number of characters to be parsed.