edu.harvard.hul.ois.jhove.module
Class HtmlModule

java.lang.Object
  extended by edu.harvard.hul.ois.jhove.ModuleBase
      extended by edu.harvard.hul.ois.jhove.module.HtmlModule
All Implemented Interfaces:
Module

public class HtmlModule
extends ModuleBase

Module for identification and validation of HTML files. HTML is different from most of the other documents in that sloppy construction is practically assumed in the specification. This module attempt to report as many errors as possible and recover reasonably from errors. To do this, there is more heuristic behavior built into this module than into the more straightforward ones. XHTML is recognized by this module, but is handed off to the XML module for processing. If the XML module is missing (which it shouldn't be if you've installed the JHOVE application without modifications), this won't be able to deal with XHTML files. HTML should be placed ahead of XML in the module order. If the XML module sees an XHTML file first, it will recognize it as XHTML, but won't be able to report the complete properties. The HTML module uses code created with the JavaCC parser generator and lexical analyzer generator. There is apparently a bug in JavaCC which causes blank lines not to be counted in certain cases, causing lexical errors to be reported with incorrect line numbers.

Author:
Gary McGath

Field Summary
protected  ChecksumInputStream _cstream
          PRIVATE INSTANCE FIELDS.
protected  java.lang.String _doctype
           
protected  java.io.DataInputStream _dstream
           
protected  TextMDMetadata _textMD
           
protected  boolean _withTextMD
           
static int HTML_3_2
           
static int HTML_4_0_FRAMESET
           
static int HTML_4_0_STRICT
           
static int HTML_4_0_TRANSITIONAL
           
static int HTML_4_01_FRAMESET
           
static int HTML_4_01_STRICT
           
static int HTML_4_01_TRANSITIONAL
           
static int XHTML_1_0_FRAMESET
           
static int XHTML_1_0_STRICT
           
static int XHTML_1_0_TRANSITIONAL
           
static int XHTML_1_1
           
 
Fields inherited from class edu.harvard.hul.ois.jhove.ModuleBase
_app, _bigEndian, _checksumFinished, _countStream, _coverage, _crc32, _date, _defaultParams, _features, _format, _init, _isRandomAccess, _je, _logger, _md5, _mimeType, _name, _nByte, _note, _param, _release, _repInfoNote, _rights, _sha1, _signature, _specification, _validityNote, _vendor, _verbosity, _wellFormedNote
 
Fields inherited from interface edu.harvard.hul.ois.jhove.Module
MAXIMUM_VERBOSITY, MINIMUM_VERBOSITY
 
Constructor Summary
HtmlModule()
          Instantiate an HtmlModule object.
 
Method Summary
protected  int checkDoctype(java.util.List elements)
           
 void checkSignatures(java.io.File file, java.io.InputStream stream, RepInfo info)
          Check if the digital object conforms to this Module's internal signature information.
protected static boolean isXmlAvailable()
           
 int parse(java.io.InputStream stream, RepInfo info, int parseIndex)
          Parse the content of a purported HTML stream digital object and store the results in RepInfo.
protected  int seemsToBeXHTML(java.util.List elements)
           
protected  java.lang.String stripQuotes(java.lang.String str)
           
 
Methods inherited from class edu.harvard.hul.ois.jhove.ModuleBase
addIntegerProperty, addIntegerProperty, applyDefaultParams, calcRAChecksum, checkSignatures, getApp, getBase, getBufferedDataStream, getCoverage, getCRC32, getDate, getDefaultParams, getFeatures, getFormat, getMimeType, getName, getNByte, getNote, getRelease, getRepInfoNote, getRights, getSignature, getSpecification, getValidityNote, getVendor, getWellFormedNote, hasFeature, init, initFeatures, initParse, isBigEndian, isRandomAccess, param, parse, readByteBuf, readDouble, readDouble, readDouble, readFloat, readFloat, readSignedByte, readSignedByte, readSignedByte, readSignedInt, readSignedInt, readSignedInt, readSignedLong, readSignedRational, readSignedRational, readSignedShort, readSignedShort, readSignedShort, readUnsignedByte, readUnsignedByte, readUnsignedByte, readUnsignedInt, readUnsignedInt, readUnsignedInt, readUnsignedRational, readUnsignedRational, readUnsignedRational, readUnsignedShort, readUnsignedShort, readUnsignedShort, resetParams, setApp, setBase, setChecksums, setCRC32, setDefaultParams, setMD5, setNByte, setSHA1, setValidityNote, setVerbosity, show, skipBytes, skipBytes, vectorToPropArray
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_cstream

protected ChecksumInputStream _cstream
PRIVATE INSTANCE FIELDS.


_dstream

protected java.io.DataInputStream _dstream

_doctype

protected java.lang.String _doctype

HTML_3_2

public static final int HTML_3_2
See Also:
Constant Field Values

HTML_4_0_STRICT

public static final int HTML_4_0_STRICT
See Also:
Constant Field Values

HTML_4_0_FRAMESET

public static final int HTML_4_0_FRAMESET
See Also:
Constant Field Values

HTML_4_0_TRANSITIONAL

public static final int HTML_4_0_TRANSITIONAL
See Also:
Constant Field Values

HTML_4_01_STRICT

public static final int HTML_4_01_STRICT
See Also:
Constant Field Values

HTML_4_01_FRAMESET

public static final int HTML_4_01_FRAMESET
See Also:
Constant Field Values

HTML_4_01_TRANSITIONAL

public static final int HTML_4_01_TRANSITIONAL
See Also:
Constant Field Values

XHTML_1_0_STRICT

public static final int XHTML_1_0_STRICT
See Also:
Constant Field Values

XHTML_1_0_TRANSITIONAL

public static final int XHTML_1_0_TRANSITIONAL
See Also:
Constant Field Values

XHTML_1_0_FRAMESET

public static final int XHTML_1_0_FRAMESET
See Also:
Constant Field Values

XHTML_1_1

public static final int XHTML_1_1
See Also:
Constant Field Values

_withTextMD

protected boolean _withTextMD

_textMD

protected TextMDMetadata _textMD
Constructor Detail

HtmlModule

public HtmlModule()
Instantiate an HtmlModule object.

Method Detail

parse

public int parse(java.io.InputStream stream,
                 RepInfo info,
                 int parseIndex)
          throws java.io.IOException
Parse the content of a purported HTML stream digital object and store the results in RepInfo.

Specified by:
parse in interface Module
Overrides:
parse in class ModuleBase
Parameters:
stream - An InputStream, positioned at its beginning, which is generated from the object to be parsed. If multiple calls to parse are made on the basis of a nonzero value being returned, a new InputStream must be provided each time.
info - A fresh (on the first call) RepInfo object which will be modified to reflect the results of the parsing If multiple calls to parse are made on the basis of a nonzero value being returned, the same RepInfo object should be passed with each call.
parseIndex - Must be 0 in first call to parse. If parse returns a nonzero value, it must be called again with parseIndex equal to that return value.
Throws:
java.io.IOException

checkSignatures

public void checkSignatures(java.io.File file,
                            java.io.InputStream stream,
                            RepInfo info)
                     throws java.io.IOException
Check if the digital object conforms to this Module's internal signature information. HTML is one of the most ill-defined of any open formats, so checking a "signature" really means using some heuristics. The only required tag is TITLE, but that could occur well into the file. So we look for any of three strings -- taking into account case-independence and white space -- within the first sigBytes bytes, and call that a signature check.

Specified by:
checkSignatures in interface Module
Overrides:
checkSignatures in class ModuleBase
Parameters:
file - A File object for the object being parsed
stream - An InputStream, positioned at its beginning, which is generated from the object to be parsed
info - A fresh RepInfo object which will be modified to reflect the results of the test
Throws:
java.io.IOException

checkDoctype

protected int checkDoctype(java.util.List elements)

seemsToBeXHTML

protected int seemsToBeXHTML(java.util.List elements)

stripQuotes

protected java.lang.String stripQuotes(java.lang.String str)

isXmlAvailable

protected static boolean isXmlAvailable()