edu.harvard.hul.ois.jhove.module
Class HtmlModule
java.lang.Object
edu.harvard.hul.ois.jhove.ModuleBase
edu.harvard.hul.ois.jhove.module.HtmlModule
- All Implemented Interfaces:
- Module
public class HtmlModule
- extends ModuleBase
Module for identification and validation of HTML files.
HTML is different from most of the other documents in that
sloppy construction is practically assumed in the specification.
This module attempt to report as many errors as possible and
recover reasonably from errors. To do this, there is more
heuristic behavior built into this module than into the more
straightforward ones.
XHTML is recognized by this module, but is handed off to the
XML module for processing. If the XML module is missing (which
it shouldn't be if you've installed the JHOVE application without
modifications), this won't be able to deal with XHTML files.
HTML should be placed ahead of XML in the module order. If the
XML module sees an XHTML file first, it will recognize it as XHTML,
but won't be able to report the complete properties.
The HTML module uses code created with the JavaCC parser generator
and lexical analyzer generator. There is apparently a bug in
JavaCC which causes blank lines not to be counted in certain cases,
causing lexical errors to be reported with incorrect line numbers.
- Author:
- Gary McGath
Fields inherited from class edu.harvard.hul.ois.jhove.ModuleBase |
_app, _bigEndian, _checksumFinished, _countStream, _coverage, _crc32, _date, _defaultParams, _features, _format, _init, _isRandomAccess, _je, _logger, _md5, _mimeType, _name, _nByte, _note, _param, _release, _repInfoNote, _rights, _sha1, _signature, _specification, _validityNote, _vendor, _verbosity, _wellFormedNote |
Constructor Summary |
HtmlModule()
Instantiate an HtmlModule object. |
Method Summary |
protected int |
checkDoctype(java.util.List elements)
|
void |
checkSignatures(java.io.File file,
java.io.InputStream stream,
RepInfo info)
Check if the digital object conforms to this Module's
internal signature information. |
protected static boolean |
isXmlAvailable()
|
int |
parse(java.io.InputStream stream,
RepInfo info,
int parseIndex)
Parse the content of a purported HTML stream digital object and store the
results in RepInfo. |
protected int |
seemsToBeXHTML(java.util.List elements)
|
protected java.lang.String |
stripQuotes(java.lang.String str)
|
Methods inherited from class edu.harvard.hul.ois.jhove.ModuleBase |
addIntegerProperty, addIntegerProperty, applyDefaultParams, calcRAChecksum, checkSignatures, getApp, getBase, getBufferedDataStream, getCoverage, getCRC32, getDate, getDefaultParams, getFeatures, getFormat, getMimeType, getName, getNByte, getNote, getRelease, getRepInfoNote, getRights, getSignature, getSpecification, getValidityNote, getVendor, getWellFormedNote, hasFeature, init, initFeatures, initParse, isBigEndian, isRandomAccess, param, parse, readByteBuf, readDouble, readDouble, readDouble, readFloat, readFloat, readSignedByte, readSignedByte, readSignedByte, readSignedInt, readSignedInt, readSignedInt, readSignedLong, readSignedRational, readSignedRational, readSignedShort, readSignedShort, readSignedShort, readUnsignedByte, readUnsignedByte, readUnsignedByte, readUnsignedInt, readUnsignedInt, readUnsignedInt, readUnsignedRational, readUnsignedRational, readUnsignedRational, readUnsignedShort, readUnsignedShort, readUnsignedShort, resetParams, setApp, setBase, setChecksums, setCRC32, setDefaultParams, setMD5, setNByte, setSHA1, setValidityNote, setVerbosity, show, skipBytes, skipBytes, vectorToPropArray |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
_cstream
protected ChecksumInputStream _cstream
- PRIVATE INSTANCE FIELDS.
_dstream
protected java.io.DataInputStream _dstream
_doctype
protected java.lang.String _doctype
HTML_3_2
public static final int HTML_3_2
- See Also:
- Constant Field Values
HTML_4_0_STRICT
public static final int HTML_4_0_STRICT
- See Also:
- Constant Field Values
HTML_4_0_FRAMESET
public static final int HTML_4_0_FRAMESET
- See Also:
- Constant Field Values
HTML_4_0_TRANSITIONAL
public static final int HTML_4_0_TRANSITIONAL
- See Also:
- Constant Field Values
HTML_4_01_STRICT
public static final int HTML_4_01_STRICT
- See Also:
- Constant Field Values
HTML_4_01_FRAMESET
public static final int HTML_4_01_FRAMESET
- See Also:
- Constant Field Values
HTML_4_01_TRANSITIONAL
public static final int HTML_4_01_TRANSITIONAL
- See Also:
- Constant Field Values
XHTML_1_0_STRICT
public static final int XHTML_1_0_STRICT
- See Also:
- Constant Field Values
XHTML_1_0_TRANSITIONAL
public static final int XHTML_1_0_TRANSITIONAL
- See Also:
- Constant Field Values
XHTML_1_0_FRAMESET
public static final int XHTML_1_0_FRAMESET
- See Also:
- Constant Field Values
XHTML_1_1
public static final int XHTML_1_1
- See Also:
- Constant Field Values
_withTextMD
protected boolean _withTextMD
_textMD
protected TextMDMetadata _textMD
HtmlModule
public HtmlModule()
- Instantiate an HtmlModule object.
parse
public int parse(java.io.InputStream stream,
RepInfo info,
int parseIndex)
throws java.io.IOException
- Parse the content of a purported HTML stream digital object and store the
results in RepInfo.
- Specified by:
parse
in interface Module
- Overrides:
parse
in class ModuleBase
- Parameters:
stream
- An InputStream, positioned at its beginning,
which is generated from the object to be parsed.
If multiple calls to parse
are made
on the basis of a nonzero value being returned,
a new InputStream must be provided each time.info
- A fresh (on the first call) RepInfo object
which will be modified
to reflect the results of the parsing
If multiple calls to parse
are made
on the basis of a nonzero value being returned,
the same RepInfo object should be passed with each
call.parseIndex
- Must be 0 in first call to parse
. If
parse
returns a nonzero value, it must be
called again with parseIndex
equal to that return value.
- Throws:
java.io.IOException
checkSignatures
public void checkSignatures(java.io.File file,
java.io.InputStream stream,
RepInfo info)
throws java.io.IOException
- Check if the digital object conforms to this Module's
internal signature information.
HTML is one of the most ill-defined of any open formats, so
checking a "signature" really means using some heuristics. The only
required tag is TITLE, but that could occur well into the file. So we
look for any of three strings -- taking into account case-independence
and white space -- within the first sigBytes bytes, and call that
a signature check.
- Specified by:
checkSignatures
in interface Module
- Overrides:
checkSignatures
in class ModuleBase
- Parameters:
file
- A File object for the object being parsedstream
- An InputStream, positioned at its beginning,
which is generated from the object to be parsedinfo
- A fresh RepInfo object which will be modified
to reflect the results of the test
- Throws:
java.io.IOException
checkDoctype
protected int checkDoctype(java.util.List elements)
seemsToBeXHTML
protected int seemsToBeXHTML(java.util.List elements)
stripQuotes
protected java.lang.String stripQuotes(java.lang.String str)
isXmlAvailable
protected static boolean isXmlAvailable()