edu.harvard.hul.ois.jhove.module
Class XmlModule

java.lang.Object
  extended by edu.harvard.hul.ois.jhove.ModuleBase
      extended by edu.harvard.hul.ois.jhove.module.XmlModule
All Implemented Interfaces:
Module

public class XmlModule
extends ModuleBase

Module for identification and validation of XML files.

Author:
Gary McGath

Field Summary
protected  java.lang.String _baseURL
           
protected  Checksummer _ckSummer
          PRIVATE INSTANCE FIELDS.
protected  ChecksumInputStream _cstream
           
protected  java.io.DataInputStream _dstream
           
protected  java.util.Map<java.lang.String,java.io.File> _localSchemas
           
protected  Property _metadata
           
protected  boolean _parseFromSig
           
protected  java.util.List<Property> _propList
           
protected  boolean _sigWantsDecl
           
protected  TextMDMetadata _textMD
           
protected  boolean _withTextMD
           
protected  java.lang.String _xhtmlDoctype
           
 
Fields inherited from class edu.harvard.hul.ois.jhove.ModuleBase
_app, _bigEndian, _checksumFinished, _countStream, _coverage, _crc32, _date, _defaultParams, _features, _format, _init, _isRandomAccess, _je, _logger, _md5, _mimeType, _name, _nByte, _note, _param, _release, _repInfoNote, _rights, _sha1, _signature, _specification, _validityNote, _vendor, _verbosity, _wellFormedNote
 
Fields inherited from interface edu.harvard.hul.ois.jhove.Module
MAXIMUM_VERBOSITY, MINIMUM_VERBOSITY
 
Constructor Summary
XmlModule()
          Instantiate an XmlModule object.
 
Method Summary
 void checkSignatures(java.io.File file, java.io.InputStream stream, RepInfo info)
          Check if the digital object conforms to this Module's internal signature information.
protected  void initParse()
          Initializes the state of the module for parsing.
protected static java.lang.String intTo4DigitHex(int n)
           
protected static boolean isNotEmpty(java.lang.String value)
          Verification that the string contains something usefull.
protected static boolean nameInCollection(java.lang.String name, java.util.Collection<java.lang.String> coll)
           
 void param(java.lang.String param)
          Per-action initialization.
 int parse(java.io.InputStream stream, RepInfo info, int parseIndex)
          Parse the content of a purported XML digital object and store the results in RepInfo.
 void resetParams()
          Reset parameter settings.
 void setXhtmlDoctype(java.lang.String doctype)
          Sets the value of the doctype string, assumed to have been forced to upper case.
 
Methods inherited from class edu.harvard.hul.ois.jhove.ModuleBase
addIntegerProperty, addIntegerProperty, applyDefaultParams, calcRAChecksum, checkSignatures, getApp, getBase, getBufferedDataStream, getCoverage, getCRC32, getDate, getDefaultParams, getFeatures, getFormat, getMimeType, getName, getNByte, getNote, getRelease, getRepInfoNote, getRights, getSignature, getSpecification, getValidityNote, getVendor, getWellFormedNote, hasFeature, init, initFeatures, isBigEndian, isRandomAccess, parse, readByteBuf, readDouble, readDouble, readDouble, readFloat, readFloat, readSignedByte, readSignedByte, readSignedByte, readSignedInt, readSignedInt, readSignedInt, readSignedLong, readSignedRational, readSignedRational, readSignedShort, readSignedShort, readSignedShort, readUnsignedByte, readUnsignedByte, readUnsignedByte, readUnsignedInt, readUnsignedInt, readUnsignedInt, readUnsignedRational, readUnsignedRational, readUnsignedRational, readUnsignedShort, readUnsignedShort, readUnsignedShort, setApp, setBase, setChecksums, setCRC32, setDefaultParams, setMD5, setNByte, setSHA1, setValidityNote, setVerbosity, show, skipBytes, skipBytes, vectorToPropArray
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_ckSummer

protected Checksummer _ckSummer
PRIVATE INSTANCE FIELDS.


_cstream

protected ChecksumInputStream _cstream

_dstream

protected java.io.DataInputStream _dstream

_propList

protected java.util.List<Property> _propList

_metadata

protected Property _metadata

_xhtmlDoctype

protected java.lang.String _xhtmlDoctype

_baseURL

protected java.lang.String _baseURL

_sigWantsDecl

protected boolean _sigWantsDecl

_parseFromSig

protected boolean _parseFromSig

_withTextMD

protected boolean _withTextMD

_textMD

protected TextMDMetadata _textMD

_localSchemas

protected java.util.Map<java.lang.String,java.io.File> _localSchemas
Constructor Detail

XmlModule

public XmlModule()
Instantiate an XmlModule object.

Method Detail

setXhtmlDoctype

public void setXhtmlDoctype(java.lang.String doctype)
Sets the value of the doctype string, assumed to have been forced to upper case. This is set only when the HTML module invokes the XML module for an XHTML document.


resetParams

public void resetParams()
                 throws java.lang.Exception
Reset parameter settings. Returns to a default state without any parameters.

Specified by:
resetParams in interface Module
Overrides:
resetParams in class ModuleBase
Throws:
java.lang.Exception

param

public void param(java.lang.String param)
Per-action initialization.

Specified by:
param in interface Module
Overrides:
param in class ModuleBase
Parameters:
param - The module parameter; under command-line Jhove, the -p parameter. If the parameter starts with "schema", then the part to the right of the equal sign identifies a URI with a local path (URI, then semicolon, then path). If the first character is 's' and the parameter isn't "schema", then signature checking requires a document declaration, and the rest of the URL is considered as follows. If the parameter begins with 'b' or 'B', then the remainder of the parameter is used as a base URL. Otherwise it is ignored, and there is no base URL.

parse

public int parse(java.io.InputStream stream,
                 RepInfo info,
                 int parseIndex)
          throws java.io.IOException
Parse the content of a purported XML digital object and store the results in RepInfo. This is designed to be called in two passes. On the first pass, a nonvalidating parse is done. If this succeeds, and the presence of DTD's or schemas is detected, then parse returns 1 so that it will be called again to do a validating parse. If there is nothing to validate, we consider it "valid."

Specified by:
parse in interface Module
Overrides:
parse in class ModuleBase
Parameters:
stream - An InputStream, positioned at its beginning, which is generated from the object to be parsed. If multiple calls to parse are made on the basis of a nonzero value being returned, a new InputStream must be provided each time.
info - A fresh (on the first call) RepInfo object which will be modified to reflect the results of the parsing If multiple calls to parse are made on the basis of a nonzero value being returned, the same RepInfo object should be passed with each call.
parseIndex - Must be 0 in first call to parse. If parse returns a nonzero value, it must be called again with parseIndex equal to that return value.
Throws:
java.io.IOException

checkSignatures

public void checkSignatures(java.io.File file,
                            java.io.InputStream stream,
                            RepInfo info)
                     throws java.io.IOException
Check if the digital object conforms to this Module's internal signature information. XML is a particularly messy case; in general, there's no even moderately good way to check "signatures" without parsing the whole file, since the document declaration is optional. We provide the user two choices, based on the "s" parameter. If 's' is the first character of the module parameter, then we look for an XML document declaration, and say there's no signature if it's missing. (This can reject well-formed XML files, though not valid ones.) Otherwise, if there's no document declaration, we parse the whole file.

Specified by:
checkSignatures in interface Module
Overrides:
checkSignatures in class ModuleBase
Parameters:
file - A File object for the object being parsed
stream - An InputStream, positioned at its beginning, which is generated from the object to be parsed
info - A fresh RepInfo object which will be modified to reflect the results of the test
Throws:
java.io.IOException

initParse

protected void initParse()
Description copied from class: ModuleBase
Initializes the state of the module for parsing. This should be called early in each module's parse() method. If a module overrides it to provide additional functionality, the module's initParse() should call super.initParse().

Overrides:
initParse in class ModuleBase

nameInCollection

protected static boolean nameInCollection(java.lang.String name,
                                          java.util.Collection<java.lang.String> coll)

intTo4DigitHex

protected static java.lang.String intTo4DigitHex(int n)

isNotEmpty

protected static boolean isNotEmpty(java.lang.String value)
Verification that the string contains something usefull.

Parameters:
value - string to test
Returns:
boolean