edu.harvard.hul.ois.jhove.module.pdf
Class Parser

java.lang.Object
  extended by edu.harvard.hul.ois.jhove.module.pdf.Parser

public class Parser
extends java.lang.Object

The Parser class implements some limited syntactic analysis for PDF. It isn't by any means intended to be a full parser. Its main job is to track nesting of syntactic elements such as dictionary and array beginnings and ends.


Constructor Summary
Parser(Tokenizer tokenizer)
          Constructor.
 
Method Summary
 int getArrayDepth()
          Returns the number of array starts not yet matched by array ends.
 int getDictDepth()
          Returns the number of dictionary starts not yet matched by dictionary ends.
 java.util.Set getLanguageCodes()
          Returns the language code set from the Tokenizer.
 Token getNext()
          Gets a token.
 Token getNext(java.lang.Class clas, java.lang.String errMsg)
          A class-sensitive version of getNext.
 Token getNext(long max)
          Gets a token.
 long getOffset()
          Returns the current offset into the file.
 boolean getPDFACompliant()
          Returns false if either the parser or the tokenizer has detected non-compliance with PDF/A restrictions.
 java.lang.String getWSString()
          Returns the Tokenizer's current whitespace string.
 PdfArray readArray()
          Reads an array.
 PdfDictionary readDictionary()
          Reads a dictionary.
 PdfObject readObject()
          Reads an object.
 PdfObject readObjectDef()
          Reads an object definition, from wherever we are in the stream to the completion of one full object after the obj keyword.
 PdfObject readObjectDef(Numeric objNumTok)
          Reads an object definition, given the first numeric object, which has already been read and is passed as an argument.
 void reset()
          Clear the state of the parser so that it can start reading at a different place in the file.
 void resetLoose()
          Clear the state of the parser so that it can start reading at a different place in the file and ignore any nesting errors.
 void scanMode(boolean flag)
          If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.
 void seek(long offset)
          Positions the file to the specified offset, and resets the state for a new token stream.
 void setEncrypted(boolean encrypted)
          Tells this Parser, and its Tokenizer, whether the file is encrypted.
 void setObjectMap(java.util.Map objectMap)
          Set the object map on which the parser will work.
 void setPDFACompliant(boolean pdfACompliant)
          Set the value of the pdfACompliant flag.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Parser

public Parser(Tokenizer tokenizer)
Constructor. A Parser works with a Tokenizer that feeds it tokens.

Parameters:
tokenizer - The Tokenizer which the parser will use
Method Detail

setObjectMap

public void setObjectMap(java.util.Map objectMap)
Set the object map on which the parser will work.


reset

public void reset()
Clear the state of the parser so that it can start reading at a different place in the file. Clears the stack and the dictionary and array depth counters.


resetLoose

public void resetLoose()
Clear the state of the parser so that it can start reading at a different place in the file and ignore any nesting errors. Sets the stack and the dictionary and array depth counters to a large number so that nesting exceptions won't be thrown.


getNext

public Token getNext()
              throws java.io.IOException,
                     PdfException
Gets a token. Uses Tokenizer.getNext, and keeps track of the depth of dictionary and array nesting.

Throws:
java.io.IOException
PdfException

getNext

public Token getNext(long max)
              throws java.io.IOException,
                     PdfException
Gets a token. Uses Tokenizer.getNext, and keeps track of the depth of dictionary and array nesting.

Parameters:
max - Maximum allowable size of the token
Throws:
java.io.IOException
PdfException

getNext

public Token getNext(java.lang.Class clas,
                     java.lang.String errMsg)
              throws java.io.IOException,
                     PdfException
A class-sensitive version of getNext. The token which is obtained must be of the specified class (or a subclass thereof), or a PdfInvalidException with message errMsg will be thrown.

Throws:
java.io.IOException
PdfException

getDictDepth

public int getDictDepth()
Returns the number of dictionary starts not yet matched by dictionary ends.


setEncrypted

public void setEncrypted(boolean encrypted)
Tells this Parser, and its Tokenizer, whether the file is encrypted.


getArrayDepth

public int getArrayDepth()
Returns the number of array starts not yet matched by array ends.


getWSString

public java.lang.String getWSString()
Returns the Tokenizer's current whitespace string.


getLanguageCodes

public java.util.Set getLanguageCodes()
Returns the language code set from the Tokenizer.


getPDFACompliant

public boolean getPDFACompliant()
Returns false if either the parser or the tokenizer has detected non-compliance with PDF/A restrictions. A value of true is no guarantee that the file is compliant.


setPDFACompliant

public void setPDFACompliant(boolean pdfACompliant)
Set the value of the pdfACompliant flag. This may be used to clear previous detection of noncompliance. If the parameter has a value of true, the tokenizer's pdfACompliant flag is also set to true.


readObjectDef

public PdfObject readObjectDef()
                        throws java.io.IOException,
                               PdfException
Reads an object definition, from wherever we are in the stream to the completion of one full object after the obj keyword.

Throws:
java.io.IOException
PdfException

readObjectDef

public PdfObject readObjectDef(Numeric objNumTok)
                        throws java.io.IOException,
                               PdfException
Reads an object definition, given the first numeric object, which has already been read and is passed as an argument. This is called by the no-argument readObjectDef; the only other case in which it will be called is for a cross-reference stream, which can be distinguished from a cross-reference table only once the first token is read.

Throws:
java.io.IOException
PdfException

readObject

public PdfObject readObject()
                     throws java.io.IOException,
                            PdfException
Reads an object. By design, this reader has a number of limitations. Functions which it uses may call it recursively to build up structures. If it encounters a token inappropriate for an object start, it throws a PdfException on which getToken() may be called to retrieve that token.

Throws:
java.io.IOException
PdfException

readArray

public PdfArray readArray()
                   throws java.io.IOException,
                          PdfException
Reads an array. When this is called, we have already read the ArrayStart token, and arrayDepth has been incremented to reflect this.

Throws:
java.io.IOException
PdfException

readDictionary

public PdfDictionary readDictionary()
                             throws java.io.IOException,
                                    PdfException
Reads a dictionary. When this is called, we have already read the DictionaryStart token, and dictDepth has been incremented to reflect this. Only for use in this special case, where we're picking up a dictionary in midstream.

Throws:
java.io.IOException
PdfException

getOffset

public long getOffset()
Returns the current offset into the file.


seek

public void seek(long offset)
          throws java.io.IOException,
                 PdfException
Positions the file to the specified offset, and resets the state for a new token stream.

Throws:
java.io.IOException
PdfException

scanMode

public void scanMode(boolean flag)
If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.

Parameters:
flag - Scan mode flag