edu.harvard.hul.ois.jhove.module.pdf
Class Tokenizer

java.lang.Object
  extended by edu.harvard.hul.ois.jhove.module.pdf.Tokenizer
Direct Known Subclasses:
FileTokenizer, StreamTokenizer

public abstract class Tokenizer
extends java.lang.Object

Tokenizer for PDF files. This is used in conjunction with the Parser, which assembled Tokens into higher-level constructs.


Field Summary
protected  int _ch
          Character code of current character.
protected  java.io.RandomAccessFile _file
          Source from which to read bytes.
static char[] PDFDOCENCODING
          Mapping between PDFDocEncoding and Unicode code points.
 
Constructor Summary
Tokenizer()
          Constructor.
 
Method Summary
 void addLanguageCode(java.lang.String langCode)
          Add a string to the language codes
abstract  void backupChar()
          Back up a byte so it will be read again.
 java.util.Set getLanguageCodes()
          Return the set of language codes.
 Token getNext()
          Parses out and returns a token from the input file.
 Token getNext(long max)
          Parses out and returns a token from the input file.
 long getOffset()
          Return the current offset into the file.
 boolean getPDFACompliant()
          Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance.
 java.lang.String getWSString()
          Returns the value of the last white space string read by the tokenizer.
protected abstract  void initStream(Stream token)
          Initialization code for Stream object.
abstract  int readChar()
          Get a character from the file or stream, using a buffer
 int readChar1(boolean utf16)
          Read a character in one-byte or 2-byte format, as requested
 void scanMode(boolean flag)
          If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.
abstract  void seek(long offset)
          Set the Tokenizer to a new position in the file.
protected  void seekReset(long offset)
          Reset after a seek.
 void setEncrypted(boolean encrypted)
          Tell this object that the file is or isn't encrypted.
 void setPDFACompliant(boolean pdfACompliant)
          Set the value of the pdfACompliant flag.
protected abstract  void setStreamOffset(Stream token)
          Sets the offset of a Stream to the current file position.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PDFDOCENCODING

public static char[] PDFDOCENCODING
Mapping between PDFDocEncoding and Unicode code points.


_file

protected java.io.RandomAccessFile _file
Source from which to read bytes.


_ch

protected int _ch
Character code of current character.

Constructor Detail

Tokenizer

public Tokenizer()
Constructor.

Method Detail

getNext

public Token getNext()
              throws java.io.IOException,
                     PdfException
Parses out and returns a token from the input file. If it hits the end of the file, returns null. Other parsing problems cause an exception to be thrown. When an exception is thrown, the state is changed to WHITESPACE, so the parser can get back in sync more easily.

Throws:
java.io.IOException
PdfException

getNext

public Token getNext(long max)
              throws java.io.IOException,
                     PdfException
Parses out and returns a token from the input file. If it hits the end of the file, returns null. Other parsing problems cause an exception to be thrown. When an exception is thrown, the state is changed to WHITESPACE, so the parser can get back in sync more easily.

Parameters:
max - Maximum allowable size of the token
Throws:
java.io.IOException
PdfException

getOffset

public long getOffset()
Return the current offset into the file.


getLanguageCodes

public java.util.Set getLanguageCodes()
Return the set of language codes. Members of the set are Strings.


setEncrypted

public void setEncrypted(boolean encrypted)
Tell this object that the file is or isn't encrypted.


getPDFACompliant

public boolean getPDFACompliant()
Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance. A value of true is no guarantee that the file is compliant.


setPDFACompliant

public void setPDFACompliant(boolean pdfACompliant)
Set the value of the pdfACompliant flag. This may be used to clear previous detection of noncompliance.


getWSString

public java.lang.String getWSString()
Returns the value of the last white space string read by the tokenizer. Repositioning clears the white space string.


seek

public abstract void seek(long offset)
                   throws java.io.IOException,
                          PdfException
Set the Tokenizer to a new position in the file.

Parameters:
offset - The offset in bytes from the start of the file.
Throws:
java.io.IOException
PdfException

seekReset

protected void seekReset(long offset)
Reset after a seek.


readChar

public abstract int readChar()
                      throws java.io.IOException
Get a character from the file or stream, using a buffer

Throws:
java.io.IOException

readChar1

public int readChar1(boolean utf16)
              throws java.io.IOException
Read a character in one-byte or 2-byte format, as requested

Throws:
java.io.IOException

backupChar

public abstract void backupChar()
Back up a byte so it will be read again.


addLanguageCode

public void addLanguageCode(java.lang.String langCode)
Add a string to the language codes


scanMode

public void scanMode(boolean flag)
If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.

Parameters:
flag - Scan mode flag

initStream

protected abstract void initStream(Stream token)
                            throws java.io.IOException,
                                   PdfException
Initialization code for Stream object. This is meaningful only for the FileTokenizer subclass.

Throws:
java.io.IOException
PdfException

setStreamOffset

protected abstract void setStreamOffset(Stream token)
                                 throws java.io.IOException,
                                        PdfException
Sets the offset of a Stream to the current file position. Only the file-based tokenizer can do this, which is why this overrides the Tokenizer method.

Throws:
java.io.IOException
PdfException