Tokenizer (JHOVE Documentation)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.harvard.hul.ois.jhove.module.pdf
Class Tokenizer

java.lang.Object
  edu.harvard.hul.ois.jhove.module.pdf.Tokenizer

Direct Known Subclasses:: FileTokenizer, StreamTokenizer

public abstract class Tokenizer
extends java.lang.Object
extends java.lang.Object

Tokenizer for PDF files. This is used in conjunction with the Parser, which assembled Tokens into higher-level constructs.

Field Summary
`protected int`	`_ch` Character code of current character.
`protected java.io.RandomAccessFile`	`_file` Source from which to read bytes.
`static char[]`	`PDFDOCENCODING` Mapping between PDFDocEncoding and Unicode code points.

Constructor Summary
`Tokenizer()` Constructor.

Method Summary
`void`	`addLanguageCode(java.lang.String langCode)` Add a string to the language codes
`abstract void`	`backupChar()` Back up a byte so it will be read again.
`java.util.Set`	`getLanguageCodes()` Return the set of language codes.
`Token`	`getNext()` Parses out and returns a token from the input file.
`Token`	`getNext(long max)` Parses out and returns a token from the input file.
`long`	`getOffset()` Return the current offset into the file.
`boolean`	`getPDFACompliant()` Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance.
`java.lang.String`	`getWSString()` Returns the value of the last white space string read by the tokenizer.
`protected abstract void`	`initStream(Stream token)` Initialization code for Stream object.
`abstract int`	`readChar()` Get a character from the file or stream, using a buffer
`int`	`readChar1(boolean utf16)` Read a character in one-byte or 2-byte format, as requested
`void`	`scanMode(boolean flag)` If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.
`abstract void`	`seek(long offset)` Set the Tokenizer to a new position in the file.
`protected void`	`seekReset(long offset)` Reset after a seek.
`void`	`setEncrypted(boolean encrypted)` Tell this object that the file is or isn't encrypted.
`void`	`setPDFACompliant(boolean pdfACompliant)` Set the value of the pdfACompliant flag.
`protected abstract void`	`setStreamOffset(Stream token)` Sets the offset of a Stream to the current file position.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

PDFDOCENCODING

public static char[] PDFDOCENCODING

Mapping between PDFDocEncoding and Unicode code points.

_file

protected java.io.RandomAccessFile _file

Source from which to read bytes.

_ch

protected int _ch

Character code of current character.

Constructor Detail

Tokenizer

public Tokenizer()

Constructor.

Method Detail

getNext

public Token getNext()
              throws java.io.IOException,
                     PdfException

Parses out and returns a token from the input file. If it hits the end of the file, returns null. Other parsing problems cause an exception to be thrown. When an exception is thrown, the state is changed to WHITESPACE, so the parser can get back in sync more easily.

Throws:: java.io.IOException; PdfException

getNext

public Token getNext(long max)
              throws java.io.IOException,
                     PdfException

Parameters:: max - Maximum allowable size of the token
Throws:: java.io.IOException; PdfException

getOffset

public long getOffset()

Return the current offset into the file.

getLanguageCodes

public java.util.Set getLanguageCodes()

Return the set of language codes. Members of the set are Strings.

setEncrypted

public void setEncrypted(boolean encrypted)

Tell this object that the file is or isn't encrypted.

getPDFACompliant

public boolean getPDFACompliant()

Returns the value of the pdfACompliant flag, which indicates that the tokenizer hasn't detected non-compliance. A value of true is no guarantee that the file is compliant.

setPDFACompliant

public void setPDFACompliant(boolean pdfACompliant)

Set the value of the pdfACompliant flag. This may be used to clear previous detection of noncompliance.

getWSString

public java.lang.String getWSString()

Returns the value of the last white space string read by the tokenizer. Repositioning clears the white space string.

seek

public abstract void seek(long offset)
                   throws java.io.IOException,
                          PdfException

Set the Tokenizer to a new position in the file.

Parameters:: offset - The offset in bytes from the start of the file.
Throws:: java.io.IOException; PdfException

seekReset

protected void seekReset(long offset)

Reset after a seek.

readChar

public abstract int readChar()
                      throws java.io.IOException

Get a character from the file or stream, using a buffer

Throws:: java.io.IOException

readChar1

public int readChar1(boolean utf16)
              throws java.io.IOException

Read a character in one-byte or 2-byte format, as requested

Throws:: java.io.IOException

backupChar

public abstract void backupChar()

Back up a byte so it will be read again.

addLanguageCode

public void addLanguageCode(java.lang.String langCode)

Add a string to the language codes

scanMode

public void scanMode(boolean flag)

If true, do not attempt to parse non-whitespace delimited tokens, e.g., literal and hexadecimal strings.

Parameters:: flag - Scan mode flag

initStream

protected abstract void initStream(Stream token)
                            throws java.io.IOException,
                                   PdfException

Initialization code for Stream object. This is meaningful only for the FileTokenizer subclass.

Throws:: java.io.IOException; PdfException

setStreamOffset

protected abstract void setStreamOffset(Stream token)
                                 throws java.io.IOException,
                                        PdfException

Sets the offset of a Stream to the current file position. Only the file-based tokenizer can do this, which is why this overrides the Tokenizer method.

Throws:: java.io.IOException; PdfException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.harvard.hul.ois.jhove.module.pdf Class Tokenizer

PDFDOCENCODING

_file

_ch

Tokenizer

getNext

getNext

getOffset

getLanguageCodes

setEncrypted

getPDFACompliant

setPDFACompliant

getWSString

seek

seekReset

readChar

readChar1

backupChar

addLanguageCode

scanMode

initStream

setStreamOffset

edu.harvard.hul.ois.jhove.module.pdf
Class Tokenizer