org.pdfbox.searchengine.lucene
Class LucenePDFDocument

java.lang.Object
  extended byorg.pdfbox.searchengine.lucene.LucenePDFDocument

public final class LucenePDFDocument
extends Object

This class is used to create a document for the lucene search engine. This should easily plug into the IndexHTML or IndexFiles that comes with the lucene project. This class will populate the following fields.

Lucene Field Name Description
path File system path if loaded from a file
url URL to PDF document
contents Entire contents of PDF document, indexed but not stored
summary First 500 characters of content
modified The modified date/time according to the url or path
uid A unique identifier for the Lucene document.
CreationDate From PDF meta-data if available
Creator From PDF meta-data if available
Keywords From PDF meta-data if available
ModificationDate From PDF meta-data if available
Producer From PDF meta-data if available
Subject From PDF meta-data if available
Trapped From PDF meta-data if available

Version:
$Revision: 1.18 $
Author:
Ben Litchfield

Method Summary
static Document getDocument(File file)
          This will get a lucene document from a PDF file.
static Document getDocument(InputStream is)
          This will get a lucene document from a PDF file.
static Document getDocument(URL url)
          This will get a lucene document from a PDF file.
static void main(String[] args)
          This will test creating a document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getDocument

public static Document getDocument(InputStream is)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
is - The stream to read the PDF from.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

getDocument

public static Document getDocument(File file)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
file - The file to get the document for.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

getDocument

public static Document getDocument(URL url)
                            throws IOException
This will get a lucene document from a PDF file.

Parameters:
url - The file to get the document for.
Returns:
The lucene document.
Throws:
IOException - If there is an error parsing or indexing the document.

main

public static void main(String[] args)
                 throws IOException
This will test creating a document. usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>

Parameters:
args - command line arguments.
Throws:
IOException - If there is an error.