Class PDFMarkedContentExtractor


  • public class PDFMarkedContentExtractor
    extends PDFStreamEngine
    This is an stream engine to extract the marked content of a pdf.
    Author:
    Johannes Koch
    • Constructor Detail

      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor()
                                  throws java.io.IOException
        Instantiate a new PDFTextStripper object.
        Throws:
        java.io.IOException
      • PDFMarkedContentExtractor

        public PDFMarkedContentExtractor​(java.lang.String encoding)
                                  throws java.io.IOException
        Constructor. Will apply encoding-specific conversions to the output text.
        Parameters:
        encoding - The encoding that the output will be written in.
        Throws:
        java.io.IOException
    • Method Detail

      • xobject

        public void xobject​(PDXObject xobject)
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Parameters:
        text - The text to process.
      • getMarkedContents

        public java.util.List<PDMarkedContent> getMarkedContents()
      • processPage

        public void processPage​(PDPage page)
                         throws java.io.IOException
        This will initialize and process the contents of the stream.
        Overrides:
        processPage in class PDFStreamEngine
        Parameters:
        page - the page to process
        Throws:
        java.io.IOException - if there is an error accessing the stream.
      • showGlyph

        protected void showGlyph​(Matrix textRenderingMatrix,
                                 PDFont font,
                                 int code,
                                 java.lang.String unicode,
                                 Vector displacement)
                          throws java.io.IOException
        This method was originally written by Ben Litchfield for PDFStreamEngine.
        Overrides:
        showGlyph in class PDFStreamEngine
        Parameters:
        textRenderingMatrix - the current text rendering matrix, Trm
        font - the current font
        code - internal PDF character code for the glyph
        unicode - the Unicode text for this glyph, or null if the PDF does provide it
        displacement - the displacement (i.e. advance) of the glyph in text space
        Throws:
        java.io.IOException - if the glyph cannot be processed