`sparknlp.reader.pdf_to_text`#

Module Contents#

Classes#

PdfToText

Extract text from PDF documents as either a single string or multiple strings per page.

class PdfToText[source]#

Extract text from PDF documents as either a single string or multiple strings per page. Input is a column with binary content of PDF files. Output is a column with extracted text, with options to include page numbers or split pages.

Parameters:

pageNumColstr, optional: Page number output column name.
partitionNumint, optional: Number of partitions (default is 0).
storeSplittedPdfbool, optional: Whether to store content of split PDFs (default is False).
splitPagebool, optional: Enable/disable splitting per page (default is True).
onlyPageNumbool, optional: Whether to extract only page numbers (default is False).
textStripperstr or TextStripperType, optional: Defines layout and formatting type.
sortbool, optional: Enable/disable sorting content per page (default is False).

Examples

>>> import sparknlp
>>> from sparknlp.reader import *
>>> from pyspark.ml import Pipeline
>>> pdf_path = "Documents/files/pdf"
>>> data_frame = spark.read.format("binaryFile").load(pdf_path)
>>> pdf_to_text = PdfToText().setStoreSplittedPdf(True)
>>> pipeline = Pipeline(stages=[pdf_to_text])
>>> pipeline_model = pipeline.fit(data_frame)
>>> pdf_df = pipeline_model.transform(data_frame)
>>> pdf_df.show()
+--------------------+--------------------+
|                path|    modificationTime|
+--------------------+--------------------+
|file:/Users/paula...|2025-05-15 11:33:...|
|file:/Users/paula...|2025-05-15 11:33:...|
+--------------------+--------------------+
>>> pdf_df.printSchema()
root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- text: string (nullable = true)
 |-- height_dimension: integer (nullable = true)
 |-- width_dimension: integer (nullable = true)
 |-- content: binary (nullable = true)
 |-- exception: string (nullable = true)
 |-- pagenum: integer (nullable = true)

pageNumCol[source]#

partitionNum[source]#

storeSplittedPdf[source]#

splitPage[source]#

textStripper[source]#

sort[source]#

onlyPageNum[source]#

extractCoordinates[source]#

normalizeLigatures[source]#

setInputCol(value)[source]#: Sets the value of inputCol.

setOutputCol(value)[source]#: Sets the value of outputCol.

setPageNumCol(value)[source]#: Sets the value of pageNumCol.

setPartitionNum(value)[source]#: Sets the value of partitionNum.

setStoreSplittedPdf(value)[source]#: Sets the value of storeSplittedPdf.

setSplitPage(value)[source]#: Sets the value of splitPage.

setOnlyPageNum(value)[source]#: Sets the value of onlyPageNum.

setTextStripper(value)[source]#: Sets the value of textStripper.

setSort(value)[source]#: Sets the value of sort.

setExtractCoordinates(value)[source]#: Sets the value of extractCoordinates.

setNormalizeLigatures(value)[source]#: Sets the value of normalizeLigatures.

sparknlp.reader.pdf_to_text#

Module Contents#

Classes#

`sparknlp.reader.pdf_to_text`#