sparknlp.reader.pdf_to_text#

Module Contents#

Classes#

PdfToText

Extract text from PDF documents as either a single string or multiple strings per page.

class PdfToText[source]#

Extract text from PDF documents as either a single string or multiple strings per page. Input is a column with binary content of PDF files. Output is a column with extracted text, with options to include page numbers or split pages.

Parameters:
pageNumColstr, optional

Page number output column name.

partitionNumint, optional

Number of partitions (default is 0).

storeSplittedPdfbool, optional

Whether to store content of split PDFs (default is False).

splitPagebool, optional

Enable/disable splitting per page (default is True).

onlyPageNumbool, optional

Whether to extract only page numbers (default is False).

textStripperstr or TextStripperType, optional

Defines layout and formatting type.

sortbool, optional

Enable/disable sorting content per page (default is False).

Examples

>>> import sparknlp
>>> from sparknlp.reader import *
>>> from pyspark.ml import Pipeline
>>> pdf_path = "Documents/files/pdf"
>>> data_frame = spark.read.format("binaryFile").load(pdf_path)
>>> pdf_to_text = PdfToText().setStoreSplittedPdf(True)
>>> pipeline = Pipeline(stages=[pdf_to_text])
>>> pipeline_model = pipeline.fit(data_frame)
>>> pdf_df = pipeline_model.transform(data_frame)
>>> pdf_df.show()
+--------------------+--------------------+
|                path|    modificationTime|
+--------------------+--------------------+
|file:/Users/paula...|2025-05-15 11:33:...|
|file:/Users/paula...|2025-05-15 11:33:...|
+--------------------+--------------------+
>>> pdf_df.printSchema()
root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- text: string (nullable = true)
 |-- height_dimension: integer (nullable = true)
 |-- width_dimension: integer (nullable = true)
 |-- content: binary (nullable = true)
 |-- exception: string (nullable = true)
 |-- pagenum: integer (nullable = true)
pageNumCol[source]#
partitionNum[source]#
storeSplittedPdf[source]#
splitPage[source]#
textStripper[source]#
sort[source]#
onlyPageNum[source]#
extractCoordinates[source]#
normalizeLigatures[source]#
setInputCol(value)[source]#

Sets the value of inputCol.

setOutputCol(value)[source]#

Sets the value of outputCol.

setPageNumCol(value)[source]#

Sets the value of pageNumCol.

setPartitionNum(value)[source]#

Sets the value of partitionNum.

setStoreSplittedPdf(value)[source]#

Sets the value of storeSplittedPdf.

setSplitPage(value)[source]#

Sets the value of splitPage.

setOnlyPageNum(value)[source]#

Sets the value of onlyPageNum.

setTextStripper(value)[source]#

Sets the value of textStripper.

setSort(value)[source]#

Sets the value of sort.

setExtractCoordinates(value)[source]#

Sets the value of extractCoordinates.

setNormalizeLigatures(value)[source]#

Sets the value of normalizeLigatures.