sparknlp.reader.pdf_to_text
#
Module Contents#
Classes#
Extract text from PDF documents as either a single string or multiple strings per page. |
- class PdfToText[source]#
Extract text from PDF documents as either a single string or multiple strings per page. Input is a column with binary content of PDF files. Output is a column with extracted text, with options to include page numbers or split pages.
- Parameters:
- pageNumColstr, optional
Page number output column name.
- partitionNumint, optional
Number of partitions (default is 0).
- storeSplittedPdfbool, optional
Whether to store content of split PDFs (default is False).
- splitPagebool, optional
Enable/disable splitting per page (default is True).
- onlyPageNumbool, optional
Whether to extract only page numbers (default is False).
- textStripperstr or TextStripperType, optional
Defines layout and formatting type.
- sortbool, optional
Enable/disable sorting content per page (default is False).
Examples
>>> import sparknlp >>> from sparknlp.reader import * >>> from pyspark.ml import Pipeline >>> pdf_path = "Documents/files/pdf" >>> data_frame = spark.read.format("binaryFile").load(pdf_path) >>> pdf_to_text = PdfToText().setStoreSplittedPdf(True) >>> pipeline = Pipeline(stages=[pdf_to_text]) >>> pipeline_model = pipeline.fit(data_frame) >>> pdf_df = pipeline_model.transform(data_frame) >>> pdf_df.show() +--------------------+--------------------+ | path| modificationTime| +--------------------+--------------------+ |file:/Users/paula...|2025-05-15 11:33:...| |file:/Users/paula...|2025-05-15 11:33:...| +--------------------+--------------------+ >>> pdf_df.printSchema() root |-- path: string (nullable = true) |-- modificationTime: timestamp (nullable = true) |-- length: long (nullable = true) |-- text: string (nullable = true) |-- height_dimension: integer (nullable = true) |-- width_dimension: integer (nullable = true) |-- content: binary (nullable = true) |-- exception: string (nullable = true) |-- pagenum: integer (nullable = true)
- setPageNumCol(value)[source]#
Sets the value of
pageNumCol
.
- setPartitionNum(value)[source]#
Sets the value of
partitionNum
.
- setStoreSplittedPdf(value)[source]#
Sets the value of
storeSplittedPdf
.
- setOnlyPageNum(value)[source]#
Sets the value of
onlyPageNum
.
- setTextStripper(value)[source]#
Sets the value of
textStripper
.
- setExtractCoordinates(value)[source]#
Sets the value of
extractCoordinates
.
- setNormalizeLigatures(value)[source]#
Sets the value of
normalizeLigatures
.