class SparkNLPReader extends Serializable
- Alphabetic
- By Inheritance
- SparkNLPReader
- Serializable
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new SparkNLPReader(params: Map[String, String] = new java.util.HashMap(), headers: Map[String, String] = new java.util.HashMap())
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- def doc(content: Array[Byte]): Seq[HTMLElement]
-
def
doc(docPath: String): DataFrame
Instantiates class to read Word files.
Instantiates class to read Word files.
docPath: this is a path to a directory of Word files or a path to an Word file E.g. "path/word/files"
Example
val docsPath = "home/user/word-directory" val sparkNLPReader = new SparkNLPReader() val docsDf = sparkNLPReader.email(docsPath)
Example 2
You can use SparkNLP for one line of code
val docsDf = SparkNLP.read.doc(docsPath)
docsDf.select("doc").show(false) +----------------------------------------------------------------------------------------------------------------------------------------------------+ |doc | | +----------------------------------------------------------------------------------------------------------------------------------------------------+ |[{Table, Header Col 1, {}}, {Table, Header Col 2, {}}, {Table, Lorem ipsum, {}}, {Table, A Link example, {}}, {NarrativeText, Dolor sit amet, {}}] | +----------------------------------------------------------------------------------------------------------------------------------------------------+ docsDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- doc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- def email(content: Array[Byte]): Seq[HTMLElement]
-
def
email(emailPath: String): DataFrame
Instantiates class to read email files.
Instantiates class to read email files.
emailPath: this is a path to a directory of email files or a path to an email file E.g. "path/email/files"
Example
val emailsPath = "home/user/emails-directory" val sparkNLPReader = new SparkNLPReader() val emailDf = sparkNLPReader.email(emailsPath)
Example 2
You can use SparkNLP for one line of code
val emailDf = SparkNLP.read.email(emailsPath)
emailDf.select("email").show(false) +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |email | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[{Title, Email Text Attachments, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>}}, {NarrativeText, Email test with two text attachments\r\n\r\nCheers,\r\n\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {NarrativeText, <html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Email test with two text attachments</span>\r\n \r\n<br>\r\n\r\n \r\nCheers,\r\n \r\n<br>\r\n\r\n</body>\r\n</html>\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/html}}, {Attachment, filename.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename.txt"}}, {NarrativeText, This is the content of the file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {Attachment, filename2.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename2.txt"}}, {NarrativeText, This is an additional content file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}]| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ emailDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- email: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getOutputColumn: String
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def html(urls: List[String]): DataFrame
- def html(urls: Array[String]): DataFrame
- def html(htmlPath: String): DataFrame
- def htmlToHTMLElement(html: String): Seq[HTMLElement]
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
md(mdPath: String): DataFrame
Instantiates class to read Markdown (.md) files.
Instantiates class to read Markdown (.md) files.
This method loads a Markdown file or directory of
.md
files and parses the content into structured elements such as headers, narrative text, lists, and code blocks.Example
val mdPath = "home/user/markdown-files" val sparkNLPReader = new SparkNLPReader() val mdDf = sparkNLPReader.md(mdPath)
Example 2
Use SparkNLP in one line:
val mdDf = SparkNLP.read.md(mdPath)
mdDf.select("md").show(false) +-----------------------------------------------------------------------------------------------------------------------------------+ |md | +-----------------------------------------------------------------------------------------------------------------------------------+ |[{Title, Introduction, {level -> 1, paragraph -> 0}}, {NarrativeText, This is a Markdown paragraph., {paragraph -> 0}}, ...] | +-----------------------------------------------------------------------------------------------------------------------------------+ mdDf.printSchema() root |-- path: string (nullable = true) |-- md: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- mdPath
Path to a single .md file or a directory of Markdown files.
- returns
A DataFrame with parsed Markdown content as structured HTMLElements.
- def mdToHTMLElement(mdContent: String): Seq[HTMLElement]
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
pdf(pdfPath: String): DataFrame
Instantiates class to read PDF files.
Instantiates class to read PDF files.
pdfPath: this is a path to a directory of PDF files or a path to an PDF file E.g. "path/pdfs/"
Example
val pdfsPath = "home/user/pdfs-directory" val sparkNLPReader = new SparkNLPReader() val pdfDf = sparkNLPReader.pdf(pdfsPath)
Example 2
You can use SparkNLP for one line of code
val pdfDf = SparkNLP.read.pdf(pdfsPath)
pdfDf.show(false) +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+ | path| modificationTime|length| text|height_dimension|width_dimension| content|exception|pagenum| +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+ |file:/content/pdf...|2025-01-15 20:48:...| 25803|This is a Title \...| 842| 596|[25 50 44 46 2D 3...| NULL| 0| |file:/content/pdf...|2025-01-15 20:48:...| 9487|This is a page.\n...| 841| 595|[25 50 44 46 2D 3...| NULL| 0| +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+ pdf_df.printSchema() root |-- path: string (nullable = true) |-- modificationTime: timestamp (nullable = true) |-- length: long (nullable = true) |-- text: string (nullable = true) |-- height_dimension: integer (nullable = true) |-- width_dimension: integer (nullable = true) |-- content: binary (nullable = true) |-- exception: string (nullable = true)
- def ppt(content: Array[Byte]): Seq[HTMLElement]
-
def
ppt(docPath: String): DataFrame
Instantiates class to read PowerPoint files.
Instantiates class to read PowerPoint files.
docPath: this is a path to a directory of PowerPoint files or a path to an PowerPoint file E.g. "path/power-point/files"
Example
val docsPath = "home/user/power-point-directory" val sparkNLPReader = new SparkNLPReader() val pptDf = sparkNLPReader.ppt(docsPath)
Example 2
You can use SparkNLP for one line of code
val pptDf = SparkNLP.read.ppt(docsPath)
xlsDf.select("ppt").show(false) +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |ppt | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[{Title, Adding a Bullet Slide, {}}, {ListItem, • Find the bullet slide layout, {}}, {ListItem, – Use _TextFrame.text for first bullet, {}}, {ListItem, • Use _TextFrame.add_paragraph() for subsequent bullets, {}}, {NarrativeText, Here is a lot of text!, {}}, {NarrativeText, Here is some text in a text box!, {}}]| +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ pptDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- ppt: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- def setOutputColumn(value: String): Unit
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
def
txt(filePath: String): DataFrame
Instantiates class to read txt files.
Instantiates class to read txt files.
filePath: this is a path to a directory of TXT files or a path to an TXT file E.g. "path/txt/files"
Example
val filePath = "home/user/txt/files" val sparkNLPReader = new SparkNLPReader() val txtDf = sparkNLPReader.txt(filePath)
Example 2
You can use SparkNLP for one line of code
val txtDf = SparkNLP.read.txt(filePath)
txtDf.select("txt").show(false) +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |txt | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}, {NarrativeText, Apache Spark is a fast and general-purpose cluster computing system.\nIt provides high-level APIs in Java, Scala, Python, and R., {paragraph -> 0}}, {Title, MACHINE LEARNING, {paragraph -> 1}}, {NarrativeText, Spark's MLlib provides scalable machine learning algorithms.\nIt includes tools for classification, regression, clustering, and more., {paragraph -> 1}}]| +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ emailDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- txt: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- def txtContent(content: String): DataFrame
- def txtToHTMLElement(text: String): Seq[HTMLElement]
- def urlToHTMLElement(url: String): Seq[HTMLElement]
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- def xls(content: Array[Byte]): Seq[HTMLElement]
-
def
xls(docPath: String): DataFrame
Instantiates class to read Excel files.
Instantiates class to read Excel files.
docPath: this is a path to a directory of Excel files or a path to an Excel file E.g. "path/excel/files"
Example
val docsPath = "home/user/excel-directory" val sparkNLPReader = new SparkNLPReader() val xlsDf = sparkNLPReader.xls(docsPath)
Example 2
You can use SparkNLP for one line of code
val xlsDf = SparkNLP.read.xls(docsPath)
xlsDf.select("xls").show(false) +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |xls | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[{Title, Financial performance, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Quarterly revenue\tNine quarters to 30 June 2023\t\t\t1.0, {SheetName -> Index}}, {NarrativeText, Group financial performance\tFY 22\tFY 23\t\t2.0, {SheetName -> Index}}, {NarrativeText, Segmental results\tFY 22\tFY 23\t\t3.0, {SheetName -> Index}}, {NarrativeText, Segmental analysis\tFY 22\tFY 23\t\t4.0, {SheetName -> Index}}, {NarrativeText, Cash flow\tFY 22\tFY 23\t\t5.0, {SheetName -> Index}}, {Title, Operational metrics, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Mobile customers\tNine quarters to 30 June 2023\t\t\t6.0, {SheetName -> Index}}, {NarrativeText, Fixed broadband customers\tNine quarters to 30 June 2023\t\t\t7.0, {SheetName -> Index}}, {NarrativeText, Marketable homes passed\tNine quarters to 30 June 2023\t\t\t8.0, {SheetName -> Index}}, {NarrativeText, TV customers\tNine quarters to 30 June 2023\t\t\t9.0, {SheetName -> Index}}, {NarrativeText, Converged customers\tNine quarters to 30 June 2023\t\t\t10.0, {SheetName -> Index}}, {NarrativeText, Mobile churn\tNine quarters to 30 June 2023\t\t\t11.0, {SheetName -> Index}}, {NarrativeText, Mobile data usage\tNine quarters to 30 June 2023\t\t\t12.0, {SheetName -> Index}}, {NarrativeText, Mobile ARPU\tNine quarters to 30 June 2023\t\t\t13.0, {SheetName -> Index}}, {Title, Other, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Average foreign exchange rates\tNine quarters to 30 June 2023\t\t\t14.0, {SheetName -> Index}}, {NarrativeText, Guidance rates\tFY 23/24\t\t\t14.0, {SheetName -> Index}}]| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ xlsDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- xls: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
-
def
xml(xmlPath: String): DataFrame
Instantiates class to read XML files.
Instantiates class to read XML files.
xmlPath: this is a path to a directory of XML files or a path to an XML file. E.g., "path/xml/files"
Example
val xmlPath = "home/user/xml-directory" val sparkNLPReader = new SparkNLPReader() val xmlDf = sparkNLPReader.xml(xmlPath)
Example 2
You can use SparkNLP for one line of code
val xmlDf = SparkNLP.read.xml(xmlPath)
xmlDf.select("xml").show(false) +------------------------------------------------------------------------------------------------------------------------+ |xml | +------------------------------------------------------------------------------------------------------------------------+ |[{Title, John Smith, {elementId -> ..., tag -> title}}, {UncategorizedText, Some content..., {elementId -> ...}}] | +------------------------------------------------------------------------------------------------------------------------+ xmlDf.printSchema() root |-- path: string (nullable = true) |-- xml: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- xmlPath
Path to the XML file or directory
- returns
A DataFrame with parsed XML as structured elements
- def xmlToHTMLElement(xml: String): Seq[HTMLElement]