c

com.johnsnowlabs.reader

SparkNLPReader

class SparkNLPReader extends Serializable

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkNLPReader
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkNLPReader(params: Map[String, String] = new java.util.HashMap(), headers: Map[String, String] = new java.util.HashMap())

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. def doc(content: Array[Byte]): Seq[HTMLElement]
  7. def doc(docPath: String): DataFrame

    Instantiates class to read Word files.

    Instantiates class to read Word files.

    docPath: this is a path to a directory of Word files or a path to an Word file E.g. "path/word/files"

    Example

    val docsPath = "home/user/word-directory"
    val sparkNLPReader = new SparkNLPReader()
    val docsDf = sparkNLPReader.email(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val docsDf = SparkNLP.read.doc(docsPath)
    docsDf.select("doc").show(false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    |doc                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Table, Header Col 1, {}}, {Table, Header Col 2, {}}, {Table, Lorem ipsum, {}}, {Table, A Link example, {}}, {NarrativeText, Dolor sit amet, {}}]  |
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    
    docsDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- doc: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  8. def email(content: Array[Byte]): Seq[HTMLElement]
  9. def email(emailPath: String): DataFrame

    Instantiates class to read email files.

    Instantiates class to read email files.

    emailPath: this is a path to a directory of email files or a path to an email file E.g. "path/email/files"

    Example

    val emailsPath = "home/user/emails-directory"
    val sparkNLPReader = new SparkNLPReader()
    val emailDf = sparkNLPReader.email(emailsPath)

    Example 2

    You can use SparkNLP for one line of code

    val emailDf = SparkNLP.read.email(emailsPath)
    emailDf.select("email").show(false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |email                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Email Text Attachments, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>}}, {NarrativeText, Email  test with two text attachments\r\n\r\nCheers,\r\n\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {NarrativeText, <html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Email&nbsp; test with two text attachments</span>\r\n
    
    \r\n<br>\r\n\r\n
    
    \r\nCheers,\r\n
    
    \r\n<br>\r\n\r\n</body>\r\n</html>\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/html}}, {Attachment, filename.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename.txt"}}, {NarrativeText, This is the content of the file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {Attachment, filename2.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename2.txt"}}, {NarrativeText, This is an additional content file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    emailDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- email: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  10. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  12. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  14. def getOutputColumn: String
  15. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  16. def html(urls: List[String]): DataFrame
  17. def html(urls: Array[String]): DataFrame
  18. def html(htmlPath: String): DataFrame
  19. def htmlToHTMLElement(html: String): Seq[HTMLElement]
  20. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  21. def md(mdPath: String): DataFrame

    Instantiates class to read Markdown (.md) files.

    Instantiates class to read Markdown (.md) files.

    This method loads a Markdown file or directory of .md files and parses the content into structured elements such as headers, narrative text, lists, and code blocks.

    Example

    val mdPath = "home/user/markdown-files"
    val sparkNLPReader = new SparkNLPReader()
    val mdDf = sparkNLPReader.md(mdPath)

    Example 2

    Use SparkNLP in one line:

    val mdDf = SparkNLP.read.md(mdPath)
    mdDf.select("md").show(false)
    
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |md                                                                                                                                 |
    +-----------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Introduction, {level -> 1, paragraph -> 0}}, {NarrativeText, This is a Markdown paragraph., {paragraph -> 0}}, ...]        |
    +-----------------------------------------------------------------------------------------------------------------------------------+
    
    mdDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- md: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
    mdPath

    Path to a single .md file or a directory of Markdown files.

    returns

    A DataFrame with parsed Markdown content as structured HTMLElements.

  22. def mdToHTMLElement(mdContent: String): Seq[HTMLElement]
  23. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  24. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  25. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  26. def pdf(pdfPath: String): DataFrame

    Instantiates class to read PDF files.

    Instantiates class to read PDF files.

    pdfPath: this is a path to a directory of PDF files or a path to an PDF file E.g. "path/pdfs/"

    Example

    val pdfsPath = "home/user/pdfs-directory"
    val sparkNLPReader = new SparkNLPReader()
    val pdfDf = sparkNLPReader.pdf(pdfsPath)

    Example 2

    You can use SparkNLP for one line of code

    val pdfDf = SparkNLP.read.pdf(pdfsPath)
    pdfDf.show(false)
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    |                path|    modificationTime|length|                text|height_dimension|width_dimension|             content|exception|pagenum|
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    |file:/content/pdf...|2025-01-15 20:48:...| 25803|This is a Title \...|             842|            596|[25 50 44 46 2D 3...|     NULL|      0|
    |file:/content/pdf...|2025-01-15 20:48:...|  9487|This is a page.\n...|             841|            595|[25 50 44 46 2D 3...|     NULL|      0|
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    
    pdf_df.printSchema()
    root
     |-- path: string (nullable = true)
     |-- modificationTime: timestamp (nullable = true)
     |-- length: long (nullable = true)
     |-- text: string (nullable = true)
     |-- height_dimension: integer (nullable = true)
     |-- width_dimension: integer (nullable = true)
     |-- content: binary (nullable = true)
     |-- exception: string (nullable = true)
  27. def ppt(content: Array[Byte]): Seq[HTMLElement]
  28. def ppt(docPath: String): DataFrame

    Instantiates class to read PowerPoint files.

    Instantiates class to read PowerPoint files.

    docPath: this is a path to a directory of PowerPoint files or a path to an PowerPoint file E.g. "path/power-point/files"

    Example

    val docsPath = "home/user/power-point-directory"
    val sparkNLPReader = new SparkNLPReader()
    val pptDf = sparkNLPReader.ppt(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val pptDf = SparkNLP.read.ppt(docsPath)
    xlsDf.select("ppt").show(false)
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |ppt                                                                                                                                                                                                                                                                                                                      |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Adding a Bullet Slide, {}}, {ListItem, • Find the bullet slide layout, {}}, {ListItem, – Use _TextFrame.text for first bullet, {}}, {ListItem, • Use _TextFrame.add_paragraph() for subsequent bullets, {}}, {NarrativeText, Here is a lot of text!, {}}, {NarrativeText, Here is some text in a text box!, {}}]|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    pptDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- ppt: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  29. def setOutputColumn(value: String): Unit
  30. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  31. def toString(): String
    Definition Classes
    AnyRef → Any
  32. def txt(filePath: String): DataFrame

    Instantiates class to read txt files.

    Instantiates class to read txt files.

    filePath: this is a path to a directory of TXT files or a path to an TXT file E.g. "path/txt/files"

    Example

    val filePath = "home/user/txt/files"
    val sparkNLPReader = new SparkNLPReader()
    val txtDf = sparkNLPReader.txt(filePath)

    Example 2

    You can use SparkNLP for one line of code

    val txtDf = SparkNLP.read.txt(filePath)
    txtDf.select("txt").show(false)
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |txt                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}, {NarrativeText, Apache Spark is a fast and general-purpose cluster computing system.\nIt provides high-level APIs in Java, Scala, Python, and R., {paragraph -> 0}}, {Title, MACHINE LEARNING, {paragraph -> 1}}, {NarrativeText, Spark's MLlib provides scalable machine learning algorithms.\nIt includes tools for classification, regression, clustering, and more., {paragraph -> 1}}]|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    emailDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- txt: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  33. def txtContent(content: String): DataFrame
  34. def txtToHTMLElement(text: String): Seq[HTMLElement]
  35. def urlToHTMLElement(url: String): Seq[HTMLElement]
  36. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  39. def xls(content: Array[Byte]): Seq[HTMLElement]
  40. def xls(docPath: String): DataFrame

    Instantiates class to read Excel files.

    Instantiates class to read Excel files.

    docPath: this is a path to a directory of Excel files or a path to an Excel file E.g. "path/excel/files"

    Example

    val docsPath = "home/user/excel-directory"
    val sparkNLPReader = new SparkNLPReader()
    val xlsDf = sparkNLPReader.xls(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val xlsDf = SparkNLP.read.xls(docsPath)
    xlsDf.select("xls").show(false)
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |xls                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Financial performance, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Quarterly revenue\tNine quarters to 30 June 2023\t\t\t1.0, {SheetName -> Index}}, {NarrativeText, Group financial performance\tFY 22\tFY 23\t\t2.0, {SheetName -> Index}}, {NarrativeText, Segmental results\tFY 22\tFY 23\t\t3.0, {SheetName -> Index}}, {NarrativeText, Segmental analysis\tFY 22\tFY 23\t\t4.0, {SheetName -> Index}}, {NarrativeText, Cash flow\tFY 22\tFY 23\t\t5.0, {SheetName -> Index}}, {Title, Operational metrics, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Mobile customers\tNine quarters to 30 June 2023\t\t\t6.0, {SheetName -> Index}}, {NarrativeText, Fixed broadband customers\tNine quarters to 30 June 2023\t\t\t7.0, {SheetName -> Index}}, {NarrativeText, Marketable homes passed\tNine quarters to 30 June 2023\t\t\t8.0, {SheetName -> Index}}, {NarrativeText, TV customers\tNine quarters to 30 June 2023\t\t\t9.0, {SheetName -> Index}}, {NarrativeText, Converged customers\tNine quarters to 30 June 2023\t\t\t10.0, {SheetName -> Index}}, {NarrativeText, Mobile churn\tNine quarters to 30 June 2023\t\t\t11.0, {SheetName -> Index}}, {NarrativeText, Mobile data usage\tNine quarters to 30 June 2023\t\t\t12.0, {SheetName -> Index}}, {NarrativeText, Mobile ARPU\tNine quarters to 30 June 2023\t\t\t13.0, {SheetName -> Index}}, {Title, Other, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Average foreign exchange rates\tNine quarters to 30 June 2023\t\t\t14.0, {SheetName -> Index}}, {NarrativeText, Guidance rates\tFY 23/24\t\t\t14.0, {SheetName -> Index}}]|
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    xlsDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- xls: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  41. def xml(xmlPath: String): DataFrame

    Instantiates class to read XML files.

    Instantiates class to read XML files.

    xmlPath: this is a path to a directory of XML files or a path to an XML file. E.g., "path/xml/files"

    Example

    val xmlPath = "home/user/xml-directory"
    val sparkNLPReader = new SparkNLPReader()
    val xmlDf = sparkNLPReader.xml(xmlPath)

    Example 2

    You can use SparkNLP for one line of code

    val xmlDf = SparkNLP.read.xml(xmlPath)
    xmlDf.select("xml").show(false)
    +------------------------------------------------------------------------------------------------------------------------+
    |xml                                                                                                                    |
    +------------------------------------------------------------------------------------------------------------------------+
    |[{Title, John Smith, {elementId -> ..., tag -> title}}, {UncategorizedText, Some content..., {elementId -> ...}}]     |
    +------------------------------------------------------------------------------------------------------------------------+
    
    xmlDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- xml: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
    xmlPath

    Path to the XML file or directory

    returns

    A DataFrame with parsed XML as structured elements

  42. def xmlToHTMLElement(xml: String): Seq[HTMLElement]

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped