Packages

c

com.johnsnowlabs.reader

SparkNLPReader

class SparkNLPReader extends AnyRef

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkNLPReader
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkNLPReader(params: Map[String, String] = new java.util.HashMap())

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. def doc(docPath: String): DataFrame

    Instantiates class to read Word files.

    Instantiates class to read Word files.

    docPath: this is a path to a directory of Word files or a path to an HTML file E.g. "path/word/files"

    Example

    val docsPath = "home/user/word-directory"
    val sparkNLPReader = new SparkNLPReader()
    val docsDf = sparkNLPReader.email(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val docsDf = SparkNLP.read.doc(docsPath)
    docsDf.select("doc").show(false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    |doc                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Table, Header Col 1, {}}, {Table, Header Col 2, {}}, {Table, Lorem ipsum, {}}, {Table, A Link example, {}}, {NarrativeText, Dolor sit amet, {}}]  |
    +----------------------------------------------------------------------------------------------------------------------------------------------------+
    
    docsDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- doc: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  7. def email(emailPath: String): DataFrame

    Instantiates class to read email files.

    Instantiates class to read email files.

    emailPath: this is a path to a directory of HTML files or a path to an HTML file E.g. "path/email/files"

    Example

    val emailsPath = "home/user/emails-directory"
    val sparkNLPReader = new SparkNLPReader()
    val emailDf = sparkNLPReader.email(emailsPath)

    Example 2

    You can use SparkNLP for one line of code

    val emailDf = SparkNLP.read.email(emailsPath)
    emailDf.select("email").show(false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |email                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Email Text Attachments, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>}}, {NarrativeText, Email  test with two text attachments\r\n\r\nCheers,\r\n\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {NarrativeText, <html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">Email&nbsp; test with two text attachments</span>\r\n
    
    \r\n<br>\r\n\r\n
    
    \r\nCheers,\r\n
    
    \r\n<br>\r\n\r\n</body>\r\n</html>\r\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/html}}, {Attachment, filename.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename.txt"}}, {NarrativeText, This is the content of the file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}, {Attachment, filename2.txt, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, contentType -> text/plain; name="filename2.txt"}}, {NarrativeText, This is an additional content file.\n, {sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>, mimeType -> text/plain}}]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    emailDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- email: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  8. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. def html(urls: List[String]): DataFrame
  14. def html(urls: Array[String]): DataFrame
  15. def html(htmlPath: String): DataFrame

    Instantiates class to read HTML files.

    Instantiates class to read HTML files.

    Two types of input paths are supported,

    htmlPath: this is a path to a directory of HTML files or a path to an HTML file E.g. "path/html/files"

    url: this is the URL or set of URLs of a website . E.g., "https://www.wikipedia.org"

    Example

    val url = "https://www.wikipedia.org"
    val sparkNLPReader = new SparkNLPReader()
    val htmlDf = sparkNLPReader.html(url)

    Example 2

    You can use SparkNLP for one line of code

    val htmlDf = SparkNLP.read.html(url)
    htmlDf.show(false)
    
    +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |url                 |html                                                                                                                                                                                                                                                                                                                            |
    +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |https://example.com/|[{Title, Example Domain, {pageNumber -> 1}}, {NarrativeText, 0, This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission., {pageNumber -> 1}}, {NarrativeText, 0, More information... More information..., {pageNumber -> 1}}]   |
    +--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    htmlDf.printSchema()
    root
     |-- url: string (nullable = true)
     |-- html: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  16. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  17. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  18. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  19. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  20. def pdf(pdfPath: String): DataFrame

    Instantiates class to read PDF files.

    Instantiates class to read PDF files.

    pdfPath: this is a path to a directory of PDF files or a path to an PDF file E.g. "path/pdfs/"

    Example

    val pdfsPath = "home/user/pdfs-directory"
    val sparkNLPReader = new SparkNLPReader()
    val pdfDf = sparkNLPReader.pdf(pdfsPath)

    Example 2

    You can use SparkNLP for one line of code

    val pdfDf = SparkNLP.read.pdf(pdfsPath)
    pdfDf.show(false)
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    |                path|    modificationTime|length|                text|height_dimension|width_dimension|             content|exception|pagenum|
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    |file:/content/pdf...|2025-01-15 20:48:...| 25803|This is a Title \...|             842|            596|[25 50 44 46 2D 3...|     NULL|      0|
    |file:/content/pdf...|2025-01-15 20:48:...|  9487|This is a page.\n...|             841|            595|[25 50 44 46 2D 3...|     NULL|      0|
    +--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
    
    pdf_df.printSchema()
    root
     |-- path: string (nullable = true)
     |-- modificationTime: timestamp (nullable = true)
     |-- length: long (nullable = true)
     |-- text: string (nullable = true)
     |-- height_dimension: integer (nullable = true)
     |-- width_dimension: integer (nullable = true)
     |-- content: binary (nullable = true)
     |-- exception: string (nullable = true)
     |-- pagenum: integer (nullable = true)
  21. def ppt(docPath: String): DataFrame

    Instantiates class to read PowerPoint files.

    Instantiates class to read PowerPoint files.

    docPath: this is a path to a directory of Excel files or a path to an HTML file E.g. "path/power-point/files"

    Example

    val docsPath = "home/user/power-point-directory"
    val sparkNLPReader = new SparkNLPReader()
    val pptDf = sparkNLPReader.ppt(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val pptDf = SparkNLP.read.ppt(docsPath)
    xlsDf.select("ppt").show(false)
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |ppt                                                                                                                                                                                                                                                                                                                      |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Adding a Bullet Slide, {}}, {ListItem, • Find the bullet slide layout, {}}, {ListItem, – Use _TextFrame.text for first bullet, {}}, {ListItem, • Use _TextFrame.add_paragraph() for subsequent bullets, {}}, {NarrativeText, Here is a lot of text!, {}}, {NarrativeText, Here is some text in a text box!, {}}]|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    pptDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- ppt: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  22. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  23. def toString(): String
    Definition Classes
    AnyRef → Any
  24. def txt(filePath: String): DataFrame

    Instantiates class to read txt files.

    Instantiates class to read txt files.

    filePath: this is a path to a directory of TXT files or a path to an TXT file E.g. "path/txt/files"

    Example

    val filePath = "home/user/txt/files"
    val sparkNLPReader = new SparkNLPReader()
    val txtDf = sparkNLPReader.txt(filePath)

    Example 2

    You can use SparkNLP for one line of code

    val txtDf = SparkNLP.read.txt(filePath)
    txtDf.select("txt").show(false)
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |txt                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}, {NarrativeText, Apache Spark is a fast and general-purpose cluster computing system.\nIt provides high-level APIs in Java, Scala, Python, and R., {paragraph -> 0}}, {Title, MACHINE LEARNING, {paragraph -> 1}}, {NarrativeText, Spark's MLlib provides scalable machine learning algorithms.\nIt includes tools for classification, regression, clustering, and more., {paragraph -> 1}}]|
    +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    emailDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- txt: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  25. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  28. def xls(docPath: String): DataFrame

    Instantiates class to read Excel files.

    Instantiates class to read Excel files.

    docPath: this is a path to a directory of Excel files or a path to an HTML file E.g. "path/excel/files"

    Example

    val docsPath = "home/user/excel-directory"
    val sparkNLPReader = new SparkNLPReader()
    val xlsDf = sparkNLPReader.xls(docsPath)

    Example 2

    You can use SparkNLP for one line of code

    val xlsDf = SparkNLP.read.xls(docsPath)
    xlsDf.select("xls").show(false)
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |xls                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{Title, Financial performance, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Quarterly revenue\tNine quarters to 30 June 2023\t\t\t1.0, {SheetName -> Index}}, {NarrativeText, Group financial performance\tFY 22\tFY 23\t\t2.0, {SheetName -> Index}}, {NarrativeText, Segmental results\tFY 22\tFY 23\t\t3.0, {SheetName -> Index}}, {NarrativeText, Segmental analysis\tFY 22\tFY 23\t\t4.0, {SheetName -> Index}}, {NarrativeText, Cash flow\tFY 22\tFY 23\t\t5.0, {SheetName -> Index}}, {Title, Operational metrics, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Mobile customers\tNine quarters to 30 June 2023\t\t\t6.0, {SheetName -> Index}}, {NarrativeText, Fixed broadband customers\tNine quarters to 30 June 2023\t\t\t7.0, {SheetName -> Index}}, {NarrativeText, Marketable homes passed\tNine quarters to 30 June 2023\t\t\t8.0, {SheetName -> Index}}, {NarrativeText, TV customers\tNine quarters to 30 June 2023\t\t\t9.0, {SheetName -> Index}}, {NarrativeText, Converged customers\tNine quarters to 30 June 2023\t\t\t10.0, {SheetName -> Index}}, {NarrativeText, Mobile churn\tNine quarters to 30 June 2023\t\t\t11.0, {SheetName -> Index}}, {NarrativeText, Mobile data usage\tNine quarters to 30 June 2023\t\t\t12.0, {SheetName -> Index}}, {NarrativeText, Mobile ARPU\tNine quarters to 30 June 2023\t\t\t13.0, {SheetName -> Index}}, {Title, Other, {SheetName -> Index}}, {Title, Topic\tPeriod\t\t\tPage, {SheetName -> Index}}, {NarrativeText, Average foreign exchange rates\tNine quarters to 30 June 2023\t\t\t14.0, {SheetName -> Index}}, {NarrativeText, Guidance rates\tFY 23/24\t\t\t14.0, {SheetName -> Index}}]|
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    xlsDf.printSchema()
    root
     |-- path: string (nullable = true)
     |-- content: binary (nullable = true)
     |-- xls: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)

Inherited from AnyRef

Inherited from Any

Ungrouped