sparknlp.reader.sparknlp_reader
#
Module Contents#
Classes#
Instantiates class to read documents in various formats. |
- class SparkNLPReader(spark, params=None, headers=None)[source]#
Instantiates class to read documents in various formats.
- Parameters:
- paramsspark
Spark session
- paramsdict, optional
Parameter with custom configuration
Notes
This class can read HTML, email, PDF, MS Word, Excel, PowerPoint, and text files.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> reader = SparkNLPReader(spark)
Reading HTML
>>> html_df = reader.html("https://www.wikipedia.org") >>> # Or with shorthand >>> import sparknlp >>> html_df = sparknlp.read().html("https://www.wikipedia.org")
Reading PDF
>>> pdf_df = reader.pdf("home/user/pdfs-directory") >>> # Or with shorthand >>> pdf_df = sparknlp.read().pdf("home/user/pdfs-directory")
Reading Email
>>> email_df = reader.email("home/user/emails-directory") >>> # Or with shorthand >>> email_df = sparknlp.read().email("home/user/emails-directory")
- html(htmlPath)[source]#
Reads HTML files or URLs and returns a Spark DataFrame.
- Parameters:
- htmlPathstr or list of str
Path(s) to HTML file(s) or a list of URLs.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing the parsed HTML content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> html_df = SparkNLPReader().html("https://www.wikipedia.org")
You can also use SparkNLP to simplify the process:
>>> import sparknlp >>> html_df = sparknlp.read().html("https://www.wikipedia.org") >>> html_df.show(truncate=False)
url
html
[{Title, Example Domain, {pageNumber -> 1}}, {NarrativeText, 0, This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission., {pageNumber -> 1}}, {NarrativeText, 0, More information… More information…, {pageNumber -> 1}}]
>>> html_df.printSchema() root |-- url: string (nullable = true) |-- html: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- email(filePath)[source]#
Reads email files and returns a Spark DataFrame.
- Parameters:
- filePathstr
Path to an email file or a directory containing emails.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed email data.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> email_df = SparkNLPReader(spark).email("home/user/emails-directory")
You can also use SparkNLP to simplify the process:
>>> import sparknlp >>> email_df = sparknlp.read().email("home/user/emails-directory") >>> email_df.show() +---------------------------------------------------+ |email | +---------------------------------------------------+ |[{Title, Email Text Attachments, {sent_to -> Danilo| +---------------------------------------------------+ >>> email_df.printSchema() root |-- path: string (nullable = true) |-- content: array (nullable = true) |-- email: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- doc(docPath)[source]#
Reads word document files and returns a Spark DataFrame.
- Parameters:
- docPathstr
Path to a word document file.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed document content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> doc_df = SparkNLPReader().doc(spark, "home/user/word-directory")
You can use SparkNLP for one line of code
>>> import sparknlp >>> doc_df = sparknlp.read().doc("home/user/word-directory") >>> doc_df.show() +-------------------------------------------------+ |doc | | +-------------------------------------------------+ |[{Table, Header Col 1, {}}, {Table, Header Col 2,| +-------------------------------------------------+
>>> doc_df.printSchema() root |-- path: string (nullable = true) |-- content: array (nullable = true) |-- doc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- xls(docPath)[source]#
Reads excel document files and returns a Spark DataFrame.
- Parameters:
- docPathstr
Path to an excel document file.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed document content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> xlsDf = SparkNLPReader().xls(spark, "home/user/excel-directory")
You can use SparkNLP for one line of code
>>> import sparknlp >>> xlsDf = sparknlp.read().xls("home/user/excel-directory") >>> xlsDf.show() +--------------------------------------------+ |xls | +--------------------------------------------+ |[{Title, Financial performance, {SheetNam}}]| +--------------------------------------------+
>>> xlsDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- xls: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- ppt(docPath)[source]#
Reads power point document files and returns a Spark DataFrame.
- Parameters:
- docPathstr
Path to an power point document file.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed document content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> pptDf = SparkNLPReader().ppt(spark, "home/user/powerpoint-directory")
You can use SparkNLP for one line of code
>>> import sparknlp >>> pptDf = sparknlp.read().ppt("home/user/powerpoint-directory") >>> pptDf.show(truncate=False) +-------------------------------------+ |ppt | +-------------------------------------+ |[{Title, Adding a Bullet Slide, {}},]| +-------------------------------------+
- txt(docPath)[source]#
Reads TXT files and returns a Spark DataFrame.
- Parameters:
- docPathstr
Path to a TXT file.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed document content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> txtDf = SparkNLPReader().txt(spark, "home/user/txt/files")
You can use SparkNLP for one line of code
>>> import sparknlp >>> txtDf = sparknlp.read().txt("home/user/txt/files") >>> txtDf.show(truncate=False) +-----------------------------------------------+ |txt | +-----------------------------------------------+ |[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}]| +-----------------------------------------------+
- xml(docPath)[source]#
Reads XML files and returns a Spark DataFrame.
- Parameters:
- docPathstr
Path to an XML file or a directory containing XML files.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed XML content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> xml_df = SparkNLPReader(spark).xml("home/user/xml-directory")
You can use SparkNLP for one line of code
>>> import sparknlp >>> xml_df = sparknlp.read().xml("home/user/xml-directory") >>> xml_df.show(truncate=False) +-----------------------------------------------------------+ |xml | +-----------------------------------------------------------+ |[{Title, John Smith, {elementId -> ..., tag -> title}}] | +-----------------------------------------------------------+
>>> xml_df.printSchema() root |-- path: string (nullable = true) |-- xml: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
- md(filePath)[source]#
Reads Markdown files and returns a Spark DataFrame.
- Parameters:
- filePathstr
Path to a Markdown file or a directory containing Markdown files.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed Markdown content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> md_df = SparkNLPReader(spark).md("home/user/markdown-directory")
You can use SparkNLP for one line of code
>>> import sparknlp >>> md_df = sparknlp.read().md("home/user/markdown-directory") >>> md_df.show(truncate=False) +-----------------------------------------------------------+ |md | +-----------------------------------------------------------+ |[{Title, Sample Markdown Document, {elementId -> ..., tag -> title}}]| +-----------------------------------------------------------+
>>> md_df.printSchema() root |-- path: string (nullable = true) |-- md: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)