sparknlp.reader.sparknlp_reader#
Module Contents#
Classes#
| Instantiates class to read documents in various formats. | 
- class SparkNLPReader(spark, params=None, headers=None)[source]#
- Instantiates class to read documents in various formats. - Parameters:
- paramsspark
- Spark session 
- paramsdict, optional
- Parameter with custom configuration 
 
 - Notes - This class can read HTML, email, PDF, MS Word, Excel, PowerPoint, and text files. - Examples - >>> from sparknlp.reader import SparkNLPReader >>> reader = SparkNLPReader(spark) - Reading HTML - >>> html_df = reader.html("https://www.wikipedia.org") >>> # Or with shorthand >>> import sparknlp >>> html_df = sparknlp.read().html("https://www.wikipedia.org") - Reading PDF - >>> pdf_df = reader.pdf("home/user/pdfs-directory") >>> # Or with shorthand >>> pdf_df = sparknlp.read().pdf("home/user/pdfs-directory") - Reading Email - >>> email_df = reader.email("home/user/emails-directory") >>> # Or with shorthand >>> email_df = sparknlp.read().email("home/user/emails-directory") - html(htmlPath)[source]#
- Reads HTML files or URLs and returns a Spark DataFrame. - Parameters:
- htmlPathstr or list of str
- Path(s) to HTML file(s) or a list of URLs. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing the parsed HTML content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> html_df = SparkNLPReader().html("https://www.wikipedia.org") - You can also use SparkNLP to simplify the process: - >>> import sparknlp >>> html_df = sparknlp.read().html("https://www.wikipedia.org") >>> html_df.show(truncate=False) - url - html - [{Title, Example Domain, {pageNumber -> 1}}, {NarrativeText, 0, This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission., {pageNumber -> 1}}, {NarrativeText, 0, More information… More information…, {pageNumber -> 1}}] - >>> html_df.printSchema() root |-- url: string (nullable = true) |-- html: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - email(filePath)[source]#
- Reads email files and returns a Spark DataFrame. - Parameters:
- filePathstr
- Path to an email file or a directory containing emails. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed email data. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> email_df = SparkNLPReader(spark).email("home/user/emails-directory") - You can also use SparkNLP to simplify the process: - >>> import sparknlp >>> email_df = sparknlp.read().email("home/user/emails-directory") >>> email_df.show() +---------------------------------------------------+ |email | +---------------------------------------------------+ |[{Title, Email Text Attachments, {sent_to -> Danilo| +---------------------------------------------------+ >>> email_df.printSchema() root |-- path: string (nullable = true) |-- content: array (nullable = true) |-- email: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - doc(docPath)[source]#
- Reads word document files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to a word document file. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed document content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> doc_df = SparkNLPReader().doc(spark, "home/user/word-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> doc_df = sparknlp.read().doc("home/user/word-directory") >>> doc_df.show() +-------------------------------------------------+ |doc | | +-------------------------------------------------+ |[{Table, Header Col 1, {}}, {Table, Header Col 2,| +-------------------------------------------------+ - >>> doc_df.printSchema() root |-- path: string (nullable = true) |-- content: array (nullable = true) |-- doc: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - xls(docPath)[source]#
- Reads excel document files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to an excel document file. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed document content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> xlsDf = SparkNLPReader().xls(spark, "home/user/excel-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> xlsDf = sparknlp.read().xls("home/user/excel-directory") >>> xlsDf.show() +--------------------------------------------+ |xls | +--------------------------------------------+ |[{Title, Financial performance, {SheetNam}}]| +--------------------------------------------+ - >>> xlsDf.printSchema() root |-- path: string (nullable = true) |-- content: binary (nullable = true) |-- xls: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - ppt(docPath)[source]#
- Reads power point document files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to an power point document file. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed document content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> pptDf = SparkNLPReader().ppt(spark, "home/user/powerpoint-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> pptDf = sparknlp.read().ppt("home/user/powerpoint-directory") >>> pptDf.show(truncate=False) +-------------------------------------+ |ppt | +-------------------------------------+ |[{Title, Adding a Bullet Slide, {}},]| +-------------------------------------+ 
 - txt(docPath)[source]#
- Reads TXT files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to a TXT file. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed document content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> txtDf = SparkNLPReader().txt(spark, "home/user/txt/files") - You can use SparkNLP for one line of code - >>> import sparknlp >>> txtDf = sparknlp.read().txt("home/user/txt/files") >>> txtDf.show(truncate=False) +-----------------------------------------------+ |txt | +-----------------------------------------------+ |[{Title, BIG DATA ANALYTICS, {paragraph -> 0}}]| +-----------------------------------------------+ 
 - xml(docPath)[source]#
- Reads XML files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to an XML file or a directory containing XML files. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed XML content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> xml_df = SparkNLPReader(spark).xml("home/user/xml-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> xml_df = sparknlp.read().xml("home/user/xml-directory") >>> xml_df.show(truncate=False) +-----------------------------------------------------------+ |xml | +-----------------------------------------------------------+ |[{Title, John Smith, {elementId -> ..., tag -> title}}] | +-----------------------------------------------------------+ - >>> xml_df.printSchema() root |-- path: string (nullable = true) |-- xml: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - md(filePath)[source]#
- Reads Markdown files and returns a Spark DataFrame. - Parameters:
- filePathstr
- Path to a Markdown file or a directory containing Markdown files. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed Markdown content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> md_df = SparkNLPReader(spark).md("home/user/markdown-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> md_df = sparknlp.read().md("home/user/markdown-directory") >>> md_df.show(truncate=False) +-----------------------------------------------------------+ |md | +-----------------------------------------------------------+ |[{Title, Sample Markdown Document, {elementId -> ..., tag -> title}}]| +-----------------------------------------------------------+ - >>> md_df.printSchema() root |-- path: string (nullable = true) |-- md: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) 
 - csv(csvPath)[source]#
- Reads CSV files and returns a Spark DataFrame. - Parameters:
- docPathstr
- Path to an CSV file or a directory containing CSV files. 
 
- Returns:
- pyspark.sql.DataFrame
- A DataFrame containing parsed CSV content. 
 
 - Examples - >>> from sparknlp.reader import SparkNLPReader >>> csv_df = SparkNLPReader(spark).csv("home/user/csv-directory") - You can use SparkNLP for one line of code - >>> import sparknlp >>> csv_df = sparknlp.read().csv("home/user/csv-directory") >>> csv_df.show(truncate=False) +-----------------------------------------------------------------------------------------------------------------------------------------+ |csv | +-----------------------------------------------------------------------------------------------------------------------------------------+ |[{NarrativeText, Alice 100 Bob 95, {}}, {Table, <table><tr><td>Alice</td><td>100</td></tr><tr><td>Bob</td><td>95</td></tr></table>, {}}] | +-----------------------------------------------------------------------------------------------------------------------------------------+ - >>> csv_df.printSchema() root |-- path: string (nullable = true) |-- csv: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- elementType: string (nullable = true) | | |-- content: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)