sparknlp.reader.sparknlp_reader#

Module Contents#

Classes#

SparkNLPReader

Instantiates class to read HTML, email, and document files.

class SparkNLPReader(spark, params=None)[source]#

Instantiates class to read HTML, email, and document files.

Two types of input paths are supported:

  • htmlPath: A path to a directory of HTML files or a single HTML file (e.g., “path/html/files”).

  • url: A single URL or a set of URLs (e.g., “https://www.wikipedia.org”).

Parameters:
sparkSparkSession

The active Spark session.

paramsdict, optional

A dictionary with custom configurations.

html(htmlPath)[source]#

Reads HTML files or URLs and returns a Spark DataFrame.

Parameters:
htmlPathstr or list of str

Path(s) to HTML file(s) or a list of URLs.

Returns:
pyspark.sql.DataFrame

A DataFrame containing the parsed HTML content.

Examples

>>> from sparknlp.reader import SparkNLPReader
>>> html_df = SparkNLPReader(spark).html("https://www.wikipedia.org")

You can also use SparkNLP to simplify the process:

>>> import sparknlp
>>> html_df = sparknlp.read().html("https://www.wikipedia.org")
>>> html_df.show(truncate=False)
email(filePath)[source]#

Reads email files and returns a Spark DataFrame.

Parameters:
filePathstr

Path to an email file or a directory containing emails.

Returns:
pyspark.sql.DataFrame

A DataFrame containing parsed email data.

Examples

>>> from sparknlp.reader import SparkNLPReader
>>> email_df = SparkNLPReader(spark).email("home/user/emails-directory")

Using SparkNLP:

>>> import sparknlp
>>> email_df = sparknlp.read().email("home/user/emails-directory")
>>> email_df.show(truncate=False)
doc(docPath)[source]#

Reads document files and returns a Spark DataFrame.

Parameters:
docPathstr

Path to a document file.

Returns:
pyspark.sql.DataFrame

A DataFrame containing parsed document content.