sparknlp.reader.sparknlp_reader
#
Module Contents#
Classes#
Instantiates class to read HTML, email, and document files. |
- class SparkNLPReader(spark, params=None)[source]#
Instantiates class to read HTML, email, and document files.
Two types of input paths are supported:
htmlPath: A path to a directory of HTML files or a single HTML file (e.g., “path/html/files”).
url: A single URL or a set of URLs (e.g., “https://www.wikipedia.org”).
- Parameters:
- sparkSparkSession
The active Spark session.
- paramsdict, optional
A dictionary with custom configurations.
- html(htmlPath)[source]#
Reads HTML files or URLs and returns a Spark DataFrame.
- Parameters:
- htmlPathstr or list of str
Path(s) to HTML file(s) or a list of URLs.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing the parsed HTML content.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> html_df = SparkNLPReader(spark).html("https://www.wikipedia.org")
You can also use SparkNLP to simplify the process:
>>> import sparknlp >>> html_df = sparknlp.read().html("https://www.wikipedia.org") >>> html_df.show(truncate=False)
- email(filePath)[source]#
Reads email files and returns a Spark DataFrame.
- Parameters:
- filePathstr
Path to an email file or a directory containing emails.
- Returns:
- pyspark.sql.DataFrame
A DataFrame containing parsed email data.
Examples
>>> from sparknlp.reader import SparkNLPReader >>> email_df = SparkNLPReader(spark).email("home/user/emails-directory")
Using SparkNLP:
>>> import sparknlp >>> email_df = sparknlp.read().email("home/user/emails-directory") >>> email_df.show(truncate=False)