sparknlp.training.pos#

Contains helper classes for part-of-speech tagging.

Module Contents#

Classes#

POS

Helper class for creating DataFrames for training a part-of-speech

class POS[source]#

Helper class for creating DataFrames for training a part-of-speech tagger.

The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag.

Input File Format:

A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN

The sentence can then be parsed with readDataset() into a column with annotations of type POS.

Can be used to train a PerceptronApproach.

Examples

In this example, the file test-training.txt has the content of the sentence above.

>>> from sparknlp.training import POS
>>> pos = POS()
>>> path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
>>> posDf = pos.readDataset(spark, path, "|", "tags")
>>> posDf.selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 16, 17, CD, [word -> 61], []]          |
|[pos, 19, 23, NNS, [word -> years], []]      |
|[pos, 25, 27, JJ, [word -> old], []]         |
|[pos, 29, 29, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
|[pos, 51, 52, IN, [word -> as], []]          |
|[pos, 47, 47, DT, [word -> a], []]           |
|[pos, 56, 67, JJ, [word -> nonexecutive], []]|
|[pos, 69, 76, NN, [word -> director], []]    |
|[pos, 78, 81, NNP, [word -> Nov.], []]       |
|[pos, 83, 84, CD, [word -> 29], []]          |
|[pos, 81, 81, ., [word -> .], []]            |
+---------------------------------------------+
readDataset(spark, path, delimiter='|', outputPosCol='tags', outputDocumentCol='document', outputTextCol='text')[source]#

Reads the dataset from an external resource.

Parameters:
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

delimiterstr, optional

Delimiter of word and POS, by default “|”

outputPosColstr, optional

Name of the output POS column, by default “tags”

outputDocumentColstr, optional

Name of the output document column, by default “document”

outputTextColstr, optional

Name of the output text column, by default “text”

Returns:
pyspark.sql.DataFrame

Spark Dataframe with the data