`sparknlp.training.pos`#

Contains helper classes for part-of-speech tagging.

Module Contents#

Classes#

POS

Helper class for creating DataFrames for training a part-of-speech

class POS[source]#

Helper class for creating DataFrames for training a part-of-speech tagger.

The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag.

Input File Format:

A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN

The sentence can then be parsed with readDataset() into a column with annotations of type POS.

Can be used to train a PerceptronApproach.

Examples

In this example, the file test-training.txt has the content of the sentence above.

>>> from sparknlp.training import POS
>>> pos = POS()
>>> path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
>>> posDf = pos.readDataset(spark, path, "|", "tags")
>>> posDf.selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 16, 17, CD, [word -> 61], []]          |
|[pos, 19, 23, NNS, [word -> years], []]      |
|[pos, 25, 27, JJ, [word -> old], []]         |
|[pos, 29, 29, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
|[pos, 51, 52, IN, [word -> as], []]          |
|[pos, 47, 47, DT, [word -> a], []]           |
|[pos, 56, 67, JJ, [word -> nonexecutive], []]|
|[pos, 69, 76, NN, [word -> director], []]    |
|[pos, 78, 81, NNP, [word -> Nov.], []]       |
|[pos, 83, 84, CD, [word -> 29], []]          |
|[pos, 81, 81, ., [word -> .], []]            |
+---------------------------------------------+

readDataset(spark, path, delimiter='|', outputPosCol='tags', outputDocumentCol='document', outputTextCol='text')[source]#

Reads the dataset from an external resource.

Parameters:

sparkpyspark.sql.SparkSession: Initiated Spark Session with Spark NLP
pathstr: Path to the resource
delimiterstr, optional: Delimiter of word and POS, by default “|”
outputPosColstr, optional: Name of the output POS column, by default “tags”
outputDocumentColstr, optional: Name of the output document column, by default “document”
outputTextColstr, optional: Name of the output text column, by default “text”

Returns:

pyspark.sql.DataFrame: Spark Dataframe with the data

sparknlp.training.pos#

Module Contents#

Classes#

`sparknlp.training.pos`#