sparknlp.training.pos
#
Contains helper classes for part-of-speech tagging.
Module Contents#
Classes#
Helper class for creating DataFrames for training a part-of-speech |
- class POS[source]#
Helper class for creating DataFrames for training a part-of-speech tagger.
The dataset needs to consist of sentences on each line, where each word is delimited with its respective tag.
Input File Format:
A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
The sentence can then be parsed with
readDataset()
into a column with annotations of typePOS
.Can be used to train a
PerceptronApproach
.Examples
In this example, the file
test-training.txt
has the content of the sentence above.>>> from sparknlp.training import POS >>> pos = POS() >>> path = "src/test/resources/anc-pos-corpus-small/test-training.txt" >>> posDf = pos.readDataset(spark, path, "|", "tags") >>> posDf.selectExpr("explode(tags) as tags").show(truncate=False) +---------------------------------------------+ |tags | +---------------------------------------------+ |[pos, 0, 5, NNP, [word -> Pierre], []] | |[pos, 7, 12, NNP, [word -> Vinken], []] | |[pos, 14, 14, ,, [word -> ,], []] | |[pos, 16, 17, CD, [word -> 61], []] | |[pos, 19, 23, NNS, [word -> years], []] | |[pos, 25, 27, JJ, [word -> old], []] | |[pos, 29, 29, ,, [word -> ,], []] | |[pos, 31, 34, MD, [word -> will], []] | |[pos, 36, 39, VB, [word -> join], []] | |[pos, 41, 43, DT, [word -> the], []] | |[pos, 45, 49, NN, [word -> board], []] | |[pos, 51, 52, IN, [word -> as], []] | |[pos, 47, 47, DT, [word -> a], []] | |[pos, 56, 67, JJ, [word -> nonexecutive], []]| |[pos, 69, 76, NN, [word -> director], []] | |[pos, 78, 81, NNP, [word -> Nov.], []] | |[pos, 83, 84, CD, [word -> 29], []] | |[pos, 81, 81, ., [word -> .], []] | +---------------------------------------------+
- readDataset(spark, path, delimiter='|', outputPosCol='tags', outputDocumentCol='document', outputTextCol='text')[source]#
Reads the dataset from an external resource.
- Parameters:
- spark
pyspark.sql.SparkSession
Initiated Spark Session with Spark NLP
- pathstr
Path to the resource
- delimiterstr, optional
Delimiter of word and POS, by default “|”
- outputPosColstr, optional
Name of the output POS column, by default “tags”
- outputDocumentColstr, optional
Name of the output document column, by default “document”
- outputTextColstr, optional
Name of the output text column, by default “text”
- spark
- Returns:
pyspark.sql.DataFrame
Spark Dataframe with the data