sparknlp.annotator.pos.perceptron
#
Contains classes for the Perceptron Annotator.
Module Contents#
Classes#
Trains an averaged Perceptron model to tag words part-of-speech. Sets a |
|
Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to |
- class PerceptronApproach[source]#
Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
For pretrained models please see the
PerceptronModel
.The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type
POS
. The Annotation needs to have memberresult
set to the POS tag and have a"word"
mapping to its word inside of membermetadata
. This DataFrame for training can easily created by the helper classPOS
.>>> POS().readDataset(spark, datasetPath) \ ... .selectExpr("explode(tags) as tags").show(truncate=False) +---------------------------------------------+ |tags | +---------------------------------------------+ |[pos, 0, 5, NNP, [word -> Pierre], []] | |[pos, 7, 12, NNP, [word -> Vinken], []] | |[pos, 14, 14, ,, [word -> ,], []] | |[pos, 31, 34, MD, [word -> will], []] | |[pos, 36, 39, VB, [word -> join], []] | |[pos, 41, 43, DT, [word -> the], []] | |[pos, 45, 49, NN, [word -> board], []] | ...
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN, DOCUMENT
POS
- Parameters:
- posCol
Column name for Array of POS tags that match tokens
- nIterations
Number of iterations in training, converges to better accuracy, by default 5
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt" >>> trainingPerceptronDF = POS().readDataset(spark, datasetPath) >>> trainedPos = PerceptronApproach() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("pos") \ ... .setPosColumn("tags") \ ... .fit(trainingPerceptronDF) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... trainedPos ... ]) >>> data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("pos.result").show(truncate=False) +--------------------------------------------------+ |result | +--------------------------------------------------+ |[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]| +--------------------------------------------------+
- setPosColumn(value)[source]#
Sets column name for Array of POS tags that match tokens.
- Parameters:
- valuestr
Name of column for Array of POS tags
- class PerceptronModel(classname='com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel', java_model=None)[source]#
Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
This is the instantiated model of the
PerceptronApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> posTagger = PerceptronModel.pretrained() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("pos")
The default model is
"pos_anc"
, if no name is provided.For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
TOKEN, DOCUMENT
POS
- Parameters:
- None
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> posTagger = PerceptronModel.pretrained() \ ... .setInputCols(["document", "token"]) \ ... .setOutputCol("pos") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... posTagger ... ]) >>> data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(pos) as pos").show(truncate=False) +-------------------------------------------+ |pos | +-------------------------------------------+ |[pos, 0, 4, NNP, [word -> Peter], []] | |[pos, 6, 11, NNP, [word -> Pipers], []] | |[pos, 13, 21, NNS, [word -> employees], []]| |[pos, 23, 25, VBP, [word -> are], []] | |[pos, 27, 33, VBG, [word -> picking], []] | |[pos, 35, 39, NNS, [word -> pecks], []] | |[pos, 41, 42, IN, [word -> of], []] | |[pos, 44, 50, JJ, [word -> pickled], []] | |[pos, 52, 58, NNS, [word -> peppers], []] | +-------------------------------------------+
- static pretrained(name='pos_anc', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “pos_anc”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- PerceptronModel
The restored model