sparknlp.annotator.ner.ner_crf
#
Contains classes for NerCrf.
Module Contents#
Classes#
Algorithm for training a Named Entity Recognition Model |
|
Extracts Named Entities based on a CRF Model. |
- class NerCrfApproach[source]#
Algorithm for training a Named Entity Recognition Model
For instantiated/pretrained models, see
NerCrfModel
.This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g.
CoNLL
2003 IOB with Annotation type columns. The data should have columns of typeDOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
and an additional label column of annotator typeNAMED_ENTITY
.Excluding the label, this can be done with for example:
a
Tokenizer
,a
PerceptronModel
and
Optionally the user can provide an entity dictionary file with
setExternalFeatures()
for better accuracy.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
NAMED_ENTITY
- Parameters:
- labelColumn
Column with label per each token
- entities
Entities to recognize
- minEpochs
Minimum number of epochs to train, by default 0
- maxEpochs
Maximum number of epochs to train, by default 1000
- verbose
Level of verbosity during training, by default 4
- randomSeed
Random seed
- l2
L2 regularization coefficient, by default 1.0
- c0
c0 params defining decay speed for gradient, by default 2250000
- lossEps
If Epoch relative improvement less than eps then training is stopped, by default 0.001
- minW
Features with less weights then this param value will be filtered
- includeConfidence
Whether to include confidence scores in annotation metadata, by default False
- externalFeatures
Additional dictionaries paths to use as a features
See also
NerDLApproach
for a deep learning based approach
NerConverter
to further process the results
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> from pyspark.ml import Pipeline
This CoNLL dataset already includes a sentence, token, POS tags and label column with their respective annotator types. If a custom dataset is used, these need to be defined with for example:
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> posTagger = PerceptronModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("pos")
Then training can start:
>>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") \ ... .setCaseSensitive(False) >>> nerTagger = NerCrfApproach() \ ... .setInputCols(["sentence", "token", "pos", "embeddings"]) \ ... .setLabelColumn("label") \ ... .setMinEpochs(1) \ ... .setMaxEpochs(3) \ ... .setOutputCol("ner") >>> pipeline = Pipeline().setStages([ ... embeddings, ... nerTagger ... ])
We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
>>> conll = CoNLL() >>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train") >>> pipelineModel = pipeline.fit(trainingData)
- setL2(l2value)[source]#
Sets L2 regularization coefficient, by default 1.0.
- Parameters:
- l2valuefloat
L2 regularization coefficient
- setC0(c0value)[source]#
Sets c0 params defining decay speed for gradient, by default 2250000.
- Parameters:
- c0valueint
c0 params defining decay speed for gradient
- setLossEps(eps)[source]#
Sets If Epoch relative improvement less than eps then training is stopped, by default 0.001.
- Parameters:
- epsfloat
The threshold
- setMinW(w)[source]#
Sets minimum weight value.
Features with less weights then this param value will be filtered.
- Parameters:
- wfloat
Minimum weight value
- setExternalFeatures(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets Additional dictionaries paths to use as a features.
- Parameters:
- pathstr
Path to the source files
- delimiterstr
Delimiter for the dictionary file. Can also be set it options.
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- class NerCrfModel(classname='com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel', java_model=None)[source]#
Extracts Named Entities based on a CRF Model.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type
DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
. These can be extracted with for examplea SentenceDetector,
a Tokenizer and
a PerceptronModel.
This is the instantiated model of the
NerCrfApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> nerTagger = NerCrfModel.pretrained() \ ... .setInputCols(["sentence", "token", "word_embeddings", "pos"]) \ ... .setOutputCol("ner")
The default model is
"ner_crf"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
NAMED_ENTITY
- Parameters:
- includeConfidence
Whether to include confidence scores in annotation metadata, by default False
See also
NerDLModel
for a deep learning based approach
NerConverter
to further process the results
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
First extract the prerequisites for the NerCrfModel
>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("word_embeddings") >>> posTagger = PerceptronModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("pos")
Then NER can be extracted
>>> nerTagger = NerCrfModel.pretrained() \ ... .setInputCols(["sentence", "token", "word_embeddings", "pos"]) \ ... .setOutputCol("ner") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... embeddings, ... posTagger, ... nerTagger ... ]) >>> data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("ner.result").show(truncate=False) +------------------------------------+ |result | +------------------------------------+ |[I-ORG, O, O, I-PER, O, O, I-LOC, O]| +------------------------------------+
- setIncludeConfidence(b)[source]#
Sets whether to include confidence scores in annotation metadata, by default False.
- Parameters:
- bbool
Whether to include the confidence value in the output.
- static pretrained(name='ner_crf', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “ner_crf”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- NerCrfModel
The restored model