package crf
- Alphabetic
- Public
- All
Type Members
- case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable
-
case class
FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable
Generates features for CrfBasedNer
-
class
NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]
Algorithm for training a Named Entity Recognition Model
Algorithm for training a Named Entity Recognition Model
For instantiated/pretrained models, see NerCrfModel.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with
Annotation
type columns. The data should have columns of typeDOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
and an additional label column of annotator typeNAMED_ENTITY
. Excluding the label, this can be done with for example- a SentenceDetector,
- a Tokenizer,
- a PerceptronModel and
- a WordEmbeddingsModel.
Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.
For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.
Example
import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel import com.johnsnowlabs.nlp.training.CoNLL import com.johnsnowlabs.nlp.annotator.NerCrfApproach import org.apache.spark.ml.Pipeline // This CoNLL dataset already includes a sentence, token, POS tags and label // column with their respective annotator types. If a custom dataset is used, // these need to be defined with for example: val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val posTagger = PerceptronModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("pos") // Then the training can start val embeddings = WordEmbeddingsModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("embeddings") .setCaseSensitive(false) val nerTagger = new NerCrfApproach() .setInputCols("sentence", "token", "pos", "embeddings") .setLabelColumn("label") .setMinEpochs(1) .setMaxEpochs(3) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( embeddings, nerTagger )) // We use the sentences, tokens, POS tags and labels from the CoNLL dataset. val conll = CoNLL() val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train") val pipelineModel = pipeline.fit(trainingData)
- See also
NerDLApproach for a deep learning based approach
NerConverter to further process the results
-
class
NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef
Extracts Named Entities based on a CRF Model.
Extracts Named Entities based on a CRF Model.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type
DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
. These can be extracted with for example- a SentenceDetector,
- a Tokenizer and
- a PerceptronModel.
This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with
pretrained
of the companion object:val nerTagger = NerCrfModel.pretrained() .setInputCols("sentence", "token", "word_embeddings", "pos") .setOutputCol("ner"
The default model is
"ner_crf"
, if no name is provided. For available pretrained models please see the Models Hub.For extended examples of usage, see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel import org.apache.spark.ml.Pipeline // First extract the prerequisites for the NerCrfModel val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("word_embeddings") val posTagger = PerceptronModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("pos") // Then NER can be extracted val nerTagger = NerCrfModel.pretrained() .setInputCols("sentence", "token", "word_embeddings", "pos") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, tokenizer, embeddings, posTagger, nerTagger )) val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("ner.result").show(false) +------------------------------------+ |result | +------------------------------------+ |[I-ORG, O, O, I-PER, O, O, I-LOC, O]| +------------------------------------+
- See also
NerDLModel for a deep learning based approach
NerConverter to further process the results
- trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]
Value Members
- object DictionaryFeatures extends Serializable
-
object
NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable
This is the companion object of NerCrfApproach.
This is the companion object of NerCrfApproach. Please refer to that class for the documentation.
-
object
NerCrfModel extends ReadablePretrainedNerCrf with Serializable
This is the companion object of NerCrfModel.
This is the companion object of NerCrfModel. Please refer to that class for the documentation.