crf

package crf

Ordering

Alphabetic

Visibility

Public
All

Type Members

case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable
case class FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable
Generates features for CrfBasedNer

class NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.

Example

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.training.CoNLL
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import org.apache.spark.ml.Pipeline

// This CoNLL dataset already includes a sentence, token, POS tags and label
// column with their respective annotator types. If a custom dataset is used,
// these need to be defined with for example:

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

// Then the training can start
val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val nerTagger = new NerCrfApproach()
  .setInputCols("sentence", "token", "pos", "embeddings")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  embeddings,
  nerTagger
))

// We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

See also: NerDLApproach for a deep learning based approach
NerConverter to further process the results

class NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef

Extracts Named Entities based on a CRF Model.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS. These can be extracted with for example

This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val nerTagger = NerCrfModel.pretrained()
  .setInputCols("sentence", "token", "word_embeddings", "pos")
  .setOutputCol("ner"

The default model is "ner_crf", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
import org.apache.spark.ml.Pipeline

// First extract the prerequisites for the NerCrfModel
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("word_embeddings")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

// Then NER can be extracted
val nerTagger = NerCrfModel.pretrained()
  .setInputCols("sentence", "token", "word_embeddings", "pos")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  posTagger,
  nerTagger
))

val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("ner.result").show(false)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

See also: NerDLModel for a deep learning based approach
NerConverter to further process the results

trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]

Value Members

object DictionaryFeatures extends Serializable
object NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable
This is the companion object of NerCrfApproach.
This is the companion object of NerCrfApproach. Please refer to that class for the documentation.
object NerCrfModel extends ReadablePretrainedNerCrf with Serializable
This is the companion object of NerCrfModel.
This is the companion object of NerCrfModel. Please refer to that class for the documentation.

Packages

crf

package crf

Type Members

Example

Example

Value Members

Ungrouped

Packages

crf 

package crf

Type Members

Example

Example

Value Members

Ungrouped

crf