Packages

package crf

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable
  2. case class FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable

    Generates features for CrfBasedNer

  3. class NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]

    Algorithm for training a Named Entity Recognition Model

    Algorithm for training a Named Entity Recognition Model

    For instantiated/pretrained models, see NerCrfModel.

    This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

    Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

    For extended examples of usage, see the Examples and the NerCrfApproachTestSpec.

    Example

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.training.CoNLL
    import com.johnsnowlabs.nlp.annotator.NerCrfApproach
    import org.apache.spark.ml.Pipeline
    
    // This CoNLL dataset already includes a sentence, token, POS tags and label
    // column with their respective annotator types. If a custom dataset is used,
    // these need to be defined with for example:
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    // Then the training can start
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    val nerTagger = new NerCrfApproach()
      .setInputCols("sentence", "token", "pos", "embeddings")
      .setLabelColumn("label")
      .setMinEpochs(1)
      .setMaxEpochs(3)
      .setOutputCol("ner")
    
    val pipeline = new Pipeline().setStages(Array(
      embeddings,
      nerTagger
    ))
    
    // We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
    val conll = CoNLL()
    val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
    
    val pipelineModel = pipeline.fit(trainingData)
    See also

    NerDLApproach for a deep learning based approach

    NerConverter to further process the results

  4. class NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef

    Extracts Named Entities based on a CRF Model.

    Extracts Named Entities based on a CRF Model.

    This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS. These can be extracted with for example

    This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val nerTagger = NerCrfModel.pretrained()
      .setInputCols("sentence", "token", "word_embeddings", "pos")
      .setOutputCol("ner"

    The default model is "ner_crf", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
    import org.apache.spark.ml.Pipeline
    
    // First extract the prerequisites for the NerCrfModel
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("word_embeddings")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    // Then NER can be extracted
    val nerTagger = NerCrfModel.pretrained()
      .setInputCols("sentence", "token", "word_embeddings", "pos")
      .setOutputCol("ner")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence,
      tokenizer,
      embeddings,
      posTagger,
      nerTagger
    ))
    
    val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("ner.result").show(false)
    +------------------------------------+
    |result                              |
    +------------------------------------+
    |[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
    +------------------------------------+
    See also

    NerDLModel for a deep learning based approach

    NerConverter to further process the results

  5. trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]

Value Members

  1. object DictionaryFeatures extends Serializable
  2. object NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable

    This is the companion object of NerCrfApproach.

    This is the companion object of NerCrfApproach. Please refer to that class for the documentation.

  3. object NerCrfModel extends ReadablePretrainedNerCrf with Serializable

    This is the companion object of NerCrfModel.

    This is the companion object of NerCrfModel. Please refer to that class for the documentation.

Ungrouped