Packages

package btm

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class BigTextMatcher extends AnnotatorApproach[BigTextMatcherModel] with HasStorage

    Annotator to match exact phrases (by token) provided in a file against a Document.

    Annotator to match exact phrases (by token) provided in a file against a Document.

    A text file of predefined phrases must be provided with setStoragePath. The text file can als be set directly as an ExternalResource.

    In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

    For extended examples of usage, see the BigTextMatcherTestSpec.

    Example

    In this example, the entities file is of the form

    ...
    dolore magna aliqua
    lorem ipsum dolor. sit
    laborum
    ...

    where each line represents an entity phrase to be extracted.

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.BigTextMatcher
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
    val entityExtractor = new BigTextMatcher()
      .setInputCols("document", "token")
      .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
      .setOutputCol("entity")
      .setCaseSensitive(false)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
    val results = pipeline.fit(data).transform(data)
    results.selectExpr("explode(entity)").show(false)
    +--------------------------------------------------------------------+
    |col                                                                 |
    +--------------------------------------------------------------------+
    |[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
    |[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
    +--------------------------------------------------------------------+
  2. class BigTextMatcherModel extends AnnotatorModel[BigTextMatcherModel] with HasSimpleAnnotate[BigTextMatcherModel] with HasStorageModel

    Instantiated model of the BigTextMatcher.

    Instantiated model of the BigTextMatcher. For usage and examples see the documentation of the main class.

  3. trait ReadablePretrainedBigTextMatcher extends StorageReadable[BigTextMatcherModel] with HasPretrained[BigTextMatcherModel]
  4. class TMEdgesReadWriter extends TMEdgesReader with StorageReadWriter[Int]
  5. class TMEdgesReader extends StorageReader[Int]
  6. class TMNodesReader extends StorageReader[TrieNode]
  7. class TMNodesWriter extends StorageBatchWriter[TrieNode]
  8. class TMVocabReadWriter extends TMVocabReader with StorageReadWriter[Int]
  9. class TMVocabReader extends StorageReader[Int]
  10. case class TrieNode(pi: Int, isLeaf: Boolean, length: Int, lastLeaf: Int) extends Product with Serializable

Value Members

  1. object BigTextMatcher extends DefaultParamsReadable[BigTextMatcher] with Serializable

    This is the companion object of BigTextMatcher.

    This is the companion object of BigTextMatcher. Please refer to that class for the documentation.

  2. object BigTextMatcherModel extends ReadablePretrainedBigTextMatcher with Serializable

    This is the companion object of BigTextMatcherModel.

    This is the companion object of BigTextMatcherModel. Please refer to that class for the documentation.

Ungrouped