btm

package btm

Ordering

Alphabetic

Visibility

Public
All

Type Members

class BigTextMatcher extends AnnotatorApproach[BigTextMatcherModel] with HasStorage

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath. The text file can als be set directly as an ExternalResource.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

For extended examples of usage, see the BigTextMatcherTestSpec.

Example

In this example, the entities file is of the form

...
dolore magna aliqua
lorem ipsum dolor. sit
laborum
...

where each line represents an entity phrase to be extracted.

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.BigTextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new BigTextMatcher()
  .setInputCols("document", "token")
  .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

class BigTextMatcherModel extends AnnotatorModel[BigTextMatcherModel] with HasSimpleAnnotate[BigTextMatcherModel] with HasStorageModel
Instantiated model of the BigTextMatcher.
Instantiated model of the BigTextMatcher. For usage and examples see the documentation of the main class.
trait ReadablePretrainedBigTextMatcher extends StorageReadable[BigTextMatcherModel] with HasPretrained[BigTextMatcherModel]
class TMEdgesReadWriter extends TMEdgesReader with StorageReadWriter[Int]
class TMEdgesReader extends StorageReader[Int]
class TMNodesReader extends StorageReader[TrieNode]
class TMNodesWriter extends StorageBatchWriter[TrieNode]
class TMVocabReadWriter extends TMVocabReader with StorageReadWriter[Int]
class TMVocabReader extends StorageReader[Int]
case class TrieNode(pi: Int, isLeaf: Boolean, length: Int, lastLeaf: Int) extends Product with Serializable

Value Members

object BigTextMatcher extends DefaultParamsReadable[BigTextMatcher] with Serializable
This is the companion object of BigTextMatcher.
This is the companion object of BigTextMatcher. Please refer to that class for the documentation.
object BigTextMatcherModel extends ReadablePretrainedBigTextMatcher with Serializable
This is the companion object of BigTextMatcherModel.
This is the companion object of BigTextMatcherModel. Please refer to that class for the documentation.

Packages

btm

package btm

Type Members

Example

Value Members

Ungrouped

Packages

btm 

package btm

Type Members

Example

Value Members

Ungrouped

btm