`sparknlp.annotator.matcher.big_text_matcher`#

Contains classes for the BigTextMatcher.

Module Contents#

Classes#

`BigTextMatcher`	Annotator to match exact phrases (by token) provided in a file against a
`BigTextMatcherModel`	Instantiated model of the BigTextMatcher.

class BigTextMatcher[source]#

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`CHUNK`

Parameters:

entities: ExternalResource for entities
caseSensitive: whether to ignore case in index lookups, by default True
mergeOverlapping: whether to merge overlapping matched chunks, by default False
tokenizer: TokenizerModel to use to tokenize input file for building a Trie

Examples

In this example, the entities file is of the form:

...
dolore magna aliqua
lorem ipsum dolor. sit
laborum
...

where each line represents an entity phrase to be extracted.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols("document") \
...     .setOutputCol("token")
>>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
>>> entityExtractor = BigTextMatcher() \
...     .setInputCols("document", "token") \
...     .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
...     .setOutputCol("entity") \
...     .setCaseSensitive(False)
>>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

entities[source]#

caseSensitive[source]#

mergeOverlapping[source]#

tokenizer[source]#

setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets ExternalResource for entities.

Parameters:

pathstr: Path to the resource
read_asstr, optional: How to read the resource, by default ReadAs.TEXT
optionsdict, optional: Options for reading the resource, by default {“format”: “text”}

setCaseSensitive(b)[source]#

Sets whether to ignore case in index lookups, by default True.

Parameters:

bbool: Whether to ignore case in index lookups

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:

bbool: Whether to merge overlapping matched chunks

setTokenizer(tokenizer_model)[source]#

Sets TokenizerModel to use to tokenize input file for building a Trie.

Parameters:

tokenizer_modelTokenizerModel: TokenizerModel to use to tokenize input file

class BigTextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.btm.TextMatcherModel', java_model=None)[source]#

Instantiated model of the BigTextMatcher.

This is the instantiated model of the BigTextMatcher. For training your own model, please see the documentation of that class.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`CHUNK`

Parameters:

caseSensitive: Whether to ignore case in index lookups
mergeOverlapping: Whether to merge overlapping matched chunks, by default False
searchTrie: SearchTrie

name = 'BigTextMatcherModel'[source]#

databases = ['TMVOCAB', 'TMEDGES', 'TMNODES'][source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

caseSensitive[source]#

mergeOverlapping[source]#

searchTrie[source]#

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:

vbool: Whether to merge overlapping matched chunks, by default False

setCaseSensitive(v)[source]#

Sets whether to ignore case in index lookups.

Parameters:

bbool: Whether to ignore case in index lookups

static pretrained(name, lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

TextMatcherModel: The restored model

static loadStorage(path, spark, storage_ref)[source]#

Loads the model from storage.

Parameters:

pathstr: Path to the model
sparkpyspark.sql.SparkSession: The current SparkSession
storage_refstr: Identifiers for the model parameters

sparknlp.annotator.matcher.big_text_matcher#

Module Contents#

Classes#

`sparknlp.annotator.matcher.big_text_matcher`#