`sparknlp.annotator.matcher.text_matcher`#

Contains classes for the TextMatcher.

Module Contents#

Classes#

`TextMatcher`	Annotator to match exact phrases (by token) provided in a file against a
`TextMatcherModel`	Instantiated model of the TextMatcher.

class TextMatcher[source]#

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setEntities().

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`CHUNK`

Parameters:

entities: ExternalResource for entities
caseSensitive: Whether to match regardless of case, by default True
mergeOverlapping: Whether to merge overlapping matched chunks, by default False
entityValue: Value for the entity metadata field
buildFromTokens: Whether the TextMatcher should take the CHUNK from TOKEN or not

See also

BigTextMatcher: to match large amounts of text

Examples

In this example, the entities file is of the form:

...
dolore magna aliqua
lorem ipsum dolor. sit
laborum
...

where each line represents an entity phrase to be extracted.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
>>> entityExtractor = TextMatcher() \
...     .setInputCols(["document", "token"]) \
...     .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
...     .setOutputCol("entity") \
...     .setCaseSensitive(False)
>>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(entity) as result").show(truncate=False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

entities[source]#

caseSensitive[source]#

mergeOverlapping[source]#

entityValue[source]#

buildFromTokens[source]#

setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets the external resource for the entities.

Parameters:

pathstr: Path to the external resource
read_asstr, optional: How to read the resource, by default ReadAs.TEXT
optionsdict, optional: Options for reading the resource, by default {“format”: “text”}

setCaseSensitive(b)[source]#

Sets whether to match regardless of case, by default True.

Parameters:

bbool: Whether to match regardless of case

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:

bbool: Whether to merge overlapping matched chunks

setEntityValue(b)[source]#

Sets value for the entity metadata field.

Parameters:

bstr: Value for the entity metadata field

setBuildFromTokens(b)[source]#

Sets whether the TextMatcher should take the CHUNK from TOKEN or not.

Parameters:

bbool: Whether the TextMatcher should take the CHUNK from TOKEN or not

class TextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.TextMatcherModel', java_model=None)[source]#

Instantiated model of the TextMatcher.

This is the instantiated model of the TextMatcher. For training your own model, please see the documentation of that class.

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN`	`CHUNK`

Parameters:

mergeOverlapping: Whether to merge overlapping matched chunks, by default False
entityValue: Value for the entity metadata field
buildFromTokens: Whether the TextMatcher should take the CHUNK from TOKEN or not

name = 'TextMatcherModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

mergeOverlapping[source]#

searchTrie[source]#

entityValue[source]#

buildFromTokens[source]#

setMergeOverlapping(b)[source]#

Sets whether to merge overlapping matched chunks, by default False.

Parameters:

bbool: Whether to merge overlapping matched chunks

setEntityValue(b)[source]#

Sets value for the entity metadata field.

Parameters:

bstr: Value for the entity metadata field

setBuildFromTokens(b)[source]#

Sets whether the TextMatcher should take the CHUNK from TOKEN or not.

Parameters:

bbool: Whether the TextMatcher should take the CHUNK from TOKEN or not

static pretrained(name, lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

TextMatcherModel: The restored model

sparknlp.annotator.matcher.text_matcher#

Module Contents#

Classes#

`sparknlp.annotator.matcher.text_matcher`#