sparknlp.annotator.matcher.text_matcher
#
Contains classes for the TextMatcher.
Module Contents#
Classes#
Annotator to match exact phrases (by token) provided in a file against a |
|
Instantiated model of the TextMatcher. |
- class TextMatcher[source]#
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with
setEntities()
.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
CHUNK
- Parameters:
- entities
ExternalResource for entities
- caseSensitive
Whether to match regardless of case, by default True
- mergeOverlapping
Whether to merge overlapping matched chunks, by default False
- entityValue
Value for the entity metadata field
- buildFromTokens
Whether the TextMatcher should take the CHUNK from TOKEN or not
See also
BigTextMatcher
to match large amounts of text
Examples
In this example, the entities file is of the form:
... dolore magna aliqua lorem ipsum dolor. sit laborum ...
where each line represents an entity phrase to be extracted.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text") >>> entityExtractor = TextMatcher() \ ... .setInputCols(["document", "token"]) \ ... .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \ ... .setOutputCol("entity") \ ... .setCaseSensitive(False) >>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor]) >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(entity) as result").show(truncate=False) +------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------+ |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []] | |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]| |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []] | +------------------------------------------------------------------------------------------+
- setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets the external resource for the entities.
- Parameters:
- pathstr
Path to the external resource
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- optionsdict, optional
Options for reading the resource, by default {“format”: “text”}
- setCaseSensitive(b)[source]#
Sets whether to match regardless of case, by default True.
- Parameters:
- bbool
Whether to match regardless of case
- setMergeOverlapping(b)[source]#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
- bbool
Whether to merge overlapping matched chunks
- class TextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.TextMatcherModel', java_model=None)[source]#
Instantiated model of the TextMatcher.
This is the instantiated model of the
TextMatcher
. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
CHUNK
- Parameters:
- mergeOverlapping
Whether to merge overlapping matched chunks, by default False
- entityValue
Value for the entity metadata field
- buildFromTokens
Whether the TextMatcher should take the CHUNK from TOKEN or not
- setMergeOverlapping(b)[source]#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
- bbool
Whether to merge overlapping matched chunks
- setEntityValue(b)[source]#
Sets value for the entity metadata field.
- Parameters:
- bstr
Value for the entity metadata field
- setBuildFromTokens(b)[source]#
Sets whether the TextMatcher should take the CHUNK from TOKEN or not.
- Parameters:
- bbool
Whether the TextMatcher should take the CHUNK from TOKEN or not
- static pretrained(name, lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- TextMatcherModel
The restored model