sparknlp.annotator.matcher.big_text_matcher
#
Contains classes for the BigTextMatcher.
Module Contents#
Classes#
Annotator to match exact phrases (by token) provided in a file against a |
|
Instantiated model of the BigTextMatcher. |
- class BigTextMatcher[source]#
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with
setStoragePath
.In contrast to the normal
TextMatcher
, theBigTextMatcher
is designed for large corpora.Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
CHUNK
- Parameters:
- entities
ExternalResource for entities
- caseSensitive
whether to ignore case in index lookups, by default True
- mergeOverlapping
whether to merge overlapping matched chunks, by default False
- tokenizer
TokenizerModel to use to tokenize input file for building a Trie
Examples
In this example, the entities file is of the form:
... dolore magna aliqua lorem ipsum dolor. sit laborum ...
where each line represents an entity phrase to be extracted.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols("document") \ ... .setOutputCol("token") >>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text") >>> entityExtractor = BigTextMatcher() \ ... .setInputCols("document", "token") \ ... .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \ ... .setOutputCol("entity") \ ... .setCaseSensitive(False) >>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor]) >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(entity)").show(truncate=False) +--------------------------------------------------------------------+ |col | +--------------------------------------------------------------------+ |[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]| |[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []] | +--------------------------------------------------------------------+
- setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets ExternalResource for entities.
- Parameters:
- pathstr
Path to the resource
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- optionsdict, optional
Options for reading the resource, by default {“format”: “text”}
- setCaseSensitive(b)[source]#
Sets whether to ignore case in index lookups, by default True.
- Parameters:
- bbool
Whether to ignore case in index lookups
- class BigTextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.btm.TextMatcherModel', java_model=None)[source]#
Instantiated model of the BigTextMatcher.
This is the instantiated model of the
BigTextMatcher
. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT, TOKEN
CHUNK
- Parameters:
- caseSensitive
Whether to ignore case in index lookups
- mergeOverlapping
Whether to merge overlapping matched chunks, by default False
- searchTrie
SearchTrie
- setMergeOverlapping(b)[source]#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
- vbool
Whether to merge overlapping matched chunks, by default False
- setCaseSensitive(v)[source]#
Sets whether to ignore case in index lookups.
- Parameters:
- bbool
Whether to ignore case in index lookups
- static pretrained(name, lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- TextMatcherModel
The restored model
- static loadStorage(path, spark, storage_ref)[source]#
Loads the model from storage.
- Parameters:
- pathstr
Path to the model
- spark
pyspark.sql.SparkSession
The current SparkSession
- storage_refstr
Identifiers for the model parameters