sparknlp.annotator.matcher.big_text_matcher#
Contains classes for the BigTextMatcher.
Module Contents#
Classes#
Annotator to match exact phrases (by token) provided in a file against a |
|
Instantiated model of the BigTextMatcher. |
- class BigTextMatcher[source]#
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with
setStoragePath.In contrast to the normal
TextMatcher, theBigTextMatcheris designed for large corpora.Input Annotation types
Output Annotation type
DOCUMENT, TOKENCHUNK- Parameters:
- entities
ExternalResource for entities
- caseSensitive
whether to ignore case in index lookups, by default True
- mergeOverlapping
whether to merge overlapping matched chunks, by default False
- tokenizer
TokenizerModel to use to tokenize input file for building a Trie
Examples
In this example, the entities file is of the form:
... dolore magna aliqua lorem ipsum dolor. sit laborum ...
where each line represents an entity phrase to be extracted.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols("document") \ ... .setOutputCol("token") >>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text") >>> entityExtractor = BigTextMatcher() \ ... .setInputCols("document", "token") \ ... .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \ ... .setOutputCol("entity") \ ... .setCaseSensitive(False) >>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor]) >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(entity)").show(truncate=False) +--------------------------------------------------------------------+ |col | +--------------------------------------------------------------------+ |[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]| |[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []] | +--------------------------------------------------------------------+
- setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets ExternalResource for entities.
- Parameters:
- pathstr
Path to the resource
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- optionsdict, optional
Options for reading the resource, by default {“format”: “text”}
- setCaseSensitive(b)[source]#
Sets whether to ignore case in index lookups, by default True.
- Parameters:
- bbool
Whether to ignore case in index lookups
- class BigTextMatcherModel(classname='com.johnsnowlabs.nlp.annotators.btm.TextMatcherModel', java_model=None)[source]#
Instantiated model of the BigTextMatcher.
This is the instantiated model of the
BigTextMatcher. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT, TOKENCHUNK- Parameters:
- caseSensitive
Whether to ignore case in index lookups
- mergeOverlapping
Whether to merge overlapping matched chunks, by default False
- searchTrie
SearchTrie
- setMergeOverlapping(b)[source]#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
- vbool
Whether to merge overlapping matched chunks, by default False
- setCaseSensitive(v)[source]#
Sets whether to ignore case in index lookups.
- Parameters:
- bbool
Whether to ignore case in index lookups
- static pretrained(name, lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- TextMatcherModel
The restored model
- static loadStorage(path, spark, storage_ref)[source]#
Loads the model from storage.
- Parameters:
- pathstr
Path to the model
- spark
pyspark.sql.SparkSession The current SparkSession
- storage_refstr
Identifiers for the model parameters