sparknlp.annotator.matcher.regex_matcher
#
Contains classes for the RegexMatcher.
Module Contents#
Classes#
Uses rules to match a set of regular expressions and associate them with a |
|
Instantiated model of the RegexMatcher. |
- class RegexMatcher[source]#
Uses rules to match a set of regular expressions and associate them with a provided identifier.
A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be “d{4}/dd/dd,date” which will match strings like “1970/01/01” to the identifier “date”.
Rules must be provided by either
setRules()
(followed bysetDelimiter()
) or an external file.To use an external file, a dictionary of predefined regular expressions must be provided with
setExternalRules()
. The dictionary can be set in the form of a delimited text file.Pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
CHUNK
- Parameters:
- strategy
Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE, by default “MATCH_ALL”
- rules
Regex rules to match the identifier with
- delimiter
Delimiter for rules provided with setRules
- externalRules
external resource to rules, needs ‘delimiter’ in options
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the
rules.txt
has the form of:the\s\w+, followed by 'the' ceremonies, ceremony
where each regex is separated by the identifier
","
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence") >>> regexMatcher = RegexMatcher() \ ... .setExternalRules("src/test/resources/regex-matcher/rules.txt", ",") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("regex") \ ... .setStrategy("MATCH_ALL") >>> pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher]) >>> data = spark.createDataFrame([[ ... "My first sentence with the first rule. This is my second sentence with ceremonies rule." ... ]]).toDF("text") >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(regex) as result").show(truncate=False) +--------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------+ |[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]| |[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []] | +--------------------------------------------------------------------------------------------+
- setStrategy(value)[source]#
Sets matching strategy, by default “MATCH_ALL”.
Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.
- Parameters:
- valuestr
Matching Strategy
- setExternalRules(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets external resource to rules, needs ‘delimiter’ in options.
Only one of either parameter rules or externalRules must be set.
- Parameters:
- pathstr
Path to the source files
- delimiterstr
Delimiter for the dictionary file. Can also be set it options.
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- setRules(value)[source]#
Sets the regex rules to match the identifier with.
The rules must consist of a regex pattern and an identifier for that pattern. The regex pattern and the identifier must be delimited by a character that will also have to set with setDelimiter.
Only one of either parameter rules or externalRules must be set.
- Parameters:
- valueList[str]
List of rules
Examples
>>> regexMatcher = RegexMatcher() \ ... .setRules(["\d{4}\/\d\d\/\d\d,date", "\d{2}\/\d\d\/\d\d,short_date"]) \ ... .setDelimiter(",") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("regex") \ ... .setStrategy("MATCH_ALL")
- class RegexMatcherModel(classname='com.johnsnowlabs.nlp.annotators.RegexMatcherModel', java_model=None)[source]#
Instantiated model of the RegexMatcher.
This is the instantiated model of the
RegexMatcher
. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT
CHUNK
- Parameters:
- None