`sparknlp.annotator.matcher.regex_matcher`#

Contains classes for the RegexMatcher.

Module Contents#

Classes#

`RegexMatcher`	Uses rules to match a set of regular expressions and associate them with a
`RegexMatcherModel`	Instantiated model of the RegexMatcher.

class RegexMatcher[source]#

Uses rules to match a set of regular expressions and associate them with a provided identifier.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be “d{4}/dd/dd,date” which will match strings like “1970/01/01” to the identifier “date”.

Rules must be provided by either setRules() (followed by setDelimiter()) or an external file.

To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules(). The dictionary can be set in the form of a delimited text file.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`CHUNK`

Parameters:

strategy: Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE, by default “MATCH_ALL”
rules: Regex rules to match the identifier with
delimiter: Delimiter for rules provided with setRules
externalRules: external resource to rules, needs ‘delimiter’ in options

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the rules.txt has the form of:

the\s\w+, followed by 'the'
ceremonies, ceremony

where each regex is separated by the identifier ","

>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
>>> regexMatcher = RegexMatcher() \
...     .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",") \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("regex") \
...     .setStrategy("MATCH_ALL")
>>> pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])
>>> data = spark.createDataFrame([[
...     "My first sentence with the first rule. This is my second sentence with ceremonies rule."
... ]]).toDF("text")
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

setStrategy(value)[source]#

Sets matching strategy, by default “MATCH_ALL”.

Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.

Parameters:

valuestr: Matching Strategy

setExternalRules(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets external resource to rules, needs ‘delimiter’ in options.

Only one of either parameter rules or externalRules must be set.

Parameters:

pathstr: Path to the source files
delimiterstr: Delimiter for the dictionary file. Can also be set it options.
read_asstr, optional: How to read the file, by default ReadAs.TEXT
optionsdict, optional: Options to read the resource, by default {“format”: “text”}

setRules(value)[source]#

Sets the regex rules to match the identifier with.

The rules must consist of a regex pattern and an identifier for that pattern. The regex pattern and the identifier must be delimited by a character that will also have to set with setDelimiter.

Only one of either parameter rules or externalRules must be set.

Parameters:

valueList[str]: List of rules

Examples

>>> regexMatcher = RegexMatcher() \
...     .setRules(["\d{4}\/\d\d\/\d\d,date", "\d{2}\/\d\d\/\d\d,short_date"]) \
...     .setDelimiter(",") \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("regex") \
...     .setStrategy("MATCH_ALL")

setDelimiter(value)[source]#

Sets the delimiter for rules.

Parameters:

valuestr: Delimiter for the rules

class RegexMatcherModel(classname='com.johnsnowlabs.nlp.annotators.RegexMatcherModel', java_model=None)[source]#

Instantiated model of the RegexMatcher.

This is the instantiated model of the RegexMatcher. For training your own model, please see the documentation of that class.

Input Annotation types	Output Annotation type
`DOCUMENT`	`CHUNK`

Parameters:

None

sparknlp.annotator.matcher.regex_matcher#

Module Contents#

Classes#

`sparknlp.annotator.matcher.regex_matcher`#