sparknlp.annotator.matcher.regex_matcher#

Contains classes for the RegexMatcher.

Module Contents#

Classes#

RegexMatcher

Uses rules to match a set of regular expressions and associate them with a

RegexMatcherModel

Instantiated model of the RegexMatcher.

class RegexMatcher[source]#

Uses rules to match a set of regular expressions and associate them with a provided identifier.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be “d{4}/dd/dd,date” which will match strings like “1970/01/01” to the identifier “date”.

Rules must be provided by either setRules() (followed by setDelimiter()) or an external file.

To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules(). The dictionary can be set in the form of a delimited text file.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

CHUNK

Parameters:
strategy

Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE, by default “MATCH_ALL”

rules

Regex rules to match the identifier with

delimiter

Delimiter for rules provided with setRules

externalRules

external resource to rules, needs ‘delimiter’ in options

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the rules.txt has the form of:

the\s\w+, followed by 'the'
ceremonies, ceremony

where each regex is separated by the identifier ","

>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
>>> regexMatcher = RegexMatcher() \
...     .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",") \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("regex") \
...     .setStrategy("MATCH_ALL")
>>> pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])
>>> data = spark.createDataFrame([[
...     "My first sentence with the first rule. This is my second sentence with ceremonies rule."
... ]]).toDF("text")
>>> results = pipeline.fit(data).transform(data)
>>> results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+
setStrategy(value)[source]#

Sets matching strategy, by default “MATCH_ALL”.

Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.

Parameters:
valuestr

Matching Strategy

setExternalRules(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets external resource to rules, needs ‘delimiter’ in options.

Only one of either parameter rules or externalRules must be set.

Parameters:
pathstr

Path to the source files

delimiterstr

Delimiter for the dictionary file. Can also be set it options.

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

setRules(value)[source]#

Sets the regex rules to match the identifier with.

The rules must consist of a regex pattern and an identifier for that pattern. The regex pattern and the identifier must be delimited by a character that will also have to set with setDelimiter.

Only one of either parameter rules or externalRules must be set.

Parameters:
valueList[str]

List of rules

Examples

>>> regexMatcher = RegexMatcher() \
...     .setRules(["\d{4}\/\d\d\/\d\d,date", "\d{2}\/\d\d\/\d\d,short_date"]) \
...     .setDelimiter(",") \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("regex") \
...     .setStrategy("MATCH_ALL")
setDelimiter(value)[source]#

Sets the delimiter for rules.

Parameters:
valuestr

Delimiter for the rules

class RegexMatcherModel(classname='com.johnsnowlabs.nlp.annotators.RegexMatcherModel', java_model=None)[source]#

Instantiated model of the RegexMatcher.

This is the instantiated model of the RegexMatcher. For training your own model, please see the documentation of that class.

Input Annotation types

Output Annotation type

DOCUMENT

CHUNK

Parameters:
None