sparknlp.annotator.matcher.regex_matcher#
Contains classes for the RegexMatcher.
Module Contents#
Classes#
Uses rules to match a set of regular expressions and associate them with a |
|
Instantiated model of the RegexMatcher. |
- class RegexMatcher[source]#
Uses rules to match a set of regular expressions and associate them with a provided identifier.
A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be “d{4}/dd/dd,date” which will match strings like “1970/01/01” to the identifier “date”.
Rules must be provided by either
setRules()(followed bysetDelimiter()) or an external file.To use an external file, a dictionary of predefined regular expressions must be provided with
setExternalRules(). The dictionary can be set in the form of a delimited text file.Pretrained pipelines are available for this module, see Pipelines.
For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENTCHUNK- Parameters:
- strategy
Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE, by default “MATCH_ALL”
- rules
Regex rules to match the identifier with
- delimiter
Delimiter for rules provided with setRules
- externalRules
external resource to rules, needs ‘delimiter’ in options
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the
rules.txthas the form of:the\s\w+, followed by 'the' ceremonies, ceremony
where each regex is separated by the identifier
",">>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") >>> sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence") >>> regexMatcher = RegexMatcher() \ ... .setExternalRules("src/test/resources/regex-matcher/rules.txt", ",") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("regex") \ ... .setStrategy("MATCH_ALL") >>> pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher]) >>> data = spark.createDataFrame([[ ... "My first sentence with the first rule. This is my second sentence with ceremonies rule." ... ]]).toDF("text") >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(regex) as result").show(truncate=False) +--------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------+ |[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]| |[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []] | +--------------------------------------------------------------------------------------------+
- setStrategy(value)[source]#
Sets matching strategy, by default “MATCH_ALL”.
Can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.
- Parameters:
- valuestr
Matching Strategy
- setExternalRules(path, delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets external resource to rules, needs ‘delimiter’ in options.
Only one of either parameter rules or externalRules must be set.
- Parameters:
- pathstr
Path to the source files
- delimiterstr
Delimiter for the dictionary file. Can also be set it options.
- read_asstr, optional
How to read the file, by default ReadAs.TEXT
- optionsdict, optional
Options to read the resource, by default {“format”: “text”}
- setRules(value)[source]#
Sets the regex rules to match the identifier with.
The rules must consist of a regex pattern and an identifier for that pattern. The regex pattern and the identifier must be delimited by a character that will also have to set with setDelimiter.
Only one of either parameter rules or externalRules must be set.
- Parameters:
- valueList[str]
List of rules
Examples
>>> regexMatcher = RegexMatcher() \ ... .setRules(["\d{4}\/\d\d\/\d\d,date", "\d{2}\/\d\d\/\d\d,short_date"]) \ ... .setDelimiter(",") \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("regex") \ ... .setStrategy("MATCH_ALL")
- class RegexMatcherModel(classname='com.johnsnowlabs.nlp.annotators.RegexMatcherModel', java_model=None)[source]#
Instantiated model of the RegexMatcher.
This is the instantiated model of the
RegexMatcher. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENTCHUNK- Parameters:
- None