sparknlp.annotator.sentence.sentence_detector
#
Contains classes for the SentenceDetector.
Module Contents#
Classes#
Base class for SentenceDetector parameters |
|
Annotator that detects sentence boundaries using regular expressions. |
- class SentenceDetector[source]#
Annotator that detects sentence boundaries using regular expressions.
The following characters are checked as sentence boundaries:
Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)
Numbers
Abbreviations
Punctuations
Multiple Periods
Geo-Locations/Coordinates (“N°. 1026.253.553.”)
Ellipsis (”…”)
In-between punctuations
Quotation marks
Exclamation Points
Basic Breakers (“.”, “;”)
For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.
To add additional custom bounds, the parameter
customBounds
can be set with an array:>>> sentence = SentenceDetector() \ >>> .setInputCols(["document"]) \ >>> .setOutputCol("sentence") \ >>> .setCustomBounds(["\n\n"])
If only the custom bounds should be used, then the parameter
useCustomBoundsOnly
should be set totrue
.Each extracted sentence can be returned in an Array or exploded to separate rows, if
explodeSentences
is set totrue
.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- useAbbreviations
whether to apply abbreviations at sentence detection, by default True
- customBounds
characters used to explicitly mark sentence bounds, by default []
- useCustomBoundsOnly
Only utilize custom bounds in sentence detection, by default False
- customBoundsStrategy
Sets how to return matched custom bounds, by default “none”.
Will have no effect if no custom bounds are used. Possible values are:
“none” - Will not return the matched bound
“prepend” - Prepends a sentence break to the match
“append” - Appends a sentence break to the match
- explodeSentences
whether to explode each sentence into a different row, for better parallelization, by default False
- splitLength
length at which sentences will be forcibly split
- minLength
Set the minimum allowed length for each sentence, by default 0
- maxLength
Set the maximum allowed length for each sentence, by default 99999
- detectLists
whether detect lists during sentence detection, by default True
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") ... .setCustomBounds(["\n\n"]) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence ... ]) >>> data = spark.createDataFrame([["This is my first sentence. This my second.\n\nHow about a third?"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(sentence) as sentences").show(truncate=False) +------------------------------------------------------------------+ |sentences | +------------------------------------------------------------------+ |[document, 0, 25, This is my first sentence., [sentence -> 0], []]| |[document, 27, 41, This my second., [sentence -> 1], []] | |[document, 43, 60, How about a third?, [sentence -> 2], []] | +------------------------------------------------------------------+
- setCustomBounds(value)[source]#
Sets characters used to explicitly mark sentence bounds, by default [].
- Parameters:
- valueList[str]
Characters used to explicitly mark sentence bounds
- setCustomBoundsStrategy(value)[source]#
Sets how to return matched custom bounds, by default “none”.
Will have no effect if no custom bounds are used. Possible values are:
“none” - Will not return the matched bound
“prepend” - Prepends a sentence break to the match
“append” - Appends a sentence break to the match
- Parameters:
- valuestr
Strategy to use
- setUseAbbreviations(value)[source]#
Sets whether to apply abbreviations at sentence detection, by default True
- Parameters:
- valuebool
Whether to apply abbreviations at sentence detection
- setDetectLists(value)[source]#
Sets whether detect lists during sentence detection, by default True
- Parameters:
- valuebool
Whether detect lists during sentence detection
- setUseCustomBoundsOnly(value)[source]#
Sets whether to only utilize custom bounds in sentence detection, by default False.
- Parameters:
- valuebool
Whether to only utilize custom bounds
- setExplodeSentences(value)[source]#
Sets whether to explode each sentence into a different row, for better parallelization, by default False.
- Parameters:
- valuebool
Whether to explode each sentence into a different row
- setSplitLength(value)[source]#
Sets length at which sentences will be forcibly split.
- Parameters:
- valueint
Length at which sentences will be forcibly split.