`sparknlp.annotator.sentence.sentence_detector`#

Contains classes for the SentenceDetector.

Module Contents#

Classes#

`SentenceDetectorParams`	Base class for SentenceDetector parameters
`SentenceDetector`	Annotator that detects sentence boundaries using regular expressions.

class SentenceDetectorParams[source]#: Base class for SentenceDetector parameters

class SentenceDetector[source]#

Annotator that detects sentence boundaries using regular expressions.

The following characters are checked as sentence boundaries:

Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)
Numbers
Abbreviations
Punctuations
Multiple Periods
Geo-Locations/Coordinates (“N°. 1026.253.553.”)
Ellipsis (”…”)
In-between punctuations
Quotation marks
Exclamation Points
Basic Breakers (“.”, “;”)

For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.

To add additional custom bounds, the parameter customBounds can be set with an array:

>>> sentence = SentenceDetector() \
>>>     .setInputCols(["document"]) \
>>>     .setOutputCol("sentence") \
>>>     .setCustomBounds(["\n\n"])

If only the custom bounds should be used, then the parameter useCustomBoundsOnly should be set to true.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

useAbbreviations

whether to apply abbreviations at sentence detection, by default True

customBounds

characters used to explicitly mark sentence bounds, by default []

useCustomBoundsOnly

Only utilize custom bounds in sentence detection, by default False

customBoundsStrategy

Sets how to return matched custom bounds, by default “none”.

Will have no effect if no custom bounds are used. Possible values are:

“none” - Will not return the matched bound
“prepend” - Prepends a sentence break to the match
“append” - Appends a sentence break to the match

explodeSentences

whether to explode each sentence into a different row, for better parallelization, by default False

splitLength

length at which sentences will be forcibly split

minLength

Set the minimum allowed length for each sentence, by default 0

maxLength

Set the maximum allowed length for each sentence, by default 99999

detectLists

whether detect lists during sentence detection, by default True

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
...     .setCustomBounds(["\n\n"])
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence
... ])
>>> data = spark.createDataFrame([["This is my first sentence. This my second.\n\nHow about a third?"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

setCustomBounds(value)[source]#

Sets characters used to explicitly mark sentence bounds, by default [].

Parameters:

valueList[str]: Characters used to explicitly mark sentence bounds

setCustomBoundsStrategy(value)[source]#

Sets how to return matched custom bounds, by default “none”.

Will have no effect if no custom bounds are used. Possible values are:

“none” - Will not return the matched bound
“prepend” - Prepends a sentence break to the match
“append” - Appends a sentence break to the match

Parameters:

valuestr: Strategy to use

setUseAbbreviations(value)[source]#

Sets whether to apply abbreviations at sentence detection, by default True

Parameters:

valuebool: Whether to apply abbreviations at sentence detection

setDetectLists(value)[source]#

Sets whether detect lists during sentence detection, by default True

Parameters:

valuebool: Whether detect lists during sentence detection

setUseCustomBoundsOnly(value)[source]#

Sets whether to only utilize custom bounds in sentence detection, by default False.

Parameters:

valuebool: Whether to only utilize custom bounds

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:

valuebool: Whether to explode each sentence into a different row

setSplitLength(value)[source]#

Sets length at which sentences will be forcibly split.

Parameters:

valueint: Length at which sentences will be forcibly split.

setMinLength(value)[source]#

Sets minimum allowed length for each sentence, by default 0

Parameters:

valueint: Minimum allowed length for each sentence

setMaxLength(value)[source]#

Sets the maximum allowed length for each sentence, by default 99999

Parameters:

valueint: Maximum allowed length for each sentence

sparknlp.annotator.sentence.sentence_detector#

Module Contents#

Classes#

`sparknlp.annotator.sentence.sentence_detector`#