sparknlp.annotator.sentence.sentence_detector#

Contains classes for the SentenceDetector.

Module Contents#

Classes#

SentenceDetectorParams

Base class for SentenceDetector parameters

SentenceDetector

Annotator that detects sentence boundaries using regular expressions.

class SentenceDetectorParams[source]#

Base class for SentenceDetector parameters

class SentenceDetector[source]#

Annotator that detects sentence boundaries using regular expressions.

The following characters are checked as sentence boundaries:

  1. Lists (“(i), (ii)”, “(a), (b)”, “1., 2.”)

  2. Numbers

  3. Abbreviations

  4. Punctuations

  5. Multiple Periods

  6. Geo-Locations/Coordinates (“N°. 1026.253.553.”)

  7. Ellipsis (”…”)

  8. In-between punctuations

  9. Quotation marks

  10. Exclamation Points

  11. Basic Breakers (“.”, “;”)

For the explicit regular expressions used for detection, refer to source of PragmaticContentFormatter.

To add additional custom bounds, the parameter customBounds can be set with an array:

>>> sentence = SentenceDetector() \
>>>     .setInputCols(["document"]) \
>>>     .setOutputCol("sentence") \
>>>     .setCustomBounds(["\n\n"])

If only the custom bounds should be used, then the parameter useCustomBoundsOnly should be set to true.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
useAbbreviations

whether to apply abbreviations at sentence detection, by default True

customBounds

characters used to explicitly mark sentence bounds, by default []

useCustomBoundsOnly

Only utilize custom bounds in sentence detection, by default False

customBoundsStrategy

Sets how to return matched custom bounds, by default “none”.

Will have no effect if no custom bounds are used. Possible values are:

  • “none” - Will not return the matched bound

  • “prepend” - Prepends a sentence break to the match

  • “append” - Appends a sentence break to the match

explodeSentences

whether to explode each sentence into a different row, for better parallelization, by default False

splitLength

length at which sentences will be forcibly split

minLength

Set the minimum allowed length for each sentence, by default 0

maxLength

Set the maximum allowed length for each sentence, by default 99999

detectLists

whether detect lists during sentence detection, by default True

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
...     .setCustomBounds(["\n\n"])
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence
... ])
>>> data = spark.createDataFrame([["This is my first sentence. This my second.\n\nHow about a third?"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+
setCustomBounds(value)[source]#

Sets characters used to explicitly mark sentence bounds, by default [].

Parameters:
valueList[str]

Characters used to explicitly mark sentence bounds

setCustomBoundsStrategy(value)[source]#

Sets how to return matched custom bounds, by default “none”.

Will have no effect if no custom bounds are used. Possible values are:

  • “none” - Will not return the matched bound

  • “prepend” - Prepends a sentence break to the match

  • “append” - Appends a sentence break to the match

Parameters:
valuestr

Strategy to use

setUseAbbreviations(value)[source]#

Sets whether to apply abbreviations at sentence detection, by default True

Parameters:
valuebool

Whether to apply abbreviations at sentence detection

setDetectLists(value)[source]#

Sets whether detect lists during sentence detection, by default True

Parameters:
valuebool

Whether detect lists during sentence detection

setUseCustomBoundsOnly(value)[source]#

Sets whether to only utilize custom bounds in sentence detection, by default False.

Parameters:
valuebool

Whether to only utilize custom bounds

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:
valuebool

Whether to explode each sentence into a different row

setSplitLength(value)[source]#

Sets length at which sentences will be forcibly split.

Parameters:
valueint

Length at which sentences will be forcibly split.

setMinLength(value)[source]#

Sets minimum allowed length for each sentence, by default 0

Parameters:
valueint

Minimum allowed length for each sentence

setMaxLength(value)[source]#

Sets the maximum allowed length for each sentence, by default 99999

Parameters:
valueint

Maximum allowed length for each sentence