sparknlp.annotator.sentence.sentence_detector_dl#

Contains classes for SentenceDetectorDl.

Module Contents#

Classes#

SentenceDetectorDLApproach

Trains an annotator that detects sentence boundaries using a deep

SentenceDetectorDLModel

Annotator that detects sentence boundaries using a deep learning approach.

class SentenceDetectorDLApproach(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach')[source]#

Trains an annotator that detects sentence boundaries using a deep learning approach.

Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with setModel().

For pretrained models see SentenceDetectorDLModel.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to True.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
modelArchitecture

Model architecture (CNN)

impossiblePenultimates

Impossible penultimates - list of strings which a sentence can’t end with

validationSplit

Choose the proportion of training dataset to be validated against the model on each

epochsNumber

Number of epochs for the optimization process

outputLogsPath

Path to folder where logs will be saved. If no path is specified, no logs are generated

explodeSentences

Whether to explode each sentence into a different row, for better parallelization. Defaults to False.

References

The default model "cnn" is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.

Examples

The training process needs data, where each data point is a sentence. In this example the train.txt file has the form of:

...
Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
...

where each line is one sentence.

Training can then be started like so:

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> trainingData = spark.read.text("train.txt").toDF("text")
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetectorDLApproach() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentences") \
...     .setEpochsNumber(100)
>>> pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])
>>> model = pipeline.fit(trainingData)
setModel(model_architecture)[source]#

Sets the Model architecture. Currently only "cnn" is available.

Parameters:
model_architecturestr

Model architecture

setValidationSplit(validation_split)[source]#

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters:
validation_splitfloat

Proportion of training dataset to be validated

setEpochsNumber(epochs_number)[source]#

Sets number of epochs to train.

Parameters:
epochs_numberint

Number of epochs

setOutputLogsPath(output_logs_path)[source]#

Sets folder path to save training logs.

Parameters:
output_logs_pathstr

Folder path to save training logs

setImpossiblePenultimates(impossible_penultimates)[source]#

Sets impossible penultimates - list of strings which a sentence can’t end with.

Parameters:
impossible_penultimatesList[str]

List of strings which a sentence can’t end with

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:
valuebool

Whether to explode each sentence into a different row

class SentenceDetectorDLModel(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLModel', java_model=None)[source]#

Annotator that detects sentence boundaries using a deep learning approach.

Instantiated Model of the SentenceDetectorDLApproach. Detects sentence boundaries using a deep learning approach.

Pretrained models can be loaded with pretrained() of the companion object:

>>> sentenceDL = SentenceDetectorDLModel.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentencesDL")

The default model is "sentence_detector_dl", if no name is provided. For available pretrained models please see the Models Hub.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
modelArchitecture

Model architecture (CNN)

explodeSentences

whether to explode each sentence into a different row, for better parallelization. Defaults to false.

customBounds

characters used to explicitly mark sentence bounds, by default []

useCustomBoundsOnly

Only utilize custom bounds in sentence detection, by default False

splitLength

length at which sentences will be forcibly split

minLength

Set the minimum allowed length for each sentence, by default 0

maxLength

Set the maximum allowed length for each sentence, by default 99999

impossiblePenultimates

Impossible penultimates - list of strings which a sentence can’t end with

Examples

In this example, the normal SentenceDetector is compared to the SentenceDetectorDLModel. In a pipeline, SentenceDetectorDLModel can be used as a replacement for the SentenceDetector.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentences")
>>> sentenceDL = SentenceDetectorDLModel \
...     .pretrained("sentence_detector_dl", "en") \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentencesDL")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     sentenceDL
... ])
>>> data = spark.createDataFrame([["""John loves Mary.Mary loves Peter
...     Peter loves Helen .Helen loves John;
...     Total: four people involved."""]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(sentences.result) as sentences").show(truncate=False)
+----------------------------------------------------------+
|sentences                                                 |
+----------------------------------------------------------+
|John loves Mary.Mary loves Peter\n     Peter loves Helen .|
|Helen loves John;                                         |
|Total: four people involved.                              |
+----------------------------------------------------------+
>>> result.selectExpr("explode(sentencesDL.result) as sentencesDL").show(truncate=False)
+----------------------------+
|sentencesDL                 |
+----------------------------+
|John loves Mary.            |
|Mary loves Peter            |
|Peter loves Helen .         |
|Helen loves John;           |
|Total: four people involved.|
+----------------------------+
setModel(modelArchitecture)[source]#

Sets the Model architecture. Currently only "cnn" is available.

Parameters:
model_architecturestr

Model architecture

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:
valuebool

Whether to explode each sentence into a different row

setCustomBounds(value)[source]#

Sets characters used to explicitly mark sentence bounds, by default [].

Parameters:
valueList[str]

Characters used to explicitly mark sentence bounds

setUseCustomBoundsOnly(value)[source]#

Sets whether to only utilize custom bounds in sentence detection, by default False.

Parameters:
valuebool

Whether to only utilize custom bounds

setSplitLength(value)[source]#

Sets length at which sentences will be forcibly split.

Parameters:
valueint

Length at which sentences will be forcibly split.

setMinLength(value)[source]#

Sets minimum allowed length for each sentence, by default 0

Parameters:
valueint

Minimum allowed length for each sentence

setMaxLength(value)[source]#

Sets the maximum allowed length for each sentence, by default 99999

Parameters:
valueint

Maximum allowed length for each sentence

setImpossiblePenultimates(impossible_penultimates)[source]#

Sets impossible penultimates - list of strings which a sentence can’t end with.

Parameters:
impossible_penultimatesList[str]

List of strings which a sentence can’t end with

static pretrained(name='sentence_detector_dl', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “sentence_detector_dl”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
SentenceDetectorDLModel

The restored model