`sparknlp.annotator.sentence.sentence_detector_dl`#

Contains classes for SentenceDetectorDl.

Module Contents#

Classes#

`SentenceDetectorDLApproach`	Trains an annotator that detects sentence boundaries using a deep
`SentenceDetectorDLModel`	Annotator that detects sentence boundaries using a deep learning approach.

class SentenceDetectorDLApproach(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach')[source]#

Trains an annotator that detects sentence boundaries using a deep learning approach.

Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with setModel().

For pretrained models see SentenceDetectorDLModel.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to True.

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

modelArchitecture: Model architecture (CNN)
impossiblePenultimates: Impossible penultimates - list of strings which a sentence can’t end with
validationSplit: Choose the proportion of training dataset to be validated against the model on each
epochsNumber: Number of epochs for the optimization process
outputLogsPath: Path to folder where logs will be saved. If no path is specified, no logs are generated
explodeSentences: Whether to explode each sentence into a different row, for better parallelization. Defaults to False.

References

The default model "cnn" is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.

Examples

The training process needs data, where each data point is a sentence. In this example the train.txt file has the form of:

...
Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
...

where each line is one sentence.

Training can then be started like so:

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> trainingData = spark.read.text("train.txt").toDF("text")
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetectorDLApproach() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentences") \
...     .setEpochsNumber(100)
>>> pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])
>>> model = pipeline.fit(trainingData)

name = 'SentenceDetectorDLApproach'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

modelArchitecture[source]#

impossiblePenultimates[source]#

validationSplit[source]#

epochsNumber[source]#

outputLogsPath[source]#

explodeSentences[source]#

setModel(model_architecture)[source]#

Sets the Model architecture. Currently only "cnn" is available.

Parameters:

model_architecturestr: Model architecture

setValidationSplit(validation_split)[source]#

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters:

validation_splitfloat: Proportion of training dataset to be validated

setEpochsNumber(epochs_number)[source]#

Sets number of epochs to train.

Parameters:

epochs_numberint: Number of epochs

setOutputLogsPath(output_logs_path)[source]#

Sets folder path to save training logs.

Parameters:

output_logs_pathstr: Folder path to save training logs

setImpossiblePenultimates(impossible_penultimates)[source]#

Sets impossible penultimates - list of strings which a sentence can’t end with.

Parameters:

impossible_penultimatesList[str]: List of strings which a sentence can’t end with

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:

valuebool: Whether to explode each sentence into a different row

class SentenceDetectorDLModel(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLModel', java_model=None)[source]#

Annotator that detects sentence boundaries using a deep learning approach.

Instantiated Model of the SentenceDetectorDLApproach. Detects sentence boundaries using a deep learning approach.

Pretrained models can be loaded with pretrained() of the companion object:

>>> sentenceDL = SentenceDetectorDLModel.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentencesDL")

The default model is "sentence_detector_dl", if no name is provided. For available pretrained models please see the Models Hub.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Examples.

Input Annotation types	Output Annotation type
`DOCUMENT`	`DOCUMENT`

Parameters:

modelArchitecture: Model architecture (CNN)
explodeSentences: whether to explode each sentence into a different row, for better parallelization. Defaults to false.
customBounds: characters used to explicitly mark sentence bounds, by default []
useCustomBoundsOnly: Only utilize custom bounds in sentence detection, by default False
splitLength: length at which sentences will be forcibly split
minLength: Set the minimum allowed length for each sentence, by default 0
maxLength: Set the maximum allowed length for each sentence, by default 99999
impossiblePenultimates: Impossible penultimates - list of strings which a sentence can’t end with

Examples

In this example, the normal SentenceDetector is compared to the SentenceDetectorDLModel. In a pipeline, SentenceDetectorDLModel can be used as a replacement for the SentenceDetector.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentences")
>>> sentenceDL = SentenceDetectorDLModel \
...     .pretrained("sentence_detector_dl", "en") \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentencesDL")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     sentenceDL
... ])
>>> data = spark.createDataFrame([["""John loves Mary.Mary loves Peter
...     Peter loves Helen .Helen loves John;
...     Total: four people involved."""]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(sentences.result) as sentences").show(truncate=False)
+----------------------------------------------------------+
|sentences                                                 |
+----------------------------------------------------------+
|John loves Mary.Mary loves Peter\n     Peter loves Helen .|
|Helen loves John;                                         |
|Total: four people involved.                              |
+----------------------------------------------------------+
>>> result.selectExpr("explode(sentencesDL.result) as sentencesDL").show(truncate=False)
+----------------------------+
|sentencesDL                 |
+----------------------------+
|John loves Mary.            |
|Mary loves Peter            |
|Peter loves Helen .         |
|Helen loves John;           |
|Total: four people involved.|
+----------------------------+

name = 'SentenceDetectorDLModel'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'document'[source]#

modelArchitecture[source]#

explodeSentences[source]#

customBounds[source]#

useCustomBoundsOnly[source]#

splitLength[source]#

minLength[source]#

maxLength[source]#

impossiblePenultimates[source]#

setModel(modelArchitecture)[source]#

Sets the Model architecture. Currently only "cnn" is available.

Parameters:

model_architecturestr: Model architecture

setExplodeSentences(value)[source]#

Sets whether to explode each sentence into a different row, for better parallelization, by default False.

Parameters:

valuebool: Whether to explode each sentence into a different row

setCustomBounds(value)[source]#

Sets characters used to explicitly mark sentence bounds, by default [].

Parameters:

valueList[str]: Characters used to explicitly mark sentence bounds

setUseCustomBoundsOnly(value)[source]#

Sets whether to only utilize custom bounds in sentence detection, by default False.

Parameters:

valuebool: Whether to only utilize custom bounds

setSplitLength(value)[source]#

Sets length at which sentences will be forcibly split.

Parameters:

valueint: Length at which sentences will be forcibly split.

setMinLength(value)[source]#

Sets minimum allowed length for each sentence, by default 0

Parameters:

valueint: Minimum allowed length for each sentence

setMaxLength(value)[source]#

Sets the maximum allowed length for each sentence, by default 99999

Parameters:

valueint: Maximum allowed length for each sentence

setImpossiblePenultimates(impossible_penultimates)[source]#

Sets impossible penultimates - list of strings which a sentence can’t end with.

Parameters:

impossible_penultimatesList[str]: List of strings which a sentence can’t end with

static pretrained(name='sentence_detector_dl', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:

namestr, optional: Name of the pretrained model, by default “sentence_detector_dl”
langstr, optional: Language of the pretrained model, by default “en”
remote_locstr, optional: Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:

SentenceDetectorDLModel: The restored model

sparknlp.annotator.sentence.sentence_detector_dl#

Module Contents#

Classes#

`sparknlp.annotator.sentence.sentence_detector_dl`#