sparknlp.annotator.sentence.sentence_detector_dl
#
Contains classes for SentenceDetectorDl.
Module Contents#
Classes#
Trains an annotator that detects sentence boundaries using a deep |
|
Annotator that detects sentence boundaries using a deep learning approach. |
- class SentenceDetectorDLApproach(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach')[source]#
Trains an annotator that detects sentence boundaries using a deep learning approach.
Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with
setModel()
.For pretrained models see
SentenceDetectorDLModel
.Each extracted sentence can be returned in an Array or exploded to separate rows, if
explodeSentences
is set toTrue
.Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- modelArchitecture
Model architecture (CNN)
- impossiblePenultimates
Impossible penultimates - list of strings which a sentence can’t end with
- validationSplit
Choose the proportion of training dataset to be validated against the model on each
- epochsNumber
Number of epochs for the optimization process
- outputLogsPath
Path to folder where logs will be saved. If no path is specified, no logs are generated
- explodeSentences
Whether to explode each sentence into a different row, for better parallelization. Defaults to False.
References
The default model
"cnn"
is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.Examples
The training process needs data, where each data point is a sentence. In this example the
train.txt
file has the form of:... Slightly more moderate language would make our present situation – namely the lack of progress – a little easier. His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela. ...
where each line is one sentence.
Training can then be started like so:
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> trainingData = spark.read.text("train.txt").toDF("text") >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetectorDLApproach() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentences") \ ... .setEpochsNumber(100) >>> pipeline = Pipeline().setStages([documentAssembler, sentenceDetector]) >>> model = pipeline.fit(trainingData)
- setModel(model_architecture)[source]#
Sets the Model architecture. Currently only
"cnn"
is available.- Parameters:
- model_architecturestr
Model architecture
- setValidationSplit(validation_split)[source]#
Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.
- Parameters:
- validation_splitfloat
Proportion of training dataset to be validated
- setEpochsNumber(epochs_number)[source]#
Sets number of epochs to train.
- Parameters:
- epochs_numberint
Number of epochs
- setOutputLogsPath(output_logs_path)[source]#
Sets folder path to save training logs.
- Parameters:
- output_logs_pathstr
Folder path to save training logs
- class SentenceDetectorDLModel(classname='com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLModel', java_model=None)[source]#
Annotator that detects sentence boundaries using a deep learning approach.
Instantiated Model of the
SentenceDetectorDLApproach
. Detects sentence boundaries using a deep learning approach.Pretrained models can be loaded with
pretrained()
of the companion object:>>> sentenceDL = SentenceDetectorDLModel.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentencesDL")
The default model is
"sentence_detector_dl"
, if no name is provided. For available pretrained models please see the Models Hub.Each extracted sentence can be returned in an Array or exploded to separate rows, if
explodeSentences
is set totrue
.For extended examples of usage, see the Examples.
Input Annotation types
Output Annotation type
DOCUMENT
DOCUMENT
- Parameters:
- modelArchitecture
Model architecture (CNN)
- explodeSentences
whether to explode each sentence into a different row, for better parallelization. Defaults to false.
- customBounds
characters used to explicitly mark sentence bounds, by default []
- useCustomBoundsOnly
Only utilize custom bounds in sentence detection, by default False
- splitLength
length at which sentences will be forcibly split
- minLength
Set the minimum allowed length for each sentence, by default 0
- maxLength
Set the maximum allowed length for each sentence, by default 99999
- impossiblePenultimates
Impossible penultimates - list of strings which a sentence can’t end with
Examples
In this example, the normal SentenceDetector is compared to the SentenceDetectorDLModel. In a pipeline, SentenceDetectorDLModel can be used as a replacement for the SentenceDetector.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentences") >>> sentenceDL = SentenceDetectorDLModel \ ... .pretrained("sentence_detector_dl", "en") \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentencesDL") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... sentenceDL ... ]) >>> data = spark.createDataFrame([["""John loves Mary.Mary loves Peter ... Peter loves Helen .Helen loves John; ... Total: four people involved."""]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(sentences.result) as sentences").show(truncate=False) +----------------------------------------------------------+ |sentences | +----------------------------------------------------------+ |John loves Mary.Mary loves Peter\n Peter loves Helen .| |Helen loves John; | |Total: four people involved. | +----------------------------------------------------------+ >>> result.selectExpr("explode(sentencesDL.result) as sentencesDL").show(truncate=False) +----------------------------+ |sentencesDL | +----------------------------+ |John loves Mary. | |Mary loves Peter | |Peter loves Helen . | |Helen loves John; | |Total: four people involved.| +----------------------------+
- setModel(modelArchitecture)[source]#
Sets the Model architecture. Currently only
"cnn"
is available.- Parameters:
- model_architecturestr
Model architecture
- setExplodeSentences(value)[source]#
Sets whether to explode each sentence into a different row, for better parallelization, by default False.
- Parameters:
- valuebool
Whether to explode each sentence into a different row
- setCustomBounds(value)[source]#
Sets characters used to explicitly mark sentence bounds, by default [].
- Parameters:
- valueList[str]
Characters used to explicitly mark sentence bounds
- setUseCustomBoundsOnly(value)[source]#
Sets whether to only utilize custom bounds in sentence detection, by default False.
- Parameters:
- valuebool
Whether to only utilize custom bounds
- setSplitLength(value)[source]#
Sets length at which sentences will be forcibly split.
- Parameters:
- valueint
Length at which sentences will be forcibly split.
- setMinLength(value)[source]#
Sets minimum allowed length for each sentence, by default 0
- Parameters:
- valueint
Minimum allowed length for each sentence
- setMaxLength(value)[source]#
Sets the maximum allowed length for each sentence, by default 99999
- Parameters:
- valueint
Maximum allowed length for each sentence
- setImpossiblePenultimates(impossible_penultimates)[source]#
Sets impossible penultimates - list of strings which a sentence can’t end with.
- Parameters:
- impossible_penultimatesList[str]
List of strings which a sentence can’t end with
- static pretrained(name='sentence_detector_dl', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “sentence_detector_dl”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- SentenceDetectorDLModel
The restored model