package pragmatic
- Alphabetic
- Public
- All
Type Members
-
class
CustomPragmaticMethod extends PragmaticMethod with Serializable
Inspired on Kevin Dias, Ruby implementation: https://github.com/diasks2/pragmatic_segmenter This approach extracts sentence bounds by first formatting the data with RuleSymbols and then extracting bounds with a strong RegexBased rule application
- class DefaultPragmaticMethod extends PragmaticMethod with Serializable
- class MixedPragmaticMethod extends PragmaticMethod with Serializable
-
class
PragmaticContentFormatter extends AnyRef
rule-based formatter that adds regex rules to different marking steps Symbols protect from ambiguous bounds to be considered splitters
-
trait
PragmaticMethod extends AnyRef
- Attributes
- protected
-
class
PragmaticSentenceExtractor extends AnyRef
Reads through symbolized data, and computes the bounds based on regex rules following symbol meaning
-
trait
RuleSymbols extends AnyRef
Base Symbols that may be extended later on.
Base Symbols that may be extended later on. For now kept in the pragmatic scope.
-
class
SentenceDetector extends AnnotatorModel[SentenceDetector] with HasSimpleAnnotate[SentenceDetector] with SentenceDetectorParams
Annotator that detects sentence boundaries using regular expressions.
Annotator that detects sentence boundaries using regular expressions.
The following characters are checked as sentence boundaries:
- Lists ("(i), (ii)", "(a), (b)", "1., 2.")
- Numbers
- Abbreviations
- Punctuations
- Multiple Periods
- Geo-Locations/Coordinates ("N°. 1026.253.553.")
- Ellipsis ("...")
- In-between punctuations
- Quotation marks
- Exclamation Points
- Basic Breakers (".", ";")
For the explicit regular expressions used for detection, refer to source of https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticContentFormatter.scala.
To add additional custom bounds, the parameter
customBounds
can be set with an array:val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") .setCustomBounds(Array("\n\n"))
If only the custom bounds should be used, then the parameter
useCustomBoundsOnly
should be set totrue
.Each extracted sentence can be returned in an Array or exploded to separate rows, if
explodeSentences
is set totrue
.For extended examples of usage, see the https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/sentence-detection/SentenceDetector_advanced_examples.ipynb.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") .setCustomBounds(Array("\n\n")) val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence )) val data = Seq("This is my first sentence. This my second.\n\nHow about a third?").toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(sentence) as sentences").show(false) +------------------------------------------------------------------+ |sentences | +------------------------------------------------------------------+ |[document, 0, 25, This is my first sentence., [sentence -> 0], []]| |[document, 27, 41, This my second., [sentence -> 1], []] | |[document, 43, 60, How about a third?, [sentence -> 2], []] | +------------------------------------------------------------------+
- See also
SentenceDetectorDLModel for pretrained models
Value Members
- object PragmaticContentFormatter
-
object
PragmaticDictionaries
This is a dictionary that contains common english abbreviations that should be considered sentence bounds
-
object
PragmaticSymbols extends RuleSymbols
Extends RuleSymbols with specific symbols used for the pragmatic approach.
Extends RuleSymbols with specific symbols used for the pragmatic approach. Right now, the only one.
-
object
SentenceDetector extends DefaultParamsReadable[SentenceDetector] with Serializable
This is the companion object of SentenceDetector.
This is the companion object of SentenceDetector. Please refer to that class for the documentation.