  1. case class Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable

    represents annotator's output parts and their details

    the type of annotation


    the index of the first character under this annotation


    the index after the last character under this annotation


    associated metadata for this annotation

  2. case class AnnotationAudio(annotatorType: String, result: Array[Float], metadata: Map[String, String]) extends IAnnotation with Product with Serializable

    Represents AudioAssembler's output parts and their details.

  3. case class AnnotationImage(annotatorType: String, origin: String, height: Int, width: Int, nChannels: Int, mode: Int, result: Array[Byte], metadata: Map[String, String]) extends IAnnotation with Product with Serializable

    Represents ImageAssembler's output parts and their details

    Image annotator type


    The origin of the image


    Height of the image in pixels


    Width of the image in pixels


    Number of image channels


    OpenCV-compatible type


    Result of the annotation


    Metadata of the annotation

  4. abstract class AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy

    This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference

  5. abstract class AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy

    This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed

  6. class AudioAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

    Prepares audio read by Spark into a format that is processable by Spark NLP.

    Input col is a single record that contains the raw content and metadata of the file.


    import com.johnsnowlabs.nlp.AudioAssembler
    val audioAssembler = new AudioAssembler()
    val pipeline = new Pipeline().setStages(Array(audioAssembler))
    val pipelineDF =
  7. trait CanBeLazy extends AnyRef
  8. class Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]

    Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.

    Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.


    import spark.implicits._
    import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val chunkAssembler = new Doc2Chunk()
    val data = Seq(
      ("Spark NLP is an open-source text processing library for advanced natural language processing.",
        Seq("Spark NLP", "text processing library", "natural language processing"))
    ).toDF("text", "target")
    val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
    val result = pipeline.transform(data)
    result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
    |result                                                           |annotatorType        |
    |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
    See also

    Chunk2Doc for converting CHUNK annotations to DOCUMENT

  9. class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

    Prepares data into a format that is processable by Spark NLP.

    Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler reads String columns. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer to the parameters section.

    For more extended examples on document pre-processing see the Examples.


    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val result = documentAssembler.transform(data)"document").show(false)
    |document                                                                                      |
    |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
     |-- document: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- annotatorType: string (nullable = true)
     |    |    |-- begin: integer (nullable = false)
     |    |    |-- end: integer (nullable = false)
     |    |    |-- result: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
     |    |    |-- embeddings: array (nullable = true)
     |    |    |    |-- element: float (containsNull = false)
  10. class EmbeddingsFinisher extends Transformer with DefaultParamsWritable

    Extracts embeddings from Annotations into a more easily usable form.

    This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

    By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

    For more extended examples see the Examples.


    import spark.implicits._
    import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
    import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}
    val documentAssembler = new DocumentAssembler()
    val tokenizer = new Tokenizer()
    val normalizer = new Normalizer()
    val stopwordsCleaner = new StopWordsCleaner()
    val gloveEmbeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("document", "cleanTokens")
    val embeddingsFinisher = new EmbeddingsFinisher()
    val data = Seq("Spark NLP is an open-source text processing library.")
    val pipeline = new Pipeline().setStages(Array(
    val result = pipeline.transform(data)
    val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
      .map { row =>
        val vector = row.getAs[](0)
        (vector.size, vector)
      }.toDF("size", "vector"), 80)
    |size|                                                                          vector|
    | 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
    | 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
    | 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
    | 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
    | 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
    See also

    Finisher for finishing Strings

  11. class FeaturesReader[T <: HasFeatures] extends MLReader[T]
  12. class FeaturesWriter[T] extends MLWriter with HasFeatures
  13. class Finisher extends Transformer with DefaultParamsWritable

    Converts annotation results into a format that easier to use.

    Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

    For more extended examples on document pre-processing see the Examples.


    import spark.implicits._
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    import com.johnsnowlabs.nlp.Finisher
    val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
    // Extracts Named Entities amongst other things
    val pipeline = PretrainedPipeline("explain_document_dl")
    val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
    val explainResult = pipeline.transform(data)
    |entities                                                                                                                                              |
    |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
    val result = finisher.transform(explainResult)"output").show(false)
    |output                |
    |[New York, New Jersey]|
    See also

    EmbeddingsFinisher for finishing embeddings

  14. class GraphFinisher extends Transformer

    Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.

    Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.


    This is a continuation of the example of GraphExtraction. To see how the graph is extracted, see the documentation of that class.

    import com.johnsnowlabs.nlp.GraphFinisher
    val graphFinisher = new GraphFinisher()
    val finishedResult = graphFinisher.transform(result)"text", "graph_finished").show(false)
    |text                                                 |graph_finished                                                         |
    |You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]|
    See also

    GraphExtraction to extract the graph.

  15. trait HasAudioFeatureProperties extends ParamsAndFeaturesWritable

    example of required parameters

    "do_normalize": true,
    "feature_size": 1,
    "padding_side": "right",
    "padding_value": 0.0,
    "return_attention_mask": false,
    "sampling_rate": 16000
  16. trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef
  17. trait HasBatchedAnnotateAudio[M <: Model[M]] extends AnyRef
  18. trait HasBatchedAnnotateImage[M <: Model[M]] extends AnyRef
  19. trait HasCandidateLabelsProperties extends ParamsAndFeaturesWritable
  20. trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable
  21. trait HasClassifierActivationProperties extends ParamsAndFeaturesWritable
  22. trait HasEnableCachingProperties extends ParamsAndFeaturesWritable
  23. trait HasEngine extends ParamsAndFeaturesWritable
  24. trait HasFeatures extends AnyRef
  25. trait HasGeneratorProperties extends AnyRef

    Parameters to configure beam search text generation.

  26. trait HasImageFeatureProperties extends ParamsAndFeaturesWritable

    example of required parameters

    "do_normalize": true,
    "do_resize": true,
    "feature_extractor_type": "ViTFeatureExtractor",
    "image_mean": [
    "image_std": [
    "resample": 2,
    "size": 224
  27. trait HasInputAnnotationCols extends Params
  28. trait HasLlamaCppProperties extends AnyRef

    Contains settable parameters for the AutoGGUFModel.

  29. trait HasMultipleInputAnnotationCols extends HasInputAnnotationCols

    Trait used to create annotators with input columns of variable length.

  30. trait HasOutputAnnotationCol extends Params
  31. trait HasOutputAnnotatorType extends AnyRef
  32. trait HasPretrained[M <: PipelineStage] extends AnyRef
  33. trait HasProtectedParams extends AnyRef

    Enables a class to protect a parameter, which means that it can only be set once.

    This trait will enable a implicit conversion from Param to ProtectedParam. In addition, the new set for ProtectedParam will then check, whether or not the value was already set. If so, then a warning will be output and the value will not be set again.

  34. trait HasRecursiveFit[M <: Model[M]] extends AnyRef

    AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's

  35. trait HasRecursiveTransform[M <: Model[M]] extends AnyRef
  36. trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef
  37. trait IAnnotation extends AnyRef

    IAnnotation trait is used to abstract the annotator's output for each NLP tasks available in Spark NLP.

    Currently Spark NLP supports three types of outputs:

    LightPipeline models in Java/Scala returns an IAnnotation collection. All of these outputs are structs with the required data types to represent Text, Image and Audio.

    If one wants to access the data as Annotation, AnnotationImage or AnnotationAudio, one just needs casting to the desired output.


    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.ImageAssembler
    import com.johnsnowlabs.nlp.LightPipeline
    import com.johnsnowlabs.util.PipelineModels
    val imageDf =
     .option("dropInvalid", value = true)
    val imageAssembler = new ImageAssembler()
    val imageClassifier = ViTForImageClassification
    val pipeline: Pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier))
    val vitModel =
    val lightPipeline = new LightPipeline(vitModel)
    val predictions = lightPipeline.fullAnnotate("./images/hen.JPEG")
    val result = predictions.flatMap(prediction => {
        case annotationText: Annotation =>
        case annotationImage: AnnotationImage =>
  38. class ImageAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

    Prepares images read by Spark into a format that is processable by Spark NLP.

    Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.


    import com.johnsnowlabs.nlp.ImageAssembler
    val imageDF: DataFrame =
      .option("dropInvalid", value = true)
    val imageAssembler = new ImageAssembler()
    val pipeline = new Pipeline().setStages(Array(imageAssembler))
    val pipelineDF =
     |-- image_assembler: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- annotatorType: string (nullable = true)
     |    |    |-- origin: string (nullable = true)
     |    |    |-- height: integer (nullable = false)
     |    |    |-- width: integer (nullable = false)
     |    |    |-- nChannels: integer (nullable = false)
     |    |    |-- mode: integer (nullable = false)
     |    |    |-- result: binary (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
  39. case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable
  40. class LightPipeline extends AnyRef
  41. class MultiDocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType

    Prepares data into a format that is processable by Spark NLP.

    Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The MultiDocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

    For more extended examples on document pre-processing see the Examples.


    import spark.implicits._
    import com.johnsnowlabs.nlp.MultiDocumentAssembler
    val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
    val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document")
    val result = multiDocumentAssembler.transform(data)"document").show(false)
    |document                                                                                      |
    |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
     |-- document: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- annotatorType: string (nullable = true)
     |    |    |-- begin: integer (nullable = false)
     |    |    |-- end: integer (nullable = false)
     |    |    |-- result: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
     |    |    |-- embeddings: array (nullable = true)
     |    |    |    |-- element: float (containsNull = false)
  42. trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]
  43. trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures
  44. trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol
  45. class RecursivePipeline extends Pipeline
  46. class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging
  47. class TableAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]

    This transformer parses text into tabular representation.

    This transformer parses text into tabular representation. The input consists of DOCUMENT annotations and the output are TABLE annotations. The source format can be either JSON or CSV. The format of the JSON files should be:

      "header": [col1, col2, ..., colN],
      "rows": [
        [val11, val12, ..., val1N],
        [val22, va22, ..., val2N],

    The CSV format support alternative delimiters (e.g. tab), as well as escaping delimiters by surrounding cell values with double quotes. For example:

    column1, column2, "column with, comma"
    value1, value2, value3
    "escaped value", "value with, comma", "value with double ("") quote"

    The transformer stores tabular data internally as JSON. The default input format is also JSON.


    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    val csvData =
        |"name", "money", "age"
        |"Donald Trump", "$100,000,000", "75"
        |"Elon Musk", "$20,000,000,000,000", "55"
    val data =Seq(csvData).toDF("csv")
    val documentAssembler = new DocumentAssembler()
    val tableAssembler = new TableAssembler()
    val pipeline = new Pipeline()
        Array(documentAssembler, tableAssembler)
    val result = pipeline.transform(data)
      .selectExpr("explode(table) AS table")
      .select("table.result", "table.metadata.input_format")
    |result                                      |input_format |
    |{                                           |csv          |
    | "header": ["name","money","age"],          |             |
    |  "rows":[                                  |             |
    |   ["Donald Trump","$100,000,000","75"],    |             |
    |   ["Elon Musk","$20,000,000,000,000","55"] |             |
    |  ]                                         |             |
    |}                                           |             |
  48. class TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]

    This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

    For more extended examples on document pre-processing see the Examples.


    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
    import com.johnsnowlabs.nlp.TokenAssembler
    // First, the text is tokenized and cleaned
    val documentAssembler = new DocumentAssembler()
    val sentenceDetector = new SentenceDetector()
    val tokenizer = new Tokenizer()
    val normalizer = new Normalizer()
    val stopwordsCleaner = new StopWordsCleaner()
    // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
    val tokenAssembler = new TokenAssembler()
      .setInputCols("sentences", "cleanTokens")
    val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
    val pipeline = new Pipeline().setStages(Array(
    val result = pipeline.transform(data)"cleanText").show(false)
    |cleanText                                                                                                                  |
    |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
    See also

    DocumentAssembler on the data structure

Value Members

  1. object ActivationFunction
  2. object Annotation extends Serializable
  3. object AnnotationAudio extends Serializable
  4. object AnnotationImage extends Serializable
  5. object AnnotatorType
  6. object AudioAssembler extends DefaultParamsReadable[AudioAssembler] with Serializable

    This is the companion object of AudioAssembler.

  7. object Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable

    This is the companion object of Doc2Chunk.

  8. object DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable

    This is the companion object of DocumentAssembler.

  9. object EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable

    This is the companion object of EmbeddingsFinisher.

  10. object Finisher extends DefaultParamsReadable[Finisher] with Serializable

    This is the companion object of Finisher.

  11. object ImageAssembler extends DefaultParamsReadable[ImageAssembler] with Serializable

    This is the companion object of ImageAssembler.

  12. object MultiDocumentAssembler extends DefaultParamsReadable[MultiDocumentAssembler] with Serializable

    This is the companion object of MultiDocumentAssembler.

  13. object SparkNLP
  14. object TableAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable

    This is the companion object of TableAssembler.

  15. object TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable

    This is the companion object of TokenAssembler.

  16. object functions
