package nlp
- Alphabetic
 
- Public
 - All
 
Type Members
- 
      
      
      
        
      
    
      
        
        case class
      
      
        Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable
      
      
      
represents annotator's output parts and their details
represents annotator's output parts and their details
- annotatorType
 the type of annotation
- begin
 the index of the first character under this annotation
- end
 the index after the last character under this annotation
- metadata
 associated metadata for this annotation
 - 
      
      
      
        
      
    
      
        
        case class
      
      
        AnnotationAudio(annotatorType: String, result: Array[Float], metadata: Map[String, String]) extends IAnnotation with Product with Serializable
      
      
      
Represents AudioAssembler's output parts and their details.
 - 
      
      
      
        
      
    
      
        
        case class
      
      
        AnnotationImage(annotatorType: String, origin: String, height: Int, width: Int, nChannels: Int, mode: Int, result: Array[Byte], metadata: Map[String, String], text: String = "") extends IAnnotation with Product with Serializable
      
      
      
Represents ImageAssembler's output parts and their details
Represents ImageAssembler's output parts and their details
- annotatorType
 Image annotator type
- origin
 The origin of the image
- height
 Height of the image in pixels
- width
 Width of the image in pixels
- nChannels
 Number of image channels
- mode
 OpenCV-compatible type
- result
 Result of the annotation
- metadata
 Metadata of the annotation
 - 
      
      
      
        
      
    
      
        abstract 
        class
      
      
        AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy
      
      
      
This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference
 - 
      
      
      
        
      
    
      
        abstract 
        class
      
      
        AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy
      
      
      
This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768
 - 
      
      
      
        
      
    
      
        
        class
      
      
        AudioAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
      
      
      
Prepares audio read by Spark into a format that is processable by Spark NLP.
Prepares audio read by Spark into a format that is processable by Spark NLP. This component is needed to process audio.
Input col is a single record that contains the raw content and metadata of the file.
Example
import com.johnsnowlabs.nlp.AudioAssembler import org.apache.spark.ml.Pipeline val audioAssembler = new AudioAssembler() .setInputCol("audio") .setOutputCol("audio_assembler") val pipeline = new Pipeline().setStages(Array(audioAssembler)) val pipelineDF = pipeline.fit(imageDF).transform(wavDf) pipelineDF.printSchema() root
 -  trait CanBeLazy extends AnyRef
 - 
      
      
      
        
      
    
      
        
        class
      
      
        Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]
      
      
      
Converts
DOCUMENTtype annotations intoCHUNKtype with the contents of achunkCol.Converts
DOCUMENTtype annotations intoCHUNKtype with the contents of achunkCol. Chunk text must be contained within inputDOCUMENT. May be eitherStringTypeorArrayType[StringType](using setIsArray). Useful for annotators that require a CHUNK type input.Example
import spark.implicits._ import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler} import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val chunkAssembler = new Doc2Chunk() .setInputCols("document") .setChunkCol("target") .setOutputCol("chunk") .setIsArray(true) val data = Seq( ("Spark NLP is an open-source text processing library for advanced natural language processing.", Seq("Spark NLP", "text processing library", "natural language processing")) ).toDF("text", "target") val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data) val result = pipeline.transform(data) result.selectExpr("chunk.result", "chunk.annotatorType").show(false) +-----------------------------------------------------------------+---------------------+ |result |annotatorType | +-----------------------------------------------------------------+---------------------+ |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]| +-----------------------------------------------------------------+---------------------+
- See also
 Chunk2Doc for converting
CHUNKannotations toDOCUMENT
 - 
      
      
      
        
      
    
      
        
        class
      
      
        DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
      
      
      
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The
DocumentAssemblerreadsStringcolumns. Additionally, setCleanupMode can be used to pre-process the text (Default:disabled). For possible options please refer to the parameters section.For more extended examples on document pre-processing see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val result = documentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
 - 
      
      
      
        
      
    
      
        
        class
      
      
        EmbeddingsFinisher extends Transformer with DefaultParamsWritable
      
      
      
Extracts embeddings from Annotations into a more easily usable form.
Extracts embeddings from Annotations into a more easily usable form.
This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.
By using
EmbeddingsFinisheryou can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that requirefeatureCol.For more extended examples see the Examples.
Example
import spark.implicits._ import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher} import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel} val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) val gloveEmbeddings = WordEmbeddingsModel.pretrained() .setInputCols("document", "cleanTokens") .setOutputCol("embeddings") .setCaseSensitive(false) val embeddingsFinisher = new EmbeddingsFinisher() .setInputCols("embeddings") .setOutputCols("finished_sentence_embeddings") .setOutputAsVector(true) .setCleanAnnotations(false) val data = Seq("Spark NLP is an open-source text processing library.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, normalizer, stopwordsCleaner, gloveEmbeddings, embeddingsFinisher )).fit(data) val result = pipeline.transform(data) val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)") .map { row => val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0) (vector.size, vector) }.toDF("size", "vector") resultWithSize.show(5, 80) +----+--------------------------------------------------------------------------------+ |size| vector| +----+--------------------------------------------------------------------------------+ | 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...| | 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...| | 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...| | 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...| | 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...| +----+--------------------------------------------------------------------------------+
- See also
 Finisher for finishing Strings
 - 
      
      
      
        
      
    
      
        
        class
      
      
        FeaturesFallbackReader[T <: HasFeatures] extends MLReader[T]
      
      
      
MLReader that loads a model with params and features, and has a fallback mechanism.
MLReader that loads a model with params and features, and has a fallback mechanism.
The fallback load will be called in case there is an exception during Spark loading (i.e. missing parameters or features).
Usually, you might want to call
loadSavedModelin thefallbackLoadmethod to load a model with default params.- T
 The type of the model that extends HasFeatures
 -  class FeaturesReader[T <: HasFeatures] extends MLReader[T]
 -  class FeaturesWriter[T] extends MLWriter with HasFeatures
 - 
      
      
      
        
      
    
      
        
        class
      
      
        Finisher extends Transformer with DefaultParamsWritable
      
      
      
Converts annotation results into a format that easier to use.
Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into
String.For more extended examples on document pre-processing see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.Finisher val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text") // Extracts Named Entities amongst other things val pipeline = PretrainedPipeline("explain_document_dl") val finisher = new Finisher().setInputCols("entities").setOutputCols("output") val explainResult = pipeline.transform(data) explainResult.selectExpr("explode(entities)").show(false) +------------------------------------------------------------------------------------------------------------------------------------------------------+ |entities | +------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]| +------------------------------------------------------------------------------------------------------------------------------------------------------+ val result = finisher.transform(explainResult) result.select("output").show(false) +----------------------+ |output | +----------------------+ |[New York, New Jersey]| +----------------------+
- See also
 EmbeddingsFinisher for finishing embeddings
 - 
      
      
      
        
      
    
      
        
        class
      
      
        GraphFinisher extends Transformer
      
      
      
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.
Example
This is a continuation of the example of GraphExtraction. To see how the graph is extracted, see the documentation of that class.
import com.johnsnowlabs.nlp.GraphFinisher val graphFinisher = new GraphFinisher() .setInputCol("graph") .setOutputCol("graph_finished") .setOutputAsArray(false) val finishedResult = graphFinisher.transform(result) finishedResult.select("text", "graph_finished").show(false) +-----------------------------------------------------+-----------------------------------------------------------------------+ |text |graph_finished | +-----------------------------------------------------+-----------------------------------------------------------------------+ |You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]| +-----------------------------------------------------+-----------------------------------------------------------------------+
- See also
 GraphExtraction to extract the graph.
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasAudioFeatureProperties extends ParamsAndFeaturesWritable
      
      
      
example of required parameters
example of required parameters
{ "do_normalize": true, "feature_size": 1, "padding_side": "right", "padding_value": 0.0, "return_attention_mask": false, "sampling_rate": 16000 } -  trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef
 -  trait HasBatchedAnnotateAudio[M <: Model[M]] extends AnyRef
 -  trait HasBatchedAnnotateImage[M <: Model[M]] extends AnyRef
 -  trait HasBatchedAnnotateTextImage[M <: Model[M]] extends AnyRef
 -  trait HasCandidateLabelsProperties extends ParamsAndFeaturesWritable
 -  trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable
 -  trait HasClassifierActivationProperties extends ParamsAndFeaturesWritable
 -  trait HasClsTokenProperties extends ParamsAndFeaturesWritable
 -  trait HasEnableCachingProperties extends ParamsAndFeaturesWritable
 -  trait HasEngine extends ParamsAndFeaturesWritable
 -  trait HasFeatures extends AnyRef
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasGeneratorProperties extends AnyRef
      
      
      
Parameters to configure beam search text generation.
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasImageFeatureProperties extends ParamsAndFeaturesWritable
      
      
      
example of required parameters
example of required parameters
{ "do_normalize": true, "do_resize": true, "feature_extractor_type": "ViTFeatureExtractor", "image_mean": [ 0.5, 0.5, 0.5 ], "image_std": [ 0.5, 0.5, 0.5 ], "resample": 2, "size": 224 } -  trait HasInputAnnotationCols extends Params
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasLlamaCppInferenceProperties extends AnyRef
      
      
      
Contains settable inference parameters for the AutoGGUFModel.
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasLlamaCppModelProperties extends AnyRef
      
      
      
Contains settable model parameters for the AutoGGUFModel.
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasMultipleInputAnnotationCols extends HasInputAnnotationCols
      
      
      
Trait used to create annotators with input columns of variable length.
 -  trait HasOutputAnnotationCol extends Params
 -  trait HasOutputAnnotatorType extends AnyRef
 -  trait HasPretrained[M <: PipelineStage] extends AnyRef
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasProtectedParams extends AnyRef
      
      
      
Enables a class to protect a parameter, which means that it can only be set once.
Enables a class to protect a parameter, which means that it can only be set once.
This trait will enable a implicit conversion from Param to ProtectedParam. In addition, the new set for ProtectedParam will then check, whether or not the value was already set. If so, then a warning will be output and the value will not be set again.
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        HasRecursiveFit[M <: Model[M]] extends AnyRef
      
      
      
AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's
 -  trait HasRecursiveTransform[M <: Model[M]] extends AnyRef
 -  trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        IAnnotation extends AnyRef
      
      
      
IAnnotation trait is used to abstract the annotator's output for each NLP tasks available in Spark NLP.
IAnnotation trait is used to abstract the annotator's output for each NLP tasks available in Spark NLP.
Currently Spark NLP supports three types of outputs:
- Text Output: com.johnsnowlabs.nlp.Annotation
 - Image Output: com.johnsnowlabs.nlp.AnnotationImage
 - Audio Output: com.johnsnowlabs.nlp.AnnotationAudio
 
LightPipeline models in Java/Scala returns an IAnnotation collection. All of these outputs are structs with the required data types to represent Text, Image and Audio.
If one wants to access the data as Annotation, AnnotationImage or AnnotationAudio, one just needs casting to the desired output.
Example
import com.johnsnowlabs.nlp.annotators.cv.ViTForImageClassification import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.ImageAssembler import com.johnsnowlabs.nlp.LightPipeline import com.johnsnowlabs.util.PipelineModels val imageDf = spark.read .format("image") .option("dropInvalid", value = true) .load("./images") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained() .setInputCols("image_assembler") .setOutputCol("class") val pipeline: Pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val vitModel = pipeline.fit(imageDf) val lightPipeline = new LightPipeline(vitModel) val predictions = lightPipeline.fullAnnotate("./images/hen.JPEG") val result = predictions.flatMap(prediction => prediction._2.map { case annotationText: Annotation => annotationText case annotationImage: AnnotationImage => annotationImage })
 - 
      
      
      
        
      
    
      
        
        class
      
      
        ImageAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
      
      
      
Prepares images read by Spark into a format that is processable by Spark NLP.
Prepares images read by Spark into a format that is processable by Spark NLP. This component is needed to process images.
Example
import com.johnsnowlabs.nlp.ImageAssembler import org.apache.spark.ml.Pipeline val imageDF: DataFrame = spark.read .format("image") .option("dropInvalid", value = true) .load("src/test/resources/image/") val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val pipeline = new Pipeline().setStages(Array(imageAssembler)) val pipelineDF = pipeline.fit(imageDF).transform(imageDF) pipelineDF.printSchema() root |-- image_assembler: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- origin: string (nullable = true) | | |-- height: integer (nullable = false) | | |-- width: integer (nullable = false) | | |-- nChannels: integer (nullable = false) | | |-- mode: integer (nullable = false) | | |-- result: binary (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true)
 -  case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends IAnnotation with Product with Serializable
 -  class LightPipeline extends AnyRef
 - 
      
      
      
        
      
    
      
        
        class
      
      
        MultiDocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType
      
      
      
Prepares data into a format that is processable by Spark NLP.
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The
MultiDocumentAssemblercan read either aStringcolumn or anArray[String]. Additionally, setCleanupMode can be used to pre-process the text (Default:disabled). For possible options please refer the parameters section.For more extended examples on document pre-processing see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.MultiDocumentAssembler val data = Seq("Spark NLP is an open-source text processing library.").toDF("text") val multiDocumentAssembler = new MultiDocumentAssembler().setInputCols("text").setOutputCols("document") val result = multiDocumentAssembler.transform(data) result.select("document").show(false) +----------------------------------------------------------------------------------------------+ |document | +----------------------------------------------------------------------------------------------+ |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]| +----------------------------------------------------------------------------------------------+ result.select("document").printSchema root |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false)
 - 
      
      
      
        
      
    
      
        
        trait
      
      
        ParamsAndFeaturesFallbackReadable[T <: HasFeatures] extends ParamsAndFeaturesReadable[T]
      
      
      
Enables loading models with params and features with a fallback mechanism.
Enables loading models with params and features with a fallback mechanism. The
fallbackLoadfunction will be called in case there is an exception during Spark loading (i.e. missing parameters or features).Usually, you might want to call
loadSavedModelin thefallbackLoadmethod to load a model with default params.- T
 The type of the model that extends HasFeatures
 -  trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]
 -  trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures
 - 
      
      
      
        
      
    
      
        
        class
      
      
        PromptAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol
      
      
      
Assembles a sequence of messages into a single string using a template.
Assembles a sequence of messages into a single string using a template. These strings can then be used as prompts for large language models.
This annotator expects an array of two-tuples as the type of the input column (one array of tuples per row). The first element of the tuples should be the role and the second element is the text of the message. Possible roles are "system", "user" and "assistant".
An assistant header can be added to the end of the generated string by using
setAddAssistant(true).At the moment, this annotator uses llama.cpp as a backend to parse and apply the templates. llama.cpp uses basic pattern matching to determine the type of the template, then applies a basic version of the template to the messages. This means that more advanced templates are not supported.
For an extended example see the example notebook.
Example
// Batches (whole conversations) of arrays of messages val data: Seq[Seq[(String, String)]] = Seq( Seq( ("system", "You are a helpful assistant."), ("assistant", "Hello there, how can I help you?"), ("user", "I need help with organizing my room."))) val dataDF = data.toDF("messages") // llama3.1 val template = "{{- bos_token }} {%- if custom_tools is defined %} {%- set tools = custom_tools %} {%- " + "endif %} {%- if not tools_in_user_message is defined %} {%- set tools_in_user_message = true %} {%- " + "endif %} {%- if not date_string is defined %} {%- set date_string = \"26 Jul 2024\" %} {%- endif %} " + "{%- if not tools is defined %} {%- set tools = none %} {%- endif %} {#- This block extracts the " + "system message, so we can slot it into the right place. #} {%- if messages[0]['role'] == 'system' %}" + " {%- set system_message = messages[0]['content']|trim %} {%- set messages = messages[1:] %} {%- else" + " %} {%- set system_message = \"\" %} {%- endif %} {#- System message + builtin tools #} {{- " + "\"<|start_header_id|>system<|end_header_id|>\\n\\n\" }} {%- if builtin_tools is defined or tools is " + "not none %} {{- \"Environment: ipython\\n\" }} {%- endif %} {%- if builtin_tools is defined %} {{- " + "\"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}} " + "{%- endif %} {{- \"Cutting Knowledge Date: December 2023\\n\" }} {{- \"Today Date: \" + date_string " + "+ \"\\n\\n\" }} {%- if tools is not none and not tools_in_user_message %} {{- \"You have access to " + "the following functions. To call a function, please respond with JSON for a function call.\" }} {{- " + "'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its" + " value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in tools %} {{- t | tojson(indent=4) " + "}} {{- \"\\n\\n\" }} {%- endfor %} {%- endif %} {{- system_message }} {{- \"<|eot_id|>\" }} {#- " + "Custom tools are passed in a user message with some extra guidance #} {%- if tools_in_user_message " + "and not tools is none %} {#- Extract the first user message so we can plug it in here #} {%- if " + "messages | length != 0 %} {%- set first_user_message = messages[0]['content']|trim %} {%- set " + "messages = messages[1:] %} {%- else %} {{- raise_exception(\"Cannot put tools in the first user " + "message when there's no first user message!\") }} {%- endif %} {{- " + "'<|start_header_id|>user<|end_header_id|>\\n\\n' -}} {{- \"Given the following functions, please " + "respond with a JSON for a function call \" }} {{- \"with its proper arguments that best answers the " + "given prompt.\\n\\n\" }} {{- 'Respond in the format {\"name\": function name, \"parameters\": " + "dictionary of argument name and its value}.' }} {{- \"Do not use variables.\\n\\n\" }} {%- for t in " + "tools %} {{- t | tojson(indent=4) }} {{- \"\\n\\n\" }} {%- endfor %} {{- first_user_message + " + "\"<|eot_id|>\"}} {%- endif %} {%- for message in messages %} {%- if not (message.role == 'ipython' " + "or message.role == 'tool' or 'tool_calls' in message) %} {{- '<|start_header_id|>' + message['role']" + " + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }} {%- elif 'tool_calls' in " + "message %} {%- if not message.tool_calls|length == 1 %} {{- raise_exception(\"This model only " + "supports single tool-calls at once!\") }} {%- endif %} {%- set tool_call = message.tool_calls[0]" + ".function %} {%- if builtin_tools is defined and tool_call.name in builtin_tools %} {{- " + "'<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- \"<|python_tag|>\" + tool_call.name + " + "\".call(\" }} {%- for arg_name, arg_val in tool_call.arguments | items %} {{- arg_name + '=\"' + " + "arg_val + '\"' }} {%- if not loop.last %} {{- \", \" }} {%- endif %} {%- endfor %} {{- \")\" }} {%- " + "else %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}} {{- '{\"name\": \"' + " + "tool_call.name + '\", ' }} {{- '\"parameters\": ' }} {{- tool_call.arguments | tojson }} {{- \"}\" " + "}} {%- endif %} {%- if builtin_tools is defined %} {#- This means we're in ipython mode #} {{- " + "\"<|eom_id|>\" }} {%- else %} {{- \"<|eot_id|>\" }} {%- endif %} {%- elif message.role == \"tool\" " + "or message.role == \"ipython\" %} {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }} {%- " + "if message.content is mapping or message.content is iterable %} {{- message.content | tojson }} {%- " + "else %} {{- message.content }} {%- endif %} {{- \"<|eot_id|>\" }} {%- endif %} {%- endfor %} {%- if " + "add_generation_prompt %} {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }} {%- endif %} " val promptAssembler = new PromptAssembler() .setInputCol("messages") .setOutputCol("prompt") .setChatTemplate(template) promptAssembler.transform(dataDF).select("prompt.result").show(truncate = false) +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello there, how can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI need help with organizing my room.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 -  trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol
 -  class RecursivePipeline extends Pipeline
 -  class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging
 - 
      
      
      
        
      
    
      
        
        class
      
      
        TableAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]
      
      
      
This transformer parses text into tabular representation.
This transformer parses text into tabular representation. The input consists of DOCUMENT annotations and the output are TABLE annotations. The source format can be either JSON or CSV. The format of the JSON files should be:
{ "header": [col1, col2, ..., colN], "rows": [ [val11, val12, ..., val1N], [val22, va22, ..., val2N], ... ] }The CSV format support alternative delimiters (e.g. tab), as well as escaping delimiters by surrounding cell values with double quotes. For example:
column1, column2, "column with, comma" value1, value2, value3 "escaped value", "value with, comma", "value with double ("") quote"
The transformer stores tabular data internally as JSON. The default input format is also JSON.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import org.apache.spark.ml.Pipeline val csvData = """ |"name", "money", "age" |"Donald Trump", "$100,000,000", "75" |"Elon Musk", "$20,000,000,000,000", "55" |""".stripMargin.trim val data =Seq(csvData).toDF("csv") val documentAssembler = new DocumentAssembler() .setInputCol("csv") .setOutputCol("document") val tableAssembler = new TableAssembler() .setInputCols(Array("document")) .setOutputCol("table") .setInputFormat("csv") val pipeline = new Pipeline() .setStages( Array(documentAssembler, tableAssembler) ).fit(data) val result = pipeline.transform(data) result .selectExpr("explode(table) AS table") .select("table.result", "table.metadata.input_format") .show(false) +--------------------------------------------+-------------+ |result |input_format | +--------------------------------------------+-------------+ |{ |csv | | "header": ["name","money","age"], | | | "rows":[ | | | ["Donald Trump","$100,000,000","75"], | | | ["Elon Musk","$20,000,000,000,000","55"] | | | ] | | |} | | +--------------------------------------------+-------------+
 - 
      
      
      
        
      
    
      
        
        class
      
      
        TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]
      
      
      
This transformer reconstructs a
DOCUMENTtype annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.This transformer reconstructs a
DOCUMENTtype annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. RequiresDOCUMENTandTOKENtype annotations as input.For more extended examples on document pre-processing see the Examples.
Example
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotator.Tokenizer import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner} import com.johnsnowlabs.nlp.TokenAssembler import org.apache.spark.ml.Pipeline // First, the text is tokenized and cleaned val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val normalizer = new Normalizer() .setInputCols("token") .setOutputCol("normalized") .setLowercase(false) val stopwordsCleaner = new StopWordsCleaner() .setInputCols("normalized") .setOutputCol("cleanTokens") .setCaseSensitive(false) // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure. val tokenAssembler = new TokenAssembler() .setInputCols("sentences", "cleanTokens") .setOutputCol("cleanText") val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.") .toDF("text") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, normalizer, stopwordsCleaner, tokenAssembler )).fit(data) val result = pipeline.transform(data) result.select("cleanText").show(false) +---------------------------------------------------------------------------------------------------------------------------+ |cleanText | +---------------------------------------------------------------------------------------------------------------------------+ |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]| +---------------------------------------------------------------------------------------------------------------------------+
- See also
 DocumentAssembler on the data structure
 
Value Members
-  object ActivationFunction
 -  object Annotation extends Serializable
 -  object AnnotationAudio extends Serializable
 -  object AnnotationImage extends Serializable
 -  object AnnotatorType
 - 
      
      
      
        
      
    
      
        
        object
      
      
        AudioAssembler extends DefaultParamsReadable[AudioAssembler] with Serializable
      
      
      
This is the companion object of AudioAssembler.
This is the companion object of AudioAssembler. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable
      
      
      
This is the companion object of Doc2Chunk.
This is the companion object of Doc2Chunk. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable
      
      
      
This is the companion object of DocumentAssembler.
This is the companion object of DocumentAssembler. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable
      
      
      
This is the companion object of EmbeddingsFinisher.
This is the companion object of EmbeddingsFinisher. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        Finisher extends DefaultParamsReadable[Finisher] with Serializable
      
      
      
This is the companion object of Finisher.
This is the companion object of Finisher. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        ImageAssembler extends DefaultParamsReadable[ImageAssembler] with Serializable
      
      
      
This is the companion object of ImageAssembler.
This is the companion object of ImageAssembler. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        MultiDocumentAssembler extends DefaultParamsReadable[MultiDocumentAssembler] with Serializable
      
      
      
This is the companion object of MultiDocumentAssembler.
This is the companion object of MultiDocumentAssembler. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        PromptAssembler extends DefaultParamsReadable[PromptAssembler] with Serializable
      
      
      
This is the companion object of PromptAssembler.
This is the companion object of PromptAssembler. Please refer to that class for the documentation.
 -  object SparkNLP
 - 
      
      
      
        
      
    
      
        
        object
      
      
        TableAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable
      
      
      
This is the companion object of TableAssembler.
This is the companion object of TableAssembler. Please refer to that class for the documentation.
 - 
      
      
      
        
      
    
      
        
        object
      
      
        TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable
      
      
      
This is the companion object of TokenAssembler.
This is the companion object of TokenAssembler. Please refer to that class for the documentation.
 -  object functions