Packages

p

com.johnsnowlabs.nlp

annotators

package annotators

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class Chunk2Doc extends AnnotatorModel[Chunk2Doc] with HasSimpleAnnotate[Chunk2Doc]

    Converts a CHUNK type column back into DOCUMENT.

    Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

    Example

    Location entities are extracted and converted back into DOCUMENT type for further processing

    import spark.implicits._
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    import com.johnsnowlabs.nlp.annotators.Chunk2Doc
    
    val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
    
    // Extracts Named Entities amongst other things
    val pipeline = PretrainedPipeline("explain_document_dl")
    
    val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
    val explainResult = pipeline.transform(data)
    
    val result = chunkToDoc.transform(explainResult)
    result.selectExpr("explode(chunkConverted)").show(false)
    +------------------------------------------------------------------------------+
    |col                                                                           |
    +------------------------------------------------------------------------------+
    |[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
    |[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
    +------------------------------------------------------------------------------+
    See also

    PretrainedPipeline on how to use the PretrainedPipeline

    Doc2Chunk for converting DOCUMENT annotations to CHUNK

  2. class ChunkTokenizer extends Tokenizer

    Tokenizes and flattens extracted NER chunks.

    Tokenizes and flattens extracted NER chunks.

    The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

    For extended examples of usage, see the ChunkTokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val entityExtractor = new TextMatcher()
      .setInputCols("sentence", "token")
      .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
      .setOutputCol("entity")
    
    val chunkTokenizer = new ChunkTokenizer()
      .setInputCols("entity")
      .setOutputCol("chunk_token")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        entityExtractor,
        chunkTokenizer
      ))
    
    val data = Seq(
      "Hello world, my name is Michael, I am an artist and I work at Benezar",
      "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
    +-----------------------------------------------+---------------------------------------------------+
    |entity                                         |chunk_token                                        |
    +-----------------------------------------------+---------------------------------------------------+
    |[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
    |[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
    +-----------------------------------------------+---------------------------------------------------+
  3. class ChunkTokenizerModel extends TokenizerModel

    Instantiated model of the ChunkTokenizer.

    Instantiated model of the ChunkTokenizer. For usage and examples see the documentation of the main class.

  4. class Chunker extends AnnotatorModel[Chunker] with HasSimpleAnnotate[Chunker]

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

    "Peter Pipers employees are picking pecks of pickled peppers."
    "<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

    To then extract these tags, regexParsers need to be set with e.g.:

    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("<NNP>+", "<NNS>+"))

    When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

    For more extended examples see the Examples and the ChunkerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val POSTag = PerceptronModel.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("pos")
    
    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("<NNP>+", "<NNS>+"))
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        POSTag,
        chunker
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
    |[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
    |[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
    |[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
    +-------------------------------------------------------------+
    See also

    PerceptronModel for Part-Of-Speech tagging

  5. class Date2Chunk extends AnnotatorModel[Date2Chunk] with HasSimpleAnnotate[Date2Chunk]

    Converts DATE type Annotations to CHUNK type.

    Converts DATE type Annotations to CHUNK type.

    This can be useful if the following annotators after DateMatcher and MultiDateMatcher require CHUNK types. The entity name in the metadata can be changed with setEntityName.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.annotator._
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val inputFormats = Array("yyyy", "yyyy/dd/MM", "MM/yyyy", "yyyy")
    val outputFormat = "yyyy/MM/dd"
    
    val date = new DateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
    
    
    val date2Chunk = new Date2Chunk()
      .setInputCols("date")
      .setOutputCol("date_chunk")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date,
      date2Chunk
    ))
    
    val data = Seq(
    """Omicron is a new variant of COVID-19, which the World Health Organization designated a variant of concern on Nov. 26, 2021/26/11.""",
    """Neighbouring Austria has already locked down its population this week for at until 2021/10/12, becoming the first to reimpose such restrictions."""
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.transform(data).select("date_chunk").show(false)
    ----------------------------------------------------+
    date_chunk                                          |
    ----------------------------------------------------+
    [{chunk, 118, 121, 2021/01/01, {sentence -> 0}, []}]|
    [{chunk, 83, 86, 2021/01/01, {sentence -> 0}, []}]  |
    ----------------------------------------------------+
  6. class DateMatcher extends AnnotatorModel[DateMatcher] with HasSimpleAnnotate[DateMatcher] with DateMatcherUtils

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Examples and the DateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.DateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new DateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("date").show(false)
    +-------------------------------------------------+
    |date                                             |
    +-------------------------------------------------+
    |[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
    |[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
    |[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
    +-------------------------------------------------+
    See also

    MultiDateMatcher for matching multiple dates in a document

  7. class DateMatcherTranslator extends Serializable
  8. sealed trait DateMatcherTranslatorPolicy extends AnyRef
  9. trait DateMatcherUtils extends Params
  10. class DocumentCharacterTextSplitter extends AnnotatorModel[DocumentCharacterTextSplitter] with HasSimpleAnnotate[DocumentCharacterTextSplitter]

    Annotator which splits large documents into chunks of roughly given size.

    Annotator which splits large documents into chunks of roughly given size.

    DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

    For example, given chunk size 20 and overlap 5:

    He was, I take it, the most perfect reasoning and observing machine that the world has seen.
    
    ["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

    Additionally, you can set

    For extended examples of usage, see the DocumentCharacterTextSplitterTest.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import org.apache.spark.ml.Pipeline
    
    val textDF =
      spark.read
        .option("wholetext", "true")
        .text("src/test/resources/spell/sherlockholmes.txt")
        .toDF("text")
    
    val documentAssembler = new DocumentAssembler().setInputCol("text")
    val textSplitter = new DocumentCharacterTextSplitter()
      .setInputCols("document")
      .setOutputCol("splits")
      .setChunkSize(20000)
      .setChunkOverlap(200)
      .setExplodeSplits(true)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
    val result = pipeline.fit(textDF).transform(textDF)
    
    result
      .selectExpr(
        "splits.result",
        "splits[0].begin",
        "splits[0].end",
        "splits[0].end - splits[0].begin as length")
      .show(8, truncate = 80)
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |                                                                          result|splits[0].begin|splits[0].end|length|
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
    |["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
    |["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
    |["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
    |[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
    |["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
    |["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
    |["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...|         137244|       157171| 19927|
    +--------------------------------------------------------------------------------+---------------+-------------+------+
  11. class DocumentNormalizer extends AnnotatorModel[DocumentNormalizer] with HasSimpleAnnotate[DocumentNormalizer]

    Annotator which normalizes raw text from tagged text, e.g.

    Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

    For extended examples of usage, see the Examples.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val cleanUpPatterns = Array("<[^>]*>")
    
    val documentNormalizer = new DocumentNormalizer()
      .setInputCols("document")
      .setOutputCol("normalizedDocument")
      .setAction("clean")
      .setPatterns(cleanUpPatterns)
      .setReplacement(" ")
      .setPolicy("pretty_all")
      .setLowercase(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentNormalizer
    ))
    
    val text =
      """
    
    
    
      THE WORLD'S LARGEST WEB DEVELOPER SITE
    
    = THE WORLD'S LARGEST WEB DEVELOPER SITE =
    
    
    
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
    
    
    """
    val data = Seq(text).toDF("text")
    val pipelineModel = pipeline.fit(data)
    
    val result = pipelineModel.transform(data)
    result.selectExpr("normalizedDocument.result").show(truncate=false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  12. class DocumentTokenSplitter extends AnnotatorModel[DocumentTokenSplitter] with HasSimpleAnnotate[DocumentTokenSplitter]

    Annotator that splits large documents into smaller documents based on the number of tokens in the text.

    Annotator that splits large documents into smaller documents based on the number of tokens in the text.

    Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.

    For example, given 3 tokens and overlap 1:

    He was, I take it, the most perfect reasoning and observing machine that the world has seen.
    
    ["He was, I", "I take it,", "it, the most", "most perfect reasoning", "reasoning and observing", "observing machine that", "that the world", "world has seen."]

    Additionally, you can set

    For extended examples of usage, see the DocumentTokenSplitterTest.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import org.apache.spark.ml.Pipeline
    
    val textDF =
      spark.read
        .option("wholetext", "true")
        .text("src/test/resources/spell/sherlockholmes.txt")
        .toDF("text")
    
    val documentAssembler = new DocumentAssembler().setInputCol("text")
    val textSplitter = new DocumentTokenSplitter()
      .setInputCols("document")
      .setOutputCol("splits")
      .setNumTokens(512)
      .setTokenOverlap(10)
      .setExplodeSplits(true)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
    val result = pipeline.fit(textDF).transform(textDF)
    
    result
      .selectExpr(
        "splits.result as result",
        "splits[0].begin as begin",
        "splits[0].end as end",
        "splits[0].end - splits[0].begin as length",
        "splits[0].metadata.numTokens as tokens")
      .show(8, truncate = 80)
    +--------------------------------------------------------------------------------+-----+-----+------+------+
    |                                                                          result|begin|  end|length|tokens|
    +--------------------------------------------------------------------------------+-----+-----+------+------+
    |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|    0| 3018|  3018|   512|
    |[study of crime, and occupied his\nimmense faculties and extraordinary powers...| 2950| 5707|  2757|   512|
    |[but as I have changed my clothes I can't imagine how you\ndeduce it. As to M...| 5659| 8483|  2824|   512|
    |[quarters received. Be in your chamber then at that hour, and do\nnot take it...| 8427|11241|  2814|   512|
    |[a pity\nto miss it."\n\n"But your client--"\n\n"Never mind him. I may want y...|11188|13970|  2782|   512|
    |[person who employs me wishes his agent to be unknown to\nyou, and I may conf...|13918|16898|  2980|   512|
    |[letters back."\n\n"Precisely so. But how--"\n\n"Was there a secret marriage?...|16836|19744|  2908|   512|
    |[seven hundred in\nnotes," he said.\n\nHolmes scribbled a receipt upon a shee...|19683|22551|  2868|   512|
    +--------------------------------------------------------------------------------+-----+-----+------+------+
  13. class GraphExtraction extends AnnotatorModel[GraphExtraction] with HasSimpleAnnotate[GraphExtraction]

    Extracts a dependency graph between entities.

    Extracts a dependency graph between entities.

    The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

    Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

    1. Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
    2. Setting setMergeEntities to true will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel and setTypedDependencyParserModel:
    val graph_extraction = new GraphExtraction()
      .setInputCols("document", "token", "ner")
      .setOutputCol("graph")
      .setRelationshipTypes(Array("prefer-LOC"))
      .setMergeEntities(true)
    //.setDependencyParserModel(Array("dependency_conllu", "en",  "public/models"))
    //.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en",  "public/models"))

    To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
    import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
    import org.apache.spark.ml.Pipeline
    import com.johnsnowlabs.nlp.annotators.GraphExtraction
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
    
    val nerTagger = NerDLModel.pretrained()
      .setInputCols("sentence", "token", "embeddings")
      .setOutputCol("ner")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    val dependencyParser = DependencyParserModel.pretrained()
      .setInputCols("sentence", "pos", "token")
      .setOutputCol("dependency")
    
    val typedDependencyParser = TypedDependencyParserModel.pretrained()
      .setInputCols("dependency", "pos", "token")
      .setOutputCol("dependency_type")
    
    val graph_extraction = new GraphExtraction()
      .setInputCols("document", "token", "ner")
      .setOutputCol("graph")
      .setRelationshipTypes(Array("prefer-LOC"))
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence,
      tokenizer,
      embeddings,
      nerTagger,
      posTagger,
      dependencyParser,
      typedDependencyParser,
      graph_extraction
    ))
    
    val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("graph").show(false)
    +-----------------------------------------------------------------------------------------------------------------+
    |graph                                                                                                            |
    +-----------------------------------------------------------------------------------------------------------------+
    |[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
    +-----------------------------------------------------------------------------------------------------------------+
    See also

    GraphFinisher to output the paths in a more generic format, like RDF

  14. class Lemmatizer extends AnnotatorApproach[LemmatizerModel]

    Class to find lemmas out of words with the objective of returning a base dictionary word.

    Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource. Pretrained models can be loaded with LemmatizerModel.pretrained.

    For available pretrained models please see the Models Hub. For extended examples of usage, see the Examples and the LemmatizerTestSpec.

    Example

    In this example, the lemma dictionary lemmas_small.txt has the form of

    ...
    pick	->	pick	picks	picking	picked
    peck	->	peck	pecking	pecked	pecks
    pickle	->	pickle	pickles	pickled	pickling
    pepper	->	pepper	peppers	peppered	peppering
    ...

    where each key is delimited by -> and values are delimited by \t

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Lemmatizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val lemmatizer = new Lemmatizer()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")
      .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        lemmatizer
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    
    val result = pipeline.fit(data).transform(data)
    result.selectExpr("lemma.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
    +------------------------------------------------------------------+
    See also

    LemmatizerModel for the instantiated model and pretrained models.

  15. class LemmatizerModel extends AnnotatorModel[LemmatizerModel] with HasSimpleAnnotate[LemmatizerModel]

    Instantiated Model of the Lemmatizer.

    Instantiated Model of the Lemmatizer. For usage and examples, please see the documentation of that class. For available pretrained models please see the Models Hub.

    Example

    The lemmatizer from the example of the Lemmatizer can be replaced with:

    val lemmatizer = LemmatizerModel.pretrained()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")

    This will load the default pretrained model which is "lemma_antbnc".

    See also

    Lemmatizer

  16. class MultiDateMatcher extends AnnotatorModel[MultiDateMatcher] with HasSimpleAnnotate[MultiDateMatcher] with DateMatcherUtils

    Matches standard date formats into a provided format.

    Matches standard date formats into a provided format.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    For extended examples of usage, see the Examples and the MultiDateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new MultiDateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("I saw him yesterday and he told me that he will visit us next week")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(date) as dates").show(false)
    +-----------------------------------------------+
    |dates                                          |
    +-----------------------------------------------+
    |[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
    |[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
    +-----------------------------------------------+
  17. class NGramGenerator extends AnnotatorModel[NGramGenerator] with HasSimpleAnnotate[NGramGenerator]

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

    When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

    For more extended examples see the Examples and the NGramGeneratorTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.NGramGenerator
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("ngrams")
      .setN(2)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        nGrams
      ))
    
    val data = Seq("This is my sentence.").toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(ngrams) as result").show(false)
    +------------------------------------------------------------+
    |result                                                      |
    +------------------------------------------------------------+
    |[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
    |[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
    |[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
    |[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
    +------------------------------------------------------------+
  18. class Normalizer extends AnnotatorApproach[NormalizerModel]

    Annotator that cleans out tokens.

    Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

    For extended examples of usage, see the Examples.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val normalizer = new Normalizer()
      .setInputCols("token")
      .setOutputCol("normalized")
      .setLowercase(true)
      .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
    // if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      normalizer
    ))
    
    val data = Seq("John and Peter are brothers. However they don't support each other that much.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("normalized.result").show(truncate = false)
    +----------------------------------------------------------------------------------------+
    |result                                                                                  |
    +----------------------------------------------------------------------------------------+
    |[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
    +----------------------------------------------------------------------------------------+
  19. class NormalizerModel extends AnnotatorModel[NormalizerModel] with HasSimpleAnnotate[NormalizerModel]

    Instantiated Model of the Normalizer.

    Instantiated Model of the Normalizer. For usage and examples, please see the documentation of that class.

    See also

    Normalizer for the base class

  20. trait ReadablePretrainedLemmatizer extends ParamsAndFeaturesReadable[LemmatizerModel] with HasPretrained[LemmatizerModel]
  21. trait ReadablePretrainedStopWordsCleanerModel extends ParamsAndFeaturesReadable[StopWordsCleaner] with HasPretrained[StopWordsCleaner]
  22. trait ReadablePretrainedTextMatcher extends ParamsAndFeaturesReadable[TextMatcherModel] with HasPretrained[TextMatcherModel]
  23. trait ReadablePretrainedTokenizer extends ParamsAndFeaturesReadable[TokenizerModel] with HasPretrained[TokenizerModel]
  24. class RecursiveTokenizer extends AnnotatorApproach[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Tokenizes raw text recursively based on a handful of definable rules.

    Tokenizes raw text recursively based on a handful of definable rules.

    Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

    • prefixes: Strings that will be split when found at the beginning of token.
    • suffixes: Strings that will be split when found at the end of token.
    • infixes: Strings that will be split when found at the middle of token.
    • whitelist: Whitelist of strings not to split

    For extended examples of usage, see the Examples and the TokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new RecursiveTokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer
    ))
    
    val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("token.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
    +------------------------------------------------------------------+
  25. class RecursiveTokenizerModel extends AnnotatorModel[RecursiveTokenizerModel] with HasSimpleAnnotate[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Instantiated model of the RecursiveTokenizer.

    Instantiated model of the RecursiveTokenizer. For usage and examples see the documentation of the main class.

  26. class RegexMatcher extends AnnotatorApproach[RegexMatcherModel]

    Uses rules to match a set of regular expressions and associate them with a provided identifier.

    Uses rules to match a set of regular expressions and associate them with a provided identifier.

    A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be \d{4}\/\d\d\/\d\d,date which will match strings like "1970/01/01" to the identifier "date".

    Rules must be provided by either setRules (followed by setDelimiter) or an external file.

    To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Examples and the RegexMatcherTestSpec.

    Example

    In this example, the rules.txt has the form of

    the\s\w+, followed by 'the'
    ceremonies, ceremony

    where each regex is separated by the identifier by ","

    import ResourceHelper.spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.RegexMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    
    val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    
    val regexMatcher = new RegexMatcher()
      .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
      .setInputCols(Array("sentence"))
      .setOutputCol("regex")
      .setStrategy("MATCH_ALL")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))
    
    val data = Seq(
      "My first sentence with the first rule. This is my second sentence with ceremonies rule."
    ).toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(regex) as result").show(false)
    +--------------------------------------------------------------------------------------------+
    |result                                                                                      |
    +--------------------------------------------------------------------------------------------+
    |[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
    |[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
    +--------------------------------------------------------------------------------------------+
  27. class RegexMatcherModel extends AnnotatorModel[RegexMatcherModel] with HasSimpleAnnotate[RegexMatcherModel]

    Instantiated model of the RegexMatcher.

    Instantiated model of the RegexMatcher. For usage and examples see the documentation of the main class.

  28. class RegexTokenizer extends AnnotatorModel[RegexTokenizer] with HasSimpleAnnotate[RegexTokenizer]

    A tokenizer that splits text by a regex pattern.

    A tokenizer that splits text by a regex pattern.

    The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RegexTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val regexTokenizer = new RegexTokenizer()
      .setInputCols("document")
      .setOutputCol("regexToken")
      .setToLowercase(true)
      .setPattern("\\s+")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        regexTokenizer
      ))
    
    val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("regexToken.result").show(false)
    +-------------------------------------------------------+
    |result                                                 |
    +-------------------------------------------------------+
    |[this, is, my, first, sentence., this, is, my, second.]|
    +-------------------------------------------------------+
  29. class Stemmer extends AnnotatorModel[Stemmer] with HasSimpleAnnotate[Stemmer]

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Examples.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val stemmer = new Stemmer()
      .setInputCols("token")
      .setOutputCol("stem")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      stemmer
    ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("stem.result").show(truncate = false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
    +-------------------------------------------------------------+
  30. class StopWordsCleaner extends AnnotatorModel[StopWordsCleaner] with HasSimpleAnnotate[StopWordsCleaner]

    This annotator takes a sequence of strings (e.g.

    This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

    By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

    val stopWords = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    // will load the default pretrained model `"stopwords_en"`.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Examples and StopWordsCleanerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val stopWords = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stopWords
      ))
    
    val data = Seq(
      "This is my first sentence. This is my second.",
      "This is my third sentence. This is my forth."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("cleanTokens.result").show(false)
    +-------------------------------+
    |result                         |
    +-------------------------------+
    |[first, sentence, ., second, .]|
    |[third, sentence, ., forth, .] |
    +-------------------------------+
  31. class TextMatcher extends AnnotatorApproach[TextMatcherModel] with ParamsAndFeaturesWritable

    Annotator to match exact phrases (by token) provided in a file against a Document.

    Annotator to match exact phrases (by token) provided in a file against a Document.

    A text file of predefined phrases must be provided with setEntities. The text file can als be set directly as an ExternalResource.

    For extended examples of usage, see the Examples and the TextMatcherTestSpec.

    Example

    In this example, the entities file is of the form

    ...
    dolore magna aliqua
    lorem ipsum dolor. sit
    laborum
    ...

    where each line represents an entity phrase to be extracted.

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.TextMatcher
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
    val entityExtractor = new TextMatcher()
      .setInputCols("document", "token")
      .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
      .setOutputCol("entity")
      .setCaseSensitive(false)
      .setTokenizer(tokenizer.fit(data))
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(entity) as result").show(false)
    +------------------------------------------------------------------------------------------+
    |result                                                                                    |
    +------------------------------------------------------------------------------------------+
    |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
    |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
    |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
    +------------------------------------------------------------------------------------------+
    See also

    BigTextMatcher to match large amounts of text

  32. class TextMatcherModel extends AnnotatorModel[TextMatcherModel] with HasSimpleAnnotate[TextMatcherModel]

    Instantiated model of the TextMatcher.

    Instantiated model of the TextMatcher. For usage and examples see the documentation of the main class.

  33. class TextSplitter extends AnyRef

    Splits texts recursively to match given length

  34. class Token2Chunk extends AnnotatorModel[Token2Chunk] with HasSimpleAnnotate[Token2Chunk]

    Converts TOKEN type Annotations to CHUNK type.

    Converts TOKEN type Annotations to CHUNK type.

    This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val token2chunk = new Token2Chunk()
      .setInputCols("token")
      .setOutputCol("chunk")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      token2chunk
    ))
    
    val data = Seq("One Two Three Four").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +------------------------------------------+
    |result                                    |
    +------------------------------------------+
    |[chunk, 0, 2, One, [sentence -> 0], []]   |
    |[chunk, 4, 6, Two, [sentence -> 0], []]   |
    |[chunk, 8, 12, Three, [sentence -> 0], []]|
    |[chunk, 14, 17, Four, [sentence -> 0], []]|
    +------------------------------------------+
  35. class Tokenizer extends AnnotatorApproach[TokenizerModel]

    Tokenizes raw text in document type columns into TokenizedSentence .

    Tokenizes raw text in document type columns into TokenizedSentence .

    This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

    Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    For extended examples of usage see the Examples and Tokenizer test class

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import org.apache.spark.ml.Pipeline
    
    val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
    val result = pipeline.transform(data)
    
    result.selectExpr("token.result").show(false)
    +-----------------------------------------------------------------------+
    |output                                                                 |
    +-----------------------------------------------------------------------+
    |[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
    +-----------------------------------------------------------------------+
  36. class TokenizerModel extends AnnotatorModel[TokenizerModel] with HasSimpleAnnotate[TokenizerModel]

    Tokenizes raw text into word pieces, tokens.

    Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    This class represents an already fitted Tokenizer model.

    See the main class Tokenizer for more examples of usage.

Value Members

  1. object Chunk2Doc extends DefaultParamsReadable[Chunk2Doc] with Serializable

    This is the companion object of Chunk2Doc.

    This is the companion object of Chunk2Doc. Please refer to that class for the documentation.

  2. object ChunkTokenizer extends DefaultParamsReadable[ChunkTokenizer] with Serializable

    This is the companion object of ChunkTokenizer.

    This is the companion object of ChunkTokenizer. Please refer to that class for the documentation.

  3. object ChunkTokenizerModel extends ParamsAndFeaturesReadable[ChunkTokenizerModel] with Serializable
  4. object Chunker extends DefaultParamsReadable[Chunker] with Serializable

    This is the companion object of Chunker.

    This is the companion object of Chunker. Please refer to that class for the documentation.

  5. object Date2Chunk extends DefaultParamsReadable[Date2Chunk] with Serializable

    This is the companion object of Date2Chunk.

    This is the companion object of Date2Chunk. Please refer to that class for the documentation.

  6. object DateMatcher extends DefaultParamsReadable[DateMatcher] with Serializable

    This is the companion object of DateMatcher.

    This is the companion object of DateMatcher. Please refer to that class for the documentation.

  7. object DocumentCharacterTextSplitter extends DefaultParamsReadable[DocumentCharacterTextSplitter] with Serializable

    This is the companion object of DocumentCharacterTextSplitter.

    This is the companion object of DocumentCharacterTextSplitter. Please refer to that class for the documentation.

  8. object DocumentNormalizer extends DefaultParamsReadable[DocumentNormalizer] with Serializable

    This is the companion object of DocumentNormalizer.

    This is the companion object of DocumentNormalizer. Please refer to that class for the documentation.

  9. object DocumentTokenSplitter extends DefaultParamsReadable[DocumentTokenSplitter] with Serializable

    This is the companion object of DocumentTokenSplitter.

    This is the companion object of DocumentTokenSplitter. Please refer to that class for the documentation.

  10. object EnglishStemmer
  11. object Lemmatizer extends DefaultParamsReadable[Lemmatizer] with Serializable

    This is the companion object of Lemmatizer.

    This is the companion object of Lemmatizer. Please refer to that class for the documentation.

  12. object LemmatizerModel extends ReadablePretrainedLemmatizer with Serializable

    This is the companion object of LemmatizerModel.

    This is the companion object of LemmatizerModel. Please refer to that class for the documentation.

  13. object LookAroundManager
  14. object MultiDateMatcher extends DefaultParamsReadable[MultiDateMatcher] with Serializable

    This is the companion object of MultiDateMatcher.

    This is the companion object of MultiDateMatcher. Please refer to that class for the documentation.

  15. object MultiDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable
  16. object NGramGenerator extends ParamsAndFeaturesReadable[NGramGenerator] with Serializable
  17. object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

    This is the companion object of Normalizer.

    This is the companion object of Normalizer. Please refer to that class for the documentation.

  18. object NormalizerModel extends ParamsAndFeaturesReadable[NormalizerModel] with Serializable
  19. object PretrainedAnnotations
  20. object RecursiveTokenizerModel extends ParamsAndFeaturesReadable[RecursiveTokenizerModel] with Serializable
  21. object RegexMatcher extends DefaultParamsReadable[RegexMatcher] with Serializable

    This is the companion object of RegexMatcher.

    This is the companion object of RegexMatcher. Please refer to that class for the documentation.

  22. object RegexMatcherModel extends ParamsAndFeaturesReadable[RegexMatcherModel] with Serializable
  23. object RegexTokenizer extends DefaultParamsReadable[RegexTokenizer] with Serializable

    This is the companion object of RegexTokenizer.

    This is the companion object of RegexTokenizer. Please refer to that class for the documentation.

  24. object SingleDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable
  25. object Stemmer extends DefaultParamsReadable[Stemmer] with Serializable

    This is the companion object of Stemmer.

    This is the companion object of Stemmer. Please refer to that class for the documentation.

  26. object StopWordsCleaner extends ParamsAndFeaturesReadable[StopWordsCleaner] with ReadablePretrainedStopWordsCleanerModel with Serializable
  27. object TextMatcher extends DefaultParamsReadable[TextMatcher] with Serializable

    This is the companion object of TextMatcher.

    This is the companion object of TextMatcher. Please refer to that class for the documentation.

  28. object TextMatcherModel extends ReadablePretrainedTextMatcher with Serializable

    This is the companion object of TextMatcherModel.

    This is the companion object of TextMatcherModel. Please refer to that class for the documentation.

  29. object Token2Chunk extends DefaultParamsReadable[Token2Chunk] with Serializable

    This is the companion object of Token2Chunk.

    This is the companion object of Token2Chunk. Please refer to that class for the documentation.

  30. object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

    This is the companion object of Tokenizer.

    This is the companion object of Tokenizer. Please refer to that class for the documentation.

  31. object TokenizerModel extends ReadablePretrainedTokenizer with Serializable

    This is the companion object of TokenizerModel.

    This is the companion object of TokenizerModel. Please refer to that class for the documentation.

Ungrouped