Training Datasets
These are classes to load common datasets to train annotators for tasks such as part-of-speech tagging, named entity recognition, spell checking and more.
POS Dataset
In order to train a Part of Speech Tagger annotator, we need to get corpus data as a Spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a Spark dataset.
Input File Format:
A|DT few|JJ months|NNS ago|RB you|PRP received|VBD a|DT letter|NN
Constructor Parameters:
None
Parameters for readDataset
:
- spark: Initiated Spark Session with Spark NLP
- path: Path to the resource
- delimiter: Delimiter of word and POS, by default “|”
- outputPosCol: Name of the output POS column, by default “tags”
- outputDocumentCol: Name of the output document column, by default “document”
- outputTextCol: Name of the output text column, by default “text”
Refer to the documentation for more details on the API:
Python API: POS | Scala API: POS | Source: POS.scala |
Show Example
from sparknlp.training import POS
pos = POS()
path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
posDf = pos.readDataset(spark, path, "|", "tags")
posDf.selectExpr("explode(tags) as tags").show(truncate=False)
+---------------------------------------------+
|tags |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []] |
|[pos, 7, 12, NNP, [word -> Vinken], []] |
|[pos, 14, 14, ,, [word -> ,], []] |
|[pos, 16, 17, CD, [word -> 61], []] |
|[pos, 19, 23, NNS, [word -> years], []] |
|[pos, 25, 27, JJ, [word -> old], []] |
|[pos, 29, 29, ,, [word -> ,], []] |
|[pos, 31, 34, MD, [word -> will], []] |
|[pos, 36, 39, VB, [word -> join], []] |
|[pos, 41, 43, DT, [word -> the], []] |
|[pos, 45, 49, NN, [word -> board], []] |
|[pos, 51, 52, IN, [word -> as], []] |
|[pos, 47, 47, DT, [word -> a], []] |
|[pos, 56, 67, JJ, [word -> nonexecutive], []]|
|[pos, 69, 76, NN, [word -> director], []] |
|[pos, 78, 81, NNP, [word -> Nov.], []] |
|[pos, 83, 84, CD, [word -> 29], []] |
|[pos, 81, 81, ., [word -> .], []] |
+---------------------------------------------+
import com.johnsnowlabs.nlp.training.POS
val pos = POS()
val path = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val posDf = pos.readDataset(spark, path, "|", "tags")
posDf.selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []] |
|[pos, 7, 12, NNP, [word -> Vinken], []] |
|[pos, 14, 14, ,, [word -> ,], []] |
|[pos, 16, 17, CD, [word -> 61], []] |
|[pos, 19, 23, NNS, [word -> years], []] |
|[pos, 25, 27, JJ, [word -> old], []] |
|[pos, 29, 29, ,, [word -> ,], []] |
|[pos, 31, 34, MD, [word -> will], []] |
|[pos, 36, 39, VB, [word -> join], []] |
|[pos, 41, 43, DT, [word -> the], []] |
|[pos, 45, 49, NN, [word -> board], []] |
|[pos, 51, 52, IN, [word -> as], []] |
|[pos, 47, 47, DT, [word -> a], []] |
|[pos, 56, 67, JJ, [word -> nonexecutive], []]|
|[pos, 69, 76, NN, [word -> director], []] |
|[pos, 78, 81, NNP, [word -> Nov.], []] |
|[pos, 83, 84, CD, [word -> 29], []] |
|[pos, 81, 81, ., [word -> .], []] |
+---------------------------------------------+
CoNLL Dataset
In order to train a Named Entity Recognition DL annotator, we need to get CoNLL format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset.
The dataset should be in the format of CoNLL 2003 and needs to be specified with readDataset()
, which will create a dataframe with the data.
Input File Format:
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
Constructor Parameters:
- documentCol: Name of the DocumentAssembler column, by default ‘document’
- sentenceCol: Name of the SentenceDetector column, by default ‘sentence’
- tokenCol: Name of the Tokenizer column, by default ‘token’
- posCol: Name of the part-of-speech tag column, by default ‘pos’
- conllLabelIndex: Index of the label column in the dataset, by default 3
- conllPosIndex: Index of the POS tags in the dataset, by default 1
- textCol: Index of the text column in the dataset, by default ‘text’
- labelCol: Name of the label column, by default ‘label’
- explodeSentences: Whether to explode sentences to separate rows, by default True
- delimiter: Delimiter used to separate columns inside CoNLL file
Parameters for readDataset
:
- spark: Initiated Spark Session with Spark NLP
- path: Path to the resource
- read_as: How to read the resource, by default ReadAs.TEXT
Refer to the documentation for more details on the API:
Python API: CoNLL | Scala API: CoNLL | Source: CoNLL.scala |
Show Example
from sparknlp.training import CoNLL
trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
trainingData.selectExpr(
"text",
"token.result as tokens",
"pos.result as pos",
"label.result as label"
).show(3, False)
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|text |tokens |pos |label |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]|
|Peter Blackburn |[Peter, Blackburn] |[NNP, NNP] |[B-PER, I-PER] |
|BRUSSELS 1996-08-22 |[BRUSSELS, 1996-08-22] |[NNP, CD] |[B-LOC, O] |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
val trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
trainingData.selectExpr("text", "token.result as tokens", "pos.result as pos", "label.result as label")
.show(3, false)
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|text |tokens |pos |label |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]|
|Peter Blackburn |[Peter, Blackburn] |[NNP, NNP] |[B-PER, I-PER] |
|BRUSSELS 1996-08-22 |[BRUSSELS, 1996-08-22] |[NNP, CD] |[B-LOC, O] |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
CoNLL-U Dataset
In order to train a DependencyParserApproach annotator, we need to get CoNLL-U format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset.
The dataset should be in the format of CoNLL-U and needs to be specified with readDataset()
, which will create a dataframe with the data.
Input File Format:
# sent_id = 1
# text = They buy and sell books.
1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj 2:nsubj|4:nsubj _
2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root 0:root _
3 and and CONJ CC _ 4 cc 4:cc _
4 sell sell VERB VBP Number=Plur|Person=3|Tense=Pres 2 conj 0:root|2:conj _
5 books book NOUN NNS Number=Plur 2 obj 2:obj|4:obj SpaceAfter=No
6 . . PUNCT . _ 2 punct 2:punct _
Constructor Parameters:
- explodeSentences: Whether to explode each sentence to a separate row
Parameters for readDataset
:
- spark: Initiated Spark Session with Spark NLP
- path: Path to the resource
- read_as: How to read the resource, by default ReadAs.TEXT
Refer to the documentation for more details on the API:
Python API: CoNLLU | Scala API: CoNLLU | Source: CoNLLU.scala |
Show Example
from sparknlp.training import CoNLLU
conlluFile = "src/test/resources/conllu/en.test.conllu"
conllDataSet = CoNLLU(False).readDataset(spark, conlluFile)
conllDataSet.selectExpr(
"text",
"form.result as form",
"upos.result as upos",
"xpos.result as xpos",
"lemma.result as lemma"
).show(1, False)
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|text |form |upos |xpos |lemma |
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|What if Google Morphed Into GoogleOS? |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]|
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
import com.johnsnowlabs.nlp.training.CoNLLU
val conlluFile = "src/test/resources/conllu/en.test.conllu"
val conllDataSet = CoNLLU(false).readDataset(ResourceHelper.spark, conlluFile)
conllDataSet.selectExpr("text", "form.result as form", "upos.result as upos", "xpos.result as xpos", "lemma.result as lemma")
.show(1, false)
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|text |form |upos |xpos |lemma |
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
|What if Google Morphed Into GoogleOS? |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]|
+---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
PubTator Dataset
The PubTator format includes medical papers’ titles, abstracts, and tagged chunks (see PubTator Docs and MedMentions Docs for more information). We can create a Spark DataFrame from a PubTator text file.
Input File Format:
25763772 0 5 DCTN4 T116,T123 C4308010
25763772 23 63 chronic Pseudomonas aeruginosa infection T047 C0854135
25763772 67 82 cystic fibrosis T047 C0010674
25763772 83 120 Pseudomonas aeruginosa (Pa) infection T047 C0854135
25763772 124 139 cystic fibrosis T047 C0010674
Constructor Parameters:
None
Parameters for readDataset
:
- spark: Initiated Spark Session with Spark NLP
- path: Path to the resource
- isPaddedToken: Whether tokens are padded
Refer to the documentation for more details on the API:
Python API: PubTator | Scala API: PubTator | Source: PubTator.scala |
Show Example
from sparknlp.training import PubTator
pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt"
pubTatorDataSet = PubTator().readDataset(spark, pubTatorFile)
pubTatorDataSet.show(1)
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
| doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [[sentence, 0], [...| [[word, DCTN4], [...| [[word, DCTN4], [...|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
import com.johnsnowlabs.nlp.training.PubTator
val pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt"
val pubTatorDataSet = PubTator().readDataset(ResourceHelper.spark, pubTatorFile)
pubTatorDataSet.show(1)
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
| doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [[sentence, 0], [...| [[word, DCTN4], [...| [[word, DCTN4], [...|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
Spell Checkers Dataset (Corpus)
In order to train a Norvig or Symmetric Spell Checkers, we need to get corpus data as a spark dataframe. We can read a plain text file and transforms it to a spark dataset.
Example:
train_corpus = spark.read \
.text("./sherlockholmes.txt") \
.withColumnRenamed("value", "text")
val trainCorpus = spark.read
.text("./sherlockholmes.txt")
.select(trainCorpus.col("value").as("text"))
Text Processing
These are annotators that can be trained to process text for tasks such as dependency parsing, lemmatisation, part-of-speech tagging, sentence detection and word segmentation.
DependencyParserApproach
Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.
Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.
The required training data can be set in two different ways (only one can be chosen for a particular model):
- Dependency treebank in the Penn Treebank format set with
setDependencyTreeBank
. Data Format:(S (S-TPC-1 (NP-SBJ (NP (NP (DT A) (NN form)) (PP (IN of) (NP (NN asbestos)))) (RRC ...)...)...) ... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*-1)))) (. .))
- Dataset in the CoNLL-U format set with
setConllU
. Data Format:-DOCSTART- -X- -X- O EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC
Apart from that, no additional training data is needed.
See DependencyParserApproachTestSpec for further reference on how to use this API.
Input Annotator Types: DOCUMENT, POS, TOKEN
Output Annotator Type: DEPENDENCY
Python API: DependencyParserApproach | Scala API: DependencyParserApproach | Source: DependencyParserApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols("document") \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols("sentence") \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols("sentence", "token") \
.setOutputCol("pos")
dependencyParserApproach = DependencyParserApproach() \
.setInputCols("sentence", "pos", "token") \
.setOutputCol("dependency") \
.setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParserApproach
])
# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
val dependencyParserApproach = new DependencyParserApproach()
.setInputCols("sentence", "pos", "token")
.setOutputCol("dependency")
.setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParserApproach
))
// Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)
Lemmatizer
Class to find lemmas out of words with the objective of returning a base dictionary word.
Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary
.
The dictionary can be set as a delimited text file.
Pretrained models can be loaded with LemmatizerModel.pretrained
.
For extended examples of usage, see the Examples.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: Lemmatizer | Scala API: Lemmatizer | Source: Lemmatizer |
Show Example
# In this example, the lemma dictionary `lemmas_small.txt` has the form of
#
# ...
# pick -> pick picks picking picked
# peck -> peck pecking pecked pecks
# pickle -> pickle pickles pickled pickling
# pepper -> pepper peppers peppered peppering
# ...
#
# where each key is delimited by `->` and values are delimited by `\t`
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
lemmatizer = Lemmatizer() \
.setInputCols(["token"]) \
.setOutputCol("lemma") \
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
pipeline = Pipeline() \
.setStages([
documentAssembler,
sentenceDetector,
tokenizer,
lemmatizer
])
data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
.toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
// In this example, the lemma dictionary `lemmas_small.txt` has the form of
//
// ...
// pick -> pick picks picking picked
// peck -> peck pecking pecked pecks
// pickle -> pickle pickles pickled pickling
// pepper -> pepper peppers peppered peppering
// ...
//
// where each key is delimited by `->` and values are delimited by `\t`
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
lemmatizer
))
val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
PerceptronApproach (Part of speech tagger)
Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.
The training data needs to be in a Spark DataFrame, where the column needs to consist of
Annotations of type POS
. The Annotation
needs to have member result
set to the POS tag and have a "word"
mapping to its word inside of member metadata
.
This DataFrame for training can easily created by the helper class POS.
POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []] |
|[pos, 7, 12, NNP, [word -> Vinken], []] |
|[pos, 14, 14, ,, [word -> ,], []] |
|[pos, 31, 34, MD, [word -> will], []] |
|[pos, 36, 39, VB, [word -> join], []] |
|[pos, 41, 43, DT, [word -> the], []] |
|[pos, 45, 49, NN, [word -> board], []] |
...
For extended examples of usage, see the Examples and PerceptronApproach tests.
Input Annotator Types: TOKEN, DOCUMENT
Output Annotator Type: POS
Python API: PerceptronApproach | Scala API: PerceptronApproach | Source: PerceptronApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)
trainedPos = PerceptronApproach() \
.setInputCols(["document", "token"]) \
.setOutputCol("pos") \
.setPosColumn("tags") \
.fit(trainingPerceptronDF)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
trainedPos
])
data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.training.POS
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val trainingPerceptronDF = POS().readDataset(spark, datasetPath)
val trainedPos = new PerceptronApproach()
.setInputCols("document", "token")
.setOutputCol("pos")
.setPosColumn("tags")
.fit(trainingPerceptronDF)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
trainedPos
))
val data = Seq("To be or not to be, is this the question?").toDF("text")
val result = pipeline.fit(data).transform(data)
result.selectExpr("pos.result").show(false)
+--------------------------------------------------+
|result |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
SentenceDetectorDLApproach
Trains an annotator that detects sentence boundaries using a deep learning approach.
For pretrained models see SentenceDetectorDLModel.
Currently, only the CNN model is supported for training, but in the future the architecture of the model can
be set with setModelArchitecture
.
The default model "cnn"
is based on the paper
Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed)
using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.
Each extracted sentence can be returned in an Array or exploded to separate rows,
if explodeSentences
is set to true
.
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: SentenceDetectorDLApproach | Scala API: SentenceDetectorDLApproach | Source: SentenceDetectorDLApproach |
Show Example
# The training process needs data, where each data point is a sentence.
# In this example the `train.txt` file has the form of
#
# ...
# Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
# His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
# ...
#
# where each line is one sentence.
# Training can then be started like so:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
trainingData = spark.read.text("train.txt").toDF("text")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLApproach() \
.setInputCols(["document"]) \
.setOutputCol("sentences") \
.setEpochsNumber(100)
pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])
model = pipeline.fit(trainingData)
// The training process needs data, where each data point is a sentence.
// In this example the `train.txt` file has the form of
//
// ...
// Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
// His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
// ...
//
// where each line is one sentence.
// Training can then be started like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach
import org.apache.spark.ml.Pipeline
val trainingData = spark.read.text("train.txt").toDF("text")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetectorDLApproach()
.setInputCols(Array("document"))
.setOutputCol("sentences")
.setEpochsNumber(100)
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector))
val model = pipeline.fit(trainingData)
TypedDependencyParser
Labeled parser that finds a grammatical relation between two words in a sentence. Its input is either a CoNLL2009 or ConllU dataset.
For instantiated/pretrained models, see TypedDependencyParserModel.
Dependency parsers provide information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.
The parser requires the dependant tokens beforehand with e.g. DependencyParser. The required training data can be set in two different ways (only one can be chosen for a particular model):
- Dataset in the CoNLL 2009 format set with
setConll2009
. Data format:1 The the the DT DT _ _ 4 4 NMOD NMOD _ _ _ _ 2 most most most RBS RBS _ _ 3 3 AMOD AMOD _ _ _ _ 3 troublesome troublesome troublesome JJ JJ _ _ 4 4 NMOD NMOD _ _ _ _
- Dataset in the CoNLL-U format set with
setConllU
Data format:-DOCSTART- -X- -X- O EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC
Apart from that, no additional training data is needed.
See TypedDependencyParserApproachTestSpec for further reference on this API.
Input Annotator Types: TOKEN, POS, DEPENDENCY
Output Annotator Type: LABELED_DEPENDENCY
Python API: TypedDependencyParserApproach | Scala API: TypedDependencyParserApproach | Source: TypedDependencyParserApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
dependencyParser = DependencyParserModel.pretrained() \
.setInputCols(["sentence", "pos", "token"]) \
.setOutputCol("dependency")
typedDependencyParser = TypedDependencyParserApproach() \
.setInputCols(["dependency", "pos", "token"]) \
.setOutputCol("dependency_type") \
.setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
.setNumberOfIterations(1)
pipeline = Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParser,
typedDependencyParser
])
# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
val dependencyParser = DependencyParserModel.pretrained()
.setInputCols("sentence", "pos", "token")
.setOutputCol("dependency")
val typedDependencyParser = new TypedDependencyParserApproach()
.setInputCols("dependency", "pos", "token")
.setOutputCol("dependency_type")
.setConllU("src/test/resources/parser/labeled/train_small.conllu.txt")
.setNumberOfIterations(1)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
posTagger,
dependencyParser,
typedDependencyParser
))
// Additional training data is not needed, the dependency parser relies on CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)
WordSegmenterApproach
Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.
Many languages are not whitespace separated and their sentences are a concatenation of many symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the words into their corresponding tokens is impossible. The WordSegmenter is trained to understand these languages and split them into semantically correct parts.
To train your own model, a training dataset consisting of
Part-Of-Speech tags is required. The data has to be loaded
into a dataframe, where the column is an Annotation of type "POS"
. This can be
set with setPosColumn
.
Tip: The helper class POS might be useful to read training data into data frames.
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT
Output Annotator Type: TOKEN
Python API: WordSegmenterApproach | Scala API: WordSegmenterApproach | Source: WordSegmenterApproach |
Show Example
# In this example, `"chinese_train.utf8"` is in the form of
#
# 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
#
# and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
wordSegmenter = WordSegmenterApproach() \
.setInputCols(["document"]) \
.setOutputCol("token") \
.setPosColumn("tags") \
.setNIterations(5)
pipeline = Pipeline().setStages([
documentAssembler,
wordSegmenter
])
trainingDataSet = POS().readDataset(
spark,
"src/test/resources/word-segmenter/chinese_train.utf8"
)
pipelineModel = pipeline.fit(trainingDataSet)
// In this example, `"chinese_train.utf8"` is in the form of
//
// 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
//
// and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ws.WordSegmenterApproach
import com.johnsnowlabs.nlp.training.POS
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val wordSegmenter = new WordSegmenterApproach()
.setInputCols("document")
.setOutputCol("token")
.setPosColumn("tags")
.setNIterations(5)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
wordSegmenter
))
val trainingDataSet = POS().readDataset(
ResourceHelper.spark,
"src/test/resources/word-segmenter/chinese_train.utf8"
)
val pipelineModel = pipeline.fit(trainingDataSet)
Spell Checkers
These are annotators that can be trained to correct text.
ContextSpellCheckerApproach
Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.
Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a
certain number of errors, ContextSpellChecker
will rank correction sequences according to three things:
- Different correction candidates for each word — word level.
- The surrounding text of each word, i.e. it’s context — sentence level.
- The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.
For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Examples.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: ContextSpellCheckerApproach | Scala API: ContextSpellCheckerApproach | Source: ContextSpellCheckerApproach |
Show Example
# For this example, we use the first Sherlock Holmes book as the training dataset.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")
spellChecker = ContextSpellCheckerApproach() \
.setInputCols("token") \
.setOutputCol("corrected") \
.setWordMaxDistance(3) \
.setBatchSize(24) \
.setEpochs(8) \
.setLanguageModelClasses(1650) # dependant on vocabulary size
# .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
.toDF("text")
pipelineModel = pipeline.fit(dataset)
// For this example, we use the first Sherlock Holmes book as the training dataset.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new ContextSpellCheckerApproach()
.setInputCols("token")
.setOutputCol("corrected")
.setWordMaxDistance(3)
.setBatchSize(24)
.setEpochs(8)
.setLanguageModelClasses(1650) // dependant on vocabulary size
// .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
NorvigSweeting Spellchecker
Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and
dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster
(than the standard approach with deletes + transposes + replaces + inserts) and language independent.
A dictionary of correct spellings must be provided with setDictionary
as a text file, where each word is parsed by a regex pattern.
For Example a file "words.txt"
:
...
gummy
gummic
gummier
gummiest
gummiferous
...
can be parsed with the regular expression \S+
, which is the default for setDictionary
.
This dictionary is then set to be the basis of the spell checker.
Inspired by Norvig model and SymSpell.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: NorvigSweetingApproach | Scala API: NorvigSweetingApproach | Source: NorvigSweetingApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = NorvigSweetingApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new NorvigSweetingApproach()
.setInputCols("token")
.setOutputCol("spell")
.setDictionary("src/test/resources/spell/words.txt")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val pipelineModel = pipeline.fit(trainingData)
SymmetricDelete Spellchecker
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.
A dictionary of correct spellings must be provided with setDictionary
as a text file, where each word is parsed by a regex pattern.
For Example a file "words.txt"
:
...
gummy
gummic
gummier
gummiest
gummiferous
...
can be parsed with the regular expression \S+
, which is the default for setDictionary
.
This dictionary is then set to be the basis of the spell checker.
Inspired by SymSpell.
For instantiated/pretrained models, see SymmetricDeleteModel.
See SymmetricDeleteModelTestSpec for further reference.
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: SymmetricDeleteApproach | Scala API: SymmetricDeleteApproach | Source: SymmetricDeleteApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
spellChecker = SymmetricDeleteApproach() \
.setInputCols(["token"]) \
.setOutputCol("spell") \
.setDictionary("src/test/resources/spell/words.txt")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
spellChecker
])
pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteApproach
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val spellChecker = new SymmetricDeleteApproach()
.setInputCols("token")
.setOutputCol("spell")
.setDictionary("src/test/resources/spell/words.txt")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
spellChecker
))
val pipelineModel = pipeline.fit(trainingData)
Token Classification
These are annotators that can be trained to recognize named entities in text.
NerCrfApproach
Algorithm for training a Named Entity Recognition Model
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning
algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with
Annotation
type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
and an
additional label column of annotator type NAMED_ENTITY
.
Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer and
- a PerceptronModel and
- a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
Optionally the user can provide an entity dictionary file with setExternalFeatures
for better accuracy.
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: NerCrfApproach | Scala API: NerCrfApproach | Source: NerCrfApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# This CoNLL dataset already includes a sentence, token, POS tags and label
# column with their respective annotator types. If a custom dataset is used,
# these need to be defined with for example:
documentAssembler = DocumentAssembler() \\
.setInputCol("text") \\
.setOutputCol("document")
sentence = SentenceDetector() \\
.setInputCols(["document"]) \\
.setOutputCol("sentence")
tokenizer = Tokenizer() \\
.setInputCols(["sentence"]) \\
.setOutputCol("token")
posTagger = PerceptronModel.pretrained() \\
.setInputCols(["sentence", "token"]) \\
.setOutputCol("pos")
Then training can start:
embeddings = WordEmbeddingsModel.pretrained() \\
.setInputCols(["sentence", "token"]) \\
.setOutputCol("embeddings") \\
.setCaseSensitive(False)
nerTagger = NerCrfApproach() \\
.setInputCols(["sentence", "token", "pos", "embeddings"]) \\
.setLabelColumn("label") \\
.setMinEpochs(1) \\
.setMaxEpochs(3) \\
.setOutputCol("ner")
pipeline = Pipeline().setStages([
embeddings,
nerTagger
])
# We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.training.CoNLL
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import org.apache.spark.ml.Pipeline
// This CoNLL dataset already includes a sentence, token, POS tags and label
// column with their respective annotator types. If a custom dataset is used,
// these need to be defined with for example:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("pos")
// Then the training can start
val embeddings = WordEmbeddingsModel.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(false)
val nerTagger = new NerCrfApproach()
.setInputCols("sentence", "token", "pos", "embeddings")
.setLabelColumn("label")
.setMinEpochs(1)
.setMaxEpochs(3)
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
embeddings,
nerTagger
))
// We use the sentences, tokens, POS tags and labels from the CoNLL dataset.
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
val pipelineModel = pipeline.fit(trainingData)
NerDLApproach
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
The training data should be a labeled Spark Dataset, in the format of CoNLL
2003 IOB with Annotation
type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS
and an
additional label column of annotator type NAMED_ENTITY
.
Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer and
- a WordEmbeddingsModel (any word embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
For extended examples of usage, see the Examples.
Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: NerDLApproach | Scala API: NerDLApproach | Source: NerDLApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# This CoNLL dataset already includes a sentence, token and label
# column with their respective annotator types. If a custom dataset is used,
# these need to be defined with for example:
documentAssembler = DocumentAssembler() \\
.setInputCol("text") \\
.setOutputCol("document")
sentence = SentenceDetector() \\
.setInputCols(["document"]) \\
.setOutputCol("sentence")
tokenizer = Tokenizer() \\
.setInputCols(["sentence"]) \\
.setOutputCol("token")
Then the training can start
embeddings = BertEmbeddings.pretrained() \\
.setInputCols(["sentence", "token"]) \\
.setOutputCol("embeddings")
nerTagger = NerDLApproach() \\
.setInputCols(["sentence", "token", "embeddings"]) \\
.setLabelColumn("label") \\
.setOutputCol("ner") \\
.setMaxEpochs(1) \\
.setRandomSeed(0) \\
.setVerbose(0)
pipeline = Pipeline().setStages([
embeddings,
nerTagger
])
# We use the sentences, tokens, and labels from the CoNLL dataset.
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline
// This CoNLL dataset already includes a sentence, token and label
// column with their respective annotator types. If a custom dataset is used,
// these need to be defined with for example:
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Then the training can start
val embeddings = BertEmbeddings.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(1)
.setRandomSeed(0)
.setVerbose(0)
val pipeline = new Pipeline().setStages(Array(
embeddings,
nerTagger
))
// We use the sentences, tokens and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
val pipelineModel = pipeline.fit(trainingData)
Text Classification
These are annotators that can be trained to classify text into different classes, such as sentiment.
ClassifierDLApproach
Trains a ClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
For extended examples of usage, see the Examples [1] [2] .
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol
Python API: ClassifierDLApproach | Scala API: ClassifierDLApproach | Source: ClassifierDLApproach |
Show Example
# In this example, the training data `"sentiment.csv"` has the form of
#
# text,label
# This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
# ...
#
# Then traning can be done like so:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
useEmbeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")
docClassifier = ClassifierDLApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("category") \
.setLabelColumn("label") \
.setBatchSize(64) \
.setMaxEpochs(20) \
.setLr(5e-3) \
.setDropout(0.5)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
useEmbeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(smallCorpus)
// In this example, the training data `"sentiment.csv"` has the form of
//
// text,label
// This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
// ...
//
// Then traning can be done like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
import org.apache.spark.ml.Pipeline
val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val useEmbeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val docClassifier = new ClassifierDLApproach()
.setInputCols("sentence_embeddings")
.setOutputCol("category")
.setLabelColumn("label")
.setBatchSize(64)
.setMaxEpochs(20)
.setLr(5e-3f)
.setDropout(0.5f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
useEmbeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
MultiClassifierDLApproach
Trains a MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.
The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.
In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
For extended examples of usage, see the Examples.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Python API: MultiClassifierDLApproach | Scala API: MultiClassifierDLApproach | Source: MultiClassifierDLApproach |
Show Example
# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# | id| text| labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... | [toxic]|
# |a1237f726b5f5d89|Dude. Place the ...| [obscene, insult]|
# |24b0d6c8733c2abe|Thanks - thanks ...| [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Process training data to create text with associated array of labels
trainDataset.printSchema()
# root
# |-- id: string (nullable = true)
# |-- text: string (nullable = true)
# |-- labels: array (nullable = true)
# | |-- element: string (containsNull = true)
# Then create pipeline for training
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setCleanupMode("shrink")
embeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("embeddings")
docClassifier = MultiClassifierDLApproach() \
.setInputCols("embeddings") \
.setOutputCol("category") \
.setLabelColumn("labels") \
.setBatchSize(128) \
.setMaxEpochs(10) \
.setLr(1e-3) \
.setThreshold(0.5) \
.setValidationSplit(0.1)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
embeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(trainDataset)
// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:
import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}
// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
labels.split(", ").map(x=>x.trim)
}
val smallCorpus = spark.read
.option("header", true)
.option("inferSchema", true)
.option("mode", "DROPMALFORMED")
.csv("src/test/resources/classifier/e2e.csv")
.withColumn("labels", splitAndTrim(col("mr")))
.withColumn("text", col("ref"))
.drop("mr")
smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// | |-- element: string (containsNull = true)
// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
val docClassifier = new MultiClassifierDLApproach()
.setInputCols("embeddings")
.setOutputCol("category")
.setLabelColumn("labels")
.setBatchSize(128)
.setMaxEpochs(10)
.setLr(1e-3f)
.setThreshold(0.5f)
.setValidationSplit(0.1f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
embeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
SentimentDLApproach
Trains a MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.
The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.
In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
For extended examples of usage, see the Examples.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: CATEGORY
Python API: MultiClassifierDLApproach | Scala API: MultiClassifierDLApproach | Source: MultiClassifierDLApproach |
Show Example
# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# | id| text| labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... | [toxic]|
# |a1237f726b5f5d89|Dude. Place the ...| [obscene, insult]|
# |24b0d6c8733c2abe|Thanks - thanks ...| [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Process training data to create text with associated array of labels
trainDataset.printSchema()
# root
# |-- id: string (nullable = true)
# |-- text: string (nullable = true)
# |-- labels: array (nullable = true)
# | |-- element: string (containsNull = true)
# Then create pipeline for training
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setCleanupMode("shrink")
embeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols("document") \
.setOutputCol("embeddings")
docClassifier = MultiClassifierDLApproach() \
.setInputCols("embeddings") \
.setOutputCol("category") \
.setLabelColumn("labels") \
.setBatchSize(128) \
.setMaxEpochs(10) \
.setLr(1e-3) \
.setThreshold(0.5) \
.setValidationSplit(0.1)
pipeline = Pipeline() \
.setStages(
[
documentAssembler,
embeddings,
docClassifier
]
)
pipelineModel = pipeline.fit(trainDataset)
// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:
import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}
// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
labels.split(", ").map(x=>x.trim)
}
val smallCorpus = spark.read
.option("header", true)
.option("inferSchema", true)
.option("mode", "DROPMALFORMED")
.csv("src/test/resources/classifier/e2e.csv")
.withColumn("labels", splitAndTrim(col("mr")))
.withColumn("text", col("ref"))
.drop("mr")
smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// | |-- element: string (containsNull = true)
// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
val docClassifier = new MultiClassifierDLApproach()
.setInputCols("embeddings")
.setOutputCol("category")
.setLabelColumn("labels")
.setBatchSize(128)
.setMaxEpochs(10)
.setLr(1e-3f)
.setThreshold(0.5f)
.setValidationSplit(0.1f)
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
embeddings,
docClassifier
)
)
val pipelineModel = pipeline.fit(smallCorpus)
ViveknSentimentApproach
Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan https://github.com/vivekn/sentiment/.
The algorithm is based on the paper “Fast and accurate sentiment classification using an enhanced Naive Bayes model”.
The analyzer requires sentence boundaries to give a score in context. Tokenization is needed to make sure tokens are within bounds. Transitivity requirements are also required.
The training data needs to consist of a column for normalized text and a label column (either "positive"
or "negative"
).
For extended examples of usage, see the Examples.
Input Annotator Types: TOKEN, DOCUMENT
Output Annotator Type: SENTIMENT
Python API: ViveknSentimentApproach | Scala API: ViveknSentimentApproach | Source: ViveknSentimentApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
document = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
token = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normal")
vivekn = ViveknSentimentApproach() \
.setInputCols(["document", "normal"]) \
.setSentimentCol("train_sentiment") \
.setOutputCol("result_sentiment")
finisher = Finisher() \
.setInputCols(["result_sentiment"]) \
.setOutputCols("final_sentiment")
pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])
training = spark.createDataFrame([
("I really liked this movie!", "positive"),
("The cast was horrible", "negative"),
("Never going to watch this again or recommend it to anyone", "negative"),
("It's a waste of time", "negative"),
("I loved the protagonist", "positive"),
("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)
data = spark.createDataFrame([
["I recommend this movie"],
["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)
result.select("final_sentiment").show(truncate=False)
+---------------+
|final_sentiment|
+---------------+
|[positive] |
|[negative] |
+---------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normal")
val vivekn = new ViveknSentimentApproach()
.setInputCols("document", "normal")
.setSentimentCol("train_sentiment")
.setOutputCol("result_sentiment")
val finisher = new Finisher()
.setInputCols("result_sentiment")
.setOutputCols("final_sentiment")
val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))
val training = Seq(
("I really liked this movie!", "positive"),
("The cast was horrible", "negative"),
("Never going to watch this again or recommend it to anyone", "negative"),
("It's a waste of time", "negative"),
("I loved the protagonist", "positive"),
("The music was really really good", "positive")
).toDF("text", "train_sentiment")
val pipelineModel = pipeline.fit(training)
val data = Seq(
"I recommend this movie",
"Dont waste your time!!!"
).toDF("text")
val result = pipelineModel.transform(data)
result.select("final_sentiment").show(false)
+---------------+
|final_sentiment|
+---------------+
|[positive] |
|[negative] |
+---------------+
Text Representation
These are annotators that can be trained to turn text into a numerical representation.
Doc2VecApproach
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Doc2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Input Annotator Types: TOKEN
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: Doc2VecApproach | Scala API: Doc2VecApproach | Source: Doc2VecApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Doc2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Doc2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = new Doc2VecApproach()
.setInputCols("token")
.setOutputCol("embeddings")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
Word2VecApproach
Trains a Word2Vec model that creates vector representations of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.
For instantiated/pretrained models, see Word2VecModel.
Sources :
For the original C implementation, see https://code.google.com/p/word2vec/
For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.
Input Annotator Types: TOKEN
Output Annotator Type: WORD_EMBEDDINGS
Python API: Word2VecApproach | Scala API: Word2VecApproach | Source: Word2VecApproach |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = Word2VecApproach() \
.setInputCols(["token"]) \
.setOutputCol("embeddings")
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings
])
path = "sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)
import spark.implicits._
import com.johnsnowlabs.nlp.annotator.{Tokenizer, Word2VecApproach}
import com.johnsnowlabs.nlp.base.DocumentAssembler
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val embeddings = new Word2VecApproach()
.setInputCols("token")
.setOutputCol("embeddings")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings
))
val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
.toDF("text")
val pipelineModel = pipeline.fit(dataset)
External Trainable Models
These are annotators that are trained in an external library, which are then loaded into Spark NLP.
AlbertForTokenClassification
AlbertForTokenClassification can load Albert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.10.0 tensorflow==2.4.1 sentencepiece
# Loading the external transformers model
from transformers import TFAlbertForTokenClassification, AlbertTokenizer
MODEL_NAME = 'HooshvareLab/albert-fa-zwnj-base-v2-ner'
tokenizer = AlbertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
print('try downloading TF weights')
model = TFAlbertForTokenClassification.from_pretrained(MODEL_NAME)
except:
print('try downloading PyTorch weights')
model = TFAlbertForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# Extracting the tokenizer resources
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/spiece.model {asset_path}
# Get label2id dictionary
labels = model.config.label2id
# Sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: AlbertForTokenClassification | Scala API: AlbertForTokenClassification | Source: AlbertForTokenClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'HooshvareLab/albert-fa-zwnj-base-v2-ner'
tokenClassifier = AlbertForTokenClassification\
.loadSavedModel('{}/saved_model/1'.format(MODEL_NAME), spark)\
.setInputCols(["document",'token'])\
.setOutputCol("ner")\
.setCaseSensitive(False)\
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier_loaded = AlbertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("ner.result").show(truncate=False)
+-----------------------------------------------------+
|result |
+-----------------------------------------------------+
|[O, O, B-ORG, I-ORG, O, B-LOC, I-LOC, I-LOC, O, O, O]|
+-----------------------------------------------------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val ModelName = "HooshvareLab/albert-fa-zwnj-base-v2-ner"
val tokenClassifier = AlbertForTokenClassification
.loadSavedModel(s"$ModelName/saved_model/1", spark)
.setInputCols("document", "token")
.setOutputCol("ner")
.setCaseSensitive(false)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite().save(s"${ModelName}_spark_nlp")
val tokenClassifierLoaded = AlbertForTokenClassification.load(s"${ModelName}_spark_nlp")
.setInputCols("document", "token")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifierLoaded
))
val data = Seq("دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد.").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("ner.result").show(truncate = false)
+-----------------------------------------------------+
|result |
+-----------------------------------------------------+
|[O, O, B-ORG, I-ORG, O, B-LOC, I-LOC, I-LOC, O, O, O]|
+-----------------------------------------------------+
BertForSequenceClassification
BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.8.1 tensorflow==2.4.1
# Loading the external transformers model
from transformers import TFBertForSequenceClassification, BertTokenizer
MODEL_NAME = 'finiteautomata/beto-sentiment-analysis'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
try:
model = TFBertForSequenceClassification.from_pretrained(MODEL_NAME)
except:
model = TFBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# Extracting the tokenizer resources
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
# Get label2id dictionary
labels = model.config.label2id
# Sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: BertForSequenceClassification | Scala API: BertForSequenceClassification | Source: BertForSequenceClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'finiteautomata/beto-sentiment-analysis'
tokenClassifier = BertForSequenceClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("ner") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = BertForSequenceClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["¡La película fue genial!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[POS] |
+------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val ModelName = "finiteautomata/beto-sentiment-analysis"
val tokenClassifier = BertForSequenceClassification
.loadSavedModel(s"$ModelName/saved_model/1", spark)
.setInputCols("document", "token")
.setOutputCol("ner")
.setCaseSensitive(false)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite().save(s"${ModelName}_spark_nlp")
val tokenClassifierLoaded = BertForSequenceClassification.load(s"${ModelName}_spark_nlp")
.setInputCols("document", "token")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifierLoaded
))
val data = Seq("¡La película fue genial!").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("ner.result").show(truncate = false)
+------+
|result|
+------+
|[POS] |
+------+
BertForTokenClassification
BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.8.1 tensorflow==2.4.1 spark-nlp>=3.2.0
# Loading the external transformers model
from transformers import TFBertForTokenClassification, BertTokenizer
MODEL_NAME = 'dslim/bert-base-NER'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
model = TFBertForTokenClassification.from_pretrained(MODEL_NAME)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# Extracting the tokenizer resources
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
# Get label2id dictionary
labels = model.config.label2id
# Sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: BertForTokenClassification | Scala API: BertForTokenClassification | Source: BertForTokenClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'dslim/bert-base-NER'
tokenClassifier = BertForTokenClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("ner") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = BertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val modelName = "dslim/bert-base-NER"
var tokenClassifier = BertForTokenClassification.loadSavedModel(s"$modelName/saved_model/1", spark)
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite.save(s"${modelName}_spark_nlp")
tokenClassifier = BertForTokenClassification.load(s"${modelName}_spark_nlp")
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifier
))
val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("label.result").show(false)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
DistilBertForSequenceClassification
DistilBertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.8.1 tensorflow==2.4.1
# Loading the external transformers model
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
try:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
except:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# Extracting the tokenizer resources
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
# Get label2id dictionary
labels = model.config.label2id
# Sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: DistilBertForSequenceClassification | Scala API: DistilBertForSequenceClassification | Source: DistilBertForSequenceClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenClassifier = DistilBertForSequenceClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("ner") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = DistilBertForSequenceClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["The movie was great!"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+----------+
|result |
+----------+
|[POSITIVE]|
+----------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val ModelName = "distilbert-base-uncased-finetuned-sst-2-english"
val tokenClassifier = DistilBertForSequenceClassification
.loadSavedModel(s"$ModelName/saved_model/1", spark)
.setInputCols("document", "token")
.setOutputCol("ner")
.setCaseSensitive(false)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite().save(s"${ModelName}_spark_nlp")
val tokenClassifierLoaded = DistilBertForSequenceClassification.load(s"${ModelName}_spark_nlp")
.setInputCols("document", "token")
.setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifierLoaded
))
val data = Seq("The movie was great!").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("ner.result").show(truncate = false)
+----------+
|result |
+----------+
|[POSITIVE]|
+----------+
DistilBertForTokenClassification
DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.8.1 tensorflow==2.4.1 spark-nlp>=3.2.0
# Loading the external transformers model
from transformers import TFDistilBertForTokenClassification, DistilBertTokenizer
MODEL_NAME = 'elastic/distilbert-base-cased-finetuned-conll03-english'
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
# let's add from_pt=True since there is no TF weights available for this model
# from_pt=True will convert the pytorch model to tf model
model = TFDistilBertForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# Extracting the tokenizer resources
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
# Get label2id dictionary
labels = model.config.label2id
# Sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: DistilBertForTokenClassification | Scala API: DistilBertForTokenClassification | Source: DistilBertForTokenClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
spark = sparknlp.start()
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'elastic/distilbert-base-cased-finetuned-conll03-english'
tokenClassifier = DistilBertForTokenClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("label") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = DistilBertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val modelName = "elastic/distilbert-base-cased-finetuned-conll03-english"
var tokenClassifier = DistilBertForTokenClassification.loadSavedModel(s"$modelName/saved_model/1", spark)
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite.save(s"${modelName}_spark_nlp")
tokenClassifier = DistilBertForTokenClassification.load(s"${modelName}_spark_nlp")
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifier
))
val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("label.result").show(false)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
RoBertaForTokenClassification
RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.10.0 tensorflow==2.4.1
# Loading the external transformers model
from transformers import TFRobertaForTokenClassification, RobertaTokenizer
MODEL_NAME = 'philschmid/distilroberta-base-ner-wikiann-conll2003-3-class'
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
print('try downloading TF weights')
model = TFRobertaForTokenClassification.from_pretrained(MODEL_NAME)
except:
print('try downloading PyTorch weights')
model = TFRobertaForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# get assets
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
# let's save the vocab as txt file
with open('{}_tokenizer/vocab.txt'.format(MODEL_NAME), 'w') as f:
for item in tokenizer.get_vocab().keys():
f.write("%s\n" % item)
# let's copy both vocab.txt and merges.txt files to saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}
!cp {MODEL_NAME}_tokenizer/merges.txt {asset_path}
# get label2id dictionary
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: RoBertaForTokenClassification | Scala API: RoBertaForTokenClassification | Source: RoBertaForTokenClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'philschmid/distilroberta-base-ner-wikiann-conll2003-3-class'
tokenClassifier = RoBertaForTokenClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("ner") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = RoBertaForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val modelName = "philschmid/distilroberta-base-ner-wikiann-conll2003-3-class"
var tokenClassifier = RoBertaForTokenClassification.loadSavedModel(s"$modelName/saved_model/1", spark)
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite.save(s"${modelName}_spark_nlp")
tokenClassifier = RoBertaForTokenClassification.load(s"${modelName}_spark_nlp")
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifier
))
val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("label.result").show(false)
+------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+
XlmRoBertaForTokenClassification
XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
Since Spark NLP 3.2.x, this annotator needs to be trained externally using the
transformers library. After the training process is done, the model checkpoint
can be loaded by this annotator. This is done with loadSavedModel
(for loading
the transformers model) and load
for the saved Spark NLP model.
For an extended example see the Examples.
Example for loading a saved transformers model:
# Installing prerequisites
!pip install -q transformers==4.10.0 tensorflow==2.4.1
# Loading the external transformers model
from transformers import TFXLMRobertaForTokenClassification, XLMRobertaTokenizer
MODEL_NAME = 'wpnbos/xlm-roberta-base-conll2002-dutch'
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))
# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
print('try downloading TF weights')
model = TFXLMRobertaForTokenClassification.from_pretrained(MODEL_NAME)
except:
print('try downloading PyTorch weights')
model = TFXLMRobertaForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)
model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True)
# get assets
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)
# let's copy sentencepiece.bpe.model file to saved_model/1/assets
!cp {MODEL_NAME}_tokenizer/sentencepiece.bpe.model {asset_path}
# get label2id dictionary
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)
with open(asset_path+'/labels.txt', 'w') as f:
f.write('\n'.join(labels))
Then the model can be loaded and used into Spark NLP in the following examples:
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: XlmRoBertaForTokenClassification | Scala API: XlmRoBertaForTokenClassification | Source: XlmRoBertaForTokenClassification |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
MODEL_NAME = 'wpnbos/xlm-roberta-base-conll2002-dutch'
tokenClassifier = XlmRoBertaForTokenClassification.loadSavedModel(
'{}/saved_model/1'.format(MODEL_NAME),
spark) \
.setInputCols(["document",'token']) \
.setOutputCol("ner") \
.setCaseSensitive(True) \
.setMaxSentenceLength(128)
# Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))
tokenClassifier = XlmRoBertaForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
.setInputCols(["document",'token'])\
.setOutputCol("label")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
tokenClassifier
])
data = spark.createDataFrame([["John Lenon is geboren in Londen en heeft in Parijs gewoond. Mijn naam is Sarah en ik woon in Londen"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[LABEL_1, LABEL_2, LABEL_0, LABEL_0, LABEL_0, LABEL_5, LABEL_0, LABEL_0, LABEL_0, LABEL_5, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_1, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_5]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
// The model needs to be trained with the transformers library.
// Afterwards it can be loaded into the scala version of Spark NLP.
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val modelName = "wpnbos/xlm-roberta-base-conll2002-dutch"
var tokenClassifier = XlmRoBertaForTokenClassification.loadSavedModel(s"$modelName/saved_model/1", spark)
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
.setMaxSentenceLength(128)
// Optionally the classifier can be saved to load it more conveniently into Spark NLP
tokenClassifier.write.overwrite.save(s"${modelName}_spark_nlp")
tokenClassifier = XlmRoBertaForTokenClassification.load(s"${modelName}_spark_nlp")
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifier
))
val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("label.result").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[LABEL_1, LABEL_2, LABEL_0, LABEL_0, LABEL_0, LABEL_5, LABEL_0, LABEL_0, LABEL_0, LABEL_5, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_1, LABEL_0, LABEL_0, LABEL_0, LABEL_0, LABEL_5]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
TensorFlow Graphs
NER DL uses Char CNNs - BiLSTM - CRF Neural Network architecture. Spark NLP defines this architecture through a Tensorflow graph, which requires the following parameters:
- Tags
- Embeddings Dimension
- Number of Chars
Spark NLP infers these values from the training dataset used in NerDLApproach annotator and tries to load the graph embedded on spark-nlp package. Currently, Spark NLP has graphs for the most common combination of tags, embeddings, and number of chars values:
Tags | Embeddings Dimension |
---|---|
10 | 100 |
10 | 200 |
10 | 300 |
10 | 768 |
10 | 1024 |
25 | 300 |
All of these graphs use an LSTM of size 128 and number of chars 100
In case, your train dataset has a different number of tags, embeddings dimension, number of chars and LSTM size combinations shown in the table above, NerDLApproach
will raise an IllegalArgumentException exception during runtime with the message below:
Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Check https://sparknlp.org/docs/en/graph for instructions to generate the required graph.
To overcome this exception message we have to follow these steps:
-
Clone spark-nlp github repo
-
Run python file
create_models
with number of tags, embeddings dimension and number of char values mentioned on your exception message error.cd spark-nlp/python/tensorflow export PYTHONPATH=lib/ner python create_models.py [number_of_tags] [embeddings_dimension] [number_of_chars] [output_path]
-
This will generate a graph on the directory defined on `output_path argument.
-
Retry training with
NerDLApproach
annotator but this time use the parametersetGraphFolder
with the path of your graph.
Note: Make sure that you have Python 3 and Tensorflow 1.15.0 installed on your system since create_models
requires those versions to generate the graph successfully.
Note: We also have a notebook in the same directory if you prefer Jupyter notebook to cerate your custom graph.