sparknlp.training.conllu
#
Contains classes for CoNLLU.
Module Contents#
Classes#
Instantiates the class to read a CoNLL-U dataset. |
- class CoNLLU(textCol='text', documentCol='document', sentenceCol='sentence', formCol='form', uposCol='upos', xposCol='xpos', lemmaCol='lemma', explodeSentences=True)[source]#
Instantiates the class to read a CoNLL-U dataset.
The dataset should be in the format of CoNLL-U and needs to be specified with
readDataset()
, which will create a dataframe with the data.Can be used to train a
DependencyParserApproach
Input File Format:
# sent_id = 1 # text = They buy and sell books. 1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj 2:nsubj|4:nsubj _ 2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root 0:root _ 3 and and CONJ CC _ 4 cc 4:cc _ 4 sell sell VERB VBP Number=Plur|Person=3|Tense=Pres 2 conj 0:root|2:conj _ 5 books book NOUN NNS Number=Plur 2 obj 2:obj|4:obj SpaceAfter=No 6 . . PUNCT . _ 2 punct 2:punct _
Examples
>>> from sparknlp.training import CoNLLU >>> conlluFile = "src/test/resources/conllu/en.test.conllu" >>> conllDataSet = CoNLLU(False).readDataset(spark, conlluFile) >>> conllDataSet.selectExpr( ... "text", ... "form.result as form", ... "upos.result as upos", ... "xpos.result as xpos", ... "lemma.result as lemma" ... ).show(1, False) +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |text |form |upos |xpos |lemma | +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+ |What if Google Morphed Into GoogleOS? |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]| +---------------------------------------+----------------------------------------------+---------------------------------------------+------------------------------+--------------------------------------------+
- readDataset(spark, path, read_as=ReadAs.TEXT)[source]#
Reads the dataset from an external resource.
- Parameters:
- spark
pyspark.sql.SparkSession
Initiated Spark Session with Spark NLP
- pathstr
Path to the resource
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- spark
- Returns:
pyspark.sql.DataFrame
Spark Dataframe with the data