Contains classes for CoNLLU.

Module Contents#



Instantiates the class to read a CoNLL-U dataset.

class CoNLLU(textCol='text', documentCol='document', sentenceCol='sentence', formCol='form', uposCol='upos', xposCol='xpos', lemmaCol='lemma', explodeSentences=True)[source]#

Instantiates the class to read a CoNLL-U dataset.

The dataset should be in the format of CoNLL-U and needs to be specified with readDataset(), which will create a dataframe with the data.

Can be used to train a DependencyParserApproach

Input File Format:

# sent_id = 1
# text = They buy and sell books.
1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _
3   and      and     CONJ    CC     _                                  4   cc      4:cc              _
4   sell     sell    VERB    VBP    Number=Plur|Person=3|Tense=Pres    2   conj    0:root|2:conj     _
5   books    book    NOUN    NNS    Number=Plur                        2   obj     2:obj|4:obj       SpaceAfter=No
6   .        .       PUNCT   .      _                                  2   punct   2:punct           _


>>> from import CoNLLU
>>> conlluFile = "src/test/resources/conllu/en.test.conllu"
>>> conllDataSet = CoNLLU(False).readDataset(spark, conlluFile)
>>> conllDataSet.selectExpr(
...     "text",
...     "form.result as form",
...     "upos.result as upos",
...     "xpos.result as xpos",
...     "lemma.result as lemma"
... ).show(1, False)
|text                                   |form                                          |upos                                         |xpos                          |lemma                                       |
|What if Google Morphed Into GoogleOS?  |[What, if, Google, Morphed, Into, GoogleOS, ?]|[PRON, SCONJ, PROPN, VERB, ADP, PROPN, PUNCT]|[WP, IN, NNP, VBD, IN, NNP, .]|[what, if, Google, morph, into, GoogleOS, ?]|
readDataset(spark, path, read_as=ReadAs.TEXT)[source]#

Reads the dataset from an external resource.


Initiated Spark Session with Spark NLP


Path to the resource

read_asstr, optional

How to read the resource, by default ReadAs.TEXT


Spark Dataframe with the data