`sparknlp.training.conll`#

Contains classes for CoNLL.

Module Contents#

Classes#

CoNLL

Instantiates the class to read a CoNLL dataset.

class CoNLL(documentCol='document', sentenceCol='sentence', tokenCol='token', posCol='pos', conllLabelIndex=3, conllPosIndex=1, conllDocIdCol='doc_id', textCol='text', labelCol='label', explodeSentences=True, delimiter=' ', includeDocId=False)[source]#

Instantiates the class to read a CoNLL dataset.

The dataset should be in the format of CoNLL 2003 and needs to be specified with readDataset(), which will create a dataframe with the data.

Can be used to train a NerDLApproach.

Input File Format:

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Parameters:

documentColstr, optional: Name of the DocumentAssembler column, by default ‘document’
sentenceColstr, optional: Name of the SentenceDetector column, by default ‘sentence’
tokenColstr, optional: Name of the Tokenizer column, by default ‘token’
posColstr, optional: Name of the PerceptronModel column, by default ‘pos’
conllLabelIndexint, optional: Index of the label column in the dataset, by default 3
conllPosIndexint, optional: Index of the POS tags in the dataset, by default 1
textColstr, optional: Index of the text column in the dataset, by default ‘text’
labelColstr, optional: Name of the label column, by default ‘label’
explodeSentencesbool, optional: Whether to explode sentences to separate rows, by default True
delimiter: str, optional: Delimiter used to separate columns inside CoNLL file
includeDocId: bool, optional: Whether to try and parse the document id from the third item in the -DOCSTART- line (X if not found)

Examples

>>> from sparknlp.training import CoNLL
>>> trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> trainingData.selectExpr(
...     "text",
...     "token.result as tokens",
...     "pos.result as pos",
...     "label.result as label"
... ).show(3, False)
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|text                                            |tokens                                                    |pos                                  |label                                    |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
|EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]|
|Peter Blackburn                                 |[Peter, Blackburn]                                        |[NNP, NNP]                           |[B-PER, I-PER]                           |
|BRUSSELS 1996-08-22                             |[BRUSSELS, 1996-08-22]                                    |[NNP, CD]                            |[B-LOC, O]                               |
+------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+

readDataset(spark, path, read_as=ReadAs.TEXT, partitions=8, storage_level=pyspark.StorageLevel.DISK_ONLY)[source]#

Reads the dataset from an external resource.

Parameters:

sparkpyspark.sql.SparkSession: Initiated Spark Session with Spark NLP
pathstr: Path to the resource, it can take two forms; a path to a conll file, or a path to a folder containing multiple CoNLL files. When the path points to a folder, the path must end in ‘*’. Examples:

“/path/to/single/file.conll’ “/path/to/folder/containing/multiple/files/*’
read_asstr, optional: How to read the resource, by default ReadAs.TEXT
partitionssets the minimum number of partitions for the case of lifting multiple files in parallel into a single dataframe. Defaults to 8.
storage_levelsets the persistence level according to PySpark definitions. Defaults to StorageLevel.DISK_ONLY. Applies only when lifting multiple files.

Returns:

pyspark.sql.DataFrame: Spark Dataframe with the data

sparknlp.training.conll#

Module Contents#

Classes#

`sparknlp.training.conll`#