sparknlp.training.conll
#
Contains classes for CoNLL.
Module Contents#
Classes#
Instantiates the class to read a CoNLL dataset. |
- class CoNLL(documentCol='document', sentenceCol='sentence', tokenCol='token', posCol='pos', conllLabelIndex=3, conllPosIndex=1, conllDocIdCol='doc_id', textCol='text', labelCol='label', explodeSentences=True, delimiter=' ', includeDocId=False)[source]#
Instantiates the class to read a CoNLL dataset.
The dataset should be in the format of CoNLL 2003 and needs to be specified with
readDataset()
, which will create a dataframe with the data.Can be used to train a
NerDLApproach
.Input File Format:
-DOCSTART- -X- -X- O EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O
- Parameters:
- documentColstr, optional
Name of the
DocumentAssembler
column, by default ‘document’- sentenceColstr, optional
Name of the
SentenceDetector
column, by default ‘sentence’- tokenColstr, optional
Name of the
Tokenizer
column, by default ‘token’- posColstr, optional
Name of the
PerceptronModel
column, by default ‘pos’- conllLabelIndexint, optional
Index of the label column in the dataset, by default 3
- conllPosIndexint, optional
Index of the POS tags in the dataset, by default 1
- textColstr, optional
Index of the text column in the dataset, by default ‘text’
- labelColstr, optional
Name of the label column, by default ‘label’
- explodeSentencesbool, optional
Whether to explode sentences to separate rows, by default True
- delimiter: str, optional
Delimiter used to separate columns inside CoNLL file
- includeDocId: bool, optional
Whether to try and parse the document id from the third item in the -DOCSTART- line (X if not found)
Examples
>>> from sparknlp.training import CoNLL >>> trainingData = CoNLL().readDataset(spark, "src/test/resources/conll2003/eng.train") >>> trainingData.selectExpr( ... "text", ... "token.result as tokens", ... "pos.result as pos", ... "label.result as label" ... ).show(3, False) +------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+ |text |tokens |pos |label | +------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+ |EU rejects German call to boycott British lamb .|[EU, rejects, German, call, to, boycott, British, lamb, .]|[NNP, VBZ, JJ, NN, TO, VB, JJ, NN, .]|[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]| |Peter Blackburn |[Peter, Blackburn] |[NNP, NNP] |[B-PER, I-PER] | |BRUSSELS 1996-08-22 |[BRUSSELS, 1996-08-22] |[NNP, CD] |[B-LOC, O] | +------------------------------------------------+----------------------------------------------------------+-------------------------------------+-----------------------------------------+
- readDataset(spark, path, read_as=ReadAs.TEXT, partitions=8, storage_level=pyspark.StorageLevel.DISK_ONLY)[source]#
Reads the dataset from an external resource.
- Parameters:
- spark
pyspark.sql.SparkSession
Initiated Spark Session with Spark NLP
- pathstr
Path to the resource, it can take two forms; a path to a conll file, or a path to a folder containing multiple CoNLL files. When the path points to a folder, the path must end in ‘*’. Examples:
“/path/to/single/file.conll’ “/path/to/folder/containing/multiple/files/*’
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- partitionssets the minimum number of partitions for the case of lifting multiple files in parallel into a single dataframe. Defaults to 8.
- storage_levelsets the persistence level according to PySpark definitions. Defaults to StorageLevel.DISK_ONLY. Applies only when lifting multiple files.
- spark
- Returns:
pyspark.sql.DataFrame
Spark Dataframe with the data