sparknlp.training.pub_tator#
Contains helper classes for PubTator datasets.
Module Contents#
Classes#
| The PubTator format includes medical papers’ titles, abstracts, and | 
- class PubTator[source]#
- The PubTator format includes medical papers’ titles, abstracts, and tagged chunks. - For more information see PubTator Docs and MedMentions Docs. - readDataset()is used to create a Spark DataFrame from a PubTator text file.- Input File Format: - 25763772 0 5 DCTN4 T116,T123 C4308010 25763772 23 63 chronic Pseudomonas aeruginosa infection T047 C0854135 25763772 67 82 cystic fibrosis T047 C0010674 25763772 83 120 Pseudomonas aeruginosa (Pa) infection T047 C0854135 25763772 124 139 cystic fibrosis T047 C0010674 - Examples - >>> from sparknlp.training import PubTator >>> pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt" >>> pubTatorDataSet = PubTator().readDataset(spark, pubTatorFile) >>> pubTatorDataSet.show(1) +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ | doc_id| finished_token| finished_pos| finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ |25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...| [[sentence, 0], [...| [[word, DCTN4], [...| [[word, DCTN4], [...| +--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+ - readDataset(spark, path, isPaddedToken=True)[source]#
- Reads the dataset from an external resource. - Parameters:
- sparkpyspark.sql.SparkSession
- Initiated Spark Session with Spark NLP 
- pathstr
- Path to the resource 
- isPaddedTokenstr, optional
- Whether tokens are padded, by default True 
 
- spark
- Returns:
- pyspark.sql.DataFrame
- Spark Dataframe with the data