`sparknlp.annotator.graph_extraction`#

Contains classes for GraphExtraction.

Module Contents#

Classes#

GraphExtraction

Extracts a dependency graph between entities.

class GraphExtraction(classname='com.johnsnowlabs.nlp.annotators.GraphExtraction', java_model=None)[source]#

Extracts a dependency graph between entities.

The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.

Setting setMergeEntities() to True will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel() and setTypedDependencyParserModel():

>>> graph_extraction = GraphExtraction() \
...     .setInputCols(["document", "token", "ner"]) \
...     .setOutputCol("graph") \
...     .setRelationshipTypes(["prefer-LOC"]) \
...     .setMergeEntities(True)
>>>     #.setDependencyParserModel(["dependency_conllu", "en",  "public/models"])
>>>     #.setTypedDependencyParserModel(["dependency_typed_conllu", "en",  "public/models"])

Input Annotation types	Output Annotation type
`DOCUMENT, TOKEN, NAMED_ENTITY`	`NODE`

Parameters:

relationshipTypes: Paths to find between a pair of token and entity
entityTypes: Paths to find between a pair of entities
explodeEntities: When set to true find paths between entities
rootTokens: Tokens to be consider as root to start traversing the paths. Use it along with explodeEntities
maxSentenceSize: Maximum sentence size that the annotator will process, by default 1000. Above this, the sentence is skipped
minSentenceSize: Minimum sentence size that the annotator will process, by default 2. Below this, the sentence is skipped.
mergeEntities: Merge same neighboring entities as a single token
includeEdges: Whether to include edges when building paths
delimiter: Delimiter symbol used for path output
posModel: Coordinates (name, lang, remoteLoc) to a pretrained POS model
dependencyParserModel: Coordinates (name, lang, remoteLoc) to a pretrained Dependency Parser model
typedDependencyParserModel: Coordinates (name, lang, remoteLoc) to a pretrained Typed Dependency Parser model

See also

GraphFinisher: to output the paths in a more generic format, like RDF

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings")
>>> nerTagger = NerDLModel.pretrained() \
...     .setInputCols(["sentence", "token", "embeddings"]) \
...     .setOutputCol("ner")
>>> posTagger = PerceptronModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("pos")
>>> dependencyParser = DependencyParserModel.pretrained() \
...     .setInputCols(["sentence", "pos", "token"]) \
...     .setOutputCol("dependency")
>>> typedDependencyParser = TypedDependencyParserModel.pretrained() \
...     .setInputCols(["dependency", "pos", "token"]) \
...     .setOutputCol("dependency_type")
>>> graph_extraction = GraphExtraction() \
...     .setInputCols(["document", "token", "ner"]) \
...     .setOutputCol("graph") \
...     .setRelationshipTypes(["prefer-LOC"])
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     sentence,
...     tokenizer,
...     embeddings,
...     nerTagger,
...     posTagger,
...     dependencyParser,
...     typedDependencyParser,
...     graph_extraction
... ])
>>> data = spark.createDataFrame([["You and John prefer the morning flight through Denver"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("graph").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
+-----------------------------------------------------------------------------------------------------------------+

name = 'GraphExtraction'[source]#

inputAnnotatorTypes[source]#

optionalInputAnnotatorTypes[source]#

outputAnnotatorType = 'node'[source]#

relationshipTypes[source]#

entityTypes[source]#

explodeEntities[source]#

rootTokens[source]#

maxSentenceSize[source]#

minSentenceSize[source]#

mergeEntities[source]#

mergeEntitiesIOBFormat[source]#

includeEdges[source]#

delimiter[source]#

posModel[source]#

dependencyParserModel[source]#

typedDependencyParserModel[source]#

setRelationshipTypes(value)[source]#

Sets paths to find between a pair of token and entity.

Parameters:

valueList[str]: Paths to find between a pair of token and entity

setEntityTypes(value)[source]#

Sets paths to find between a pair of entities.

Parameters:

valueList[str]: Paths to find between a pair of entities

setExplodeEntities(value)[source]#

Sets whether to find paths between entities.

Parameters:

valuebool: Whether to find paths between entities.

setRootTokens(value)[source]#

Sets tokens to be considered as the root to start traversing the paths.

Use it along with explodeEntities.

Parameters:

valueList[str]: Sets Tokens to be consider as root to start traversing the paths.

setMaxSentenceSize(value)[source]#

Sets Maximum sentence size that the annotator will process, by default 1000.

Above this, the sentence is skipped.

Parameters:

valueint: Maximum sentence size that the annotator will process

setMinSentenceSize(value)[source]#

Sets Minimum sentence size that the annotator will process, by default 2.

Below this, the sentence is skipped.

Parameters:

valueint: Minimum sentence size that the annotator will process

setMergeEntities(value)[source]#

Sets whether to merge same neighboring entities as a single token.

Parameters:

valuebool: Whether to merge same neighboring entities as a single token.

setMergeEntitiesIOBFormat(value)[source]#

Sets IOB format to apply when merging entities.

Parameters:

valuestr: IOB format to apply when merging entities. Values IOB or IOB2

setIncludeEdges(value)[source]#

Sets whether to include edges when building paths.

Parameters:

valuebool: Whether to include edges when building paths

setDelimiter(value)[source]#

Sets delimiter symbol used for path output.

Parameters:

valuestr: Delimiter symbol used for path output

setPosModel(value)[source]#

Sets Coordinates (name, lang, remoteLoc) to a pretrained POS model.

Parameters:

valueList[str]: Coordinates (name, lang, remoteLoc) to a pretrained POS model

setDependencyParserModel(value)[source]#

Sets Coordinates (name, lang, remoteLoc) to a pretrained Dependency Parser model.

Parameters:

valueList[str]: Coordinates (name, lang, remoteLoc) to a pretrained Dependency Parser model

setTypedDependencyParserModel(value)[source]#

Sets Coordinates (name, lang, remoteLoc) to a pretrained Typed Dependency Parser model.

Parameters:

valueList[str]: Coordinates (name, lang, remoteLoc) to a pretrained Typed Dependency Parser model

sparknlp.annotator.graph_extraction#

Module Contents#

Classes#

`sparknlp.annotator.graph_extraction`#