sparknlp.annotator.graph_extraction
#
Contains classes for GraphExtraction.
Module Contents#
Classes#
Extracts a dependency graph between entities. |
- class GraphExtraction(classname='com.johnsnowlabs.nlp.annotators.GraphExtraction', java_model=None)[source]#
Extracts a dependency graph between entities.
The GraphExtraction class takes e.g. extracted entities from a
NerDLModel
and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.Both the
DependencyParserModel
andTypedDependencyParserModel
need to be present in the pipeline. There are two ways to set them:Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
Setting
setMergeEntities()
toTrue
will download the default pretrained models for those two Annotators automatically. The specific models can also be set withsetDependencyParserModel()
andsetTypedDependencyParserModel()
:>>> graph_extraction = GraphExtraction() \ ... .setInputCols(["document", "token", "ner"]) \ ... .setOutputCol("graph") \ ... .setRelationshipTypes(["prefer-LOC"]) \ ... .setMergeEntities(True) >>> #.setDependencyParserModel(["dependency_conllu", "en", "public/models"]) >>> #.setTypedDependencyParserModel(["dependency_typed_conllu", "en", "public/models"])
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, NAMED_ENTITY
NODE
- Parameters:
- relationshipTypes
Paths to find between a pair of token and entity
- entityTypes
Paths to find between a pair of entities
- explodeEntities
When set to true find paths between entities
- rootTokens
Tokens to be consider as root to start traversing the paths. Use it along with explodeEntities
- maxSentenceSize
Maximum sentence size that the annotator will process, by default 1000. Above this, the sentence is skipped
- minSentenceSize
Minimum sentence size that the annotator will process, by default 2. Below this, the sentence is skipped.
- mergeEntities
Merge same neighboring entities as a single token
- includeEdges
Whether to include edges when building paths
- delimiter
Delimiter symbol used for path output
- posModel
Coordinates (name, lang, remoteLoc) to a pretrained POS model
- dependencyParserModel
Coordinates (name, lang, remoteLoc) to a pretrained Dependency Parser model
- typedDependencyParserModel
Coordinates (name, lang, remoteLoc) to a pretrained Typed Dependency Parser model
See also
GraphFinisher
to output the paths in a more generic format, like RDF
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> embeddings = WordEmbeddingsModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings") >>> nerTagger = NerDLModel.pretrained() \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setOutputCol("ner") >>> posTagger = PerceptronModel.pretrained() \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("pos") >>> dependencyParser = DependencyParserModel.pretrained() \ ... .setInputCols(["sentence", "pos", "token"]) \ ... .setOutputCol("dependency") >>> typedDependencyParser = TypedDependencyParserModel.pretrained() \ ... .setInputCols(["dependency", "pos", "token"]) \ ... .setOutputCol("dependency_type") >>> graph_extraction = GraphExtraction() \ ... .setInputCols(["document", "token", "ner"]) \ ... .setOutputCol("graph") \ ... .setRelationshipTypes(["prefer-LOC"]) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentence, ... tokenizer, ... embeddings, ... nerTagger, ... posTagger, ... dependencyParser, ... typedDependencyParser, ... graph_extraction ... ]) >>> data = spark.createDataFrame([["You and John prefer the morning flight through Denver"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("graph").show(truncate=False) +-----------------------------------------------------------------------------------------------------------------+ |graph | +-----------------------------------------------------------------------------------------------------------------+ |[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]| +-----------------------------------------------------------------------------------------------------------------+
- setRelationshipTypes(value)[source]#
Sets paths to find between a pair of token and entity.
- Parameters:
- valueList[str]
Paths to find between a pair of token and entity
- setEntityTypes(value)[source]#
Sets paths to find between a pair of entities.
- Parameters:
- valueList[str]
Paths to find between a pair of entities
- setExplodeEntities(value)[source]#
Sets whether to find paths between entities.
- Parameters:
- valuebool
Whether to find paths between entities.
- setRootTokens(value)[source]#
Sets tokens to be considered as the root to start traversing the paths.
Use it along with explodeEntities.
- Parameters:
- valueList[str]
Sets Tokens to be consider as root to start traversing the paths.
- setMaxSentenceSize(value)[source]#
Sets Maximum sentence size that the annotator will process, by default 1000.
Above this, the sentence is skipped.
- Parameters:
- valueint
Maximum sentence size that the annotator will process
- setMinSentenceSize(value)[source]#
Sets Minimum sentence size that the annotator will process, by default 2.
Below this, the sentence is skipped.
- Parameters:
- valueint
Minimum sentence size that the annotator will process
- setMergeEntities(value)[source]#
Sets whether to merge same neighboring entities as a single token.
- Parameters:
- valuebool
Whether to merge same neighboring entities as a single token.
- setMergeEntitiesIOBFormat(value)[source]#
Sets IOB format to apply when merging entities.
- Parameters:
- valuestr
IOB format to apply when merging entities. Values IOB or IOB2
- setIncludeEdges(value)[source]#
Sets whether to include edges when building paths.
- Parameters:
- valuebool
Whether to include edges when building paths
- setDelimiter(value)[source]#
Sets delimiter symbol used for path output.
- Parameters:
- valuestr
Delimiter symbol used for path output
- setPosModel(value)[source]#
Sets Coordinates (name, lang, remoteLoc) to a pretrained POS model.
- Parameters:
- valueList[str]
Coordinates (name, lang, remoteLoc) to a pretrained POS model