sparknlp.annotator.ner.ner_converter#

Contains classes for the NerConverter.

Module Contents#

Classes#

NerConverter

Converts a IOB or IOB2 representation of NER to a user-friendly one, by

class NerConverter[source]#

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Results in CHUNK Annotation type.

NER chunks can then be filtered by setting a whitelist with setWhiteList. Chunks with no associated entity (tagged “O”) are filtered.

See also Inside–outside–beginning (tagging) for more information.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, NAMED_ENTITY

CHUNK

Parameters:
whiteList

If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels

preservePosition

Whether to preserve the original position of the tokens in the original document or use the modified tokens, by default True

Examples

This is a continuation of the example of the NerDLModel. See that class on how to extract the entities. The output of the NerDLModel follows the Annotator schema and can be converted like so:

>>> result.selectExpr("explode(ner)").show(truncate=False)
+----------------------------------------------------+
|col                                                 |
+----------------------------------------------------+
|[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
|[named_entity, 3, 3, O, [word -> .], []]            |
|[named_entity, 5, 12, O, [word -> official], []]    |
|[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
|[named_entity, 20, 24, O, [word -> heads], []]      |
|[named_entity, 26, 28, O, [word -> for], []]        |
|[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
|[named_entity, 37, 37, O, [word -> .], []]          |
+----------------------------------------------------+

After the converter is used:

>>> converter = NerConverter() \
...     .setInputCols(["sentence", "token", "ner"]) \
...     .setOutputCol("entities")
>>> converter.transform(result).selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+
setWhiteList(entities)[source]#

Sets list of entities to process. The rest will be ignored.

Does not include IOB prefix on labels.

Parameters:
entitiesList[str]

If defined, list of entities to process. The rest will be ignored.

setPreservePosition(value)[source]#

Whether to preserve the original position of the tokens in the original document or use the modified tokens, by default True.

Parameters:
valuebool

Whether to preserve the original position of the tokens in the original document or use the modified tokens

setNerHasNoSchema(value)[source]#

set this to true if your NER tags coming from a model that does not have a IOB/IOB2 schema

Parameters:
valuebool

set this to true if your NER tags coming from a model that does not have a IOB/IOB2 schema