`sparknlp.annotator.ner.llm_entity_extractor`#

Contains classes for the LLMEntityExtractor annotator.

Module Contents#

Classes#

LLMEntityExtractor

End-to-end LLM-based entity extraction using AutoGGUF with BNF grammars.

class LLMEntityExtractor(classname='com.johnsnowlabs.nlp.annotators.ner.dl.LLMEntityExtractor', java_model=None)[source]#

End-to-end LLM-based entity extraction using AutoGGUF with BNF grammars.

LLMEntityExtractor is an end-to-end annotator that performs entity extraction from text using Large Language Models (LLMs) with structured JSON output via BNF grammars. It embeds AutoGGUFModel directly and uses string matching to compute accurate character indices for extracted entities.

This annotator follows the LangExtract pattern from Google Research, combining few-shot prompting with constrained generation through llama.cpp BNF grammars to ensure valid JSON output.

The LLM generates responses in this format (enforced by grammar):

{
  "extractions": [
    {"entity": "MEDICATION", "text": "aspirin"},
    {"entity": "DOSAGE", "text": "250mg"}
  ]
}

The annotator performs string matching to find the exact character positions of each entity in the original text, outputting CHUNK annotations with accurate begin/end indices and chunk indexing similar to other Spark NLP annotators.

The model is loaded via LLMEntityExtractor.pretrained() to download a pretrained model, or LLMEntityExtractor.loadSavedModel() to load a local GGUF model:

>>> entity_extractor = LLMEntityExtractor.pretrained("qwen3_4b_bf16_gguf")     ...     .setInputCols(["document"])     ...     .setOutputCol("entities")     ...     .setEntityTypes(["PERSON", "ORGANIZATION", "LOCATION"])

Input Annotation types	Output Annotation type
`DOCUMENT`	`CHUNK`

Parameters:

promptTemplatestr, optional: Custom prompt template for entity extraction. Use {entityTypes} placeholder.
entityTypesList[str], optional: List of entity types to extract (used in prompt), by default [“PERSON”, “ORGANIZATION”, “LOCATION”, “DATE”, “TIME”]
caseSensitivebool, optional: Whether entity matching is case-sensitive, by default False
fewShotExamplesList[Tuple[str, str]], optional: Few-shot examples as (input, output_json) tuples to guide the model

See also

NerDLModel: for traditional BiLSTM-CRF NER
NerConverter: to further process NER results

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> entity_extractor = LLMEntityExtractor.pretrained("qwen3_4b_bf16_gguf")     ...     .setInputCols(["document"])     ...     .setOutputCol("entities")     ...     .setEntityTypes(["MEDICATION", "DOSAGE", "ROUTE", "FREQUENCY"])     ...     .setNPredict(500) \
...     .setTemperature(0.1)
>>> pipeline = Pipeline().setStages([documentAssembler, entity_extractor])
>>> data = spark.createDataFrame([["Patient prescribed 500mg amoxicillin PO TID"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("entities.result", "entities.metadata").show(truncate=False)
+------------------------------+--------------------------------+
|result                        |metadata                        |
+------------------------------+--------------------------------+
|[500mg, amoxicillin, PO, TID] |[{entity -> DOSAGE}, ...]       |
+------------------------------+--------------------------------+

name = 'LLMEntityExtractor'[source]#

inputAnnotatorTypes[source]#

outputAnnotatorType = 'chunk'[source]#

promptTemplate[source]#

entityTypes[source]#

caseSensitive[source]#

fewShotExamples[source]#

setPromptTemplate(value)[source]#

Set custom prompt template for entity extraction.

Parameters:

valuestr: Custom prompt template. Use {entityTypes} and {text} as placeholders.

Returns:

LLMEntityExtractor: The updated model

setEntityTypes(value)[source]#

Set the list of entity types to extract.

Parameters:

valueList[str]: List of entity type names

Returns:

LLMEntityExtractor: The updated model

setCaseSensitive(value)[source]#

Set whether entity matching is case-sensitive.

Parameters:

valuebool: True for case-sensitive matching, False for case-insensitive

Returns:

LLMEntityExtractor: The updated model

setFewShotExamples(value)[source]#

Set few-shot examples to guide the model.

Parameters:

valueList[Tuple[str, str]]: List of (input_text, json_output) tuples as examples

Returns:

LLMEntityExtractor: The updated model

getPromptTemplate()[source]#: Get the custom prompt template for entity extraction.

getEntityTypes()[source]#: Get the list of entity types to extract.

getCaseSensitive()[source]#: Get whether entity matching is case-sensitive.

getFewShotExamples()[source]#: Get the few-shot examples.

classmethod loadSavedModel(path, spark_session)[source]#: Loads a locally saved GGUF model for LLM-based entity extraction.

classmethod pretrained(name='qwen3_4b_bf16_gguf', lang='en', remote_loc=None)[source]#: Downloads and loads a pretrained model.

close()[source]#: Closes the underlying llama.cpp model backend freeing resources.

sparknlp.annotator.ner.llm_entity_extractor#

Module Contents#

Classes#

`sparknlp.annotator.ner.llm_entity_extractor`#