sparknlp.annotator.token.chunk_tokenizer#
Contains classes for the ChunkTokenizer.
Module Contents#
Classes#
Tokenizes and flattens extracted NER chunks. |
|
Instantiated model of the ChunkTokenizer. |
- class ChunkTokenizer[source]#
Tokenizes and flattens extracted NER chunks.
The ChunkTokenizer will split the extracted NER
CHUNKtype Annotations and will createTOKENtype Annotations. The result is then flattened, resulting in a single array.Input Annotation types
Output Annotation type
CHUNKTOKEN- Parameters:
- None
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> sparknlp.common import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> entityExtractor = TextMatcher() \ ... .setInputCols(["sentence", "token"]) \ ... .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT) \ ... .setOutputCol("entity") >>> chunkTokenizer = ChunkTokenizer() \ ... .setInputCols(["entity"]) \ ... .setOutputCol("chunk_token") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... sentenceDetector, ... tokenizer, ... entityExtractor, ... chunkTokenizer ... ]) >>> data = spark.createDataFrame([ ... ["Hello world, my name is Michael, I am an artist and I work at Benezar"], ... ["Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."] >>> ]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(truncate=False) +-----------------------------------------------+---------------------------------------------------+ |entity |chunk_token | +-----------------------------------------------+---------------------------------------------------+ |[world, Michael, work at Benezar] |[world, Michael, work, at, Benezar] | |[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]| +-----------------------------------------------+---------------------------------------------------+
- class ChunkTokenizerModel(classname='com.johnsnowlabs.nlp.annotators.ChunkTokenizerModel', java_model=None)[source]#
Instantiated model of the ChunkTokenizer.
This is the instantiated model of the
ChunkTokenizer. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
CHUNKTOKEN- Parameters:
- None