sparknlp.annotator.similarity.document_similarity_ranker
#
Contains classes for DocumentSimilarityRanker.
Module Contents#
Classes#
Annotator that uses LSH techniques present in Spark ML lib to execute |
|
Base class for :py:class:`Model`s that wrap Java/Scala |
|
Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the |
- class DocumentSimilarityRankerApproach[source]#
Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.
It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
For instantiated/pretrained models, see DocumentSimilarityRankerModel.
For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
DOC_SIMILARITY_RANKINGS
- Parameters:
- enableCaching
Whether to enable caching DataFrames or RDDs during the training
- similarityMethod
The similarity method used to calculate the neighbours. (Default: ‘brp’,Bucketed Random Projection for Euclidean Distance)
- numberOfNeighbours
The number of neighbours the model will return (Default:10)
- bucketLength
Controls the average size of hash buckets. A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives)
- numHashTables
Number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.
- visibleDistances
“Whether to set visibleDistances in ranking output (Default: false).
- identityRanking
Whether to include identity in ranking result set. Useful for debug. (Default: false).
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> from sparknlp.annotator.similarity.document_similarity_ranker import * >>> document_assembler = DocumentAssembler() ... .setInputCol("text") ... .setOutputCol("document") >>> sentence_embeddings = E5Embeddings.pretrained() ... .setInputCols(["document"]) ... .setOutputCol("sentence_embeddings") >>> document_similarity_ranker = DocumentSimilarityRankerApproach() ... .setInputCols("sentence_embeddings") ... .setOutputCol("doc_similarity_rankings") ... .setSimilarityMethod("brp") ... .setNumberOfNeighbours(1) ... .setBucketLength(2.0) ... .setNumHashTables(3) ... .setVisibleDistances(True) ... .setIdentityRanking(False) >>> document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() ... .setInputCols("doc_similarity_rankings") ... .setOutputCols( ... "finished_doc_similarity_rankings_id", ... "finished_doc_similarity_rankings_neighbors") ... .setExtractNearestNeighbor(True) >>> pipeline = Pipeline(stages=[ ... document_assembler, ... sentence_embeddings, ... document_similarity_ranker, ... document_similarity_ranker_finisher ... ]) >>> docSimRankerPipeline = pipeline.fit(data).transform(data) >>> ( ... docSimRankerPipeline ... .select( ... "finished_doc_similarity_rankings_id", ... "finished_doc_similarity_rankings_neighbors" ... ).show(10, False) ... ) +-----------------------------------+------------------------------------------+ |finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors| +-----------------------------------+------------------------------------------+ |1510101612 |[(1634839239,0.12448559591306324)] | |1634839239 |[(1510101612,0.12448559591306324)] | |-612640902 |[(1274183715,0.1220122862046063)] | |1274183715 |[(-612640902,0.1220122862046063)] | |-1320876223 |[(1293373212,0.17848855164122393)] | |1293373212 |[(-1320876223,0.17848855164122393)] | |-1548374770 |[(-1719102856,0.23297156732534166)] | |-1719102856 |[(-1548374770,0.23297156732534166)] | +-----------------------------------+------------------------------------------+
- setSimilarityMethod(value)[source]#
- Sets the similarity method used to calculate the neighbours.
(Default: “brp”, Bucketed Random Projection for Euclidean Distance)
- Parameters:
- valuestr
the similarity method to calculate the neighbours.
- setNumberOfNeighbours(value)[source]#
Sets The number of neighbours the model will return for each document(Default:”10”).
- Parameters:
- valuestr
the number of neighbours the model will return for each document.
- setBucketLength(value)[source]#
Sets the bucket length that controls the average size of hash buckets (Default:”2.0”).
- Parameters:
- valuefloat
Sets the bucket length that controls the average size of hash buckets.
- setNumHashTables(value)[source]#
Sets the number of hash tables.
- Parameters:
- valueint
Sets the number of hash tables.
- setVisibleDistances(value)[source]#
Sets the document distances visible in the result set.
- Parameters:
- valuebool
Sets the document distances visible in the result set. Default(‘False’)
- setIdentityRanking(value)[source]#
Sets the document identity ranking inclusive in the result set.
- Parameters:
- valuebool
Sets the document identity ranking inclusive in the result set. Useful for debugging. Default(‘False’).
- asRetriever(value)[source]#
- Sets the query to use the document similarity ranker as a retriever in a RAG fashion.
(Default: “”, empty if this annotator is not used as retriever)
- Parameters:
- valuestr
the query to use to select nearest neighbors in the retrieval process.
- setAggregationMethod(value)[source]#
- Set the method used to aggregate multiple sentence embeddings into a single vector
representation.
- Parameters:
- valuestr
- Options include
‘AVERAGE’ (compute the mean of all embeddings), ‘FIRST’ (use the first embedding only), ‘MAX’ (compute the element-wise maximum across embeddings)
Default (‘AVERAGE’)
- class DocumentSimilarityRankerModel(classname='com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerModel', java_model=None)[source]#
Base class for :py:class:`Model`s that wrap Java/Scala implementations. Subclasses should inherit this class before param mix-ins, because this sets the UID from the Java model.
- class DocumentSimilarityRankerFinisher[source]#
Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
DOC_SIMILARITY_RANKINGS
- Parameters:
- extractNearestNeighbor
Whether to extract the nearest neighbor document
- setInputCols(*value)[source]#
Sets name of input annotation columns containing embeddings.
- Parameters:
- *valuestr
Input columns for the annotator
- setOutputCols(*value)[source]#
Sets names of finished output columns.
- Parameters:
- *valueList[str]
Input columns for the annotator