`sparknlp.annotator.similarity.document_similarity_ranker`#

Contains classes for DocumentSimilarityRanker.

Module Contents#

Classes#

`DocumentSimilarityRankerApproach`	Annotator that uses LSH techniques present in Spark ML lib to execute
`DocumentSimilarityRankerModel`	Base class for :py:class:`Model`s that wrap Java/Scala
`DocumentSimilarityRankerFinisher`	Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the

class DocumentSimilarityRankerApproach[source]#

Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

For instantiated/pretrained models, see DocumentSimilarityRankerModel.

For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.

Input Annotation types	Output Annotation type
`SENTENCE_EMBEDDINGS`	`DOC_SIMILARITY_RANKINGS`

Parameters:

enableCaching: Whether to enable caching DataFrames or RDDs during the training
similarityMethod: The similarity method used to calculate the neighbours. (Default: ‘brp’,Bucketed Random Projection for Euclidean Distance)
numberOfNeighbours: The number of neighbours the model will return (Default:10)
bucketLength: Controls the average size of hash buckets. A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives)
numHashTables: Number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.
visibleDistances: “Whether to set visibleDistances in ranking output (Default: false).
identityRanking: Whether to include identity in ranking result set. Useful for debug. (Default: false).

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> from sparknlp.annotator.similarity.document_similarity_ranker import *
>>> document_assembler = DocumentAssembler()     ...             .setInputCol("text")     ...             .setOutputCol("document")
>>> sentence_embeddings = E5Embeddings.pretrained()     ...             .setInputCols(["document"])     ...             .setOutputCol("sentence_embeddings")
>>> document_similarity_ranker = DocumentSimilarityRankerApproach()     ...             .setInputCols("sentence_embeddings")     ...             .setOutputCol("doc_similarity_rankings")     ...             .setSimilarityMethod("brp")     ...             .setNumberOfNeighbours(1)     ...             .setBucketLength(2.0)     ...             .setNumHashTables(3)     ...             .setVisibleDistances(True)     ...             .setIdentityRanking(False)
>>> document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher()     ...         .setInputCols("doc_similarity_rankings")     ...         .setOutputCols(
...             "finished_doc_similarity_rankings_id",
...             "finished_doc_similarity_rankings_neighbors")     ...         .setExtractNearestNeighbor(True)
>>> pipeline = Pipeline(stages=[
...             document_assembler,
...             sentence_embeddings,
...             document_similarity_ranker,
...             document_similarity_ranker_finisher
...         ])
>>> docSimRankerPipeline = pipeline.fit(data).transform(data)
>>> (
...     docSimRankerPipeline
...         .select(
...                "finished_doc_similarity_rankings_id",
...                "finished_doc_similarity_rankings_neighbors"
...         ).show(10, False)
... )
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+

setSimilarityMethod(value)[source]#

Sets the similarity method used to calculate the neighbours.: (Default: “brp”, Bucketed Random Projection for Euclidean Distance)

Parameters:

valuestr: the similarity method to calculate the neighbours.

setNumberOfNeighbours(value)[source]#

Sets The number of neighbours the model will return for each document(Default:”10”).

Parameters:

valuestr: the number of neighbours the model will return for each document.

setBucketLength(value)[source]#

Sets the bucket length that controls the average size of hash buckets (Default:”2.0”).

Parameters:

valuefloat: Sets the bucket length that controls the average size of hash buckets.

setNumHashTables(value)[source]#

Sets the number of hash tables.

Parameters:

valueint: Sets the number of hash tables.

setVisibleDistances(value)[source]#

Sets the document distances visible in the result set.

Parameters:

valuebool: Sets the document distances visible in the result set. Default(‘False’)

setIdentityRanking(value)[source]#

Sets the document identity ranking inclusive in the result set.

Parameters:

valuebool: Sets the document identity ranking inclusive in the result set. Useful for debugging. Default(‘False’).

asRetriever(value)[source]#

Sets the query to use the document similarity ranker as a retriever in a RAG fashion.: (Default: “”, empty if this annotator is not used as retriever)

Parameters:

valuestr: the query to use to select nearest neighbors in the retrieval process.

setAggregationMethod(value)[source]#

Set the method used to aggregate multiple sentence embeddings into a single vector: representation.

Parameters:

valuestr

Options include: ‘AVERAGE’ (compute the mean of all embeddings), ‘FIRST’ (use the first embedding only), ‘MAX’ (compute the element-wise maximum across embeddings)

Default (‘AVERAGE’)

class DocumentSimilarityRankerModel(classname='com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerModel', java_model=None)[source]#: Base class for :py:class:`Model`s that wrap Java/Scala implementations. Subclasses should inherit this class before param mix-ins, because this sets the UID from the Java model.

class DocumentSimilarityRankerFinisher[source]#

Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.

Input Annotation types	Output Annotation type
`SENTENCE_EMBEDDINGS`	`DOC_SIMILARITY_RANKINGS`

Parameters:

extractNearestNeighbor: Whether to extract the nearest neighbor document

setInputCols(*value)[source]#

Sets name of input annotation columns containing embeddings.

Parameters:

*valuestr: Input columns for the annotator

setOutputCols(*value)[source]#

Sets names of finished output columns.

Parameters:

*valueList[str]: Input columns for the annotator

setExtractNearestNeighbor(value)[source]#

Sets whether to extract the nearest neighbor document, by default False.

Parameters:

valuebool: Whether to extract the nearest neighbor document

getInputCols()[source]#: Gets input columns name of annotations.

getOutputCols()[source]#: Gets output columns name of annotations.

sparknlp.annotator.similarity.document_similarity_ranker#

Module Contents#

Classes#

`sparknlp.annotator.similarity.document_similarity_ranker`#