package similarity
- Alphabetic
- Public
- All
Type Members
-
class
DocumentSimilarityRankerApproach extends AnnotatorApproach[DocumentSimilarityRankerModel] with HasEnableCachingProperties
Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.
Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.
It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
For instantiated/pretrained models, see DocumentSimilarityRankerModel.
For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.
Example
import com.johnsnowlabs.nlp.base._ import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher import org.apache.spark.ml.Pipeline import spark.implicits._ val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceEmbeddings = RoBertaSentenceEmbeddings .pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val documentSimilarityRanker = new DocumentSimilarityRankerApproach() .setInputCols("sentence_embeddings") .setOutputCol("doc_similarity_rankings") .setSimilarityMethod("brp") .setNumberOfNeighbours(1) .setBucketLength(2.0) .setNumHashTables(3) .setVisibleDistances(true) .setIdentityRanking(false) val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher() .setInputCols("doc_similarity_rankings") .setOutputCols( "finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors") .setExtractNearestNeighbor(true) // Let's use a dataset where we can visually control similarity // Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose val data = Seq( "First document, this is my first sentence. This is my second sentence.", "Second document, this is my second sentence. This is my second sentence.", "Third document, climate change is arguably one of the most pressing problems of our time.", "Fourth document, climate change is definitely one of the most pressing problems of our time.", "Fifth document, Florence in Italy, is among the most beautiful cities in Europe.", "Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.", "Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.", "Eighth document, the warmest place in France is the French Riviera coast in Southern France.") .toDF("text") val pipeline = new Pipeline().setStages( Array( documentAssembler, sentenceEmbeddings, documentSimilarityRanker, documentSimilarityRankerFinisher)) val result = pipeline.fit(data).transform(data) result .select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors") .show(10, truncate = false) +-----------------------------------+------------------------------------------+ |finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors| +-----------------------------------+------------------------------------------+ |1510101612 |[(1634839239,0.12448559591306324)] | |1634839239 |[(1510101612,0.12448559591306324)] | |-612640902 |[(1274183715,0.1220122862046063)] | |1274183715 |[(-612640902,0.1220122862046063)] | |-1320876223 |[(1293373212,0.17848855164122393)] | |1293373212 |[(-1320876223,0.17848855164122393)] | |-1548374770 |[(-1719102856,0.23297156732534166)] | |-1719102856 |[(-1548374770,0.23297156732534166)] | +-----------------------------------+------------------------------------------+
-
class
DocumentSimilarityRankerModel extends AnnotatorModel[DocumentSimilarityRankerModel] with HasSimpleAnnotate[DocumentSimilarityRankerModel] with HasEmbeddingsProperties with ParamsAndFeaturesWritable
Instantiated model of the DocumentSimilarityRankerApproach.
Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.
- case class IndexedNeighbors(neighbors: Array[Int]) extends NeighborAnnotation with Product with Serializable
- case class IndexedNeighborsWithDistance(neighbors: Array[(Int, Double)]) extends NeighborAnnotation with Product with Serializable
- sealed trait NeighborAnnotation extends AnyRef
- case class NeighborsResultSet(result: (Int, NeighborAnnotation)) extends Product with Serializable
- trait ReadableDocumentSimilarityRanker extends ParamsAndFeaturesReadable[DocumentSimilarityRankerModel]
Value Members
-
object
DocumentSimilarityRankerApproach extends DefaultParamsReadable[DocumentSimilarityRankerApproach] with Serializable
This is the companion object of DocumentSimilarityRankerApproach.
This is the companion object of DocumentSimilarityRankerApproach. Please refer to that class for the documentation.
- object DocumentSimilarityRankerModel extends ReadableDocumentSimilarityRanker with Serializable
- object DocumentSimilarityUtil