similarity

package similarity

Ordering

Alphabetic

Visibility

Public
All

Type Members

class DocumentSimilarityRankerApproach extends AnnotatorApproach[DocumentSimilarityRankerModel] with HasEnableCachingProperties

Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

For instantiated/pretrained models, see DocumentSimilarityRankerModel.

For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.

Example

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach
import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher
import org.apache.spark.ml.Pipeline

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceEmbeddings = RoBertaSentenceEmbeddings
  .pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val documentSimilarityRanker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols(
    "finished_doc_similarity_rankings_id",
    "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

// Let's use a dataset where we can visually control similarity
// Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose
val data = Seq(
  "First document, this is my first sentence. This is my second sentence.",
  "Second document, this is my second sentence. This is my second sentence.",
  "Third document, climate change is arguably one of the most pressing problems of our time.",
  "Fourth document, climate change is definitely one of the most pressing problems of our time.",
  "Fifth document, Florence in Italy, is among the most beautiful cities in Europe.",
  "Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.",
  "Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.",
  "Eighth document, the warmest place in France is the French Riviera coast in Southern France.")
  .toDF("text")

val pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    sentenceEmbeddings,
    documentSimilarityRanker,
    documentSimilarityRankerFinisher))

val result = pipeline.fit(data).transform(data)

result
  .select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .show(10, truncate = false)
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1634839239,0.12448559591306324)]        |
|1634839239                         |[(1510101612,0.12448559591306324)]        |
|-612640902                         |[(1274183715,0.1220122862046063)]         |
|1274183715                         |[(-612640902,0.1220122862046063)]         |
|-1320876223                        |[(1293373212,0.17848855164122393)]        |
|1293373212                         |[(-1320876223,0.17848855164122393)]       |
|-1548374770                        |[(-1719102856,0.23297156732534166)]       |
|-1719102856                        |[(-1548374770,0.23297156732534166)]       |
+-----------------------------------+------------------------------------------+

class DocumentSimilarityRankerModel extends AnnotatorModel[DocumentSimilarityRankerModel] with HasSimpleAnnotate[DocumentSimilarityRankerModel] with HasEmbeddingsProperties with ParamsAndFeaturesWritable
Instantiated model of the DocumentSimilarityRankerApproach.
Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.
case class IndexedNeighbors(neighbors: Array[Int]) extends NeighborAnnotation with Product with Serializable
case class IndexedNeighborsWithDistance(neighbors: Array[(Int, Double)]) extends NeighborAnnotation with Product with Serializable
sealed trait NeighborAnnotation extends AnyRef
case class NeighborsResultSet(result: (Int, NeighborAnnotation)) extends Product with Serializable
trait ReadableDocumentSimilarityRanker extends ParamsAndFeaturesReadable[DocumentSimilarityRankerModel]

Value Members

object DocumentSimilarityRankerApproach extends DefaultParamsReadable[DocumentSimilarityRankerApproach] with Serializable
This is the companion object of DocumentSimilarityRankerApproach.
This is the companion object of DocumentSimilarityRankerApproach. Please refer to that class for the documentation.
object DocumentSimilarityRankerModel extends ReadableDocumentSimilarityRanker with Serializable
object DocumentSimilarityUtil

Packages

similarity

package similarity

Type Members

Example

Value Members

Ungrouped

Packages

similarity 

package similarity

Type Members

Example

Value Members

Ungrouped

similarity