Packages

package similarity

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class DocumentSimilarityRankerApproach extends AnnotatorApproach[DocumentSimilarityRankerModel] with HasEnableCachingProperties

    Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

    Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

    It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

    For instantiated/pretrained models, see DocumentSimilarityRankerModel.

    For extended examples of usage, see the jupyter notebook Document Similarity Ranker for Spark NLP.

    Example

    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach
    import com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher
    import org.apache.spark.ml.Pipeline
    
    import spark.implicits._
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceEmbeddings = RoBertaSentenceEmbeddings
      .pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val documentSimilarityRanker = new DocumentSimilarityRankerApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("doc_similarity_rankings")
      .setSimilarityMethod("brp")
      .setNumberOfNeighbours(1)
      .setBucketLength(2.0)
      .setNumHashTables(3)
      .setVisibleDistances(true)
      .setIdentityRanking(false)
    
    val documentSimilarityRankerFinisher = new DocumentSimilarityRankerFinisher()
      .setInputCols("doc_similarity_rankings")
      .setOutputCols(
        "finished_doc_similarity_rankings_id",
        "finished_doc_similarity_rankings_neighbors")
      .setExtractNearestNeighbor(true)
    
    // Let's use a dataset where we can visually control similarity
    // Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were create to be similar on purpose
    val data = Seq(
      "First document, this is my first sentence. This is my second sentence.",
      "Second document, this is my second sentence. This is my second sentence.",
      "Third document, climate change is arguably one of the most pressing problems of our time.",
      "Fourth document, climate change is definitely one of the most pressing problems of our time.",
      "Fifth document, Florence in Italy, is among the most beautiful cities in Europe.",
      "Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.",
      "Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France.",
      "Eighth document, the warmest place in France is the French Riviera coast in Southern France.")
      .toDF("text")
    
    val pipeline = new Pipeline().setStages(
      Array(
        documentAssembler,
        sentenceEmbeddings,
        documentSimilarityRanker,
        documentSimilarityRankerFinisher))
    
    val result = pipeline.fit(data).transform(data)
    
    result
      .select("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
      .show(10, truncate = false)
    +-----------------------------------+------------------------------------------+
    |finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
    +-----------------------------------+------------------------------------------+
    |1510101612                         |[(1634839239,0.12448559591306324)]        |
    |1634839239                         |[(1510101612,0.12448559591306324)]        |
    |-612640902                         |[(1274183715,0.1220122862046063)]         |
    |1274183715                         |[(-612640902,0.1220122862046063)]         |
    |-1320876223                        |[(1293373212,0.17848855164122393)]        |
    |1293373212                         |[(-1320876223,0.17848855164122393)]       |
    |-1548374770                        |[(-1719102856,0.23297156732534166)]       |
    |-1719102856                        |[(-1548374770,0.23297156732534166)]       |
    +-----------------------------------+------------------------------------------+
  2. class DocumentSimilarityRankerModel extends AnnotatorModel[DocumentSimilarityRankerModel] with HasSimpleAnnotate[DocumentSimilarityRankerModel] with HasEmbeddingsProperties with ParamsAndFeaturesWritable

    Instantiated model of the DocumentSimilarityRankerApproach.

    Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.

  3. case class IndexedNeighbors(neighbors: Array[Int]) extends NeighborAnnotation with Product with Serializable
  4. case class IndexedNeighborsWithDistance(neighbors: Array[(Int, Double)]) extends NeighborAnnotation with Product with Serializable
  5. sealed trait NeighborAnnotation extends AnyRef
  6. case class NeighborsResultSet(result: (Int, NeighborAnnotation)) extends Product with Serializable
  7. trait ReadableDocumentSimilarityRanker extends ParamsAndFeaturesReadable[DocumentSimilarityRankerModel]

Value Members

  1. object DocumentSimilarityRankerApproach extends DefaultParamsReadable[DocumentSimilarityRankerApproach] with Serializable

    This is the companion object of DocumentSimilarityRankerApproach.

    This is the companion object of DocumentSimilarityRankerApproach. Please refer to that class for the documentation.

  2. object DocumentSimilarityRankerModel extends ReadableDocumentSimilarityRanker with Serializable

Ungrouped