Alexander Chepurnoy

The Web of Mind

Faster Cosine Similarity Between Two Dicuments With Scala&Lucene

| Comments

For calculating cosine similarity between two documents I used modified(and rewritten to Scala) example by Sujit Pal/Mark Butler(http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html, http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene?rq=1). It’s working, but under big workload(hundreds of comparisons in second in my case) it consumes a lot of CPU / memory and even worse, some of threads getting stuck for few minutes(!) somewhere inside the Lucene code. So I had to rewrite that code to avoid RAMDirectory / IndexReader usage(I need no to store documents to Lucene storage at all). Here are two functions I have written and now want to share:

  1. extractTerms function which extracts terms from document(presented as String) with stemming & stopwords removing:

     def extractTerms(content: String): Map[String, Int] = {    
         val analyzer = new StopAnalyzer(Version.LUCENE_46)
         val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content))
         val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute])
    
         val m = scala.collection.mutable.Map[String, Int]()
    
         ts.reset()
         while (ts.incrementToken()) {
             val term = charTermAttribute.toString
             val newCount = m.get(term).map(_ + 1).getOrElse(1)
             m += term -> newCount       
         }
    
         m.toMap
     }  
    
  2. similarity function which calculates similarity between two term vectors

     def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = {
         //word, t1 freq, t2 freq
         val m = scala.collection.mutable.HashMap[String, (Int, Int)]()
    
         val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) =>
             m += word ->(freq, 0)
             sum + freq
         }
    
         val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) =>
             m.get(word) match {
                 case Some((freq1, _)) => m += word ->(freq1, freq)
                 case None => m += word ->(0, freq)
             }
             sum + freq
         }
    
         val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) =>
             val fs = e._2
             val f1 = fs._1 / sum1
             val f2 = fs._2 / sum2
             (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2)
         }
    
         val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3))
         cos
     }
    

To calculate cosine similarity between text1 and text2 just call similarity(extractTerms(text1), extractTerms(text2))

In tests my code is 7-10 times faster and has less memory footprint. Enjoy!

Comments