Alexander Chepurnoy

The Web of Mind

Faster Cosine Similarity Between Two Dicuments With Scala&Lucene

| Comments

For calculating cosine similarity between two documents I used modified(and rewritten to Scala) example by Sujit Pal/Mark Butler(http://sujitpal.blogspot.ch/2011/10/computing-document-similarity-using.html, http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene?rq=1). It’s working, but under big workload(hundreds of comparisons in second in my case) it consumes a lot of CPU / memory and even worse, some of threads getting stuck for few minutes(!) somewhere inside the Lucene code. So I had to rewrite that code to avoid RAMDirectory / IndexReader usage(I need no to store documents to Lucene storage at all). Here are two functions I have written and now want to share:

  1. extractTerms function which extracts terms from document(presented as String) with stemming & stopwords removing:

     def extractTerms(content: String): Map[String, Int] = {    
         val analyzer = new StopAnalyzer(Version.LUCENE_46)
         val ts = new EnglishMinimalStemFilter(analyzer.tokenStream("c", content))
         val charTermAttribute = ts.addAttribute(classOf[CharTermAttribute])
    
         val m = scala.collection.mutable.Map[String, Int]()
    
         ts.reset()
         while (ts.incrementToken()) {
             val term = charTermAttribute.toString
             val newCount = m.get(term).map(_ + 1).getOrElse(1)
             m += term -> newCount       
         }
    
         m.toMap
     }  
    
  2. similarity function which calculates similarity between two term vectors

     def similarity(t1: Map[String, Int], t2: Map[String, Int]): Double = {
         //word, t1 freq, t2 freq
         val m = scala.collection.mutable.HashMap[String, (Int, Int)]()
    
         val sum1 = t1.foldLeft(0d) {case (sum, (word, freq)) =>
             m += word ->(freq, 0)
             sum + freq
         }
    
         val sum2 = t2.foldLeft(0d) {case (sum, (word, freq)) =>
             m.get(word) match {
                 case Some((freq1, _)) => m += word ->(freq1, freq)
                 case None => m += word ->(0, freq)
             }
             sum + freq
         }
    
         val (p1, p2, p3) = m.foldLeft((0d, 0d, 0d)) {case ((s1, s2, s3), e) =>
             val fs = e._2
             val f1 = fs._1 / sum1
             val f2 = fs._2 / sum2
             (s1 + f1 * f2, s2 + f1 * f1, s3 + f2 * f2)
         }
    
         val cos = p1 / (Math.sqrt(p2) * Math.sqrt(p3))
         cos
     }
    

To calculate cosine similarity between text1 and text2 just call similarity(extractTerms(text1), extractTerms(text2))

In tests my code is 7-10 times faster and has less memory footprint. Enjoy!

Investigating Namecoin Database Pt. 1 : Introduction & Objects Counting

| Comments

Introduction

Probably you get some knowledge about cryptocurrencies from all that massive buzz around. But cryptocurrencies are not only buzz about another merchant accepting Bitcoin, market news, fun story involving dogecoin etc. A lot of tech possibilities we have now behind just money as digital property being transferring from one peer to another.

One of the notable alternative cryptocurrencies is Namecoin, which could be considered as currency and key-value database as well with auto-expiration of records injected in the blockchain. Owner of record has to pay (0.01 NMC for now) to have it in database for period of time of 36,000 blocks(~250 days).

You can insert any name-value record in the database, though there are some formal namespaces, two for now: DNS system for .bit domains(names starting with “d/”) and NameId which is fully decentralized OpenId alternative(names starting with “id/”).

Objects Counting

We are starting to get deeper into the Namecoin database with counting named objects in it.

First, install Namecoin daemon namecoind (very like Bitcoin’s bitcoind installation), for details, visit namecoin.info or namecoin.org . Wait while blockchain will be downloaded.

Then pull out all active records into out.txt file with

namecoind name_scan "" 1000000 > out.txt

To count all objects in database dump you have now use

cat out.txt | grep '"name"' | wc -l

To count .bit domains being registered

cat out.txt | grep '"name" : "d/' | wc -l

To count NameId entries

cat out.txt | grep '"name" : "id/' | wc -l

As of today, Feb 26th, 2014, Namecoin database contains 156276 entries in total, including 140237 .bit domains and 2919 NameIds.

Flattening Scala Futures(Future[Future[T]] –> Future[T])

| Comments

Well, as scala.concurrent.Future class has no .flatten method, you could be wondered how to convert Future of Future e.g. Future[Future[T]] to just Future[T]. That’s can be easy done with combination of flatMap and identity functions:

//having fft:Future[Future[T]]
val ft:Future[Int] = fft.flatMap(x=> x) 

Learn Play! Framework by Example

| Comments

Play! Framework is probably the most popular Web framework for Scala Language(and one of the most popular Web frameworks for Java too). It’s stateless, it’s elegant, it’s a good friend of functional reactive programming.

To learn framework, the site contains very good tutorial, also there are some samples within ‘samples’ folder. And I’m happy to present another option to dive into the framework.

The sample application was developed with aim to test Mechanize Framework, but GistLabs published it as standalone application.

Application contains following examples:

  • Controllers with different response codes: 302 redirect, 302 infinite redirect, 466 code, internal server error with custom message
  • Cookies examples: printing cookies got in request, setting cookie
  • XML examples: simple XML document output and echoing value passed in an input XML document passed with PUT request
  • JSON example: echoing value passed in an input JSON document passed with PUT request
  • Not-Modified example: on first request Etag/Cache-Control headers are sent in response, then Not-Modified result sent
  • Forms: GET/POST forms, POST form with validation
  • Files: single file upload, multiple files upload, POST/PUT/GET service example(create a file with random filename on POST, create a file with specified filename on PUT, then get file with GET request)
  • Auth: login to see secret token, or try to access secret area directly(with redirect to index page)

Pull sources from GitHub and learn Play! by example!

P.S. Next part of the story will be about different Play! sample applications around the Web

Console Applications With Play Framework 2.x

| Comments

What if you want to launch a part of your application within IDE to see log output? What if you want to have console launcher for actor system which is the part of your Play 2.x app? If your code uses Play classes(e.g. play.api.Logger) or it needs , you can’t write ordinary console app because it will throw “There is no started application” error during run.

How to get Play context within console application? Just extend your launcher class from play.core.StaticApplication class, pass path to application root to StaticApplication instance constructor. Simple example:

object ConsoleLauncher extends StaticApplication(new File(".")) {
    def main(args: Array[String]) {
        DataExtractionActorSystem.start()
    }
}

How to Remove Task Dependency in Gradle

| Comments

What if you want to remove dependency for Gradle task? For example, remove dependency from compileGroovy for compileJava(to add dependency in opposite dirrection and avoid cyclic dependency problem). Should be simple, but I didn’t find ready solution using search engines, though two minutes experiment provided me working code:

compileGroovy.taskDependencies.values -= "compileJava"

Ready solution to compile first Groovy classes then Java:

compileGroovy.taskDependencies.values -= "compileJava"
compileJava.dependsOn(compileGroovy)

Scala Clients for BTC-e Trade and Public Data APIs (My First Opensource Released)

| Comments

I just released my first open-source component, Scala Client for BTC-e Trade and Public Data APIs! BTC-e.com is broker for Bitcoin/Litecoin/Namecoin/other cryprocurrencies trading. This post is about some choices made during development and how to use the clients.

Usage Details and Examples

  • Implement ClientCredentials trait to connect to the Trade API :

      object MyCredentials extends ClientCredentials {
          val Key = "my key"
          val Secret = "my secret"
      }
    
  • Initialize Trade API client as

      val tradeClient = new DefaultTradeApiClient(MyCredentials)
    
  • Get free funds info with

      client.getInfo.map(println(_)) 
    
    It will print something like
      USD: 4.7 RUR: 2399 EUR: 0.0 BTC: 10 LTC: 19.99 NMC: 0.0 NVC: 0.0 TRC: 0.0 PPC: 0.0
    
  • Cancel all open orders with following code

      tradeClient.orderList.getOrElse(List()).foreach{order=>
          tradeClient.cancelOrder(order.orderId)
      } 
    
  • Create order to sell 200 litecoins for $4.99 each

      tradeClient.trade(Currency.LTC, Currency.USD, Direction.Sell, 4.99, 200.00)
    
  • Close connections at the end

      tradeClient.releaseConnections 
    
  • Initialize Public Data API as

      val pubClient = new DefaultMarketDataApiClient
    
  • Get last deal price from ticker data for BTC/USD and print it:

      pubClient.ticker(Currency.BTC, Currency.USD).map{td=>
          println(td.last)
      }
    

See MarketDataApiClient and TradeApiClient classes for more functions.

Implementation and Customization Details

  1. Common functions located in btce.scala, Trade API client in btce-trade.scala, Public Data API client in btce-marketdata.scala, Specs2 tests in BtceSpec.scala

  2. There are many HTTP layer implementations. I implemented http requests/responses using WS framework from PlayFramework 2(Scala wrapper for Ning framework). If your project doesn’t use PlayFramework and/or already uses another HTTP framework(e.g. Apache HttpClient), make own implementation of HttpApiClient trait. Override functions getRequest(url: String): String (simple get request, it’s used by Public Data API), signedPostRequest(url: String, key: String, Secret: String, postBody: String): String (post request with already signed postBody, used by Trade API ), releaseConnections (shutdown connections pool here, if needed). Then define own Trade API client with code like class MyTradeApiClient(credentials: ClientCredentials) extends TradeApiClient(credentials) with MyHttpApiClient, class MyMarketDataApiClient extends MarketDataApiClient with MyHttpApiClient for Public Data client.

  3. Enumerations chosen over sealed case classes hierarchy, e.g.

     object Direction extends Enumeration {
         type Direction = Value
         val Sell = Value("sell")
         val Buy = Value("buy")
     }
    

    It could be not the best choice in case of having in mind to build trading DSL over it. But I have no plans for trading DSL now.

  4. No logging implemented in the released version to avoid extra dependency. If you incorporate a client into your software, add logging where needed(catch clauses, None results) with a logging framework project uses.

Again, the URL is https://github.com/kushti/btce-scala

Why Scala+PlayFramework Could Be the Best Choice for Your Startup

| Comments

Do you plan to change this world with a web startup? Thinking about technology stack? Monsterous Spring+hundreds of other Java frameworks or elegant, trendy but bit controversial Ruby on Rails? Don’t think about any compromises, think about Scala+PlayFramework 2!

What gives you Scala and PlayFramework combination?

  • Play’s CLI(command line interface), hit refresh workflow, conciseness of Scala code and powerful abstractions provided by the framework(and dependent frameworks too, e.g. Specs2) give you stunning speed of development. In fact, you can have development speed typical for dynamic language while having all benefits of strong static typing. A startup needs for fast prototyping, so get it!

  • A startup should be scalable to handle fast growth of userbase. Stateless framework architecture & built-in Akka support give you highest level of scalability.

  • Scala is the JVM language means you can easily use thousands of opensource frameworks for map-reduce data processing, NLP, ML, genetic algorithms etc… Java was(and is) standard for academic open-source frameworks, #1 language for Apache Software Foundation(more than 100 opensource projects) etc. It adds speed to the prototyping, make your system more simple(one platform means less headache), also makes you development process much cheaper.

  • Type safety gives you more stable and predictable development process(easier refactorings, avoiding of some types of errors etc). Unit tests are not enough, there is no doubt.

  • Built-in asynchronous HTTP support makes modern web applications development easy.

P.S. I’m passionate about PlayFramework 2.x(already used it for 3 Scala and 1 Java projects). Next month I’m thinking about navigation plugin development(a bit like play navigator, but with formal FSM approach). Please write me if you want to contribute.

Play2+Morphia: How to Avoid ‘Can’t Parse Argument Number Interface’ Error

| Comments

Trying to run Play2 + Morphia application, you can get such an error java.lang.IllegalArgumentException: can't parse argument number interface com.google.code.morphia.annotations.Id = @com.google.code.morphia.annotations.Id().

How to avoid it:

  • Add SLF4JExtension for Morphia : http://code.google.com/p/morphia/wiki/SLF4JExtension. Here is the example how to add SBT dependencies, but please mind difference beetwen com.google.code.morphia and com.github.jmkgreen.morphia (make appropriate changes): https://github.com/leodagdag/play2-morphia-plugin/blob/master/project/Build.scala .

  • Add 2 lines to init of your Global object(or beforeStart method)

      import com.google.code.morphia.logging.MorphiaLoggerFactory
      import com.google.code.morphia.logging.slf4j.SLF4JLogrImplFactory
      import play.api.GlobalSettings
    
      object Global extends GlobalSettings{
          MorphiaLoggerFactory.reset()
          MorphiaLoggerFactory.registerLogger(classOf[SLF4JLogrImplFactory])
      }
    

Akka-based Data Extraction System Design

| Comments

Introduction

If you have an experience in data extraction systems, you know how hard it could be to develop. You need to implement workers then combine them in error-prone, scalable and flexible system. Sounds like a lot of pain, isn’t it? But with modern painkillers the job could be done much simpler. I mean Akka.

I already used Akka for some real-world systems, including realtime forex data mashup forexnotions.com, domains value estimation system, data gathering systems for SEO parameters, real estate etc. And I want to publish common approach I use in the simplest form.

The Example

Consider real estate data extraction system, where some sources have XML/RSS output, some only HTML. Workers already written, one for each site, so it’s the time to combine them into higher-level logic. Consider, for example, we have 3 sites to get data from, and we want to recrawl them every 30 seconds(too crazy, but it’s just an example).

A worker is derived from base trait BasePropertyExtractor and returns list of properties or exception. Let’s define sample workers as well as sample property bean

case class Property(name:String)
case class ExtractionResult(value: Either[Throwable, List[Property]])

trait BasePropertyExtractor {
    def extractData:ExtractionResult
    def label:String
}

class SiteAExtractor extends BasePropertyExtractor{
    override def extractData = ExtractionResult(Right(List[Property](Property("Nice beachside boongalow"))))
    override def label = "SiteAExtractor"
}

class SiteBExtractor extends BasePropertyExtractor{
    override def extractData = ExtractionResult(Right(List[Property](Property("Awesome apartments"))))
    override def label = "SiteBExtractor"
}

class SiteCExtractor extends BasePropertyExtractor{
    override def extractData = ExtractionResult(Left(new Exception("XML parsing failed")))
    override def label = "SiteCExtractor"
}

Define control signals to be sent to system’s components

case class ExtractionCommand(extractor:BasePropertyExtractor)
object StartParsing

StartExtraction is signal to start whole extraction process, while ExtractionCommand is signal to start concrete extractor

Data extraction actor:

class ExtractingActor extends Actor with akka.actor.ActorLogging {
    override def receive = {
        case ExtractionCommand(extractor:BasePropertyExtractor) =>
        println("Going to extract data by "+extractor)
        sender ! extractor.extractData
    }
}

Database writer actor(in our example it doesn’t write to a database actually, but just prints result to console:

class DbWriterActor extends Actor with akka.actor.ActorLogging {
    override def receive = {
        case p: Property  => println(p)
    }
}

And we’re going to define main actor incapsulating control logic and implementing ScatterGather design pattern(yeah, meet design patterns in actors field):

class ScatterGather extends Actor with akka.actor.ActorLogging {
    context.setReceiveTimeout(29 seconds)

    private val actorsCommands = Map(
        context.actorOf(Props[ExtractingActor]) -> ExtractionCommand(new SiteAExtractor),
        context.actorOf(Props[ExtractingActor]) -> ExtractionCommand(new SiteBExtractor),
        context.actorOf(Props[ExtractingActor]) -> ExtractionCommand(new SiteCExtractor)
    )

    private val dbWriterActor = context.actorOf(Props[DbWriterActor])

    override def receive = {
        case StartExtraction =>
            actorsCommands foreach {
                case (actor, command) => actor ! command
            }

        case result: ExtractionResult => result.value match{
            case Left(t:Throwable) => log.warning("Exception found instead of result: " + t)
            case Right(l:List[Property]) => l foreach {writer ! _}
        }

        case ReceiveTimeout =>
            context.stop(self)
            actorsCommands foreach {_ => context.stop(_)}
            context.stop(dbWriterActor)
    }
}

And launcher

object RealEstateExtractionLauncher extends App {
    import ExecutionContext.Implicits.global
    val system = ActorSystem("RealEstateExample")

    system.scheduler.schedule(0 seconds, 30 seconds){
        val listeningActor = system.actorOf(Props[ScatterGather])
        listeningActor ! StartParsing
    }
}

Complete Code & Output

Complete code is on Github : https://github.com/kushti/blog-examples/blob/master/scala/akka/DataExtractionSystem.scala

Running it you’ll get something like

Going to extract data by SiteCExtractor
Going to extract data by SiteBExtractor
Going to extract data by SiteAExtractor
Property(Awesome apartments)
Property(Nice beachside boongalow)
[WARN] [02/27/2013 13:22:40.331] [RealEstateExample-akka.actor.default-dispatcher-2] [akka://RealEstateExample/user/$a] Exception found instead of result: java.lang.Exception: XML parsing failed
Going to extract data by SiteAExtractor
Going to extract data by SiteCExtractor
Going to extract data by SiteBExtractor
Property(Nice beachside boongalow)
Property(Awesome apartments)
[WARN] [02/27/2013 13:23:10.297] [RealEstateExample-akka.actor.default-dispatcher-7] [akka://RealEstateExample/user/$b] Exception found instead of result: java.lang.Exception: XML parsing failed

Conclusion

In less than 100 lines of code we got fully scalable recurrent data extraction example with simple logging and error handling. And as Akka is a close friend of Play 2.x framework, to say more preciously, Play includes Akka, it’s easy to build Web Application on top of our system.

You can now start to play with remote actors to build distibuted system. Or implement real-world application starting with the example design provided. Or visit “Hire me” section.