Using Scala to Perform a Multi-Key Get on a Couchbase View

Nov 16th, 2013 3:26 pm | Comments

To retrieve documents from Couchbase by anything other than the document key requires querying a view and views are defined by map and reduce functions written in JavaScript. Consider a view that returns data like this:

key         value
---         -----
"key1"      { "term": "red", "count": 2 }
"key2"      { "term": "red", "count": 1 }
"key3"      { "term": "blue", "count": 4 }
...

And this Scala case class to hold the documents retrieved from the view.

case class TermOccurrence(term: String, count: Int)

It’s a common scenario to retrieve multiple documents at once and the Java driver has a pretty straight forward api for that. The desired keys are simply specified as a json array.

import com.couchbase.client.CouchbaseClient
import play.api.libs.json.Json
import CouchbaseExtensions._

@Log
def findByTerms(terms: List[String]): List[TermOccurrence] = {

  val keys = Json.stringify(Json.toJson(terms map (_.toLowerCase)))

  val view = client.getView("view1", "by_term")
  val query = new Query()
  query.setIncludeDocs(false)
  query.setKeys(keys)

  val response = client.query(view, query).asScala

  response.toList map (_.as[TermOccurrence])
}

The Java driver deals with strings so it’s up to the client application to handle the json parsing. That was an excellent design decision and makes using the Java driver from Scala pretty painless. I’m using the Play Framework json libraries and an extension method _.as[TermOccurrence] defined on ViewRow to simplify the mapping of the response documents to Scala objects.

object CouchbaseExtensions {

  implicit class RichViewRow(row: ViewRow) {
    def as[A](implicit format: Format[A]): A = {
      val document = row.getValue
      val modelJsValue = Json.parse(document)
      Json.fromJson[A](modelJsValue).get
    }
  }

}

In order for this extension method to work, it requires an implicit Format[TermOccurrence] which is defined on of the TermOccurrence compainion object.

object TermOccurrence {
  implicit val formatTermOccurrence = Json.format[TermOccurrence]
}

Hadoop MapReduce Join Optimization With a Bloom Filter

Oct 15th, 2013 10:49 pm | Comments

Doing a join in hadoop with Java is painful. A one-liner in Pig Latin can easily explode into hundreds of lines of Java. However, the additional control in Java can yield significant performance gains and simplify complex logic that is difficult to express in Pig Latin.

In my case, the left side of the join contained about 100K records while the right side was closer to 1B. Emitting all join keys from the mapper means that all 1B records from the right side of the join are shuffled, sorted and sent to a reducer. The reducer then ends up discarding most of join keys that don’t match the left side.

Any best practices guide will tell you to push more work into the mapper. In the case of a join, that means dropping records in the mapper that will end up getting dropped by the reducer anyway. In order to do that, the mapper needs to know if a particular join key exists on the left hand side.

An easy way to accomplish this is to put the smaller dataset into the DistributedCache and then load all the join keys into a HashSet that the mapper can do a lookup against.

for (FileStatus status : fileSystem.globStatus(theSmallSideOfTheJoin)) {
    DistributedCache.addCacheFile(status.getPath().toUri(), job.getConfiguration());
}

@Override
protected void setup(Mapper.Context context) {
    buildJoinKeyHashMap(context);
}

@Override
protected void map(LongWritable key, Text value, Context context) {

  ...

    if (!joinKeys.contains(joinKey))
      return;

    ...

    context.write(outKey, outVal);
}

This totally works, but consumes enough memory that I was occassionally getting java.lang.OutOfMemoryError: Java heap space from the mappers. Enter the Bloom filter.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. Elements can be added to the set, but not removed. The more elements that are added to the set, the larger the probability of false positives. -Wikipedia

I hadn’t heard of a Bloom filter before taking Algorithms: Design and Analysis, Part 1. If not for the course, I’m pretty sure I would have skimmed over the innocuous reference while pilfering around the hadoop documentation. Fortunately, recent exposure made the term jump out at me and I quickly recognized it was exactly what I was looking for.

When I took the course, I thought the Bloom filter was an interesting idea that I wasn’t likely to use anytime soon because I haven’t needed one yet and I’ve been programming professionally for more than a few years. But you don’t know what you don’t know, right? It’s like thinking about buying a car you didn’t notice before and now seeing it everywhere.

Configuration

The documentation is thin, with little more than variable names to glean meaning from.

public BloomFilter(int vectorSize,
                   int nbHash,
                   int hashType)

vectorSize - The vector size of this filter.
nbHash - The number of hash function to consider.
hashType - type of the hashing function (see Hash).

I know what you’re thinking. What could be more helpful than The vector size of this filter as a description for vectorSize? Well, the basic idea is there’s a trade-off between space, speed and probability of a false positive. Here’s how I think about it:

vectorSize - The amount of memory used to store hash keys. Larger values are less likey to yield false positives. If the value is too large, you might as well use a HashSet.
nbHash - The number of times to hash the key. Larger numbers are less likely to yeild false positives at the expense of additional computation effort. Expect deminishing returns on larger values.
hashType - type of the hashing function (see Hash). The Hash documentation was reasonable so I’m not going to add anything.

I used trial and error to figure out numbers that were good for my constraints.

public class BloomFilterTests {
    private static BloomFilter bloomFilter;
    private static String knownKey = newGuid();
    private static int numberOfKeys = 500000;

    @BeforeClass
    public static void before() {
        bloomFilter = new BloomFilter(numberOfKeys * 20, 8, Hash.MURMUR_HASH);
        bloomFilter.add(newKey(knownKey));

        for (int i = 0; i < numberOfKeys; i++)
            bloomFilter.add(newKey(newGuid()));
    }

    @Test
    public void should_contain_known_key() {
        assertThat(hasKey(knownKey), is(true));
    }

    @Test
    public void false_positive_probability_should_be_low() {

        int count = 0;
        for (int i = 0; i < numberOfKeys; i++)
            if (hasKey(newGuid()))
                count++;

        int onePercent = (int) (numberOfKeys * .01);

        assertThat(count, is(lessThan(onePercent)));
    }

    private static String newGuid() {
        return UUID.randomUUID().toString();
    }

    private static Key newKey(String key) {
        return new Key(key.getBytes());
    }

    private boolean hasKey(String key) {
        return bloomFilter.membershipTest(newKey(key));
    }
}

When you have your numbers worked out, simply swap out the HashMap with the BloomFilter and then blog about it.

Extracting Root Domain From a Url

Oct 6th, 2013 5:05 pm | Comments

Given a url like http://www.google.com/?q=tld+uk, extract the root domain. In this case, it would be google.com. Sounds easy, right? Well it is, but http://www.google.com.br/ is also legit. Okay, so recursion to the rescue!

Not so fast…

.ac.uk
.co.uk
.gov.uk
.parliament.uk
.police.uk
...
.metro.tokyo.jp
.city.(cityname).(prefecturename).jp
...

http://en.wikipedia.org/wiki/.uk
http://en.wikipedia.org/wiki/.jp

Doh! Surely George Clinton is not happy about this and neither am I because it means I’m stuck doing a lookup against a list of arbitrary domains.

Using java.net.URI takes care of extracting the host, so the interesting part is parsing the host into a list of chunks that decrease in number of segments.

test("should split host into chunks of decreasing parts") {

  val chunks = splitIntoChunks("www.google.com.br")

  chunks.toList should equal(List("www.google.com.br", "google.com.br", "com.br", "br"))
}

An implementation of splitIntoChunks isn’t terribly exciting, but probably a good interview question. How about an implementation that doesn’t mutate state? Sounds fun, but why make this more difficult? It’s not because I want to run this bit of code on mutliple cores or distributed across machines, but because it challenges me to change the way I think about simple problems so that solving more complicated problems using functional idioms feels more natural. After all, when something is painful, you should do it more often. You know, like push ups and deployments.

def splitIntoChunks(host: String): List[String] = {

  val parts = host.split('.').toList

  parts.dropRight(1).scanRight(parts.last) {(acc, p) => s"$acc.$p"}
}

Scala ftw.

Appreciating Algorithm Complexity

Oct 5th, 2013 12:52 pm | Comments

Algorithms: Design and Analysis, Part 1 is a free online course from Standford offered through Coursera.org. An assignment that stuck with me was implementing an algorithm for counting inversions. Counting inversions is a way to quantify similarity of two ordered lists. The canonical example described in the lectures was comparing lists of favorite movies.

I banged out a naive implementation using nested loops that I could use for comparison to make sure I was implementing the real algorithm correctly.

private long countQuadratic(int[] a) {
    long count = 0;
    for (int i = 0; i < a.length; i++)
        for (int j = i; j < a.length; j++)
            if (a[i] > a[j])
                count++;
    return count;
}

The algorithm described in lecture is a divide and conquer algorithm based on a variation of merge sort, and as you might expect from its lineage, runs in O(nlog(n)) time.

private long countLinearithmic(int[] a) {

    // REDACTED

    long countLeft = countLinearithmic(left);
    long countRight = countLinearithmic(right);
    long countSplit = 0;

    int i = 0;
    int j = 0;
    for (int k = 0; k < a.length; k++) {

        // REDACTED

    }

    return countLeft + countRight + countSplit;
}

* The guts of the implementation are not shown per the Coursera honor code.

The assignment was to run this algorithm on a list of 100,000 integers and report the number of inversions. I ran both implementations as a sanity check.

@Test
public void should_count_inversions_in_text_file() {

    int[] a = FileUtil.parseInts("Inversions.txt");
    assertThat(a.length, is(100000));

    long start1 = System.currentTimeMillis();
    long count1 = countQuadratic(a); // O(n^2)
    long duration1 = start1 - System.currentTimeMillis();

    println("countQuadratic: " + duration1);

    long start2 = System.currentTimeMillis();
    long count2 = countLinearithmic(a); // O(nlog(n))
    long duration2 = start2 - System.currentTimeMillis();

    println("countLinearithmic: " + duration2);

    assertThat(count1, is(equalTo(count2)));
}

It’s one thing to crunch numbers on a calulator or plot graphs to gain an intuition for algorithm complexity, but it’s quite a bit more meaningful to wait for 15 seconds while a 2.7GHz quad-core Intel Core i7 grinds through ~5 Billion comparisions O(n^2) versus a mere 140 ms to zip through 1.6 Million comparisons using an O(nlog(n)) implementation.

Yeah, science!

Configuring Decorators With Google Guice

Feb 20th, 2012 7:11 am | Comments

You have a few options and each have their trade-offs. The one I find least annoying requires using a binding annotation. Since I’m stuck using annotations with Guice anyway, using one more to facilitate a decorator seems like an acceptable concession. Before I go on though, I have to take a moment. My beef isn’t about verbose configuration or annotations, it’s that once again the documentation gets it all wrong and sends the impressionable reader down a misguided path. Let’s take a look at this excerpt from the Guice documentation for binding annotations:

public class RealBillingService implements BillingService {

  @Inject
  public RealBillingService(@PayPal CreditCardProcessor processor, ...) {
    ...
  }

This bit of innocuous code encourages the reader to squander the power of dependency inversion and reduce it to a clunky tool that makes unit testing a little bit easier. That sounds harsh, so let’s start by discussing what Guice is and the problem it solves.

Guice and the like are referred to as IoC containers. That’s Inversion of Control. It’s a pretty general principle and when applied to object oriented programming, it manifests itself in the form of a technique called Dependency Inversion. In terms of the BillingService example, it means the code depends on a CreditCardProcessor abstraction rather than new‘ing something specific like a PayPalCreditCardProcessor. Perhaps depends is an overloaded term here. With or without the new keyword, there is a dependency. In one case, a higher level module is responsible for deciding what implementation to use, and in the other case, the class itself decides that it’s using a PayPalCreditCardProcessor, period.

Writing all your classes to declare their dependencies leaves you with the tedious task of building up complex object graphs before you can actually use your object. This is where Guice comes in. It’s a tool to simplify the havoc wreaked by inverting your dependencies and it’s inevitable when guided by a few principles like DRY (Don’t Repeat Yourself). If you don’t believe me, go ahead a see for yourself. Write some truly SOLID code and you’ll end up writing an IoC container in the process.

So now that we’ve covered what Guice is and the problem it solves, we are ready to talk about what’s wrong with @PayPal. Specifying the concrete class you expect with an annotation is pretty much the same as just declaring the dependency explicitly. Sure, you get a few points for using an interface and injecting the object, but it’s really just going through the motions while entirely missing the point. It would be like the Karate Kid going into auto detailing after learning wax-on, wax-off.

Abstractions create seams in your code. It’s how you add new behavior as the application evolves and it’s the key to managing complexity. Since we’re looking at a billing example, let’s throw out a few requirements that could pop up. How about some protection against running the same transaction twice in a short time period. How about checking a blacklist of credit cards or customers. Or maybe you need a card number that always fails in a particular way so QA can test the sad path. Or maybe your company works with a few payment gateways and wants to choose the least cost option based on the charge amount or card type. In this little snippet of code, we’ve got 2 seams we can use to work in this behavior. We’ve got the BillingService and CreditCardProcesor.

Oh, wait a minute we’re declaring that we need the PayPalCreditCardProcessor with that annotation so now our code is rigid and we can’t inject additional behavior by wrapping it in a DoubleChargeCreditCardProcessor, open-closed style. That’s the ‘O’ in SOLID. So you’re probably thinking, why can’t you just change the annotation from @PayPal to @DoubleCharge? Let’s dive a little deeper into this example to find out:

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  @Inject
  public DoubleChargeCreditCardProcessor(CreditCardProcessor processor, ...) {
    ...
  }

I’m not going to rant about how extends is evil and that you’re better off with a decorator because I’ve already done that, and this article is about how to wire up a decorator with Guice. So the challenge here is how to configure the container to supply the correct credit card processor as the first dependency of our double charge processor which itself implements CreditCardProcessor. Looking at the Guice documentation, you would likely think the answer is to do this:

public class RealBillingService implements BillingService {

  @Inject
  public RealBillingService(@DoubleCharge CreditCardProcessor processor, ...) {
    ...
  }

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  @Inject
  public DoubleChargeCreditCardProcessor(@PayPal CreditCardProcessor processor, ...) {
    ...
  }

That’s wrong though. The CreditCardProcessor isn’t a thing, it’s a seam and it’s where you put additional behavior like preventing duplicate charges in a short time period. If you look at the decorator, you’ll notice that it has nothing to do with PayPal. That’s because it’s a business rule and shouldn’t be mixed with integration code. Our business rule code and the PayPal integration code will likely live in different packages and the CreditCardProcessor abstraction could get assembled differently for any number of reasons. Maybe your application supports multi-tenancy and each tenant can use a different payment gateway. We can’t reuse our double charge business rule if it’s hard-coded to wrap a PayPal processor, and that’s a problem.

While I don’t particularly like using annotations for this sort of thing, it’s not the root cause. As a mechanic, it works just fine and can help us accomplish our task. The problem is that the documentation is subtly wrong and encourages mis-use of this feature. The better way to use binding annotations and not undermine the point of injecting your dependencies is like so:

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  public static final String BASE = "DoubleChargeCreditCardProcessor.base";

  public DoubleChargeCreditCardProcessor(@Named(BASE) CreditCardProcessor processor, ...) {
    ...
  }

public class ConfigureCreditCardProcessor extends AbstractModule {

  @Override
  protected void configure() {

    bind(CreditCardProcessor.class).to(DoubleChargeCreditCardProcessor.class);

    bind(CreditCardProcessor.class)
      .annotatedWith(Names.named(DoubleChargeCreditCardProcessor.BASE))
      .to(PayPayCreditCardProcessor.class);
  }
}

The difference is subtle, but the devil is in the details. In this last example, the DoubleChargeCreditCardProcessor doesn’t know or care what implementation it’s decorating. It simply declares a name for it’s dependency so it can be referenced unambiguously in a configuration module. This moves the configuration logic to… well, configuration code. Now you can see that the code is once again flexible and you can easily imagine more sophisticated configuration logic that could consider tenant settings or environment variables in selecting the proper combination of credit card processors to assemble.

← Older Blog Archives Newer →