Mike Valenty

Scala Solution to Finding Happy Numbers

2015-05-03T10:45:00-07:00

What is a Happy Number?

Starting with any positive integer, replace the number by the sum of the squares of its digits, and repeat the process until the number equals 1 (where it will stay), or it loops endlessly in a cycle which does not include 1.

http://en.wikipedia.org/wiki/Happy_number

@RunWith(classOf[JUnitRunner])
class HappyTests extends FunSuite with ShouldMatchers {

  def happy(x: Int): Boolean = {

    def sumOfSquares(x: Int): Int = {
      def digits(x: Int): Seq[Int] = if (x < 10) Seq(x) else digits(x / 10) :+ x % 10
      def squared(x: Int): Int = x * x
      digits(x).map(squared).sum
    }

    def loop(seen: Set[Int], x: Int): Boolean = {
      x match {
        case 1 => true
        case _ if seen.contains(x) => false
        case _ => loop(seen + x, sumOfSquares(x))
      }
    }

    loop(Set(), x)
  }

  test("happy numbers") {

    val actual = 1 to 50 filter happy

    val expected = Seq(1, 7, 10, 13, 19, 23, 28, 31, 32, 44, 49)

    actual should equal(expected)
  }
}

Scala Solution to Cheryl's Birthday Problem

2015-04-19T14:54:00-07:00

I ran across a Java 8 solution to Cheryl’s birthday problem and just had to write a Scala version since this is one of those problems the language was made for. Also, I’ve been on the Java train for a few months and this was a good excuse to brush off the dust and work on my Scala chops.

Cheryl’s Birthday Problem

Albert and Bernard just become friends with Cheryl, and they want to know when her birthday is. Cheryl gives them a list of 10 possible dates.

May 15, May 16, May 19
June 17, June 18
July 14, July 16
August 14, August 15, August 17

Cheryl then tells Albert and Bernard separately the month and the day of her birthday respectively.

Albert: I don’t know when Cheryl’s birthday is, but I know that Bernard does not know too.

Bernard: At first I don’t know when Cheryl’s birthday is, but I know now.

Albert: Then I also know when Cheryl’s birthday is.

So when is Cheryl’s birthday?

@RunWith(classOf[JUnitRunner])
class BirthdayProblem extends FunSuite with ShouldMatchers {

  case class Birthday(month: Int, day: Int)

  object Birthday {
    def apply(s: String): Birthday = s.split("-").map(_.toInt) match {
      case Array(m, d) => Birthday(m, d)
    }
  }

  val birthdays: List[Birthday] = List(
    "5-15", "5-16", "5-19",
    "6-17", "6-18",
    "7-14", "7-16",
    "8-14", "8-15", "8-17"
  ).map(Birthday.apply)

  def uniqueBy[A, K](fn: (A) => K)(b: List[A]): List[A] = {
    def groupSizeIs(size: Int)(x: (_, List[_])) = x._2.size == size
    b.groupBy(fn).filter(groupSizeIs(1)).flatMap(_._2).toList
  }

  val uniqueByDay: (List[Birthday]) => List[Birthday] = uniqueBy(_.day)

  val uniqueByMonth: (List[Birthday]) => List[Birthday] = uniqueBy(_.month)

  // Albert tells us that Bernard doesn't know the answer, so we
  // know the answer must be in a month that does not contain a
  // birthday with a unique day.
  val clue1: List[Birthday] = {
    val monthsWithUniqueDay: Set[Int] = uniqueByDay(birthdays).map(_.month).toSet
    birthdays.filterNot(b => monthsWithUniqueDay.contains(b.month))
  }

  // Since Bernard now knows the answer, that tells us that the
  // day must be unique among the remaining birthdays.
  val clue2: List[Birthday] = uniqueByDay(clue1)

  // Since Albert now knows the answer, that tells us the answer
  // has to be unique by month.
  val clue3: List[Birthday] = uniqueByMonth(clue2)

  // The only remaining birthday is the answer.
  val answer = clue3.head

  test("Cheryl's birthday") {

    answer should equal(Birthday("xx-xx")) // REDACTED

  }
}

Yay.

Unit Testing Scalding Jobs

2014-03-15T15:27:00-07:00

Scalding is a powerful framework for writing complex data processing applications on Apache Hadoop. It’s concise and expressive - almost to a fault. It’s dangerously easy to pack gobs of subtle business logic into just a few lines of code. If you’re writing real data processing applications and not just ad-hoc reports, unit testing is a must. However tests can get unwieldy to manage as job complexity grows and the arity of data increases.

For example, consider this scalding job:

class ComplicatedJob(args: Args) extends Job(args) {

  val bloatedTsv = Tsv("input",
    ('user_id,
      'timestamp,
      'host,
      'referer,
      'remote_addr,
      'user_agent,
      'cookie,
      'connection,
      'accept_encoding,
      'accept_language)).read

  bloatedTsv
    .map('timestamp -> 'timestamp) { ts: String => toDateTime(ts) }
    .filter('timestamp) { ts: DateTime => ts.isAfter(DateTime.now.minusDays(30)) }
    .map('user_agent -> 'browser) { userAgent: String => toBrowser(userAgent) }
    .map('remote_addr -> 'country) { ip: String => toCountry(ip) }
    .map('country -> 'country) { c: String => if (c == "us") c else "other" }
    .groupBy('browser, 'country) { _.size('count) }
    .groupBy('browser) { _.pivot(('country, 'count) ->('us, 'other)) }
    .write(Tsv("output"))

  def toDateTime(ts: String): DateTime = { ... }

  ...
}

Testing this job end-to-end would be fragile because there is so much going on and it would be tedious and noisy to build fake data to isolate and highlight edge cases. The pivot operations on lines 20-22 only deal with browser and country yet test data with all 10 fields is required including valid timestamps and user agents just to get to the pivot logic.

There are a few ways to tackle this and an approach I like is to use extension methods to breakdown the logic into smaller chunks of testable code. The result might look something like this.

class ComplicatedJob(args: Args) extends Job(args) {

  ...

  bloatedTsv
    .timestampIsAfter(DateTime.now.minusDays(30))
    .userAgentToBrowser()
    .remoteAddrToCountry()
    .countCountryByBrowser()
    .write(Tsv("output"))
}

Each block of code depends on only a few fields so it doesn’t require mocking the entire input set.

import Dsl._

object ComplicatedJob {

  implicit class ComplicatedJobRichPipe(pipe: Pipe) {

    // this chunk of code is testable in isolation
    def countCountryByBrowser(): Pipe = {
      pipe
        .map('country -> 'country) { c: String => if (c == "us") c else "other" }
        .groupBy('browser, 'country) { _.size('count) }
        .groupBy('browser) { _.pivot(('country, 'count) ->('us, 'other)) }
    }

    ...
  }

}

In this example only browser and country are required so setting up test data is reasonably painless and the intent of the test case isn’t lost in a sea of tuples. Granted, this approach requires creating a helper job to set up the input and capture the output for test assertions, but I think it’s a worthwhile trade off to reveal such a clear test case.

import ComplicatedJob._
import ComplicatedJobTests._

@RunWith(classOf[JUnitRunner])
class ComplicatedJobTests extends FunSuite with ShouldMatchers {

  test("should count and pivot rows into columns") {

    val input = List[InputTuple](
      ("firefox", "us"),
      ("chrome", "us"),
      ("safari", "us"),
      ("firefox", "us"),
      ("firefox", "br"),
      ("chrome", "de")
    )

    val expected = Set[OutputTuple](
      ("firefox", 2, 1),
      ("safari", 1, 0),
      ("chrome", 1, 1)
    )

    count(input) { _.toSet should equal(expected) }
  }

}

object ComplicatedJobTests {
  type InputTuple = (String, String)
  type OutputTuple = (String, Int, Int)

  // this is a helper job to set up the inputs and outputs
  // for the chunk of code we're trying to test
  class CountCountryByBrowser(args: Args) extends Job(args) {

    Tsv("input", ('browser, 'country))
      .read
      .countCountryByBrowser() // this is what we're testing
      .project('browser, 'us, 'other)
      .write(Tsv("output"))

  }

  // helper method to run our test job
  def count(input: List[InputTuple])(fn: List[OutputTuple] => Unit) {
    import Dsl._
    JobTest[CountCountryByBrowser]
      .source(Tsv("input", ('browser, 'country)), input)
      .sink[OutputTuple](Tsv("output")) { b => fn(b.toList) }
      .run
      .finish
  }
}

Parsing Json With Defaults in Scala

2014-01-24T22:57:00-08:00

The json library in Play does this thing where it explodes if the json you’re reading is missing a property. Gson would happily leave the property null, but that’s just not the scala way. In scala, if something is optional, it’s wrapped in an Option. The problem is I’d like to add a property to my data model and I don’t want it to be optional because it’s not.

In Java land, I probably would have added the property and thought about how to deal with migrating old data later, but not in scala. This battle has to be fought right now. That’s what type safe means and it’s why the NullPointerException has all but died in pure scala apps, not unlike the carpal tunnel epidemic.

In my case I’ve got some data in Couchbase and when I add a new property to my data model, I won’t be able to read the old data. What I need to do is transform the json before hydrating the object. Fortunately, this is a snap by using the rich transformation features described in JSON Coast-to-Coast.

My approach was to create a subclass of Reads[A] called ReadsWithDefaults[A] that reads json and uses a transformation to merge default values. It looks like this:

case class MyObject(color: String, shape: String)

object MyObject {

    val defaults = MyObject("blue", "square")

    implicit val readsMyObject = new ReadsWithDefaults(defaults)(Json.format[MyObject])

    implicit val writesMyObject = new Json.writes[MyObject]
}

import play.api.libs.json._
import play.api.libs.json.Reads._

class ReadsWithDefaults[A](defaults: A)(implicit format: Format[A]) extends Reads[A] {

  private val mergeWithDefaults = {
    val defaultJson = Json.fromJson[JsObject](Json.toJson(defaults)).get
    __.json.update(of[JsObject] map {o => defaultJson ++ o})
  }

  def reads(json: JsValue) = {
    val jsObject = json.transform(mergeWithDefaults).get
    Json.fromJson[A](jsObject)(format)
  }
}

Web Framework Scaffolding Considered Harmful

2014-01-02T09:59:00-08:00

If you’re a start up and spending time building back office tools to input application data into web forms, you’re doing it wrong. Sure, it’s a pretty simple task to build a few CRUD screens using the hottest new MVC framework, but you can do better.

Integrating with Google spreadsheets to manage application data unlocks powerful possibilities and will take you further than anything you could scaffold up in a comperable amount of time. To be clear, I’m not suggesting you take a runtime dependency on Google docs. I’m talking about using a lightweight business process that takes advantage of the rich collaborations features of Google docs and simply import the data into your database using the API.

Revision history

Google docs keeps a revision history of what was changed, when and by whom. That’s a pretty cool feature and one you’re not likely to build when there are more pressing customer facing features to work on.

Sometimes release notes aren’t detailed enough when trying to correlate a surprising change in a KPI, so it’s reassuring to have the nitty gritty revision history there when you need to dig a little deeper. It’s just like reviewing source control history when tracking down a bug that was introduced in the latest code release. Your data deserves the same respect.

Concurrent editing

My older brother carries a pocket knife at all times, and when he believes in something, he can’t help but sell everyone else on the idea. He used to tell me I needed to carry a pocket knife so I could cut off my seatbelt when trapped in a burning (or sinking) car. In his defense, GM recalled 240,000 SUVs in 2010 citing “…[the seat belt] may seem to be jammed.”

Naturally, I started carrying a pocket knife too. I figured that along with untrapping myself from a burning car or defending myself at an ATM, it would be handy when I needed to open a box. To my surprise, new opportunities presented themselves and I was using my knife several times a day. I could cut off a hanging thread, scrape bee pollen off my windsheild, improve the vent in my coffee lid, remove a scratchy tag from my son’s shirt and yes, open a box.

It turns out I had a similar experience with concurrent editing. Working on the same document at the same time with multiple people is pretty cool, and you’d be surprised how this powerful ability can change the way you work. There’s no way you would justify an extravagant feature like this in your own back office tools and you get it for free with Google docs.

Formulas

Since it’s a spreadsheet, you can use formulas and this is cooler than you might think. Consider a list of products. Chances are the product prices have some kind of relationship that you can capture with a formula and save yourself from tedious and error prone manual entry.

In reality though, you probably don’t care about the actual product prices as much as you care about your margins or some other derived number. With a spreadsheet, you can keep your formula together with your data. When you change a price, you can see how it affects your bottom line in a calculated field a few cells over. Or work backwards and calculate the price based on a combination of other factors. Or just use a forumla as a sanity check that all your numbers add up to the right amount.

Again, this is the kind of feature that invites new ways of working. You might use a formula to find a bug before the data ever hits your app or add a new forumla so you don’t get burned twice by the same mistake. A spreadsheet gives you a place to capture knowledge about your data. You could capture this knowledge in a wiki, but the fact that it’s in a Google spreadsheet that’s connected to your app, makes it real. It’s the difference between a code comment and a unit test.

Import

A data update is a deployment just like a code deployment and things can go horribly wrong when there’s a mistake in your data. Big companies get this and respond with change advisory boards and more process to protect themselves, but you can do better.

Your data deserves a repeatable one-click deployment process and you shouldn’t settle for anything less. An import is idempotent which means you can run it multiple times and the end result is always the same. In other words, you can practice in a test environment and iterate until you get it right. Oh, and if you spin off a copy prior to making changes, you get one-click rollback too.

Conclusion

One-click deployment and rollback with revision history for your data without developer interaction. That’s a really big deal because the end result is confidence. Confidence means that as an organization, you can make decisions, execute and not look back.

Empower your team (not just developers) with a familiar tool like a spreadsheet and give them the opportunity to impress you. Watch as calculations and charts emerge to add deeper meaning to your data. Watch as collaboration occurs and team members are brought together rather than divided by functional roles. Watch as confidence increases in your ability to deploy changes.

Seriously. Don’t waste your time half-assing back office tools when you can invest a similar amount of effort to import data from Google spreadsheets. It’s a scrappy tool that delivers on the features that accelerate the way you work.

If you can do a half-assed job of anything, you’re a one-eyed man in a kingdom of the blind. – Kurt Vonnegut

Using Scala to Perform a Multi-Key Get on a Couchbase View

2013-11-16T15:26:00-08:00

To retrieve documents from Couchbase by anything other than the document key requires querying a view and views are defined by map and reduce functions written in JavaScript. Consider a view that returns data like this:

key         value
---         -----
"key1"      { "term": "red", "count": 2 }
"key2"      { "term": "red", "count": 1 }
"key3"      { "term": "blue", "count": 4 }
...

And this Scala case class to hold the documents retrieved from the view.

case class TermOccurrence(term: String, count: Int)

It’s a common scenario to retrieve multiple documents at once and the Java driver has a pretty straight forward api for that. The desired keys are simply specified as a json array.

import com.couchbase.client.CouchbaseClient
import play.api.libs.json.Json
import CouchbaseExtensions._

@Log
def findByTerms(terms: List[String]): List[TermOccurrence] = {

  val keys = Json.stringify(Json.toJson(terms map (_.toLowerCase)))

  val view = client.getView("view1", "by_term")
  val query = new Query()
  query.setIncludeDocs(false)
  query.setKeys(keys)

  val response = client.query(view, query).asScala

  response.toList map (_.as[TermOccurrence])
}

The Java driver deals with strings so it’s up to the client application to handle the json parsing. That was an excellent design decision and makes using the Java driver from Scala pretty painless. I’m using the Play Framework json libraries and an extension method _.as[TermOccurrence] defined on ViewRow to simplify the mapping of the response documents to Scala objects.

object CouchbaseExtensions {

  implicit class RichViewRow(row: ViewRow) {
    def as[A](implicit format: Format[A]): A = {
      val document = row.getValue
      val modelJsValue = Json.parse(document)
      Json.fromJson[A](modelJsValue).get
    }
  }

}

In order for this extension method to work, it requires an implicit Format[TermOccurrence] which is defined on of the TermOccurrence compainion object.

object TermOccurrence {
  implicit val formatTermOccurrence = Json.format[TermOccurrence]
}

Hadoop MapReduce Join Optimization With a Bloom Filter

2013-10-15T22:49:00-07:00

Doing a join in hadoop with Java is painful. A one-liner in Pig Latin can easily explode into hundreds of lines of Java. However, the additional control in Java can yield significant performance gains and simplify complex logic that is difficult to express in Pig Latin.

In my case, the left side of the join contained about 100K records while the right side was closer to 1B. Emitting all join keys from the mapper means that all 1B records from the right side of the join are shuffled, sorted and sent to a reducer. The reducer then ends up discarding most of join keys that don’t match the left side.

Any best practices guide will tell you to push more work into the mapper. In the case of a join, that means dropping records in the mapper that will end up getting dropped by the reducer anyway. In order to do that, the mapper needs to know if a particular join key exists on the left hand side.

An easy way to accomplish this is to put the smaller dataset into the DistributedCache and then load all the join keys into a HashSet that the mapper can do a lookup against.

for (FileStatus status : fileSystem.globStatus(theSmallSideOfTheJoin)) {
    DistributedCache.addCacheFile(status.getPath().toUri(), job.getConfiguration());
}

@Override
protected void setup(Mapper.Context context) {
    buildJoinKeyHashMap(context);
}

@Override
protected void map(LongWritable key, Text value, Context context) {

  ...

    if (!joinKeys.contains(joinKey))
      return;

    ...

    context.write(outKey, outVal);
}

This totally works, but consumes enough memory that I was occassionally getting java.lang.OutOfMemoryError: Java heap space from the mappers. Enter the Bloom filter.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. Elements can be added to the set, but not removed. The more elements that are added to the set, the larger the probability of false positives. -Wikipedia

I hadn’t heard of a Bloom filter before taking Algorithms: Design and Analysis, Part 1. If not for the course, I’m pretty sure I would have skimmed over the innocuous reference while pilfering around the hadoop documentation. Fortunately, recent exposure made the term jump out at me and I quickly recognized it was exactly what I was looking for.

When I took the course, I thought the Bloom filter was an interesting idea that I wasn’t likely to use anytime soon because I haven’t needed one yet and I’ve been programming professionally for more than a few years. But you don’t know what you don’t know, right? It’s like thinking about buying a car you didn’t notice before and now seeing it everywhere.

Configuration

The documentation is thin, with little more than variable names to glean meaning from.

public BloomFilter(int vectorSize,
                   int nbHash,
                   int hashType)

vectorSize - The vector size of this filter.
nbHash - The number of hash function to consider.
hashType - type of the hashing function (see Hash).

I know what you’re thinking. What could be more helpful than The vector size of this filter as a description for vectorSize? Well, the basic idea is there’s a trade-off between space, speed and probability of a false positive. Here’s how I think about it:

vectorSize - The amount of memory used to store hash keys. Larger values are less likey to yield false positives. If the value is too large, you might as well use a HashSet.
nbHash - The number of times to hash the key. Larger numbers are less likely to yeild false positives at the expense of additional computation effort. Expect deminishing returns on larger values.
hashType - type of the hashing function (see Hash). The Hash documentation was reasonable so I’m not going to add anything.

I used trial and error to figure out numbers that were good for my constraints.

public class BloomFilterTests {
    private static BloomFilter bloomFilter;
    private static String knownKey = newGuid();
    private static int numberOfKeys = 500000;

    @BeforeClass
    public static void before() {
        bloomFilter = new BloomFilter(numberOfKeys * 20, 8, Hash.MURMUR_HASH);
        bloomFilter.add(newKey(knownKey));

        for (int i = 0; i < numberOfKeys; i++)
            bloomFilter.add(newKey(newGuid()));
    }

    @Test
    public void should_contain_known_key() {
        assertThat(hasKey(knownKey), is(true));
    }

    @Test
    public void false_positive_probability_should_be_low() {

        int count = 0;
        for (int i = 0; i < numberOfKeys; i++)
            if (hasKey(newGuid()))
                count++;

        int onePercent = (int) (numberOfKeys * .01);

        assertThat(count, is(lessThan(onePercent)));
    }

    private static String newGuid() {
        return UUID.randomUUID().toString();
    }

    private static Key newKey(String key) {
        return new Key(key.getBytes());
    }

    private boolean hasKey(String key) {
        return bloomFilter.membershipTest(newKey(key));
    }
}

When you have your numbers worked out, simply swap out the HashMap with the BloomFilter and then blog about it.

Extracting Root Domain From a Url

2013-10-06T17:05:00-07:00

Given a url like http://www.google.com/?q=tld+uk, extract the root domain. In this case, it would be google.com. Sounds easy, right? Well it is, but http://www.google.com.br/ is also legit. Okay, so recursion to the rescue!

Not so fast…

.ac.uk
.co.uk
.gov.uk
.parliament.uk
.police.uk
...
.metro.tokyo.jp
.city.(cityname).(prefecturename).jp
...

http://en.wikipedia.org/wiki/.uk
http://en.wikipedia.org/wiki/.jp

Doh! Surely George Clinton is not happy about this and neither am I because it means I’m stuck doing a lookup against a list of arbitrary domains.

Using java.net.URI takes care of extracting the host, so the interesting part is parsing the host into a list of chunks that decrease in number of segments.

test("should split host into chunks of decreasing parts") {

  val chunks = splitIntoChunks("www.google.com.br")

  chunks.toList should equal(List("www.google.com.br", "google.com.br", "com.br", "br"))
}

An implementation of splitIntoChunks isn’t terribly exciting, but probably a good interview question. How about an implementation that doesn’t mutate state? Sounds fun, but why make this more difficult? It’s not because I want to run this bit of code on mutliple cores or distributed across machines, but because it challenges me to change the way I think about simple problems so that solving more complicated problems using functional idioms feels more natural. After all, when something is painful, you should do it more often. You know, like push ups and deployments.

def splitIntoChunks(host: String): List[String] = {

  val parts = host.split('.').toList

  parts.dropRight(1).scanRight(parts.last) {(acc, p) => s"$acc.$p"}
}

Scala ftw.

Appreciating Algorithm Complexity

2013-10-05T12:52:00-07:00

Algorithms: Design and Analysis, Part 1 is a free online course from Standford offered through Coursera.org. An assignment that stuck with me was implementing an algorithm for counting inversions. Counting inversions is a way to quantify similarity of two ordered lists. The canonical example described in the lectures was comparing lists of favorite movies.

I banged out a naive implementation using nested loops that I could use for comparison to make sure I was implementing the real algorithm correctly.

private long countQuadratic(int[] a) {
    long count = 0;
    for (int i = 0; i < a.length; i++)
        for (int j = i; j < a.length; j++)
            if (a[i] > a[j])
                count++;
    return count;
}

The algorithm described in lecture is a divide and conquer algorithm based on a variation of merge sort, and as you might expect from its lineage, runs in O(nlog(n)) time.

private long countLinearithmic(int[] a) {

    // REDACTED

    long countLeft = countLinearithmic(left);
    long countRight = countLinearithmic(right);
    long countSplit = 0;

    int i = 0;
    int j = 0;
    for (int k = 0; k < a.length; k++) {

        // REDACTED

    }

    return countLeft + countRight + countSplit;
}

* The guts of the implementation are not shown per the Coursera honor code.

The assignment was to run this algorithm on a list of 100,000 integers and report the number of inversions. I ran both implementations as a sanity check.

@Test
public void should_count_inversions_in_text_file() {

    int[] a = FileUtil.parseInts("Inversions.txt");
    assertThat(a.length, is(100000));

    long start1 = System.currentTimeMillis();
    long count1 = countQuadratic(a); // O(n^2)
    long duration1 = start1 - System.currentTimeMillis();

    println("countQuadratic: " + duration1);

    long start2 = System.currentTimeMillis();
    long count2 = countLinearithmic(a); // O(nlog(n))
    long duration2 = start2 - System.currentTimeMillis();

    println("countLinearithmic: " + duration2);

    assertThat(count1, is(equalTo(count2)));
}

It’s one thing to crunch numbers on a calulator or plot graphs to gain an intuition for algorithm complexity, but it’s quite a bit more meaningful to wait for 15 seconds while a 2.7GHz quad-core Intel Core i7 grinds through ~5 Billion comparisions O(n^2) versus a mere 140 ms to zip through 1.6 Million comparisons using an O(nlog(n)) implementation.

Yeah, science!

Configuring Decorators With Google Guice

2012-02-20T07:11:00-08:00

You have a few options and each have their trade-offs. The one I find least annoying requires using a binding annotation. Since I’m stuck using annotations with Guice anyway, using one more to facilitate a decorator seems like an acceptable concession. Before I go on though, I have to take a moment. My beef isn’t about verbose configuration or annotations, it’s that once again the documentation gets it all wrong and sends the impressionable reader down a misguided path. Let’s take a look at this excerpt from the Guice documentation for binding annotations:

public class RealBillingService implements BillingService {

  @Inject
  public RealBillingService(@PayPal CreditCardProcessor processor, ...) {
    ...
  }

This bit of innocuous code encourages the reader to squander the power of dependency inversion and reduce it to a clunky tool that makes unit testing a little bit easier. That sounds harsh, so let’s start by discussing what Guice is and the problem it solves.

Guice and the like are referred to as IoC containers. That’s Inversion of Control. It’s a pretty general principle and when applied to object oriented programming, it manifests itself in the form of a technique called Dependency Inversion. In terms of the BillingService example, it means the code depends on a CreditCardProcessor abstraction rather than new‘ing something specific like a PayPalCreditCardProcessor. Perhaps depends is an overloaded term here. With or without the new keyword, there is a dependency. In one case, a higher level module is responsible for deciding what implementation to use, and in the other case, the class itself decides that it’s using a PayPalCreditCardProcessor, period.

Writing all your classes to declare their dependencies leaves you with the tedious task of building up complex object graphs before you can actually use your object. This is where Guice comes in. It’s a tool to simplify the havoc wreaked by inverting your dependencies and it’s inevitable when guided by a few principles like DRY (Don’t Repeat Yourself). If you don’t believe me, go ahead a see for yourself. Write some truly SOLID code and you’ll end up writing an IoC container in the process.

So now that we’ve covered what Guice is and the problem it solves, we are ready to talk about what’s wrong with @PayPal. Specifying the concrete class you expect with an annotation is pretty much the same as just declaring the dependency explicitly. Sure, you get a few points for using an interface and injecting the object, but it’s really just going through the motions while entirely missing the point. It would be like the Karate Kid going into auto detailing after learning wax-on, wax-off.

Abstractions create seams in your code. It’s how you add new behavior as the application evolves and it’s the key to managing complexity. Since we’re looking at a billing example, let’s throw out a few requirements that could pop up. How about some protection against running the same transaction twice in a short time period. How about checking a blacklist of credit cards or customers. Or maybe you need a card number that always fails in a particular way so QA can test the sad path. Or maybe your company works with a few payment gateways and wants to choose the least cost option based on the charge amount or card type. In this little snippet of code, we’ve got 2 seams we can use to work in this behavior. We’ve got the BillingService and CreditCardProcesor.

Oh, wait a minute we’re declaring that we need the PayPalCreditCardProcessor with that annotation so now our code is rigid and we can’t inject additional behavior by wrapping it in a DoubleChargeCreditCardProcessor, open-closed style. That’s the ‘O’ in SOLID. So you’re probably thinking, why can’t you just change the annotation from @PayPal to @DoubleCharge? Let’s dive a little deeper into this example to find out:

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  @Inject
  public DoubleChargeCreditCardProcessor(CreditCardProcessor processor, ...) {
    ...
  }

I’m not going to rant about how extends is evil and that you’re better off with a decorator because I’ve already done that, and this article is about how to wire up a decorator with Guice. So the challenge here is how to configure the container to supply the correct credit card processor as the first dependency of our double charge processor which itself implements CreditCardProcessor. Looking at the Guice documentation, you would likely think the answer is to do this:

public class RealBillingService implements BillingService {

  @Inject
  public RealBillingService(@DoubleCharge CreditCardProcessor processor, ...) {
    ...
  }

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  @Inject
  public DoubleChargeCreditCardProcessor(@PayPal CreditCardProcessor processor, ...) {
    ...
  }

That’s wrong though. The CreditCardProcessor isn’t a thing, it’s a seam and it’s where you put additional behavior like preventing duplicate charges in a short time period. If you look at the decorator, you’ll notice that it has nothing to do with PayPal. That’s because it’s a business rule and shouldn’t be mixed with integration code. Our business rule code and the PayPal integration code will likely live in different packages and the CreditCardProcessor abstraction could get assembled differently for any number of reasons. Maybe your application supports multi-tenancy and each tenant can use a different payment gateway. We can’t reuse our double charge business rule if it’s hard-coded to wrap a PayPal processor, and that’s a problem.

While I don’t particularly like using annotations for this sort of thing, it’s not the root cause. As a mechanic, it works just fine and can help us accomplish our task. The problem is that the documentation is subtly wrong and encourages mis-use of this feature. The better way to use binding annotations and not undermine the point of injecting your dependencies is like so:

public class DoubleChargeCreditCardProcessor implements CreditCardProcessor {

  public static final String BASE = "DoubleChargeCreditCardProcessor.base";

  public DoubleChargeCreditCardProcessor(@Named(BASE) CreditCardProcessor processor, ...) {
    ...
  }

public class ConfigureCreditCardProcessor extends AbstractModule {

  @Override
  protected void configure() {

    bind(CreditCardProcessor.class).to(DoubleChargeCreditCardProcessor.class);

    bind(CreditCardProcessor.class)
      .annotatedWith(Names.named(DoubleChargeCreditCardProcessor.BASE))
      .to(PayPayCreditCardProcessor.class);
  }
}

The difference is subtle, but the devil is in the details. In this last example, the DoubleChargeCreditCardProcessor doesn’t know or care what implementation it’s decorating. It simply declares a name for it’s dependency so it can be referenced unambiguously in a configuration module. This moves the configuration logic to… well, configuration code. Now you can see that the code is once again flexible and you can easily imagine more sophisticated configuration logic that could consider tenant settings or environment variables in selecting the proper combination of credit card processors to assemble.

Bolt-on Multi-Tenancy in ASP.NET MVC With Unity and NHibernate: Part II – Commingled Data

2011-06-18T11:08:00-07:00

Last time I went over going from separate web applications per tenant to a shared web application for all tenants, but each tenant still had its own database. Now we’re going to take the next step and let multiple tenants share the same database. After we add tenant_id to most of the tables in our database we’ll need the application to take care of a few things. First, we need to apply a where clause to all queries to ensure that each tenant sees only their data. This is pretty painless with NHibernate, we just have to define a parameterized filter:

 xmlns="urn:nhibernate-mapping-2.2">

   name="tenant">
     name="id" type="System.Int32" />
  

And then apply it to each entity:

 name="User" table="[user]">

   name="Id" column="user_id">
     class="identity" />
  

   name="Username" />
   name="Email" />

   name="tenant" condition="tenant_id = :id" />

The last step is to set the value of the filter at runtime. This is done on the ISession like this:

Container
    .RegisterType<ISession>(
        new PerRequestLifetimeManager(),
        new InjectionFactory(c =>
        {
            var session = c.Resolve<ISessionFactory>().OpenSession();
            session.EnableFilter("tenant").SetParameter("id", c.Resolve<Tenant>().Id);
            return session;
        })
    );

The current tenant comes from c.Resolve(). In order for that to work, you have to tell Unity how to find the current tenant. In ASP.NET MVC, we can look at the host header on the request and find our tenant that way. We could just as easily use another strategy though. Maybe if this were a WCF service, we could use an authentication header to establish the current tenant context. You could build out some interfaces and strategies around establishing the current tenant context, however for this article I’ll just bang it out.

Container
    .RegisterType<Tenant>(new InjectionFactory(c =>
    {
        var repository = c.Resolve<ITenantRepository>();

        var context = c.Resolve<HttpContextBase>();

        var host = context.Request.Headers["Host"] ?? context.Request.Url.Host;

        return repository.FindByHost(host);
    }));

Second, we have to set the tenant_id when new entities are saved. This is a bit more complicated with NHibernate and requires a bit of a concession in that we have to add a field to the entity in order for NHibernate to know how to persist the value. I’m using a private nullable int for this.

public class User
{
    private int? tenantId;

    public virtual int Id { get; set; }

    public virtual string Username { get; set; }

    public virtual string Email { get; set; }
}

It’s private because I don’t want the business logic to deal with it and it’s nullable because my tenant table is in a separate database which means I can’t lean on the data model to enforce referential integrity. That’s a problem because the default value for an integer is zero which could be happily saved by the database. By making it nullable I can be sure the database will blow up if the tenant_id is not set.

So, back to the issue at hand. The tenant_id needs to be set when the entity is saved. For this, I’m using an interceptor and setting the value in the OnSave method:

public class MultiTenantInterceptor : EmptyInterceptor
{
    private readonly Func<Tenant> tenant;

    public MultiTenantInterceptor(Func<Tenant> tenant)
    {
        this.tenant = tenant;
    }

    public override bool OnSave(object entity... object[] state, string[] propertyNames...)
    {
        var index = Array.IndexOf(propertyNames, "tenantId");

        if (index == -1)
            return false;

        var tenantId = tenant().Id;

        state[index] = tenantId;

        entity
            .GetType()
            .GetField("tenantId", BindingFlags.Instance | BindingFlags.NonPublic)
            .SetValue(entity, tenantId);

        return false;
    }
}

This IInterceptor mechanism is a little wonky. If you change any data, you have to do it in both the entity instance and the state array that NHibernate uses to hydrate entities. It’s not a big deal, it’s just one of those things you have to accept like the fact that Apple and Google are tracking your every move via your smart phone. Oh, and the interceptor gets wired up like this:

Container
    .RegisterType<ISessionFactory>(
        new ContainerControlledLifetimeManager(),
        new InjectionFactory(c =>
        {
            return new NHibernate.Cfg.Configuration()
                .Configure()
                .SetInterceptor(new MultiTenantInterceptor(() => c.Resolve<Tenant>()))
                .BuildSessionFactory();
        })
    );

We’re almost done. There is one more case that needs to be handled. When NHibernate loads an entity by its primary key, it doesn’t run through the query engine which means the tenant filter isn’t applied. Fortunately, we can take care of this in the interceptor:

public class MultiTenantInterceptor : EmptyInterceptor
{
    ...

    public override bool OnLoad(object entity... object[] state, string[] propertyNames...)
    {
        var index = Array.IndexOf(propertyNames, "tenantId");

        if (index == -1)
            return false;

        var entityTenantId = Convert.ToInt32(state[index]);

        var currentTenantId = tenant().Id;

        if (entityTenantId != currentTenantId)
        {
            throw new AuthorizationException("Permission denied to {0}", entity);
        }

        return false;
    }
}

That’s it. Have fun and happy commingling.

Bolt-on Multi-Tenancy in ASP.NET MVC With Unity and NHibernate

2011-05-14T13:09:00-07:00

The Mission:

Build a web application as though it’s for a single customer (tenant) and add multi-tenancy as a bolt-on feature by writing only new code. There are flavors of multi-tenancy, in this case I want each tenant to have its own database but I want all tenants to share the same web application and figure out who’s who by looking at the host header on the http request.

The Plan:

To pull this off, we’re going to have to rely on our SOLID design principles, especially Single Responsibility and Dependency Inversion. We’ll get some help from these frameworks:

Game on:

Let’s take a look at a controller that uses NHibernate to talk to the database. I’m not going to get into whether you should talk directly to NHibernate from the controller or go through a service layer or repository because it doesn’t affect how we’re going to add multi-tenancy. The important thing here is that the ISession is injected into the controller, and we aren’t using the service locator pattern to request the ISession from a singleton.

public class UserController : Controller
{
    private readonly ISession session;

    public UserController(ISession session)
    {
        this.session = session;
    }

    public ActionResult Edit(int id)
    {
        var user = session.Load<User>(id);

        return View(user);
    }
}

Alright, now it’s time to write some new code and make our web application connect to the correct database based on the host header in the http request. First, we’ll need a database to store a list of tenants along with the connection string for that tenant’s database. Here’s my entity:

public class Tenant
{
    public virtual string Name { get; set; }

    public virtual string Host { get; set; }

    public virtual string ConnectionString { get; set; }
}

I’ll use the repository pattern here so there is a crisp consumer of the ISession that connects to the lookup database rather than one of the tenant shards. This will be important later when we go to configure Unity.

public class NHibernateTenantRepository : ITenantRepository
{
    private readonly ISession session;
    private readonly HttpContextBase context;

    public NHibernateTenantRepository(ISession session, HttpContextBase context)
    {
        this.session = session;
        this.context = context;
    }

    public Tenant Current
    {
        get
        {
            var host = context.Request.Headers["Host"];
            return FindByHost(host);
        }
    }

    public Tenant FindByHost(string host)
    {
        return session
            .Query<Tenant>()
            .SingleOrDefault(t => t.Host == host);
    }
}

So now we need a dedicated ISessionFactory for the lookup database and make sure that our NHibernateTenantRepository gets the right ISession. It’s not too bad, we just need to name them in the container so we can refer to them explicitly.

Container
    .RegisterType<ISessionFactory>(
        "tenant_session_factory",
        new ContainerControlledLifetimeManager(),
        new InjectionFactory(c =>
            new NHibernate.Cfg.Configuration().Configure().BuildSessionFactory())
    );

Container
    .RegisterType<ISession>(
        "tenant_session",
        new PerRequestLifetimeManager(),
        new InjectionFactory(c =>
            c.Resolve<ISessionFactory>("tenant_session_factory").OpenSession())
    );

Container
    .RegisterType<ITenantRepository, NHibernateTenantRepository>()
    .RegisterType<NHibernateTenantRepository>(
        new InjectionConstructor(
            new ResolvedParameter<ISession>("tenant_session"),
            new ResolvedParameter<HttpContextBase>()
        )
    );

Hopefully that’s about what you were expecting since it’s not really the interesting part. The more interesting part is configuring the ISession that gets injected into the UserController to connect to a different database based on the host header in the http request. The Unity feature we’re going to leverage for this is the LifetimeManager. This is an often overlooked feature of IoC containers.

Container
    .RegisterType<ISessionFactory>(
        new PerHostLifetimeManager(() => new HttpContextWrapper(HttpContext.Current)),
        new InjectionFactory(c =>
        {
            var connString = c
                .Resolve<ITenantRepository>()
                .Current
                .ConnectionString;

            return new NHibernate.Cfg.Configuration()
                .Configure()
                .SetProperty(NHibernate.Cfg.Environment.ConnectionString, connString)
                .BuildSessionFactory();
        }));

Container
    .RegisterType<ISession>(
        new PerRequestLifetimeManager(),
        new InjectionFactory(c => c.Resolve<ISessionFactory>().OpenSession())
    );

Here we’re using a custom PerHostLifetimeManager. This tells Unity to maintain a session factory per host. When Unity runs across a host it doesn’t have a session factory for, it will run the InjectionFactory block to create one using the connection string associated with that tenant.

Since multiple simultaneous requests will be trying to get and set values with the same key, we need to make sure our PerHostLifetimeManager is thread safe. That’s pretty easy since Unity comes with a SynchronizedLifetimeManager base class that takes care of the fact that Dictionary isn’t thread safe.

public class PerHostLifetimeManager : SynchronizedLifetimeManager
{
    private readonly Func<HttpContextBase> context;
    private readonly IDictionary<string, object> store;

    public PerHostLifetimeManager(Func<HttpContextBase> context)
    {
        this.context = context;
        store = new Dictionary<string, object>();
    }

    protected override object SynchronizedGetValue()
    {
        var host = GetHost();

        if (!store.ContainsKey(host))
            return null;

        return store[host];
    }

    protected override void SynchronizedSetValue(object newValue)
    {
        store[GetHost()] = newValue;
    }

    private string GetHost()
    {
        return context().Request.Headers["Host"];
    }
}

So what did we accomplish? Well we didn’t touch any of our existing application code. We just wrote new code and through configuration we added multi-tenancy! That’s pretty cool, but was it worth it? Well, the goal in itself isn’t super important, but this exercise can certainly highlight areas of your codebase where you might be violating the single responsibility principle or leaking too many infrastructure concepts into your application logic.

Tamarack: Chain of Responsibility Framework for .NET

2011-04-21T21:39:00-07:00

The Chain of Responsibility is a key building block of extensible software.

Avoid coupling the sender of a request to its receiver by giving more than one object a chance to handle the request. Chain the receiving objects and pass the request along the chain until an object handles it. – Gang of Four

Variations of this pattern are the basis for Servlet Filters, IIS Modules and Handlers and several open source projects I’ve had the opportunity to work with including Sync4J, JAMES, Log4Net, Unity and yes, even Joomla. It’s an essential tool in the OO toolbox and key in transforming rigid procedural code into a composable Domain Specific Language.

I’ve blogged about this pattern before so what’s new this time?

The next filter in the chain is provided via a delegate parameter rather than a property
The project is hosted on github
There is a NuGet package for it

How does it work? It’s pretty simple, there is just one interface to implement and it looks like this:

public interface IFilter<T, TOut>
{
    TOut Execute(T context, Func<T, TOut> executeNext);
}

Basically, you get an input to operate on and a value to return. The executeNext parameter is a delegate for the next filter in the chain. The filters are composed together in a chain which is referred to as a Pipeline in the Tamarack framework. This structure is the essence of the Chain of Responsibility pattern and it facilitates some pretty cool things:

Modify the input before the next filter gets it
Modify the output of the next filter before returning
Short circuit out of the chain by not calling the executeNext delegate

Show me examples!

Consider a block of code to process a blog comment coming from a web-based rich text editor. There are probably several things you’ll want to do before letting the text into your database.

public int Submit(Post post)
{
    var pipeline = new Pipeline<Post, int>()
        .Add(new CanoncalizeHtml())
        .Add(new StripMaliciousTags())
        .Add(new RemoveJavascript())
        .Add(new RewriteProfanity())
        .Add(new GuardAgainstDoublePost())
        .Finally(p => repository.Save(p));

    var newId = pipeline.Execute(post);

    return newId;
}

What about dependency injection for complex filters? Take a look at this user login pipeline. Notice the generic syntax for adding filters by type. Those filters are built-up using the supplied implementation of System.IServiceProvider. My favorite is UnityServiceProvider.

public bool Login(string username, string password)
{
    var pipeline = new Pipeline<LoginContext, bool>(serviceProvider)
        .Add<WriteLoginAttemptToAuditLog>()
        .Add<LockoutOnConsecutiveFailures>()
        .Add<AuthenticateAgainstLocalStore>()
        .Add<AuthenticateAgainstLdap>()
        .Finally(c => false);

    return pipeline.Execute(new LoginContext(username, password));
}

Here’s another place you might see the chain of responsibility pattern. Calculating the spam score of a block of text:

public double CalculateSpamScore(string text)
{
    var pipeline = new Pipeline<string, double>()
        .Add<SpamCopBlacklistFilter>()
        .Add<PerspcriptionDrugFilter>()
        .Add<PornographyFilter>()
        .Finally(score => 0);

    return pipeline.Execute(text);
}

Prefer convention over configuration? Try this instead:

public double CalculateSpamScore(string text)
{
    var pipeline = new Pipeline<string, double>()
        .AddFiltersIn("Tamarack.Example.Pipeline.SpamScorer.Filters")
        .Finally(score => 0);

    return pipeline.Execute(text);
}

Let’s look at the IFilter interface in action. In the spam score calculator example, each filter looks for markers in the text and adds to the overall spam score by modifying the result of the next filter before returning.

public class PerspcriptionDrugFilter : IFilter<string, double>
{
    public double Execute(string text, Func<string, double> executeNext)
    {
        var score = executeNext(text);

        if (text.Contains("viagra"))
            score += .25;

        return score;
    }
}

In the login example, we look for the user in our local user store and if it exists we’ll short-circuit the chain and authenticate the request. Otherwise we’ll let the request continue to the next filter which looks for the user in an LDAP repository.

public class AuthenticateAgainstLocalStore : IFilter<LoginContext, bool>
{
    private readonly IUserRepository repository;

    public AuthenticateAgainstLocalStore(IUserRepository repository)
    {
        this.repository = repository;
    }

    public bool Execute(LoginContext context, Func<LoginContext, bool> executeNext)
    {
        var user = repository.FindByUsername(context.Username);

        if (user != null)
            return user.IsValid(context.Password);

        return executeNext(context);
    }
}

Why should I use it?

It’s pretty much my favorite animal. It’s like a lion and a tiger mixed… bred for its skills in magic. – Napoleon Dynamite

It’s simple and mildly opinionated in effort to guide you and your code into The Pit of Success. It’s easy to write single responsibility classes and use inversion of control and composition and convention over configuration and lots of other goodness. Try it out. Tell a friend.

A Custom HttpModule to Log Request Duration

2011-01-16T02:31:00-08:00

My application has logging of fine-grained operations, but I want to see the duration of the entire web request. The idea is to start a Stopwatch on the BeginRequest event and then log the elapsed time on the EndRequest event. I started by modifying the Global.asax to wire this up, but quickly got turned off because I was violating the Open-Closed Principle. I really just want to bolt-in this behavior while I’m profiling the application and then turn it off when the kinks are worked out. IIS has a pretty slick extension point for this sort of thing that let’s you hook into the request lifecycle events.

public class RequestDurationLoggerModule : IHttpModule
{
    private const string ContextItemKey = "stopwatchContextItemKey";

    public void Init(HttpApplication application)
    {
        application.BeginRequest += (o, args) =>
        {
            application.Context.Items[ContextItemKey] = Stopwatch.StartNew();
        };

        application.EndRequest += (o, args) =>
        {
            var stopwatch = (Stopwatch)application.Context.Items[ContextItemKey];
            stopwatch.Stop();

            var logger = GetLogger(application);

            logger.Debug(
                "{0} -> [{1} ms]",
                application.Context.Request.RawUrl,
                stopwatch.ElapsedMilliseconds);
        };
    }

    private static ILogger GetLogger(HttpApplication application)
    {
        var serviceProvider = application as IServiceProvider;

        if (serviceProvider == null)
        {
            return new NullLogger();
        }

        return serviceProvider.GetService<ILogger>();
    }

    public void Dispose()
    {
    }
}

The only weird part here is getting a handle to the logger. I’m using an IoC container in my application, however I can’t tell IIS how to build up my RequestDurationLoggerModule, so I’m stuck using the Service Locator pattern. The container could be a singleton, but I don’t like singletons, so I implemented IServiceProvider in Global.asax instead. All that’s left now is wiring in the module. Since Cassini behaves like IIS6, you have to use the legacy style configuration, which looks like this:

  
       name="..." type="MyApplication.RequestDurationLoggerModule, MyApplication"/>

For IIS7 though, you add it like this:

  
       name="..." type="MyApplication.RequestDurationLoggerModule, MyApplication"/>

Finally, it’s time to run the application and see the total request duration logged.

A Proper Closure in C#

2010-12-18T20:46:00-08:00

You’ve seen code like this:

public class Order
{
    private ITaxCalculator taxCalculator;
    private decimal? tax;

    public Order(ITaxCalculator taxCalculator)
    {
        this.taxCalculator = taxCalculator;
    }

    public decimal CalculateTax()
    {
        if (!tax.HasValue)
        {
            tax = taxCalculator.Calculate(this);
        }

        return tax.Value;
    }

    ...
}

There are a few things to pick at. I’m looking at the member variable named tax that is null until you call CalculateTax for the first time. There isn’t anything to prevent the rest of the class from using the tax variable directly and possibly repeating the null check code in multiple places. I thought it would be fun to rewrite it using a closure. I don’t mean the kind of accidental closures we write with LINQ, but an honest to goodness proper closure.

public class Order
{
    public Order(ITaxCalculator taxCalculator)
    {
        CalculateTax = new Func<Func<decimal>>(() =>
        {
            decimal? tax = null;
            return () =>
            {
                if (!tax.HasValue)
                {
                    tax = taxCalculator.Calculate(this);
                }
                return tax.Value;
            };
        })();
    }

    public Func<decimal> CalculateTax { get; set; }

    ...
}

A closure is created around the tax variable so only the CalculateTax function has access to it. That’s pretty awesome. I wouldn’t have thought of using this technique before learning JavaScript all over again. It’s fun code to write, but it’s not going to get checked-in to source control. It’s basically a landmine for the next guy that has to make changes. The mental energy it takes to wrap your head around it is like listening to a joke with a really long setup and a lousy punch line. The solution is more complicated than the problem.

I still thought it was worth the mental exercise. Each language has its wheelhouse which makes a certain class of problems easy to solve. Admittedly this was a bit forced here, but opening your mind to other solutions may lead to a breakthrough solving a legitimately tough problem.

Refactoring C# Style

2010-11-27T09:51:00-08:00

Take a look a this function.

[UnitOfWork]
public virtual void Upload(DocumentDto dto)
{
    var entity = new Document().Merge(dto);

    using (var stream = new MemoryStream(dto.Data))
    {
        repository.Save(entity, stream);
    }
}

After doing some performance profiling, it quickly popped up as the top offender. Why? Because it’s holding a database transaction open while saving a file. In order to fix it, we have to ditch our UnitOfWork aspect and implement a finer-grained transaction. Basically what needs to happen is saving the entity and saving the file need to be separate operations so we can commit the transaction as soon as the entity is saved. And since saving the file could fail, we might have to clean up an orphaned file entity.

public virtual void Upload(DocumentDto dto)
{
    var entity = new Document().Merge(dto);

    SaveEntity(entity);

    try
    {
        SaveFile(entity, dto.Data);
    }
    catch
    {
        TryDeleteEntity(entity);
        throw;
    }
}

private void SaveEntity(Document entity)
{
    unitOfWork.Start();
    try
    {
        repository.Save(entity);
        unitOfWork.Commit();
    }
    catch
    {
        unitOfWork.Rollback();
        throw;
    }
}

private void SaveFile(Document entity, byte[] data)
{
    using (var stream = new MemoryStream(data))
    {
        repository.Save(entity, stream);
    }
}

private void TryDeleteEntity(Document entity)
{
    unitOfWork.Start();
    try
    {
        repository.Delete(entity);
        unitOfWork.Commit();
    }
    catch
    {
        unitOfWork.Rollback();
    }
}

That wasn’t too bad except that the number of lines of code exploded and we have a few private methods in our service that only deal with plumbing. It would be nice to push them into the framework. Since C# is awesome, we can use a combination of delegates and extension methods to do that.

public virtual void Upload(DocumentDto dto)
{
    var entity = new Document().Merge(dto);

    unitOfWork.Execute(() => repository.Save(entity));

    try
    {
        repository.SaveFile(entity, dto.Data);
    }
    catch
    {
        unitOfWork.TryExecute(() => repository.Delete(entity));
        throw;
    }
}

public static class DocumentRepositoryExtensions
{
    public static void SaveFile(this IDocumentRepository repository, Document document, byte[] data)
    {
        using (var stream = new MemoryStream(data))
        {
            repository.SaveFile(document, stream);
        }
    }
}

public static class UnitOfWorkExtensions
{
    public static void Execute(this IUnitOfWork unitOfWork, Action action)
    {
        unitOfWork.Start();
        try
        {
            action.Invoke();
            unitOfWork.Commit();
        }
        catch
        {
            unitOfWork.Rollback();
            throw;
        }
    }

    public static void TryExecute(this IUnitOfWork unitOfWork, Action action)
    {
        unitOfWork.Start();
        try
        {
            action.Invoke();
            unitOfWork.Commit();
        }
        catch
        {
            unitOfWork.Rollback();
        }
    }
}

Now we can move the extension methods into the framework and out of our service. In a less awesome language we could define these convenience methods on the IUnitOfWork interface and implement them in an abstract base class, but inheritance is evil and it’s a tradeoff we don’t have to make in C#.

Software Development Team Building – Part 1

2010-11-13T17:26:00-08:00

The NHL season starts in October and ends 82 games later in April. Well, the regular season, that is. Hockey fans consider it an extended pre-season because when playoffs start, it’s like watching a different sport. To win the coveted Stanley Cup, a team must battle through four grueling best-of-seven series, and when it’s all over, it’s not about a bad call by the referee or a fluke fumble or broken tackle. It’s not even about a hot goal-tender or superstar forward. It’s about the whole team digging deep and banding together to create a whole greater than the sum of its parts.

If that wasn’t true, you could simply put all the best players on one team and watch them dominate. Russia tried it in the 2010 Winter Olympics and only walked away with the bronze medal. A successful team has an identity that transcends individuals. If you watch hockey, you’ll hear teams described as up-tempo, finesse, defensive or gritty and hard-hitting. One isn’t better than another, it really depends on what your core strengths are and then getting buy-in from everyone on the team.

On a software team, maybe you’ve got some pre-family twenty-something developers that are wizards with Javascript. Or, maybe you’ve got a veteran team and company with deep enough pockets to invest in a long term product strategy. It really doesn’t matter, just identify your strengths and get buy-in.

Developers are opinionated and not always team players; the cliché herding cats comes to mind. So how do you get developers to buy in? Try out this exercise. Get your team in front of a whiteboard and come up with a list of words that describe positive software quality factors. Here’s a start:

Scalable
Fast
Efficient
Reliable
Maintainable
Elegant
Expressive
Readable
Portable
Concise
Testable
Understandable
Correct
Consistent
Clean
Innovative
Secure
Extensible
Reusable
Composable

Have each team member to pick their top 5 and write them on the board. Compare the lists and allow the team to negotiate with each other to come up with a common top 3. Why is a shared list important? Writing software is about trade-offs. We make dozens of decisions per day while writing code and sharing a set of values is essential to consistent forward progress in a team environment.

If you don’t know where you’re going, you might not get there – Yogi Berra

Consider this list:

Testable
Expressive
Scalable

Compare that with:

Innovative
Efficient
Elegant

Imagine you are sitting at the keyboard working on a feature. Can you see how the first list could influence completely different decisions than the second list? It’s not about right or wrong, it’s just about being explicit about how you make decisions and trusting the rest of your team to make similar choices.

Automatic Deployment From TeamCity Using WebDeploy

2010-10-27T11:07:00-07:00

A solid continuous integration strategy for an enterprise web application is more than just an automated build. You really need a layered approach to maintain a high level of confidence in your code base.

Run unit tests – these are fast running unit tests with no external dependencies. We use NUnit.
Run integration tests – these are tests that have a dependency on a database. The primarily purpose is to test NHibernate mappings.
Run acceptance tests – these tests are written in the Given, When, Then style. We write the tests in BehaveN, but we expect that a stakeholder could read them and verify the business rules are correct.
Deploy to CI and run UI tests – these are qunit and Selenium tests. They require the code to be deployed to a real web server before the tests run. That’s what this article is about.
Deploy to QA – once the automated UI tests have passed, we deploy to our QA server for manual testing.

Step 1: Install WebDeploy on the web server you want to deploy to.

Step 2: Configure Web.config transforms. This will enable you to change connection strings and whatnot based on your build configuration.

Currently this is only supported for web applications, but since it’s built on top of MSBuild tasks, you can do the same thing to an App.config with a little extra work. Take a peak at Microsoft.Web.Publishing.targets (C:\Program Files\MSBuild\Microsoft\VisualStudio\v10.0\Web) to see how to use the build tasks.

 TaskName="TransformXml" AssemblyFile="Microsoft.Web.Publishing.Tasks.dll"/>
 TaskName="MSDeploy" AssemblyFile="Microsoft.Web.Publishing.Tasks.dll"/>

Step 3: Figure out the MSBuild command line arguments that work for your application. This took a bit of trial and error before landing on this:

C:\Windows\Microsoft.NET\Framework\v4.0.30319\MSBuild.exe MyProject.sln 
  /p:Configuration=QA
  /p:OutputPath=bin
  /p:DeployOnBuild=True 
  /p:DeployTarget=MSDeployPublish 
  /p:MsDeployServiceUrl=https://myserver:8172/msdeploy.axd 
  /p:username=***** 
  /p:password=***** 
  /p:AllowUntrustedCertificate=True 
  /p:DeployIisAppPath=ci
  /p:MSDeployPublishMethod=WMSVC

Step 4: Configure the Build Runner in TeamCity

Paste the command line parameters you figured out in Step 3 into the Build Runner configuration in TeamCity:

Step 5: Configure build dependencies in TeamCity. This means the integration tests will only run if the unit tests have passed and so on.

Step 6: Write some code.

Hacker, Artist, Craftsman or Engineer?

2010-10-10T09:44:00-07:00

I recently read the book Coders at Work on tmont’s recommendation. It was a good read, and at 500+ pages you can repurpose it as a monitor stand when you’re finished reading it.

Based on nearly eighty hours of conversations with fifteen all-time great programmers and computer scientists, the Q&A interviews in Coders at Work provide a multifaceted view into how great programmers learn to program, how they practice their craft, and what they think about the future of programming.

The author, Peter Siebel, asked each of the interviewees “Do you consider yourself a hacker, artist, craftsman or engineer?” It’s a good question and one I forgot about until Uncle Bob wrote an article suggesting a correlation with one’s definition of done. It’s a clever angle but there is more to explore.

Hacker

On a PDP-8 with only eight instructions and two registers, you’re going to need some ingenuity, maybe even the rare kind of intellect that led a 14 year-old to win eight United States Chess Championships in a row.

It’s just you and your opponent at the board and you’re trying to prove something. — Bobby Fischer

The hacker lives in a world of tight constraints, like Houdini in a straight jacket inside a locked box submerged in icy water. The hacker gets a thrill from this kind of challenge and often relies on a toolbox of algorithms and decompilers to get the job done. The hacker isn’t satisfied with mainstream abstractions for IO, instead the hacker knows exactly what happens when you write a file or open a socket and why you might choose UDP instead of TCP. The hacker is the guy that’s going to push the industry forward with new technology and techniques.

It’s a fine line though, finding loopholes and circumventing the intent of your framework can be a recipe for disaster. Often times, it’s not just you and the machine, it’s you and a team of developers writing software for users that expect their cell phone to stream music from Pandora via Bluetooth to their car stereo while rendering a map of the current location with real-time traffic info. And if all this takes a few extra seconds, your software is casually dismissed as a piece of crap.

Artist

The artist believes that software is an expression of one’s self. The process is creative and each solution is subtly unique like a brush stroke on a canvas. An artist believes what they do is special and you could no more teach someone how to be a great programmer than you could teach someone how to write a great song or novel. You’ve either got it or you don’t.

Just like a great band, truly great software comes from a small team with just the right mix of strengths and styles. Too many contributors and it’s a mess. Not enough and it’s too predictable. Boston was a great band, but it was dominated by Tom Schulz, the guitarist, keyboardist, songwriter and producer. He owned every note. It’s not necessarily a bad thing, Boston sold over 31 million albums in the United States. But consider the Beatles, two creative forces pushing and pulling on each other, clashing and meshing to produce a truly dynamic catalog.

Craftsman

If you’re having open heart surgery, you probably want an experienced doctor holding the scalpel. The human body is a complex system and reading a text book just isn’t the same as years of experience.

Programming is difficult. At its core, it is about managing complexity. Computer programs are the most complex things that humans make. Quality is illusive and elusive. – Douglas Crockford

When a fine furniture craftsman looks at a piece of wood, the unique shape and the location of the knots are thoughtfully considered. The craftsman observes how dry or green the wood is, how hard or soft it is and applies a set of best practices honed through years of experience to produce the finest end result.

The craftsman believes that most software problems are the same but knows a solution can’t be stamped out by code monkeys. Physical and financial constraints, subtleties of the problem domain and quirks of a particular framework or language require an experienced and steady hand to produce a quality product. The craftsman believes software development is a highly skilled trade and to become an expert you must start as an apprentice.

Engineer

An engineer writes the kind of software you count on when you drive a car, fly in a plane or dial 911; Where a failure means more than a delayed shipment or double payment on your online order. The stakes are high and the problems are too big to fit in one person’s head. Problems that only process and frameworks can wrangle. You can’t simply start typing and watch the design of your application emerge from nothing, it requires detailed analysis and thoughtful study.

Which one am I? I have a piece of paper that says Engineer on it, and while I enjoy writing frameworks as much as the next guy, I’m a craftsman. Maybe it’s because I’ve been doing this for a dozen years and I’d like to think my experience is worth more than a piece of paper.

Hey, if you want me to take a dump in a box and mark it guaranteed, I will. I got spare time. But for now, for your customer’s sake, for your daughter’s sake, ya might wanna think about buying a quality product from me. – Tommy Boy

Which one are you?

Don't Say Singleton

2010-09-18T22:10:00-07:00

I was in junior high around the time that Master of Puppets was released followed by …And Justice For All. Maybe it’s nostalgia, but those two albums were not only the best Metallica albums, they were perhaps the greatest albums of the entire genre. Something happened though and I don’t mean grunge.

Metallica eventually succumbed to their own success and tried to crank out albums more accessible to wider audiences. This tension between music with something to say and music that appeals to the masses has been the bane of the industry since the invention of the radio.

If you don’t believe me, go ask 10 people what their favorite Metallica song is. You’re going to hear Enter Sandman unless you work at an auto shop or maybe guitar center in which case you’re probably not reading this article. You’re going to hear Enter Sandman because that song was engineered by producers and marketing types to be played on the same radio stations that host morning shows like Jeff and Jer, yet that song couldn’t be further from the band’s roots.

Likewise, if you ask 10 programmers if they use design patterns, you will undoubtedly hear a resounding yes citing the widely accessible singleton pattern. Why is this the case? It’s because the singleton most closely resembles something that looks familiar when writing procedural code and it’s the easiest pattern to insert into a pile of crap. The problem is that the singleton represents design patterns about as well as Enter Sandman captures the essence of why Metallica was a great band.

Pattern catalogs should probably look more like this:

Design Patterns:

Singleton*
Strategy
Decorator
…

* Not a design pattern

Or at least come with a warning label.

Warning: The Singleton is actually a global variable. Use this pattern as a last resort and be sure to hide the fact you are using it from the rest of your application and your manager.

So how do you know if you’re using the Singleton correctly? I’ll give you a hint, if you have any knowledge of the fact that the object you’re using is a singleton, then you’re not using it correctly. Let’s say you’re writing some caching logic since it’s one of those places where singletons pop up.

Let’s start with a simple interface for finding a user object:

public interface IUserRepository
{
    User FindById(int id);
}

You realize you keep doing the same lookup over and over and adding some caching would really perk up the application. Hooray, for the Singleton!

public class CachingUserRepository : SqlUserRepository
{
    public override User FindById(int id)
    {
        var user = MyCache.Instance.Get(id);

        if (user == null)
        {
            user = base.FindById(id);
            MyCache.Instance.Add(id, user);
        }

        return user;
    }
}

Well this probably works great in a console application or Windows service, but what if your code is running in a web application, or as a WCF service. What if your Windows service is distributed across multiple machines? Should you care how your application is deployed in your CachingUserRepository? No, you shouldn’t. That object has one responsibility, and it’s to manage collaboration with the cache. Let’s try this instead:

public class CachingUserRepository : SqlUserRepository
{
    private readonly ICacheProvider<User> cache;

    public CachingUserRepository(ICacheProvider<User> cache)
    {
        this.cache = cache;
    }

    public override User FindById(int id)
    {
        var user = cache.Get(id);

        if (user == null)
        {
            user = base.FindById(id);
            cache.Add(id, user);
        }

        return user;
    }
}

Maybe your cache should live for the lifetime of a single web request, or maybe a web session, or possibly a distributed cache is right for your scenario. The good news is that caching is one of those solved problems, like data access, so you can just pick your favorite library and plug it in to your application with a simple adapter:

HttpRequestCacheProvider
HttpSessionCacheProvider
HttpContextCacheProvider
MemcacheCacheProvider
EnterpriseLibraryCacheProvider
VelocityCacheProvider

Let’s say you still want to roll your own cache and you decide that singleton is right scope for you. That’s okay, just recognize that using a singleton is a configuration decision and should not leak out into your application. The only place that you should see that .Instance getter is where you bootstrap your application.