.

Tags:

In practical programming, it is quite common to determine whether an object belongs to a set or not. For example, if you implement a spell checker, it is essentially a comparison against the set of all correctly-spelled words. If the set itself is fixed and it never changes, what are the approaches we can use?

The usual initial solution is by using an associative container, often denoted as dictionary data structure. We populate the dictionary with the members of the set. Looking up whether the set contains the search word or not will reveal the outcome of the spell checking.

JavaScript, also known as ECMAScript, is unique in a sense that every object can act as an associative array. Obviously, this is not what you are supposed to do because of the abuse, but to give an idea:

var valid_words = {
  'foobar': true,
  'bar': true,
  'baz': true,
  'quux': true
};
 
function is_valid(word) {
  return valid_words.hasOwnProperty(word);
}
 
is_valid('fox'); // false

If you are using ECMAScript 5 only, you can use Object.create from null so that valid_words does not inherit some properties from Object.prototype.

Another (better) approach is to use array lookup, i.e. indexOf method, in particular since we can store the list of valid words in an array anyway:

var valid_words = ['foobar', 'bar', 'baz', 'quux'];
 
function is_valid(word) {
  return valid_words.indexOf(word) >= 0;
}

The problem with the above tricks is that the performance depends very much on the implementation inside the JavaScript engines. Since this is not the primary use case of object or array, often this technique does not lead to a good performance, especially if the set is large.

How about using the native comparison in the language itself? Using switch statement is one possibility. It could be even a series of if statements.

function is_valid(word) {
  switch (word) {
  case 'foobar':
  case 'bar':
  case 'baz':
  case 'quux':
    return true;
  }
  return false;
}

Beside generating a lot of code (since the checking is done for each possible set members), a string comparison is not the fastest thing to do. Depending on the size of the set, the performance may vary.

Something I learned from Roberto Raggi few years ago is the use of prefix tree or trie to build a sort of decision tree. This is used in his C++ parser to detect whether an identifier is a keyword or not, pretty similar to our spell checker example. For the code of that cppparser, take a look at lexer.cpp. How do we apply the same approach to the simple spell checker? It will look like the following. Note that there can be dozens of other variants, but you get the idea.

function is_valid(word) {
  switch (word.charAt(0)) {
  case 'f':
    return word === 'foobar';
 
  case 'b':
    switch (word.charAt(2)) {
    case 'r':
      return word === 'bar';
    case 'z':
      return word === 'baz';
    }
    return false;
 
  case 'q':
    return word === 'quux';
  }
  return false;
}

A simplified construct that I use for Esprima, the fast JavaScript parser, is to take the word length as the first level comparison and then a simple conditional for the second one. This leads to:

function is_valid(word) {
  switch (word.length) {
  case 3:
    return word === 'bar' || word === 'baz';
  case 4:
    return word === 'quux';
  case 6:
    return word === 'foobar';
  }
  return false;
}

I found out that the above method works fast enough (tested with its benchmark suite) and there is no reason to micro-optimize further. Just like in PGO (profile-guided optimization), combining the short-circuit behavior of the logical condition with the representative distribution of the words can be used for branch optimization.

If the set is large and some members are quite similar to each other, DAG or directed acyclic graph (or rather directed acyclic word graph in this case), can be a better solution with respect to the memory usage. For many cases, generating this graph would be too complicated and not worth the effort.

Of course, it would be incomplete if I don’t mention no-collision hash, i.e. perfect hash. This is the approach used to recognize keywords in the tokenizers of various GNU compilers, from C++ to Java, as well as projects like WebKit and Mozilla. A popular perfect hash function generator is gperf, it generates C code though it is possible to transform the generated code to JavaScript.

What we need to feed to gperf is the following input:

%{
%}
%%
foobar
bar
baz
quux
%%

Now it will spit out some C code. For your convenience, I have crudely transformed it into JavaScript as follows. I also did change the asso_values look-up table since many entries in that table are duplicated. The (perfect) hash function itself is unexpectedly quite primitive. Combined with the short wordlist array, this gperf-based solution works rather quite well.

var MIN_WORD_LENGTH = 3;
var MAX_WORD_LENGTH = 6;
var MIN_HASH_VALUE = 3;
var MAX_HASH_VALUE = 8;
 
function hash(str) {
  var asso_values = [];
  for (var i = 0; i < 256; ++i)
    asso_values.push(9);
  asso_values[111] = 0;
  asso_values[114] = 5;
  asso_values[117] = 0;
  asso_values[122] = 0;
  return str.length + asso_values[str.charCodeAt(2)];
}
 
function is_valid(word) {
  var wordlist = ['', '', '', 'baz', 'quux', '', 'foobar', '', 'bar'];
 
  if (word.length <= MAX_WORD_LENGTH && word.length >= MIN_WORD_LENGTH) {
    var key = hash(word);
    if (key < = MAX_HASH_VALUE && key >= 0) {
      var value = wordlist[key];
      return word === value;
    }
  }
  return false;
}

Before you create microbenchmarks comparing all the different methods, let me remind you again that it is always good to have the right data for your benchmark. In other words, write the most descriptive implementation first, create the benchmark suite properly with the representative corpus, and then do your comparison. In the above example, if you just measure the speed of running is_valid("foo"), you would be led astray.

Obviously, all the approaches mentioned above are not the only possible solutions. Computer scientists surely have studied and understood few more tricks. What is your favorite? Share it with us!

  • http://www.facebook.com/jarrednicholls Jarred Nicholls

    Cool story bro

  • MySchizoBuddy

    how about you look at the length of the word and then instead of checking the first character you check the last character. What are the chances that a 3 letter word ends with “r”. Pretty low i think.

    • http://ariya.ofilabs.com/ Ariya Hidayat

      That’s just another variant: suffix tree.

  • http://twitter.com/pebaryan Peb Ruswono Aryan

    i use a trie for a gazetteer in text processing and triple store. it works fine especially for namespace prefix aliasing. other variants of tree which tolerate spelling error (for searching places or names) are just as nice.

  • http://kirbysayshi.com/ Andrew Petersen

    One thing I’m confused about, is why not just use the associative array/object property lookup?

    I could see not wanting to use it due to memory in certain situations, but how could a hash function be faster than a simple property lookup? Are there other concerns that you didn’t mention (or that I missed) in the article?

    Or I suppose the idea is to use the best method for the job.

  • Peter van der Zee

    For some reason, the length check hadn’t occurred to me. I was only doing the charcodeat prefix check.

    So I’ve done some testing between testing the leading charCodeAt and identifier length, both followed by string comparisons for the proper set of identifier strings. It seems that doing the charCodeAt prefix check is consistently faster than length. Though this is mostly noticable over an 8mb file and won’t make a difference when parsing jquery (but hey, micro is micro).

    In my case that might have to do with slicing the input when comparing though. I don’t slice unless absolutely necessary, something which Esprima will do regardless, so there’s less overhead (difference) for it there. And seeing how the slicing is pretty significant, my results might be kind of moot for you :)

    Benchmarking is hard

    • http://twitter.com/ariyahidayat Ariya Hidayat

      What do you mean by slicing?

  • Asen Bozhilov

    While your post is interesting, you have explored only linear search approaches. If you have access to the set, I would pre-sort the set and after apply binary search algorithm. The complexity would be O(log n).

    • Asen Bozhilov

      Huh, I missed your last hash function. Anyway, binary search is good to be mention it.