Lately I’ve been addicted to solving challenges at HackerRank.com. I decided early on that I wanted to solve the majority of Algorithm challenges in C. I spend so much time in higher level languages that I’ll generally jump at the chance to work closer to the machine. Sometimes this can be frustrating because you have to do heavy lifting where a higher level language hands you the perfect class or set of functions.

But sometimes…sometimes it can be very rewarding.

The Challenge

The ACM ICPC Team challenge starts off with two numbers. N is the number of people at a conference, and M is the number of topics that the people can know about. The input then consists of N strings with M characters in each string. Each character is a 0 or a 1. The idea is that there’s a line for each person and a character position for each topic. 1 means the person is well versed in the topic, and 0 means they are not. The goal is to find the maximum number of topics any two person team is well versed in, and the count of teams that hit the max.

When I read the challenge my mind ran straight to a specific solution, and I never even considered another approach. After I submitted my solution and got the points I read both the editorial and the discussion. And I was rather surprised at what I found.

You Can Do Better Than O(n^3)

The editorial states that this is a brute force problem with O(n^3) complexity. The C++ code example uses three nested loops to compare every character of every pair of strings, each string stored in a vector.

The PHP example takes advantage of the fact that you can bitwise OR two strings to eliminate a loop. But internally there’s still a character by character operation to produce a new string, followed by another character comparison of the result to count the 1’s. So this solution merely hides the O(n^3) complexity behind language functions.

And finally the Python example converts the strings to integers and does a bitwise OR, but then converts the result back to a string to count the 1’s. The Python code actually touches on an important point, but the potential speed gain gets lost in Python itself.

Surprised by the inefficiency of these solutions I clicked over to the discussions tab.

Nobody Saw It?

I skimmed the comments going back two years. Aside from the normal “why doesn’t my code work?” posts, there were numerous complaints about timeouts. And numerous code postings that squeaked in under the finish line, but took similar approaches to the editorial: O(n^3) value comparisons or counting 1’s in a string which was the result of a bitwise OR.

There were a handful of more creative approaches involving Java’s BitSet (slow) or a 2D array of integers (not bad, but still).

Right before giving up on loading more comments I came across another C solution which was comparable to mine in its approach and speed. But for all the interest expressed in the discussion about faster ways to solve the puzzle, nobody paid much attention to the post or the reason why it was insanely fast.

I was shocked by this point because when I wrote my solution I didn’t think of it as being unique or special. I assumed the editorial samples would include a similar approach, along with many of the user submissions.

Let The CPU Do Its Job

The challenge input is simply a collection of bits represented as human readable characters in the 8-bit ASCII set. Yet almost no one reduced the data to a form that naturally fits the CPU.

The help section mentioned bitwise OR as did people in the discussion trying to help those with code that was timing out. But the majority were using bitwise OR on the individual ASCII characters from the string. In some cases literally, one character pair at a time. Others applied bitwise OR to multiple characters in a single operation, but they were still using 8-bits in memory to represent a single bit of the problem.

By analogy, if you compare two topic characters from the string at a time you are only using 1 lane of a 64 lane highway. If you bitwise OR 8 of the topic characters at a time you’re doing better, but still only using 8 lanes of a 64 lane highway.

The solution I wrote?

  • calloc a properly sized block of memory based on the inputs.
  • Use strtoul to convert the base 2 strings, in chunks determined by sizeof, to unsigned long ints placed into the memory block. Each 0 and 1 character becomes a single bit in memory. On 64-bit CPUs unsigned longs are 8 bytes and can store 64 topic flags.
  • Loop through the two person teams in O(n^2).

The key differences?

  • The CPU is performing bitwise OR operations at its natural word size on a data structure with no waste. For a 64-bit system that means the topics of a two person team are merged with a single instruction for every 64 topics. It also means the code which replaces the innermost loop in the brute force approach performs with a fraction of the loop overhead.
  • The data describing each person’s topics is compressed by a factor of 8 meaning the CPU has less data to move and less chance of a cache miss.
  • With a fairly simple algorithm you can reduce the act of counting the bits to a handful of instructions for the CPU. And in Xcode at least, __builtin_popcountl implements an even faster algorithm for you.

The end result? On test case 7 with 500 people and 500 topics the editorial Python solution took 0.881728 seconds, as tested on my MacBook Pro. The editorial C++ solution took 0.401785 with the same Xcode compiler settings as my solution.

My solution? 0.002911 seconds. 138x faster than the C++ sample. And 303x faster than the Python sample.

Sometimes it pays to be close to the machine.