Saving Humanity (By Vectorizing The ALU)

The Challenge

Oh!! Mankind is in trouble again. This time, it’s a deadly disease spreading at a rate never seen before. The need of the hour is to set up efficient virus detectors. You are the lead at Central Hospital and you need to find a fast and reliable way to detect the footprints of the virus DNA in that of the patient.

The HackerRank challenge Save Humanity can be summarized as follows:

You are given two strings of lowercase letters a-z. We’ll call them dnaP and dnaV.
You have to return the starting index (0-based) of every occurrence of dnaV in dnaP.
dnaV matches a substring in dnaP if all characters match, or if all characters but one match.

This challenge is ranked as expert with a 100 point award. As of the day this post was published only 12.94% of submissions have succeeded, and the Discussion tab is filled with the laments of those who timeout on the most difficult test files.

I coded a solution in less than 15 minutes. (Cue the anger and wrath of anyone who is stuck on this challenge.)

To be fair, I did not initially code an optimal solution. I guessed when I read the challenge that an optimal solution would involve something like a z-algorithm, a suffix array, or some form of tree. But I had a hunch there was an easier way to get 100 points.

First Draft

With this class of challenge it’s a sure bet that a brute force solution will timeout. Since the brute force approach is trivial to code I went ahead and tried it to verify that I wasn’t missing anything about the problem, and so that I could use my hackos to purchase test files.

It timed out on 5 of the test files but succeeded on the others. I bought the files, ran some tests on my MBP, and confirmed my hunch.

Go Big or Go Home, Brutus

The obvious brute force solution is to compare each character in dnaV to each character in dnaP and stop if there’s more than one mismatch. This of course has to be done repeatedly starting at each character index in dnaP. It’s the repeat comparisons that drive the processing time up, and an optimal solution would find a way to eliminate as many repeat comparisons as possible.

But there’s another problem with the brute force solution: it’s comparing one character at a time on a 64-bit processor which can compare 8 characters at a time using the ALU. Situations like this are why I freaking love C. I took my solution and modified it as follows:

Cast the char pointers dnaP and dnaV to long long pointers.
Compare the character arrays as arrays of long long, essentially vectorizing the ALU.
If there’s a mismatch, drop down into a loop which examines the 8 characters at that position to count the mismatches.

In C these were trivial modifications to make and they brought the worst time down to 1.17s for test file 7, passing the challenge with time to spare. We’ll call this the B8 solution (brute force 8-byte compares) and the original solution B1 (brute force 1-byte compares).

Having solved the challenge I wanted to explore the solution further to see if I could make it faster than the editorial solution on the longer test files (3-9).

Saving Humanity with the Vector Processor

The obvious next step in improving the brute force approach is to compare more than 8 bytes per iteration. SSE2 has instructions which can compare 16 bytes at a time. Initially I tried to both compare the values and get a count of mismatches all using SSE2 instructions. However, this proved to be slower than simply performing the compare and counting the mismatches normally if the SSE2 result indicated any mismatches at all. The B16 solution is therefore the same as the B8 solution, it just compares 16-bytes at a time.

Test file 7 is again the worst case and the B16 solution is able to process it in 0.68 seconds. That’s roughly 1.7x faster than using the ALU.

Later extensions to x86-64 vector processing added support for 256-bit operations (AVX2) and 512-bit operations (AVX-512). Unfortunately my MBP processor does not support AVX-512. It does support AVX2, but using 256-bit loads and compares slowed down all test files except 8 and 9, which saw a very slight performance gain. I suspect this has something to do with the algorithm’s inherent lack of alignment on 32-byte boundaries.

Beating the Editorial Solution

I won’t go into detail on the editorial solution here since HackerRank hides the editorial until you either get your points or forfeit those points in order to see the solution. Suffice it to say that the editorial solution involves the use of a suffix array.

The editorial solution is 3-4x faster than the B16 solution on files 3-7. But on test files 8 and 9 B16 is nearly 21x faster. Even brute force using single byte compares is faster than the editorial here by a factor of 3x. What’s going on?

Test files 3-7 involve very long repetitions of the letter ‘a’ with a few other characters interspersed in the text to form a pattern. Test files 8 and 9 involve long repetitions of the alphabet with a few other characters interspersed. The brute force solutions have to repeatedly compare a large percentage of the bytes in test files 3-7. But on test files 8 and 9, 25 out of every 26 comparisons end on the 2nd byte being compared. Here the brute force solutions aren’t comparing nearly as many bytes as the editorial solution which has to sort the suffix array for it to be useful.

To deal with the high character repetition in some of the test files I made the following modifications to the B16 solution. We’ll call this version B16R.

Allocate an array of integers for both dnaP and dnaV.
Scan the strings from end to beginning counting repeat characters and storing the counts in the arrays.
Now at any given character position we can look at the corresponding integer arrays and see how many times the character will repeat in both dnaP and dnaV.
The string matching loop can use this information to jump ahead any time it finds a suitable repetition.

Mission Accomplished

	B1	B8	B16	B16R	Editorial	B16R vs Ed
03	4.576205	0.900688	0.515656	0.032629	0.164616	5.05
04	4.704736	0.946557	0.575980	0.041863	0.170677	4.08
05	3.347541	0.668534	0.396330	0.040929	0.165247	4.04
06	4.699864	0.956699	0.588319	0.050467	0.162984	3.23
07	5.791916	1.167417	0.698043	0.063837	0.178140	2.79
08	0.159862	0.036540	0.026027	0.050608	0.541531	10.70
09	0.169632	0.036569	0.026848	0.057554	0.542238	9.42

All tests: MacBook Pro Retina 15-inch Mid 2015 • 2.8 GHz i7 • Xcode 10B61 • Max Optimization • Loop Unrolling

B16R is roughly 3-5x faster than the editorial solution on files 3-7, and each of the test files completes in under 1/10th of a second. None of the long test files complete in under 1/10 of a second with the editorial solution, and two of them take a half second.

These changes did double the run times on test files 8 and 9. If this were a real world scenario with larger data sets it would be worth keeping track of repetition frequency while building the array for dnaV. At that point the code could decide whether or not to bother looking for repetitions in the search.

Is There An Even Better Way?

String matching is a well researched area of computer science. This web site lists the major algorithms along with their strengths, weaknesses, and sample implementations in C. Most programmers will never need to implement one of these algorithms and will simply rely on library calls whenever they need to search a haystack for a needle of text. It’s still educational to look over the various approaches to a common problem and to understand how each one tried to improve performance.

Time permitting I hope to explore a few of these algorithms, and how well they handle this particular challenge, in a later blog post.

Saving Humanity (By Vectorizing The ALU)

The Challenge

First Draft

Go Big or Go Home, Brutus

Saving Humanity with the Vector Processor

Beating the Editorial Solution

Mission Accomplished

Is There An Even Better Way?

Source Files

Submit a Comment Cancel reply

Recent Posts

Categories

Archives