SPO 600 Project

In this post, I will discuss my findings during Stage 2 of the project. For more details, click here.

Before I should talk about how I optimized the Murmurhash2 algorithm, I would like to answer a question that I forgot to add on my stage 1 post - where is that algorithm is used? After doing some research, I found out that Tengine, just like nginx, uses Murmurhash2 in the ngx_http_split_clients_module, which is often used for A/B Testing.

Alright then, let's talk about how I optimized the algorithm. First of all, I think I should mention the optimization strategies I eliminated:

In-line assembler: Given Tengine is meant to support a variety of platforms and CPU architectures, I honesty don't think it is a good option to use it - unless Jack Ma would give me a job afterwards :)
Altered build options: At first I considered this option, since it is the easiest one, and planning to change its compile options to -O3/-Ofast, but after checking the cc (which listed all the compile options for all the compilers it supported) directory of Tengine at GitHub, I quickly noticed that it still can be compiled using some legacy compilers - Compaq C compiler and Open Watcom C compiler, to name a few - and those are things I don't want to mess with, as I don't want to break backward compatibility.
Code changes to permit better optimization by the compiler: I also briefly considered this option and change some parts of the algorithm, so it can take advantage of auto-vectorization and SIMD, though I have a concern: what if the software is running on a platform that doesn't support SIMD and/or auto-vectorization? Also, to really take advantage of auto-vectorization and SIMD, my best option is to compile it with -O3, which could potentially wreck the entire software (I assume Igor Sysoev chose -O2 for the original nginx for the same reason).

So, the safest (and the only possible) option for me is Algorithm improvements.

Improve the algorithm

Since Murmurhash itself is already a simple algorithm, at first I think there is no way I can further optimize it (for this reason, I even once consider to choose the md5 algorithm instead, but I later dropped the idea), but after paying the SMHasher a visit, I realized I was wrong. I scrolled to the Murmurhash2 and did a quick comparison with the ngx_murmurhash.c file, and I noticed it is using the endian- and alignment-neutral version. An idea quickly popped out from my mind: what if I just combine the original and endian- & alignment-neutral version, so it can be switched to original when the platform is already little-endian?

Then another problem arises: is there a compiler-independent way to detect the endianness of a CPU? I know in GCC, <sys/param.h> and <endian.h> there are macros for detecting the endianness, but I think it is unfair to penalize users with a slow algorithm only because they aren't using GCC, they're not using Unix-like OS, or they don't have GNU C library. That means I have figure another way to detect the endianness.

And when I am surfing around Tengine's directory at GitHub (largely because I am bored), I noticed a file called endianness - a small script that checks the endianness of the CPU, and creates the NGX_HAVE_LITTLE_ENDIAN macro if the CPU is little-endian.

Thanks to that, I quickly wrote a proof-of-concept of my improved version and its tester, with the original version serve as the comparison. This time I would be using a 50000000-character string, as recommended by my professor, Chris Tyler:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

#define DATA_LENGTH 50000000
#define BILLION 1000000000L

uint32_t
ngx_murmur_hash2(u_char *data, size_t len)
{
    uint32_t h, k;

    h = 0 ^ len;

    while (len >= 4)
    {
        k = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch (len)
    {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= 0x5bd1e995;
    }

    h ^= h >> 13;
    h *= 0x5bd1e995;
    h ^= h >> 15;

    return h;
}

int ngx_check_endianness()
{
    int i = 0x11223344;
    char *p;

    p = (char *) &i;
    if (*p == 0x44) return 0;
    return 1;
}

uint32_t
ngx_murmur_hash2_mod(u_char *data, size_t len)
{
    uint32_t h, k;

    h = 0 ^ len;

    while (len >= 4)
    {
        if(ngx_check_endianness() == 0)
        {
            k = *((uint32_t *)data);
        }
        else
        {
            k = data[0];
            k |= data[1] << 8;
            k |= data[2] << 16;
            k |= data[3] << 24;
        }
        
        k *= 0x5bd1e995;
        k ^= k >> 24;
        k *= 0x5bd1e995;

        h *= 0x5bd1e995;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch (len)
    {
    case 3:
        h ^= data[2] << 16;
        /* fall through */
    case 2:
        h ^= data[1] << 8;
        /* fall through */
    case 1:
        h ^= data[0];
        h *= 0x5bd1e995;
    }

    h ^= h >> 13;
    h *= 0x5bd1e995;
    h ^= h >> 15;

    return h;
}

int main()
{
    u_char tmp;
    u_char *data;
    struct timespec tval_before, tval_after;

    srand(time(NULL));

    data = calloc(DATA_LENGTH, sizeof(u_char));
    for (int i = 0; i < DATA_LENGTH; i++)
    {
        tmp = (u_char)((rand() % 26) + 'a');
        data[i] = tmp;
    }

    clock_gettime(CLOCK_MONOTONIC, &tval_before);
    uint32_t result = ngx_murmur_hash2(data, sizeof(u_char) * DATA_LENGTH);
    clock_gettime(CLOCK_MONOTONIC, &tval_after);
    uint64_t timedif = BILLION * (tval_after.tv_sec - tval_before.tv_sec) + tval_after.tv_nsec - tval_before.tv_nsec;
    printf("original version took %llu nanoseconds\n", (long long unsigned int) timedif);

    struct timespec tval_before2, tval_after2;
    clock_gettime(CLOCK_MONOTONIC, &tval_before2);
    uint32_t result2 = ngx_murmur_hash2_mod(data, sizeof(u_char) * DATA_LENGTH);
    clock_gettime(CLOCK_MONOTONIC, &tval_after2);
    uint64_t timedif2 = BILLION * (tval_after2.tv_sec - tval_before2.tv_sec) + tval_after2.tv_nsec - tval_before2.tv_nsec;
    printf("modified version took %llu nanoseconds\n", (long long unsigned int) timedif2);

    char match;

    if(result == result2)
    {
        match = 'y';
    }
    else
    {
        match = 'n';
    }

    printf("hash match: %c\ndifference between two versions: %lli nanoseconds\n", match, (long long int) timedif - timedif2);

    free(data);

    return 0;
}

(Note: Since it is a proof-of-concept, I didn't use the macro directly, though the ngx_check_endianness function uses exactly the same way how Tengine detects the endianness of CPU to create the macro. Of course in the final version, I will use the macro.)

I run the tester on aarchie (AArch64), and the results is much better than I expected (my expectation is the improved version would only faster than the original by a couple nanoseconds):

[GeoffWu@aarchie tester]$ ./tester
original version took 33387800 nanoseconds
modified version took 33328578 nanoseconds
hash match: y
difference between two versions: 59222 nanoseconds
[GeoffWu@aarchie tester]$ gprof ./tester
Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls Ts/call Ts/call name
57.37      0.04     0.04                             ngx_murmur_hash2
43.03      0.07     0.03                             ngx_murmur_hash2_mod

Since I am too lazy to type ./tester 10 times, I modify the tester to run itself 10 times, each time with a new string. And the result is:

[GeoffWu@aarchie tester]$ ./tester
original version took 33476581 nanoseconds
modified version took 33345553 nanoseconds
hash match: y
difference between two versions: 131028 nanoseconds
original version took 33399119 nanoseconds
modified version took 33353913 nanoseconds
hash match: y
difference between two versions: 45206 nanoseconds
original version took 33400424 nanoseconds
modified version took 33327394 nanoseconds
hash match: y
difference between two versions: 73030 nanoseconds
original version took 33355609 nanoseconds
modified version took 33320899 nanoseconds
hash match: y
difference between two versions: 34710 nanoseconds
original version took 33369537 nanoseconds
modified version took 33380704 nanoseconds
hash match: y
difference between two versions: -11167 nanoseconds
original version took 33400528 nanoseconds
modified version took 33388808 nanoseconds
hash match: y
difference between two versions: 11720 nanoseconds
original version took 33404792 nanoseconds
modified version took 33388615 nanoseconds
hash match: y
difference between two versions: 16177 nanoseconds
original version took 33364073 nanoseconds
modified version took 33322090 nanoseconds
hash match: y
difference between two versions: 41983 nanoseconds
original version took 33376240 nanoseconds
modified version took 33391640 nanoseconds
hash match: y
difference between two versions: -15400 nanoseconds
original version took 33544634 nanoseconds
modified version took 33392719 nanoseconds
hash match: y
difference between two versions: 151915 nanoseconds

I also run this tester on xerxes (x86_64), and this is the result:

[GeoffWu@xerxes tester]$ ./tester
original version took 24486375 nanoseconds
modified version took 24449778 nanoseconds
hash match: y
difference between two versions: 36597 nanoseconds
original version took 24482464 nanoseconds
modified version took 24497271 nanoseconds
hash match: y
difference between two versions: -14807 nanoseconds
original version took 24500344 nanoseconds
modified version took 24479112 nanoseconds
hash match: y
difference between two versions: 21232 nanoseconds
original version took 24509004 nanoseconds
modified version took 24478274 nanoseconds
hash match: y
difference between two versions: 30730 nanoseconds
original version took 24473804 nanoseconds
modified version took 24488610 nanoseconds
hash match: y
difference between two versions: -14806 nanoseconds
original version took 24471569 nanoseconds
modified version took 24454527 nanoseconds
hash match: y
difference between two versions: 17042 nanoseconds
original version took 24476877 nanoseconds
modified version took 24455086 nanoseconds
hash match: y
difference between two versions: 21791 nanoseconds
original version took 24473525 nanoseconds
modified version took 24444749 nanoseconds
hash match: y
difference between two versions: 28776 nanoseconds
original version took 24496712 nanoseconds
modified version took 24437485 nanoseconds
hash match: y
difference between two versions: 59227 nanoseconds
original version took 24500064 nanoseconds
modified version took 24445308 nanoseconds
hash match: y
difference between two versions: 54756 nanoseconds

Overall, the improved version is faster than the original version (although there are several instances when the original version is faster, and my speculation is because each time I'm using a new string), and I consider that as a success.

The final version of my improved Murmurhash can be found here. (Note: initially I only checked the CPU endianness, but after reading this conversation in the nginx mailing list, I added the NGX_HAVE_NONALIGNED to check the alignment.)

A Quick side-note on Tengine's make test

This part is not necessary relevant to the project, but I really think I should write this all down, just so when someone wants to contribute to Tengine and run the unit tests before opening a pull request, they won't have to spend a week to figure it out (like I did).

For Tengine, there is a (at least seemingly) convenient to run all the unit tests it provided. According to its How to test page (written in Chinese) on their wiki, after you configure it, you just have to type make test to run all its unit tests. Looks pretty easy, isn't it?

But as it turns out, it is much more complicated than I expected.

At first, I simply type ./configure --prefix=/home/GeoffWu/app to create the Makefile, then run make test. Since I haven't implement my changes, I expect I would pass the test, but it fails spectacularly.

After spending some time to analyze the error message, I realize it is because I didn't include a module that is not compiled by default: the ngx_http_dyups_module and its Lua API. Also, I need to include the lua-nginx-module (I am using the one provided by Tengine), as well as install either LuaJIT 2.0/2.1 or Lua 5.1 when I am compiling it.

After installing all the LuaJIT and the luajit-devel and make sure I include the modules it requires, I try it again. But this time, it can even compile:

modules/ngx_http_lua_module/src/ngx_http_lua_headers.c: In function ‘ngx_http_lua_ngx_header_set’:
modules/ngx_http_lua_module/src/ngx_http_lua_headers.c:709:13: error: implicit declaration of function ‘luaL_getn’; did you mean ‘luaL_gsub’? [-Werror=implicit-function-declaration]
         n = luaL_getn(L, 3);
             ^~~~~~~~~
             luaL_gsub
cc1: all warnings being treated as errors
make[1]: *** [objs/Makefile:1469: objs/addon/src/ngx_http_lua_headers.o] Error 1
make[1]: Leaving directory '/home/GeoffWu/tengine'

For those who can't understand what it means, it is just the the Lua module provided by Tengine is so old (in fact it is released in 2015), it still uses luaL_getn, which is deprecated. I end up getting the unit tests to run by going the official repo of ngx_http_lua_module, clone it to my directory, then manually adding it to Tengine when I am compiling it.

The exact command is: ./configure --prefix=/home/GeoffWu/app --with-http_dyups_module --with-http_dyups_lua_api --add-module=/home/GeoffWu/lua-nginx-module --with-luajit-inc=/usr/include/luajit-2.1 --with-luajit-lib=/usr/lib64

Search This Blog

SPO600

SPO 600 Project - Stage 2

Improve the algorithm

A Quick side-note on Tengine's make test

Comments

Post a Comment

Popular posts from this blog

SPO 600 Project - Stage 1

SPO 600 - Lab 5

SPO 600 - Lab 6