Engineyard Competition
Recently I saw on some news site a competition to calculate SHA1 collisions. Naturally, I tried it out, but used Python, my preferred language. Unfortunately, my Python implementation was significantly slower than others using C (which was smoot since both C and Python implementations were completely dwarfed by CUDA implementations). I entered the competition after it was already in progress, and managed to get a pretty good score, but that was with an insane amount of computers crunching it.
So that leads me back to the topic of this post. Convinced that I had more hardware than anyone else in the competition, and that even though I started with less than 12 hours to beat everyone, I was pretty sure I should have won (before I saw the CUDA performance). So I started rewriting my hastily written Python program in C. The competition was already over, so this was purely for self improvement.
A Little Rusty
My first programming language was C, and I was an expert in it when I was 13 or so (probably 1992). Since then, my attention was driven towards Python and PHP, as the defacto languages of the web. This made me realize exactly how much I had forgotten about plain old C (my C++ was not horrible, atleast I could still understand C++ code written by decent programmers). Anyhow, part of the competition included taking numbers and comparing how much each of the 160 bits varied (ie 1011 and 1001 differ by 1 bit, 1100 and 1111 differ by 2 bits etc). So I wanted to make a C function that converts any arbitrary integer to a binary string (which really wasn’t the goal of the exercise, as the hashes were in hex, but it made me try and remember how to convert an integer as it had been so long).
Conclusion to a Very Boring Story
So I ended up writing an int2binary function and a program that converts them and prints them out. It will break with very large number, but I included a function to estimate it’s size (always going to the next number divisible by 4 as I like that padding), so linking that to a malloc would be trivial. I don’t suggest this is run anywhere as it hasn’t been really tested or code reviewed, but I like my answer better than most of the answers I found on the net, so I will share it with you, and hope to get a higher google score than the ones I don’t like:
int2bin.c
#include <stdio.h>
void int2binary(unsigned int n, char *buffer, unsigned int buffer_size);
unsigned int buffer_sizer(unsigned int n);
int main(int argc, char **argv)
{
char buffer[2048];
unsigned int binaryNumbers[9] = { 1, 4, 6, 8, 10, 55, 123, 2000, 2048 };
unsigned int i = 0;
for(i = 0; i < 9; i++)
{
int2binary(binaryNumbers[i], buffer, (buffer_sizer(binaryNumbers[i]) + 1));
printf("%4u: %s\n", binaryNumbers[i], buffer);
}
return 0;
}
void int2binary(unsigned int n, char *buffer, unsigned int buffer_size)
{
unsigned int i = (buffer_size - 1);
buffer[i] = '\0';
while(i > 0)
{
if(n & 0x01)
{
buffer[--i] = '1';
} else
{
buffer[--i] = '0';
}
n >>= 1;
}
}
unsigned int buffer_sizer(unsigned int n)
{
unsigned int x = 1;
unsigned int i = 0;
unsigned int lastFound = 0;
// Takes care of n = 0, and we dont need to loop for 1st case
if(n < 16)
{
return 4;
}
while(1)
{
i++;
if(n / x)
{
lastFound = i;
} else
{
break;
}
x <<= 1;
}
// Not a multiple of 4, how much do we need to add to it
if(lastFound % 4)
{
lastFound += (4 - (lastFound % 4));
}
return lastFound;
}
As you can see I did a pretty poor job commenting it and naming variables, basically though in int2binary we need to start from the right hand side and work our way left, so we start variable i at the end of the string and work our way back. We use & to compare variable n and 0×01 (which means that if the rightmost position has a 1, we add a 1 to the buffer, else add a 0 to the buffer).
For buffer sizer, this just tried to take a guess at how long that string should be. If the integer is less than 16 it will always be 4, so no need to do anything else (I was in recursive mode for a different algorithm so I was cutting things off at beginning mode). Otherwise we loop through, and try to divide variable n by variable x. Each time it is ran, variable x is shifted to the left, which has the effect of squaring the number. This is also super fast on the cpu.
For people not in C development mode, didn’t make a make file, but to compile it:
gcc int2bin.c
Output should be:
[sharms@sparrow ~]$ ./a.out 1: 0001 4: 0100 6: 0110 8: 1000 10: 1010 55: 00110111 123: 01111011 2000: 011111010000 2048: 100000000000
Related posts:
#1 by Joachim Nilsson on July 23, 2009 - 5:57 pm
Quote
http://www.faqs.org/qa/qa-430.html
#2 by Jon on July 23, 2009 - 6:23 pm
Quote
If speed is what you’re seeking, the nibble counter (buffer sizer) is slower than it needs to be. Each loop has: increment, divide, compare, shift. The division is unneeded, and if you only want nibble count (not bit count), you’re doing four times too many loops. Why not something more like:
unsigned int r;
for (r = 0; v; v >>= 4, r += 4) /* do nothing */ ;
return r;
> If the integer is less than 16 it will always be 4, so no need to do anything else (I was in
> recursive mode for a different algorithm so I was cutting things off at beginning mode).
You’ve just introduced a comparison that will shortcut (for random inputs) 1/16th of all cases. The loop above will catch it in: one comparison (first check of v), one shift, one addition, one comparison (last check of v). Yours has one extra operation that is wasted for 15/16ths of all input. The above has two extra operations that are wasted for 1/16ths of all input.
> Each time it is ran, variable x is shifted to the left, which has the
> effect of squaring the number.
Not squared, but multiplied by two.
I applaud you for writing the code! I’m always happy to see people experimenting in source.
#3 by martin on July 23, 2009 - 6:43 pm
Quote
#include
#include
#include
#include
#define ARRAY_SIZE(x) sizeof(x)/sizeof(*x)
const char* inttobin(uint64_t num) {
static char bin[64 + 1] = { 0 };
int i = 0;
memset(bin, ' ', 64);
while (num > 0) {
bin[64 - 1 - i++] = '0' + num % 2;
num >>= 1;
}
return bin;
}
int main(void) {
uint64_t nums[] = { 1, 4, 6, 8, 10, 55, 123, 2000, 2048, 070707070, UINT64_MAX - UINT_MAX };
for (size_t i = 0; i < ARRAY_SIZE(nums); ++i)
printf("%22lu == %s\n", nums[i], inttobin(nums[i]));
return 0;
}
#4 by Haskell Prime on July 23, 2009 - 7:31 pm
Quote
type Bit = Int
int2bin :: Int -> [Bit]
int2bin = unfold (== 0) (`mod`2)(`div`2)
#5 by Kasper Henriksen on July 24, 2009 - 4:00 am
Quote
If you wanted to know how many bits in two integers are different, there should be a much more efficient way of counting that in C and that is with bitwise XOR.
Here’s a short example with just ints:
int countbits(int i) {
int res = 0;
while (i != 0) {
i &= i-1; /* Exercise: What does this do and why does it work? */
res++;
}
return res;
}
int diffbits(int n, int m) {
int diff = n ^ m; /* bitwise xor */
return countbits(diff);
}
Needs a bit more work if you want to do it with a 256-bit hash, of course.
#6 by Malcolm Parsons on July 24, 2009 - 6:24 am
Quote
int bits = __builtin_popcount( x ^ y );
#7 by puccha on July 24, 2009 - 6:31 am
Quote
I’m not sure how fast C’s logaritm implementation is but it seems a more logical than a loop for counting the number of bits.
pseudo (untested):
function nrOfBinDigits(int i) -> int
{
return (int) log(i, 2) + 1
}
Replace ’2′ with any other desired base.
#8 by sharms on July 24, 2009 - 12:55 pm
Quote
@martin – nice I like it
#9 by Andy on July 24, 2009 - 11:50 pm
Quote
puccha – log will be very slow, as it will involve a conversion to floating point and back.
However, since this is an integer log base 2, you can do smarter things (see http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup for some examples)
#10 by puccha on July 25, 2009 - 5:05 am
Quote
@Andy,
While log indeed has a slow nature, it’s still faster* than Steven’s first aproach. And I wouldn’t be surprised if the standard library has a pretty efficient log2 implementation.
Ofcourse other implementations are faster, a simple branching and shifting should do the trick. This way just seemed more natural to me, because it is a more mathematical expression.
* I did a (poorly fundemented) test:
for integers i from 0 to 100,000,000
log function: 15 sec
stevens function: 20 sec
#11 by trixie on July 27, 2009 - 7:34 am
Quote
can you make it very easy…hehehe