Thursday, December 6, 2007

Shifty Bits in Python

I have been working on a small side project which requires some Huffman coding for compression. Unfortunately the algorithm used is a bit custom, so I could not find a way to do it with libraries that are currently available.

Luckily I have an example of the algorithm in C so I decided to try porting it to Python. This seemed simple enough as I have a bit of a C/C++ background and the code was just a lookup table, a couple of loops, and some bit shifts, so how hard could this be?

I had the initial port of the code done fairly quickly, but it didn't work. I banged my head against the code for a while twiddling with various bits until I realized what was going on. Python doesn't overflow bitwise operators, so when you shift left, the bits do not overflow off into the ether as they would in C.

As an example, say you have an unsigned char in C and its value is 150. In binary that would look like 10010110. If you shift that left 1 place, you end up with 00101100 which is the value 44. Note that the 1 bit on the very left disappeared because it overflowed. Now in Python if you shift it left 1 place you end up with 100101101 which is the value 300. As you can see, Python will let the value get as large as it needs to be which can be both a blessing and a curse.

I finally came up with the following code which would more or less "cast" a value from one size to another. In this example it "casts" a 4-byte representation of the number to a 1-byte representation.

value = struct.unpack('B', struct.pack('L', value)[:1])[0]

In peices:

struct.pack('L', value) - Create a byte-encoded version of the number. The number 300 would look like '\x2c\x01\x00\x00'

[:1] - Get the first byte from the byte string. In this case it is '\x2c'

struct.unpack('B', ...) - Create a new number from the byte string that is and unsigned byte. In this case 150.

[0] - Since struct.unpack returns a tuple even if there is one item, this extracts the first item from the tuple.

If you were to want to get an unsigned short int (2 bytes) from an unsigned int (4 bytes), then you would just have to replace 'B' with 'I' and change [:1] to [:2].

This is a bit of a hack, but at least the Huffman coding algorithm now works. Eventually I will probably make this a C module for speed, but for now I just need to have something working as I experiment.

I'm posting this as a reminder to myself in hope that history doesn't repeat itself, and on the off chance that someone else might be running into the same problem.

Update: A couple of friends chimed in with a much better idea. All you need to do is perform a logical and with 0xFF (add as many F's as you need), and that's it. I don't know what I was thinking! Thanks guys for the ideas!


Mark said...

Hey Chuck,

Couldn't you do something like ANDing the result with 11111110 to eliminate the rotated bit. That would seem to be faster than unpacking/packing the result.


Edwin said...

My thoughts exactly.

value &= 0xff

should do the trick.

Chuck said...

Man I don't know what I was thinking (that is what I get for trying to code some stuff in the middle of the night). You guys are right... I have updated the post, and thanks for the tips.