使用PHP中的位

Say I want to store a sequence of 8 words in PHP, and I don't want to use compression.

Since there are only 8 words, I could assign each one a binary value and then store these binary values in a file instead of the ascii words.

The possible binary values would be:

000, 001, 010, 011, 100, 101, 110, 111

This would be much more efficient to parse because: (1) each word is now the same size, and, (2) it takes up much less space.

My question is:

How can I do this in PHP? How can I assign a binary value to something, then write this to a file (writing the bits how I want them), then read this back again?

The reason I want to do this is to create an efficient indexing system.

First, if you want to compress data, use php builtin functions for that like the gzip extension..

But as you requested, I've prepared an example how this can be done in PHP. It is not perfect, just a trivial implementation. The compression rate could be better if I would use the gap between bit 30 and 32 of each integer. Maybe will add this feature... However I've used 32bit unsigned integers in favour of bytes as with them the loss is 2 bits per 32 bits instead of 2 bits per byte.

First we prepare the lookup table that contains the relations word => decimal number, its the coding table:

<?php

// coding table
$lookupTable = array (
//  'word0' => chr(0), // reserved for 0 byte gap in last byte
    'word1' => chr(1),
    'word2' => chr(2),
    'word3' => chr(3),
    'word4' => chr(4),
    'word5' => chr(5),
    'word6' => chr(6),
    // reserve one word for white space
    ' ' => chr(7)
);

Then comes the compression function:

/**
 *
 */
function _3bit_compress($text, $lookupTable) {

    echo 'before compression                  : ' . strlen($text) . ' chars', PHP_EOL;

    // first step is one byte compression using the lookup table
    $text = strtr($text, $lookupTable);
    echo 'after one byte per word compression : ' . strlen($text) . ' chars', PHP_EOL;

    $bin = ''; // the result
    $carrier = 0; // 32 bit usingned int can 'carry' 10 words in 3 bit notation

    for($c = 0; $c < strlen($text); $c++) {
        $triplet = $c % 10;
        // every 30 bits we add the 4byte unsigned integer to $bin.
        // please read the manual of pack
        if($triplet === 0 && $carrier !== 0) {
            $bin .= pack('N', $carrier);
            $carrier = 0;
        }

        $char = $text[$c];
        $carrier  <<= 3; // make space for the the next 3 bits
        $carrier += ord($char); // add the next 3 bit pattern
        // echo $carrier, ' added ' . ord($char), PHP_EOL;
    }
    $bin .= pack('N', $carrier); // don't forget the remaining bits
    echo 'after 3 bit compression             : ' . strlen($bin) . ' chars', PHP_EOL;
    return $bin;
}

And the decompression function:

/**
 *
 */
function _3_bit_uncompress($compressed, $lookupTable) {
    $len = strlen($compressed);
    echo 'compressed length:            : ' . $len . ' chars', PHP_EOL;

    $i = 0;
    $tmp = '';
    $text = '';
    // unpack string as 4byte unsigned integer
    foreach(unpack('N*', $compressed) as $carrier) {
        while($i < 10) {
            $code = $carrier & 7; // get the next code
            // echo $carrier . ' ' . $code, PHP_EOL;
            $tmp = chr($code) . $tmp;
            $i++;
            $carrier >>= 3; // shift forward to the next 3 bits
        }
        $i = 0;
        $text = $text . $tmp;
        $tmp = '';
    }
    // reverse translate from decimal codes to words
    return strtr($text, array_flip($lookupTable));
}

Now its time to test the functions :)

$original = <<<EOF
word1 word2 word3 word4 word5 word6 word1 word3 word3  word2
EOF;


$compressed = _3bit_compress($original, $lookupTable);
$restored = _3_bit_uncompress($compressed, $lookupTable);

echo 'compressed size: ' . round(strlen($compressed) * 100 / strlen($original), 2) . '%', PHP_EOL;

echo 'Message before compression  : ' . $original, PHP_EOL;
echo 'Message after decompression : ' . $restored, PHP_EOL;

The example should give you:

before compression                  : 60 chars
after one byte per word compression : 20 chars
after 3 bit compression             : 8 chars
compressed length:            : 8 chars
compressed size: 13,33%
Message before compression  : word1 word2 word3 word4 word5 word6 word1 word3 word3  word2
Message after decompression : word1 word2 word3 word4 word5 word6 word1 word3 word3  word2

If we are testing with loooong words the compression rate will of course get even better:

before compression                  : 112 chars
after one byte per word compression : 16 chars
after 3 bit compression             : 8 chars
compressed length:            : 8 chars
compressed size: 7,14%
Message before compression  : wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3 wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3 
Message after decompression : wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3 wooooooooord1 wooooooooord2 wooooooooord2 wooooooooord3