Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Say I have a single-page application that uses a third party API for content. The app’s logic is in-browser only, and there is no backend I can write to.

To allow deep-linking into the state of the app, I use pushState to keep track of a few variables that determine the state of the app (note that Ubersicht’s public version doesn’t do this yet). In this case repos, labels, milestones and username, show_open (bool) and with_comments (bool) and without_comments (bool). The URL format is ?label=label_1,label_2,label_3&repos=repo_1…. Values are the usual suspects, roughly [a-zA-Z][a-zA-Z0-9_-], or any boolean indicator.

So far so good. Now since the query string can be a bit long and unwieldy and I would like to be able to pass around URLs like http://espy.github.io/ubersicht/?state=SOMOPAQUETOKENTHATLOSSLESSLYDECOMPRESSESINTOTHEORIGINALVALUES#hoodiehq, the shorter the better.

My first attempt was going to be using some zlib like algorithm for this (https://github.com/imaya/zlib.js) and @flipzagging pointed to antirez/smaz (https//github.com/antirez/smaz) which sounds more suitable for short strings (JavaScript version at https://github.com/personalcomputer/smaz.js).

Since = and & are not specifically handled in https://github.com/personalcomputer/smaz.js/blob/master/lib/smaz.js#L9, we might be able to tweak things a little there.

Furthermore, there is an option for encoding the values in a fixed table, e.g. the order of arguments is pre-defined and all we need to keep track of is the actual value. E.g. turn a=hamster&b=cat into 7hamster3cat (length+chars)or hamster|cat (value + |), potentially before the smaz compression.

Is there anything else I should be looking for?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
124 views
Welcome To Ask or Share your Answers For Others

1 Answer

A working solution putting various bits of good (or so I think) ideas together

I did this for fun, mainly because it gave me an opportunity to implement an Huffman encoder in PHP and I could not find a satisfactory existing implementation.

However, this might save you some time if you plan to explore a similar path.

Burrows-Wheeler+move-to-front+Huffman transform

I'm not quite sure BWT would be best suited for your kind of input.
This is no regular text, so recurring patterns would probably not occur as often as in source code or plain English.

Besides, a dynamic Huffman code would have to be passed along with the encoded data which, for very short input strings, would harm the compression gain badly.

I might well be wrong, in which case I would gladly see someone prove me to be.

Anyway, I decided to try another approach.

General principle

1) define a structure for your URL parameters and strip the constant part

for instance, starting from:

repos=aaa,bbb,ccc&
labels=ddd,eee,fff&
milestones=ggg,hhh,iii&
username=kkk&
show_open=0&
show_closed=1&
show_commented=1&
show_uncommented=0

extract:

aaa,bbb,ccc|ddd,eee,fff|ggg,hhh,iii|kkk|0110

where , and | act as string and/or field terminators, while boolean values don't need any.

2) define a static repartition of symbols based on the expected average input and derive a static Huffman code

Since transmitting a dynamic table would take more space than your initial string, I think the only way to achhieve any compression at all is to have a static huffman table.

However, you can use the structure of your data to your advantage to compute reasonable probabilities.

You can start with the repartition of letters in English or other languages and throw in a certain percentage of numbers and other punctuation signs.

Testing with a dynamic Huffman coding, I saw compression rates of 30 to 50%.

This means with a static table you can expect maybe a .6 compression factor (reducing the lenght of your data by 1/3), not much more.

3) convert this binary Huffmann code into something an URI can handle

The 70 regular ASCII 7 bits chars in that list

!'()*-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

would give you an expansion factor of about 30%, practically no better than a base64 encode.

A 30% expansion would ruin the gain from a static Huffman compression, so this is hardly an option!

However, since you control the encoding client and server side, you can use about anything that is not an URI reserved character.

An interesting possiblity would be to complete the above set up to 256 with whatever unicode glyphs, which would allow to encode your binary data with the same number of URI-compliant characters, thus replacing a painful and slow bunch of long integer divisions with a lightning fast table lookup.

Structure description

The codec is meant to be used both client and server side, so it is essential that server and clients share a common data structure definition.

Since the interface is likely to evolve, it seems wise to store a version number for upward compatibility.

The interface definition will use a very minimalistic description language, like so:

v   1               // version number (between 0 and 63)
a   en              // alphabet used (English)
o   10              // 10% of digits and other punctuation characters
f   1               // 1% of uncompressed "foreign" characters
s 15:3 repos        // list of expeced 3 strings of average length 15
s 10:3 labels
s 8:3  milestones
s 10   username     // single string of average length 10
b      show_open    // boolean value
b      show_closed
b      show_commented
b      show_uncommented

Each language supported will have a frequency table for all its used letters

digits and other computerish symbols like -, . or _ will have a global frequency, regardless of languages

separators (, and |) frequencies will be computed according to the number of lists and fields present in the structure.

All other "foreign" characters will be escaped with a specific code and encoded as plain UTF-8.

Implementation

The bidirectional conversion path is as follows:

list of fields <-> UTF-8 data stream <-> huffman codes <-> URI

Here is the main codec

include ('class.huffman.codec.php');
class IRI_prm_codec
{
    // available characters for IRI translation
    static private $translator = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzàá??????èéê?ìí??D?òó???ùú?üYT?àáa?????èéê?ìí??e?òó???ùú?üyt?āā????????????????ēē??????ěě??????????????īī????????????????????????ń???ň???ōō????????????????????????????ūū??????????????????????????";

    const VERSION_LEN = 6; // version number between 0 and 63

    // ========================================================================
    // constructs an encoder
    // ========================================================================
    public function __construct ($config)
    {
        $num_record_terminators = 0;
        $num_record_separators = 0;
        $num_text_sym = 0;

        // parse config file
        $lines = file($config, FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($lines as $line)
        {
            list ($code, $val) = preg_split('/s+/', $line, 2);
            switch ($code)
            {
            case 'v': $this->version = intval($val); break;
            case 'a': $alphabet = $val; break;
            case 'o': $percent_others = $val; break;
            case 'f': $percent_foreign = $val; break;
            case 'b':
                $this->type[$val] = 'b';
                break;
            case 's':
                list ($val, $field) = preg_split('/s+/u', $val, 2);
                @list ($len,$num) = explode (':', $val);
                if (!$num) $num=1;
                $this->type[$field] = 's';
                $num_record_terminators++;
                $num_record_separators+=$num-1;
                $num_text_sym += $num*$len;
                break;

            default: throw new Exception ("Invalid config parameter $code");
            }
        }

        // compute symbol frequencies           
        $total = $num_record_terminators + $num_record_separators + $num_text_sym + 1;

        $num_chars = $num_text_sym * (100-($percent_others+$percent_foreign))/100;
        $num_sym = $num_text_sym * $percent_others/100;
        $num_foreign = $num_text_sym * $percent_foreign/100;

        $this->get_frequencies ($alphabet, $num_chars/$total);
        $this->set_frequencies (" .-_0123456789", $num_sym/$total);
        $this->set_frequencies ("|", $num_record_terminators/$total);
        $this->set_frequencies (",", $num_record_separators/$total);
        $this->set_frequencies ("1", $num_foreign/$total);
        $this->set_frequencies ("", 1/$total);

        // create Huffman codec
        $this->huffman = new Huffman_codec();
        $this->huffman->make_code ($this->frequency);
    }

    // ------------------------------------------------------------------------
    // grab letter frequencies for a given language
    // ------------------------------------------------------------------------
    private function get_frequencies ($lang, $coef)
    {
        $coef /= 100;
        $frequs = file("$lang.dat", FILE_IGNORE_NEW_LINES|FILE_SKIP_EMPTY_LINES);
        foreach ($frequs as $line)
        {
            $vals = explode (" ", $line);
            $this->frequency[$vals[0]] = floatval ($vals[1]) * $coef;
        }
    }

    // ------------------------------------------------------------------------
    // set a given frequency for a group of symbols
    // ------------------------------------------------------------------------
    private function set_frequencies ($symbols, $coef)
    {
        $coef /= strlen ($symbols);
        for ($i = 0 ; $i != strlen($symbols) ; $i++) $this->frequency[$symbols[$i]] = $coef;
    }

    // ========================================================================
    // encodes a parameter block
    // ========================================================================
    public function encode($input)
    {
        // get back input values
        $bools = '';
        foreach (get_object_vars($input) as $prop => $val)
        {
            if (!isset ($this->type[$prop])) throw new Exception ("unknown property $prop");
            switch ($this->type[$prop])
            {
            case 'b': $bools .= $val ? '1' : '0'; break;
            case 's': $strings[] = $val; break;
            default: throw new Exception ("Uh oh... type ".$this->type[$prop]." not handled ?!?");
            }
        }

        // set version number and boolean values in front
        $prefix = sprintf ("%0".self::VERSION_LEN."b$bools", $this->version);

        // pass strings to our Huffman encoder
        $strings = implode ("|", $strings);
        $huff = $this->huffman->encode ($strings, $prefix, "UTF-8");

        // translate into IRI characters
        mb_internal_encoding("UTF-8");
        $res = '';
        for ($i = 0 ; $i != strlen($huff) ; $i++) $res .= mb_substr (self::$translator, ord($huff[$i]), 1);

        // done
        return $res;
    }

    // ========================================================================
    // decodes an IRI string into a lambda object
    // ========================================================================
    public function decode($input)
    {
        // convert IRI characters to binary
        mb_internal_encoding("UTF-8");
        $raw = '';
        $len = mb_strlen ($input);
        for ($i = 0 ; $i != $len ; $i++)
        {
            $c = mb_substr ($input, 0, 1);
            $input = mb_substr ($input, 1);
            $raw .= chr(mb_strpos (self::$translator, $c));
        }

        $this->bin = '';        

        // check version
        $version = $this->read_bits ($raw, self::VERSION_LEN);
        if ($version != $this->version) throw new Exception ("Version mismatch: expected {$this->version}, found $version");

        // read booleans
        foreach ($this->type as $field => $type)
            if ($type == 'b')
                $res->$field = $this->read_bits ($raw, 1) != 0;

        // decode strings
        $strings = explode ('|', $this->huffman->decode ($raw, $this->bin));
        $i = 0;
        foreach ($this->type as $field => $type) 
            if ($type == 's')
                $res->$field = $strings[$i++];

        // done
        return $res;
    }

    // ------------------------------------------------------------------------
    // reads raw bit blocks from a binary string
    // ------------------------------------------------------------------------
    private function read_bits (&$raw, $len)
    {
        while (strlen($this->bin) < $len)
        {
            if ($raw == '') throw new Exception ("premature end of input"); 
            $this->bin .= sprintf ("%08b", ord($raw[0]));
            $raw = substr($raw, 1);
        }
        $res = bindec (substr($this->bin, 0, $len));
        $this->bin = substr ($this->bin, $len);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...