‎mzip- short compression

mayank1234cmd · October 12, 2020, 11:00pm

Encoded=>浺楰⁩猠浡摥℠捯灹物杨琡⁣挭批⵳愡
Decoded=>mzip is made! copyright! cc-by-sa!
yes this software is lisenced under cc by sa
download
len(浺楰⁩猠浡摥℠捯灹物杨琡⁣挭批⵳愡) == 17
len(mzip is made! copyright! cc-by-sa!) == 36

mz(“test”) - encode
mz_decomp(“瑥獴”) - decode
mz_file(“myfile.txt”) - saves mzipped in myfile.txt.mz
mz_unzip(“myfile.txt.mz”) - saves mzipped in myfile.txt.mz.unzipped.mz(wow pretty long filename)

problem: can encode only even strings (hi) not (h) (fixed)

ihack2712 · October 13, 2020, 1:53am

Hi! This idea is really cool! But I’m sorry to tell you this is sadly not compression in any way at all, if anything this will lengthen the amount of bytes it uses.

When using characters that are not included in ASCII you are most likely using Unicode. When encoding text in Unicode only the first 128 characters in the set will fit into one byte, the reason for this is because the first 1-5 bits in the byte defines the amount of bytes it’s going to use to encode a single character.

When encoding either of the first 128 characters in UTF-8 the byte will look similar to 0xxxxxxx, the 0 in the beginning simply says that I’m the final part of the character. In Unicode, there are so many characters and it just won’t fit into 8 bits, therefore we have UTF-8. What UTF-8 does is making it able to encode characters to and decode from ASCII characters that are not included in the ASCII character set.

I’m not gonna go into specfics on how UTF-8 does this, but you can read about it here. Anyway, for UTF-8 to make it possible to combine multiple bytes together, it needs a way to understand whether or not a byte is the final byte of a character, or if there are more bytes left of a character, it does this by stealing the first 1-5 bits of a byte, the last bit of what it steals will be 0, to simply say that it either should create a new buffer for a character, or finalize the current buffer and add the character to the string.

What you’re doing here is essentially using character codes larger than 255, and as I said, the idea is good, but the technical details just cannot be defined as compression because of how computers work and deal with text encoding.

Here shows the difference between just using ASCII and using your program:

Your program actually increased the file size by 48.1% compared to original content.

If you are interested in learning how text compression works I recommend you have a look at this video by Tom Scott, it is well explained!

system · April 11, 2021, 2:01am

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.