首页 > 解决方案 > What are the cases where it'll be useful to compress already compressed data?

问题描述

As best as I understand, basic principle behind data compression is searching for repeated patterns and getting rid of the found duplicates, so the end result cannot be compressed any further without data loss, and if attempted anyway, will result in increase of the data size instead of desired reduction. But then there's, for example, ssh compression, which (when ssh is used as proxy) supposedly speeds up even already gzip-compressed and https-encrypted internet traffic. How and why it works (if it does)? Can a compressed file be compressed again without data loss via some magic? What are the use cases where it actually can happen and where it'd be useful?

标签: compression

解决方案


通常只有当第一次压缩达到或至少接近该压缩格式的最大压缩比时。这将需要高度冗余的数据作为未压缩的输入。当您接近最大压缩率时,压缩数据中会保留一些冗余。

一个简单的例子是 deflate,它的最大压缩比是 1032:1。如果我从十亿 (10 9 ) 个零字节开始,使用 gzip 进行的第一次压缩会将其减少到 970501 字节,比率为 1030.4:1。该结果本身大部分为零,因此第二次压缩将其降至 2476 字节,比率为 394.8:1。(我减去 gzip 标头和预告片来计算比率。)这仍然是多余的,尽管不是很长的零字符串。它将第三次压缩到 298 字节,比例为 8.78:1。

尝试第四次压缩会产生更大的输出,正如您在尝试压缩已经压缩的数据时通常会得到的那样。大多数情况下都会发生这种情况,因为正常的压缩数据与随机数据无法区分为压缩器。

通过 ssh/sshd 对已经压缩的数据进行第二次压缩几乎永远不会加快速度。这只会减慢他们的速度。不仅来自数据的小幅扩展,还来自压缩所需的时间。


推荐阅读