June 2020 - Skywind Inside

定点数优化：性能成倍提升

June 20th, 2020 skywind No comments

定点数这玩意儿并不是什么新东西，早年 CPU 浮点性能不够，定点数技巧大量活跃于各类图形图像处理的热点路径中。今天 CPU 浮点上来了，但很多情况下整数仍然快于浮点，因此比如：libcario (gnome/quartz 后端）及 pixman 之类的很多库里你仍然找得到定点数的身影。那么今天我们就来看看使用定点数到底能快多少。

简单用一下的话，下面这几行宏就够了：

#define cfixed_from_int(i)      (((cfixed)(i)) << 16)
#define cfixed_from_float(x)    ((cfixed)((x) * 65536.0f))
#define cfixed_from_double(d)   ((cfixed)((d) * 65536.0))
#define cfixed_to_int(f)        ((f) >> 16)
#define cfixed_to_float(x)      ((float)((x) / 65536.0f))
#define cfixed_to_double(f)     ((double)((f) / 65536.0))
#define cfixed_const_1          (cfixed_from_int(1))
#define cfixed_const_half       (cfixed_const_1 >> 1)
#define cfixed_const_e          ((cfixed)(1))
#define cfixed_const_1_m_e      (cfixed_const_1 - cfixed_const_e)
#define cfixed_frac(f)          ((f) & cfixed_const_1_m_e)
#define cfixed_floor(f)         ((f) & (~cfixed_const_1_m_e))
#define cfixed_ceil(f)          (cfixed_floor((f) + 0xffff))
#define cfixed_mul(x, y)        ((cfixed)((((int64_t)(x)) * (y)) >> 16))
#define cfixed_div(x, y)        ((cfixed)((((int64_t)(x)) << 16) / (y)))
#define cfixed_const_max        ((int64_t)0x7fffffff)
#define cfixed_const_min        (-((((int64_t)1) << 31)))
typedef int32_t cfixed;

类型狂可以写成 inline 函数，封装狂可以封装成一系列 operator xx，如果需要更高的精度，可以将上面用 int32_t 表示的 16.16 定点数改为用 int64_t 表示的 32.32 定点数。

那么我们找个浮点数的例子优化一下吧，比如 libyuv 中的 ARGBAffineRow_C 函数：

void ARGBAffineRow_C(const uint8_t* src_argb,
                     int src_argb_stride,
                     uint8_t* dst_argb,
                     const float* uv_dudv,
                     int width) {
  int i;
  // Render a row of pixels from source into a buffer.
  float uv[2];
  uv[0] = uv_dudv[0];
  uv[1] = uv_dudv[1];
  for (i = 0; i < width; ++i) {
    int x = (int)(uv[0]);
    int y = (int)(uv[1]);
    *(uint32_t*)(dst_argb) = *(const uint32_t*)(src_argb + y * src_argb_stride + x * 4);
    dst_argb += 4;
    uv[0] += uv_dudv[2];
    uv[1] += uv_dudv[3];
  }
}

这个函数是干什么用的呢？给图像做仿射变换（affine transformation）用的，比如 2D 图像库或者 ActionScript 中可以给 Bitmap 设置一个 3×3 的矩阵，然后让 Bitmap 按照该矩阵进行变换绘制：

基本上二维图像所有：缩放，旋转，扭曲都是通过仿射变换完成，这个函数就是从图像的起点（u, v）开始按照步长（du, dv）进行采样，放入临时缓存中，方便下一步一次性整行写入 frame buffer。

这个采样函数有几个特点：

运算简单：没有复杂的运算，计算无越界，不需要求什么 log/exp 之类的复杂函数。
范围可控：大部分图像长宽尺寸都在 32768 范围内，用 16.16 的定点数即可。
转换频繁：每个点的坐标都需要从浮点转换成整数，这个操作很费事。

适合用定点数简单重写一下：（点击 Read more 展开）

快除 255：到底能有多快？

June 13th, 2020 skywind No comments

真金不怕火炼，我先前在《C 语言有什么奇技淫巧？》中给出的整数快速除以 255 的公式：

#define div_255_fast(x)    (((x) + (((x) + 257) >> 8)) >> 8)

有人觉得并没有快多少，还给出了测试：

红色为 255 快除法的消耗时间，看他的测试好像也只快了那么一点，是这样的么？

并非如此，我们只要把测试用例中的 long long j 改成 int j 就有比较大的性能提升了：

链接：http://quick-bench.com/t3Y2-b4isYIwnKwMaPQi3n9dmtQ

这才是真实的快除法性能。

原评测的作者其他地方都是用 int ，这里故意改成 64 位去和原始的 / 255 对齐，引入一个干扰项，得到一个比较慢的结果，到底是为了黑而黑呢？还是别的什么原因？

编译器生成的 / 255 方法是把 x / 255 换成定点数的 x * (1/255)：

（点击 Read more 展开）

快速范围判断：再来一种新写法

June 10th, 2020 skywind No comments

C 语言的魔法数不胜数，我在《C 语言有什么奇技淫巧？》中过给快速范围判断的公式，将：

if (x >= minx && x <= maxx) ...

改做：

if (((x - minx) | (maxx - x)) >= 0) ...

能有一倍的性能提升，我也提到，如果你的数据 99% 都是超出范围的那继续用 && 最快。今天再给大家介绍另外一种新写法，它有更均衡的性能，并且在最坏的情况下，任然表现良好：

if ((unsigned)(x - minx) <= (unsigned)(maxx - minx)) ...

该公式在各种测试数据中能有更均衡的表现，类型安全狂们可以写作：

if (((unsigned)x - (unsigned)minx) <= ((unsigned)maxx - (unsigned)minx)) ...

利用单次无符号整数溢出来减少指令和分支，普通情况，这个公式性能照样快接近一倍：

链接：http://quick-bench.com/EbCR9psA3lUEhpn8bYLwVtJ-FWk

为什么说它综合性能最好呢？是不是只实用于某些特殊情况呢？普通情况如何？汇编指令有啥区别？理论依据是啥？是不是只有 x86 可以用，换个平台就不行呢？下面依次回答：

（点击 Read more 展开）

Skywind Inside

Archive

定点数优化：性能成倍提升

快除 255：到底能有多快？

快速范围判断：再来一种新写法

Categories

Recent Comments

Meta

Categories

Blogroll

Archives

Meta

Skywind Inside

Archive

定点数优化：性能成倍提升

快除 255：到底能有多快？

快速范围判断：再来一种新写法

Popular Posts

Tag Cloud

Categories

Recent Comments

Meta

Categories

Blogroll

Archives

Meta