首页 > 解决方案 > LBFGS never converges in large dimensions in pytorch

问题描述

I am playing with Rule 110 of Wolfram cellular automata. Given line of zeroes and ones, you can calculate next line with these rules:

enter image description here

Starting with 00000000....1 in the end you get this sequence:

enter image description here

Just of curiosity I decided to approximate these rules with a polynomial, so that cells could be not only 0 and 1, but also gray color in between:

def triangle(x,y,z,v0):
    v=(y + y * y + y * y * y - 3. * (1. + x) * y * z + z * (1. + z + z * z)) / 3.
    return (v-v0)*(v-v0)

so if x,y,z and v0 matches any of these rules from the table, it will return 0, and positive nonzero value otherwise.

Next I've added all possible groups of 4 neighbors into single sum, which will be zero for integer solutions:

def eval():
    s = 0.
    for i in range(W - 1):
        for j in range(1, W + 1):
            xx = x[i, (j - 1) % W]
            yy = x[i, j % W]
            zz = x[i, (j + 1) % W]
            r = x[i + 1, j % W]
            s += triangle(xx, yy, zz, r)
    for j in range(W - 1): s += x[0, j] * x[0, j]
    s += (1 - x[0, W - 1]) * (1 - x[0, W - 1])
    return torch.sqrt(s)

Also in the bottom of this function I add ordinary conditions for first line, so that all elements are 0 except last one, which is 1. Finally I've decided to minimize this sum of squares on W*W matrix with pytorch:

x = Variable(torch.DoubleTensor(W,W).zero_(), requires_grad=True)
opt = torch.optim.LBFGS([x],lr=.1)
for i in range(15500):
    def closure():
        opt.zero_grad()
        s=eval()
        s.backward()
        return s
    opt.step(closure)

Here is full code, you can try it yourself. The problem is that for 10*10 it converges to correct solution in ~20 steps:

enter image description here

But if I take 15*15 board, it never finishes convergence:

enter image description here

The graph on the right shows how sum of squares is changing with each next iteration and you can see that it never reaches zero. My question is why this happens and how can I fix this. Tried different pytorch optimisers, but all of them perform worse then LBFGS. Tried different learning rates. Any ideas why this happen and how I can reach final point during optimisation?

UPD: improved convergence graph, log of SOS:

enter image description here

UPD2: I also tried doing same in C++ with dlib, and I don't have any convergency issues there, it goes much deeper in much less time:

enter image description here

I am using this code for optimisation in C++:

find_min_using_approximate_derivatives(bfgs_search_strategy(),
        objective_delta_stop_strategy(1e-87),
        s, x, -1)

标签: pythontensorflowpytorchnonlinear-optimization

解决方案


您在这里尝试做的是非凸优化,这是一个众所周知的难题。一旦你考虑它,它就很有意义,因为几乎任何实际的数学问题都可以被表述为一个优化问题。

1. 前奏
因此,在提示您在哪里可以找到特定问题的解决方案之前,我想说明为什么某些优化问题很容易解决。

我将从讨论凸问题开始。即使在受约束的情况下,这些也很容易解决,原因是当你计算梯度时,你实际上得到了很多关于最小值的信息(凸函数的泰勒展开,f,总是f) 的低估,另外只有一个最小值并且没有鞍点。如果您有兴趣了解有关凸优化的更多信息,我建议您在YouTube 上查看 Stephen Boyd 的凸优化课程

那么,如果非凸优化如此困难,我们怎么能在深度学习中解决它呢?答案很简单,我们在深度学习中最小化的非凸函数非常好,正如Henaff 等人所证明的那样。

因此,机器学习从业者必须意识到,如果深度学习中使用的操作程序在其他非凸问题上首先收敛到最小值,那么它们很可能不会产生好的最小值。

2.回答你的问题
现在回答你的问题,你不是你可能不会找到快速解决方案,因为非凸优化是NP完全的。但不要害怕,SciPy 有一些全局优化算法可供选择。是另一个堆栈溢出线程的链接,可以很好地回答您的问题。

3. 故事的寓意
最后,我想提醒您,收敛保证很重要,忘记它会导致石油钻井平台倒塌

PS。请原谅错别字,我正在使用我的手机

更新:至于为什么 BFGS 与 dlib 一起工作,可能有两个原因,首先,BFGS 比 L-BFGS 更擅长使用曲率信息,其次它使用线搜索来找到最佳步长。我建议检查 PyTorch 是否允许行搜索,如果不允许,则设置一个减小的步长(或者只是一个非常低的步长)。


推荐阅读