Solve Traveling Salesperson Problem With Reinforcement Learning - Part (2)

This post is about Neural Combinatorial Optimization that tackles NP-hard combinatorial optimization problems using recurrent neural networks and reinforcement learning.

Most NP-hard combinatorial problems are tackled by algorithms that rely on handcrafted heuristics. The performances of such heuristics vary widly depending on the conditions of problem instances. Hence, it is natural to seek a machine learning approach that can potentially generalize across a variety of problem instances or even different problems.

One may ask “why RL over supervised learning?” The paper gives three reasons for using RL to NP-hard combinatorial optimization problems. First, it is difficult to perform supervised learning as it is hard to obtain global optimum labels. If it weren’t hard, it would imply we know how to solve the problems efficiently. Second, even if we use labels generated by heuristics, the performance of a supervised model will be tied to the local optima and not surpass them. Third, RL may find solutions different from the existing ones and hopefully at a competitive performance. One more reason I would add is that RL may be even more preferred to supervised learning when one must solve a sequence of problems over time where problem instances are not independent. That is the kind of problem RL can potentially excel at.

Of course, RL requires a good reward function that is hard to engineer manually in practice. If a misspecified reward function is used, all sorts of weird behaviors can come about. However, in our problem, we have a reward function that naturally arises out of a problem definition–the total distance of a tour.

Problem formulation

Now, we prepare the ingredients for RL. We first modify the TSP formulation a bit. For the sake of simplicity, we consider the 2D Euclidean TSP. Again, we are given a set of $n$ cities, $\mathbf{x} = \{ x_i \}_{i=1}^n$ with each city $x_i \in \mathbb{R}^2$. Our aim is to learn a stochastic policy modeled as:

$p( \pi \vert \mathbf{x}; W) = \prod_{i=1}^n p(\pi_i \vert \pi_{<i}, \mathbf{x}; W)$

where $W$ is model parameters, $\pi_{<i} = (\pi_1, \dots, \pi_{i-1})$ and the cost function is given by $L_{\mathbf{x}}(\pi) = \sum_{i=1}^{n-1} \lVert \pi_i - \pi_{i+1} \rVert_2 + \lVert \pi_n - \pi_1 \rVert_2$ . We defined a model that assigns probabilities over the space of tours. Note the identity follows from the probability chain rule.

Sequence-to-Sequence Models

At this point, you may notice the factorized formulation (the right-hand side) is almost the same formulation as sequence-to-sequence models. Sequence-to-Sequence models use Recurrent Neural Networks (RNN) to accept input sequences of varying lengths (you can train one model for an arbitrary number of cities) and predict output sequences poentially of different lengths (not so relevant to our problem).

In case you are not familiar, an RNN is a neural network that takes two types of inputs sequentially; the real input given by a problem (e.g. a word in a given sentence) and the other computed by itself at a previous step. The latter (also called a hidden state) is computed by $h_t = f(x_t, h_{t-1})$ at each time step $t =1,2,\dots$ where $f$ is usually just a composition of an affine transformation and an activation function. RNN is nice since $h$ can carry information from the past (at the superficial level…).

Sequence-to-Sequence models usually consist of two phases: encoding and decoding where typically an RNN(s) is used for both. Let $\mathbf{e} = (e_1,\dots, e_n), \mathbf{d} = (d_1, \dots, d_m)$ denote d-dimensional hidden states ($h_t$) of an encoder and a decoder, respectively. For encoding, you just feed to an RNN a component (token) of an input sequence one at a time to collect $\mathbf{e}$. For decoding, you add one more computation to each $d_t$: $p(y \vert x_{\le t}) = \text{softmax}(W_y d_t + b_y)$ where $y$ output.

Can we apply sequence-to-sequence models to our problem out of the box? The answer is no. Unlike the typical seq-to-seq formulation, in our problem, the output dictionary is fully determined by a given input sequence: i.e. $\pi_i \in \mathbf{x}, \forall i$. By the output dictionary I mean a set of values $p(\pi_i \vert \pi_{<i}, \mathbf{x})$ can take (ignoring feasibility of a tour for now). Contrast this to a machine translation task from English to French where the output dictionary (the French vocabulary) is fixed a priori and determined independent of inputs.

Pointer Networks

Now, how can we adjust output dictionary sizes depending on given input sequences? Clearly, we do not want to build separate models for different $n$. One simple idea called Pointer Networks has been proposed. The mechanism is similiar to the attention model. At each decoding step, Pointer Networks learn a softmax distribution over all input hidden states.

$u_j^i = v \cdot tanh(W_1 e_j + W_2 d_i), \, j \in [n]\\ p(\pi_i \vert \pi_{<i}, \mathbf{x}; W) = \text{softmax}([u_1^i, \dots, u_n^i])$

Notice $u_j^i$ computes a scalar to $u_j$ at $i$-th decoding step: essentially, how much attention to where at each decoding time. This simple formulation, by design, allows output dictionary sizes to change according to input sequences.

Seq-to-Seq models vs. Pointer Networks. Credit: link.

Neural Combinatorial Optimization with RL

Now, we can finally talk about the main paper. What it does is basically Pointer Networks plus RL. First, we make a minor adjustment to the pointing mechanism above, so our model predicts feasible tours only (no city can be selected twice). Below, we just iterate over each city and if it has been already selected before $i$-th step, we give the $-\infty$ attention.

$u_j^i = \begin{cases} v \cdot tanh(W_1 e_j + W_2 d_i) & \text{if} \, x_j \ne \pi_k, k < i, k \in \mathbb{N}\\ -\infty & \text{otherwise} \end{cases}\\ \text{where}\,\, i, j \in [n]\\ p(\pi_i \vert \pi_{<i}, \mathbf{x}; W) = \text{softmax}([u_1^i, \dots, u_n^i])$

From above, we derive the policy $\mathbf{X} \mapsto \Pi$:

$p( \pi \vert \mathbf{x}; W) = \prod_{i=1}^n p(\pi_i \vert \pi_{<i}, \mathbf{x}; W)$

The setup is similar to a multi-armed bandit problem. You have $n!$ bandits to play and can choose only one lever to pull at each time. For each instance of a problem (an input graph of cities), you choose one permutation. (Technically, the dimension of $\mathbf{x}$ varies but we do not want to introduce an infinite sequence…)

Policy Optimization with REINFORCE

Now how do we optimize the policy? There are various ways. The paper suggested a model-free policy gradient algorithm called REINFORCE. The idea is quite simple. You have a (possibly non-convex) objective function and want to use a first-order (gradients-based) method that performs a local search. All you need to obtain is, well, a gradient of your objective with respect to the parameters of your policy. Concretely, consider $J(W \vert \mathbf{x}) = \mathbb{E}_{\pi \sim p(\cdot \vert \mathbf{x};W)} \left [ L_{\mathbf{x}}(\pi) \right ]$ that is the expected total tour distance. Assume there exists some unknown distribution that generates problem instances $\mathbf{x} \sim d(x)$ . Finally, we set our objective to be $J(W) = \mathbb{E}_{\mathbf{x}} \left[ J(W \vert \mathbf{x}) \right]$ .

Now, how do we compute $\nabla_W J(W \vert \mathbf{x})$? We need to take a gradient of an expectation that feels hard at a first encounter. In fact, this task is so common in machine learning that there is a special name to it: the log derivative trick.

$\begin{align} \nabla_W J(W \vert \mathbf{x}) &= \nabla_W \left (\mathbb{E}_{\pi \sim p(\cdot \vert \mathbf{x};W)} \left [ L_{\mathbf{x}}(\pi) \right ] \right )\\ &= \sum_{\pi \in \Pi} L_{\mathbf{x}}(\pi) \, \nabla_W \left ( p(\pi\vert \mathbf{x};W) \right )\\ &= \sum_{\pi \in \Pi} L_{\mathbf{x}}(\pi) \frac{ p(\pi\vert \mathbf{x};W)}{ p(\pi\vert \mathbf{x};W)} \nabla_W \left ( p(\pi\vert \mathbf{x};W)\right ) \\ &= \sum_{\pi \in \Pi} L_{\mathbf{x}}(\pi) p(\pi\vert \mathbf{x};W) \frac{\nabla_W p(\pi\vert \mathbf{x};W) }{ p(\pi\vert \mathbf{x};W)}\\ &= \sum_{\pi \in \Pi} L_{\mathbf{x}}(\pi) p(\pi\vert \mathbf{x};W) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \\ &= \mathbb{E}_{\pi \sim p(\cdot \vert \mathbf{x};W)} \left [ L_{\mathbf{x}}(\pi) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ] \\ \nabla_W J(W) &\approx \frac{1}{N} \sum_{j=1}^N L_{\mathbf{x}^{(j)}}(\pi^{(j)}) \nabla_W \log{p(\pi^{(j)} \vert \mathbf{x}^{(j)};W)} \\ \end{align}$

We observe the final line is not only an unbiased estimate, but also is easy to evaluate. Given a prediction $\pi$, it is easy to evaluate $L_x(\pi)$. As long as we keep our model as compositions of differentiable operations, obtaining $\nabla_W \log{p(\pi^{(j)} \vert \mathbf{x};W)}$ is a matter of applying the derivative chain rule.

Now, let us make one last refinement to our policy gradient estimate. Notice the gradient estimate is multiplied by $L_{\mathbf{x}}(\pi)$. This is not desirable since larger problem instances will tend to have higher costs, regardless of $\pi$, than smaller problem instances: e.g. tours for 10 cities would likely be shorter than for 1,000 cities. Hence, larger problems will likely have a disproportionally larger effect during training.

Can we adjust the cost term according to some baseline that reflects on the intrinsic difficulty of $\mathbf{x}$? The trick is to observe $\mathbb{E}_{\pi} \left [ \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ] = 0$ which is not hard to prove.

$\begin{align} &\mathbb{E}_{\pi} \left [ L_{\mathbf{x}}(\pi) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ]\\ &= \mathbb{E}_{\pi} \left [ L_{\mathbf{x}}(\pi) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ] - \mathbb{E}_{\pi} \left [ b(\mathbf{x}) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ]\\ &= \mathbb{E}_{\pi} \left [ L_{\mathbf{x}}(\pi) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ] - b(\mathbf{x}) \underbrace{\mathbb{E}_{\pi} \left [ \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ]}_{=0}\\ &=\mathbb{E}_{\pi} \left [ \left ( L_{\mathbf{x}}(\pi) - b(\mathbf{x}) \right ) \nabla_W \log{p(\pi\vert \mathbf{x};W)} \right ] \\ \end{align}$

Actor-Critic algorithm Credit: link.

Moreover, we can explicitly parameterize $b(\mathbf{x}; W_v)$. This is called an Actor-Critic algorithm where both policy and baseline are given parameters. There are various approaches regarding what $b(\mathbf{x})$ should be. To remain unbiased, we just require $b(\mathbf{x})$ to be some function that does not depend on $\pi$. The choice is especially critical for non-linearities like a neural network. In that case, there is usually no convergence guarantee and explicit tricks that stablilize training are required. For training, we can use another objective function. We choose a Mean Squared Loss (MSE): $L_{W}(W_v) = \mathbb{E}_{\mathbf{x}, \pi}\left[ \lVert b(\mathbf{x}; W_v) - L_{\mathbf{x}}(\pi) \rVert_2^2 \right]$ . Notice the cost is conditioned on $W$, the current policy. The exact architecture of the critic can be found in the Appendix of the paper. Basically, it begins with an LSTM encoder, a special kind of RNN designed to maintain long term information. An RNN is still needed to handle a variable-length input. It is then followed by a process block, a series of attention-like computations over a LSTM hidden state. Finally, a few fully connected layers are used to produce a scalar.

Now, how do we train our actor ($p(\pi\vert \mathbf{x};W)$) and critic ($b(\mathbf{x}; W_v)$)? You can just iteratively apply a first-order stochastic optimization method such as Adam on sampled problem instances and on the cross-entropy loss based on the softmax output.

Inference: Search Strategies

Generally speaking, doing inference with a multi-variate output space (sequence/tour) is computationally challenging. There are usually three types of questions you could ask a probability distribution: random samples, the mode, and the mean. Samples of $\pi$ can be generated through $n$ softmax operations (pointing), where each softmax can be simulated by sampling methods like Inverse transform sampling–giving $O(n^2)$ per sample tour.

To control the diversity of samples, we can introduce a temperature hyperpameter: $\text{softmax}(x/T)$. If $T > 1$ is large, $T$ will discount the weight of $x$ such that the distribution becomes relatively more flat. If $T$ is small, the effect is reversed. The paper cross-validated with a grid search to find a nice $T$.

As for the mean, we take Monte Carlo samples for $\mathbb{E}_{\pi \sim p(\cdot \vert \mathbf{x}; W)} \pi$ . As for the mode, we must solve $\arg\max_{\pi \in \Pi} p( \pi \vert \mathbf{x}; W)$ . As we know, an exhaustive search is computationally infeasible for it is $O(n!)$. Recall the factorization: $p( \pi \vert \mathbf{x}; W) = \prod_{i=1}^n p(\pi_i \vert \pi_{<i}, \mathbf{x}; W)$ which has a form for tree traversal. One of the easiest approaches is to take $n$ greedy decisions $O(n^2)$. Another popular approach is called Beam search where we take a top-B greedy decision at each time $O(n^2B)$. If our method were supervised learning, During training, the computational burden for the mode inference would have been reduced as it can be easily parallelized by conditioning on ground-truth labels teacher forcing. With reinforcement learning, we must unroll our mode prediction sequentially.

Now, what kind of inference did the paper choose for inference during test inference? It still used sampling and suggested two search methods driven both by the sampling: (vanilla) sampling and Active Search. Given some test input $\mathbf{x}$, the (vanilla) sampling approach simply generates a batch of samples from a fixed distribution $p( \pi \vert \mathbf{x}; W)$. Subsequently, it selects the shortest one based on the cost function which can be evaluated cheaply at $\Theta(n)$. Active Search, however, is a more diligent student; it generates $t$ batches of samples sequentially one batch at a time and actively update the parameters based on the loss computed for each batch. Basically, it does extra training fine-tuned on a specific test input. The paper reports a significant performance boost using Active Search over vanilla sampling.

Well, that is all. The model produced competitve experimental results both in terms of optimality and wall-clock inference time. It outperformed OR-Tool’s local search and Christofide algorithm. This is impressive and promising in that there is not much of hand-crafted knowledge/heuristics baked-in in the model. I can imagine follow-up research can produce even more impressive results with more compute and better techniques.

Issues Remaining

Of course, the model is not applicable for all NP-hard combinatorial optimization problems. For example, it capitalized on a few nice properties of TSP. First, it is easy to generate a feasible tour (given a complete graph); we just need to prevent our model from repeating the same city. This may not always be the case (e.g. TSP with constraints like time windows). In that case, relaxing the feasibility constraint at the inference phase may be required while the feasibility is enforced post-hoc or via regularization. Second, TSP is a one-step problem without any time-varying components in inputs. It would not be straight-forward to apply the model to a multi-step problem like Vehicle Routing Problem with capacity. We will talk about another paper that deals with the second issue soon.