I have had a number of friends ask me “Okay, AlphaGo was impressive, but where else can you use Reinforcement Learning (RL)?”.
If you google the question, you will find a legitimate list like this paper or this article. Today I would like to talk about a cool application of RL that is not as often mentioned: combinatorial optimization. For the sake of conciseness, I will quickly go through a few background concepts and discuss the components I think are most interesting for the audience who likes RL.
Combinatorial optimization problems are important in computer science and operations research and correspond to many reallife problems. For example, suppose you have to take a lowcost flight without free checkin baggage. Now you need to somehow find a way to pack as many valuable items into your small backpack while keeping the total weight from exceeding the limit. In computer science speak, you are solving an instance of Knapsack problem. Combinatorial optimization is about finding a solution that maximimizes/minimizes some given objective function defined over a set of solutions that live on a discrete space.
Some of such problems are NPhard which, at this point, means no polynomial time exact solution can ever be found (yeah, probably. there is still hope! ). This is a problem, to my knowledge, at least as old as the history of computer science, so there are already many approximate and heuristics algorithms. Is there anything new there is to talk about? (Drumroll…) well, this is exactly where RL can kick in. In this part of the post, we describe some classic approahches. If you are already familiar with them, I suggest moving to the part 2.
Traveling Salesperson Problem
As a running example, let us considere an NPhard combinatorial optimization problem Traveling Salesperson Problem (TSP). A traveling salesperson must visit $n$ cities $X = { x_1,\dots, x_n }$ where every pair of cities has a road (edge) connecting them with varying distances (complete, undirected, weighted graph). The goal of the salesperson is to find a route such that a) each city is visited exactly once^{1}, b) the salesperson returns to the city it started from (a.k.a depot), and c) the total traveling distance (cost) is minimal. That is, we want to solve.
Let $\Pi$ denote the set of possible routes and $\pi$ a permutation (route) over cities . Then we define a cost fuction simply by summing the distances of all edges for a given $\pi$. Note the cost function is defined over a discrete space of cities (or its indices) and the distance function is given and symmetric; $(\Pi, d)$ may not even be a metric space though many fortunate results can be derived with the triangle inequality.
Exact algorithms
How do we solve this? Without constraints, there are $n!$ possible solutions^{2}. Unless $n$ is really small or the problems belong to some special cases we know how to attack with ease, exact algorithms are likely an impossible dream. Recall factorial asymtotically grows faster than the infamous exponential. The best exact solution that we know is called HeldKarp algorithm. The trick is to capitalize on the invariant that given the optimal route $\pi^\star$ of minimum distance, any subroute of it is also of minimum distance. If there were a subroute that isn’t, $\pi^\star$ is not optimal; a contradiction. The problem can be formulated to recursively compute the optimal route given a start vertex, an end vertex (technically, second to last vertex) and a set of intermediate vertices to go through such that no computation is repeated at the expense of memory; i.e. dynamic programming. Since there are $n$ vertices to consider for the start and the end, respectively and $2^n$ subsets for the intermediate, it gives the worst time complexity of $O(n^2 2^n)$ and the space $O(2^n n)$. Better than factorial but, still exponential.
Approximation algorithms and heuristics
What people use in practice are (local) approximation algorithms and/or search algorithms with heuristics. Approximation algorithms guarantee that its performance won’t be too bad (up to a multiplicative factor/ratio) compared to the optimal performance (sometimes, in expectation). Local search algorithms find a local optimum where heuristics helps them get stuck at a good local optimum. Note heuristics are problemdependent to varying degrees and one that is generally applicable gets a distinguished status of metaheuristic. While there have been many various attempts at TSP, it appears there are a few popular ones. For brevity, we just name them here.
The simplest is Nearest Neighbor algorithm. Given a start vertex, it iteratively constructs a route such that at $i$th iteration, it looks at the unvisted neighbors of its most recently added vertex and chooses the nearest neighbor: $x^\star = \arg\min_{x \in X \setminus \pi} d(\pi_i, x)$ and $\pi = \pi \cup \{ x^\star \}$. At first, it has $n1$ unvisted vertices to evaluate, then $n2$ until there’s one left. Each evaluation is independent so we add them up to have $O(n^2)$. Since it is cheap to run, it is often used to produce an initial solution and we use other techniques to improve it iteratively. There’s not much of satisfactory performance guarantees. The sad thing is that this algorithm’s approximation ratio is not even bounded; it can produce the worst toue. That said, special instances with the triangle inequality give the approximation ratio of $O(\log n)$.
Christofides algorithm is the best known approximation algorithm that guarantees the ratio of 3/2. The result assumes the distance function forms a metric. It begins by constructing a minimum spanning tree $T$ which is a nice start: $T$ is connected and minimal both in the number of edges and the cost. Plus, $L(T) < L(\pi^\star)$ since removing an edge in $\pi^\star$ would lead to a spanning tree. At this point, $T$ is not a valid tour but we can attack the odddegree edges in $T$ and find a minimum matching, $M$. If we add $M$ back to $T$ (call it $T’$), there must exist a subgraph in $T’$ that visits every edge exactly once. This kind of graph is called an Euler tour. It is nice as it already contains a TSP tour $\tilde{\pi}$. The TSP tour can be found by adding shortcuts for the vertices visited twice–this is where the triangle inequality is used. It follows that $L(\tilde{\pi}) \le L(T’) = L(T) + L(M) \le (1 + 1/2)L(\pi^\star)$. Note adding shortcuts is a nonincreasing operation. The time complexity is polynomial with $O(n^3)$ due to the matching step.
Another local heuristic algorithm often mentioned together is 2Opt. The algorithm assumes a tour $\pi$ is given and aims to improve upon it with a series of twoedge exchanges. The trick is to replace a pair of edges that cross over each other (Xshaped) with another pair that does not. It iteratively considers each possible pair of edges that are not adjacent in $\pi$. Remove the pair and we get two paths, $T_1, T_2$. By a simple argument, one can show that there is only one way to combine $T_1,T_2$ to a new tour $\pi’$. We set $\pi = \pi’$ if an improvement can be made such that $L(\pi’) < L(\pi)$. Why does this algorithm work? When we pick two nonadjacent edges in a tour, the pair will either cross or not. In both cases, there’s a unique way to recombine them into a different and valid tour. The only case where the cost decreases is when they cross. The running time for the portion described so far is $O(n^2)$ since there are at most $\binom{n}{2}$ edge pairs to consider and the swap evaluation is $O(1)$. The catch is you may have to run this procedure (for the entire nonadjacentedge pairs) many times over which can lead to an exponential number of exchanges. It will converge to a local optimum as it changes the graph only when strict improvement is possible; it can’t cycle back to a higher cost solution it has seen before. There is a generalization of this idea called LinKernighan.
Google opensourced a nice tool called ORtools. In their first solution strategy section, you can find the algorithms above or similar and other more sophisticated ones that produce an initial solution. Furthermore, it features metaheuristics (local search options) that help find a better local optimum. The tool is designed to handle not only the TSP but other combinatorial optimization problems such as Vehicle Routing Problem (VRP) that subsumes TSP as a special case of having a single vehicle, a single depot and no constraints on things like capacity or time windows.
Next, we talk about RL in the part two of this post.

The complete connectivity of a graph assures the existence of a tour that visits each vertex only once. Whether the shortest tour will be included in such visitonce tours generally depends on the traingle equality where short tours are short and long tour are long. ↩

If we assume a depot city (the first and last visit to visit) is assigned as a part of the problem definition, there will be $(n1)!$ ways. Also $n!/2$ if we assume two tours of opposite directions are identical (e.g. $A\rightarrow B, B \rightarrow A$). ↩