In the previous post, we discussed some informed search algorithms and the idea of using heuristics to guide the search process. It’s time to look under the hood to understand what these heuristics are, their properties and how to design one for a problem at hand.
In AI search, the word “heuristic” has a very specific meaning. A heuristic is a function that estimates how close a state is to a goal. The heuristic function takes a state and returns an estimate of its distance (path cost) from a goal state. For example, if you want to go from your home to some restaurant, euclidean distance could be a good estimate of the distance between the two locations. However, do note that a heuristic is just an estimate, it may be nowhere close to the actual value.
There is a special class of heuristics called admissible heuristics, which possess some desirable properties. Let’s understand what this admissibility is. A heuristic is admissible if it never overestimates the cost of reaching a goal from any node i.e., h(n) <= actual cost of reaching the closest goal from n, for any node n, and h(goal) = 0. According to this definition, even the null heuristic (which returns 0 for any node) is admissible so it is not as restrictive as it sounds. The only thing required for a heuristic to be admissible is that it never returns a value greater than the actual path cost to the nearest goal for any node.
A* search requires the use of an admissible heuristic to be optimal (i.e., always returns the cheapest solution). Inadmissible (or pessimistic because they overestimate the cost) heuristics break optimality by trapping good plans on the fringe. Admissible (or optimistic because they can only underestimate the cost) heuristics can only slow down search by assigning a lower cost to a bad plan so that it is explored first but sooner or later A* will find the optimal solution. Let’s see an example to understand how inadmissibility breaks optimality.
Suppose node A has been expanded and nodes B and C are on the fringe. Since, f(C) = g(C)+h(C) = 1+2 = 3 < f(B) = g(B)+h(B) = 1+4 = 5, node C will be expanded next by A*. Now, f(Goal) = g(Goal)+h(Goal) =4+0 = 4 < f(B), so Goal will be taken off the fringe before B and A* will claim victory by returning the more expensive path A->C->Goal (cost 4), instead of the cheaper path A->B->Goal (cost 3). This happened because our heuristic function overestimated the cost of path B->Goal.
We now know what heuristics are and some of their properties. Let’s now see how we can design a heuristic for a problem at hand with the example of 8-puzzle.
Here is an instance of the 8 puzzle problem. We are given the state on the left side and we want to transform it into the state on the right. The only action we have is sliding a tile up, down, left or right into the empty space. All actions are equivalent so let’s assign each action the same cost 1. The cost of a solution is then the number of actions taken and we want to find the cheapest solution.
There is a standard approach of creating heuristics: problem relaxation. We relax a problem by adding some new actions such that search is not required to find the solution cost in the relaxed problem. Heuristics created as solutions to relaxed problems are usually admissible because adding new actions should only reduce the solution cost and not increase it. Let’s see how we can create admissible heuristics for 8-puzzle with problem relaxation.
- Suppose we could take out a tile and place it in its correct position in 1 move. Given any state, how many moves would it take to reach the goal state from that state? It’s precisely the number of misplaced tiles, also called the Hamming distance. In this relaxed version of 8-puzzle, we don’t need to use search to find out the solution cost. We can calculate it directly so it can be used as a heuristic. Hamming distance of the initial state given above is 8 because each tile is misplaced.
- Suppose we could slide a tile into any block, as opposed to only the empty block, in 1 move (Imagine a 3D board, where each tile is in its own 2D plane). Given any state, how many moves would it take to reach the goal state from that state? It’d be the sum of horizontal and vertical distances of each tile from its desired position, also called the Manhattan distance. We can calculate this value directly, without requiring search. Manhattan distance of the initial state given above is 19.
Do note that although the Hamming distance for the initial state given above is 8 and the Manhattan distance is 19, it doesn’t necessarily mean that it can be solved in 8 or 19 moves. The only guarantee is that it can’t be solved in less than 19 moves. This gets us to the question of what makes a heuristic better than another. A heuristic ha is better than another hb (i.e., ha dominates hb; ha ≥ hb) if ∀ node n: ha(n) ≥ hb(n). For example, for 8-puzzle, Manhattan distance dominates Hamming distance. Using a better heuristic means we have to explore fewer nodes before we find the solution.
Heuristics have a trade-off between quality of estimate and work per node. As heuristics get closer to the true cost, we expand fewer nodes but usually do more work per node to compute the heuristic itself. At the two extremes are the null heuristic (which always returns 0 and requires the least amount of work) and the exact cost (which can only be found out by conducting search itself and requires the most amount of work). One good point to note here is that the maximum of two admissible heuristics is also admissible so if we have developed two or more heuristics for the same problem and are unsure whether any of them dominates all others, we can use the maximum of them as the composite heuristic. This composite heuristic will take more time to compute but would be more accurate.
There is another class of heuristics, called consistent heuristics, which places even stricter constraints on the heuristic. It requires that the heuristic estimate must never be greater than the actual cost for each arc along a path to a goal. Formally, for any node n and its any successor p, h(n) ≤ h(p) + cost(n,p) and h(goal) = 0. One consequence of this is that the f value never decreases along a path to a goal. Consistency implies admissibility i.e., a consistent heuristic is also admissible (however the opposite is not necessarily true). Although, most admissible heuristics are consistent, especially if from relaxed problems. For example, Hamming and Manhattan distances are consistent heuristics for 8-puzzle.
Now, let’s bring together all the ideas we have seen so far in this series on AI, implement a complete solution to 8-puzzle and compare the performance of different search algorithms on the above 8-puzzle instance.
If we execute this code, we’ll get the following output:
Hamming Distance: 8
Manhattan Distance: 19
|Algorithm||Nodes Explored||Path Length|
|A* with Hamming distance||12887||23|
|A* with Manhattan distance||608||23|
We can draw following observations from this:
- Although the Hamming and Manhattan distances are 8 and 19 respectively, this problem instance cannot be solved in less than 23 moves.
- DFS is clearly not optimal. BFS is behaving optimally because all action costs are 1.
- A* needs to explore far fewer nodes than the other algorithms to find the optimal solution.
- Just replacing Hamming distance with Manhattan distance as the heuristic reduces the number of nodes explored by 20x. Dominance of Manhattan distance over Hamming distance is clearly visible.