Fast Iterative Alignment of Pose Graphs
with Poor Initial Estimates
Edwin Olson, John Leonard, and Seth Teller
MIT Computer Science and Artiﬁcial Intelligence Laboratory
Cambridge, MA 02139
Email: eolson@mit.edu, jleonard@mit.edu, teller@csail.mit.edu
http://rvsn.csail.mit.edu

Abstract— A robot exploring an environment can estimate
its own motion and the relative positions of features in the
environment. Simultaneous Localization and Mapping (SLAM)
algorithms attempt to fuse these estimates to produce a map and
a robot trajectory. The constraints are generally non-linear, thus
SLAM can be viewed as a non-linear optimization problem. The
optimization can be difﬁcult, due to poor initial estimates arising
from odometry data, and due to the size of the state space.
We present a fast non-linear optimization algorithm that
rapidly recovers the robot trajectory, even when given a poor
initial estimate. Our approach uses a variant of Stochastic
Gradient Descent on an alternative state-space representation
that has good stability and computational properties. We compare
our algorithm to several others, using both real and synthetic data
sets.

Truth

Corrupted Input

Gauss-Seidel
60s, 20000 iters

EKF
6.4s

LU Decomposition
28.6s, 4 iters

Our Method
0.9s 400 iters

I. I NTRODUCTION
Many robotics problems require a robot to build a map of its
environment while simultaneously determining its trajectory;
this process is dubbed Simultaneous Localization and Mapping
(SLAM). At its core, SLAM is an optimization problem: ﬁnd
the map with the greatest probability given sensor observations. Data association is a core problem in SLAM, but in this
paper, we assume that data associations are known.
The SLAM optimization problem is quite difﬁcult. It has
a large search space, since the position of each pose and
feature must be determined. It is also highly nonlinear, since
observations are made relative to the robot’s location (the
constraint equations include trigonometric functions of the
robot’s orientation).
Perhaps the most challenging aspect of SLAM optimization
is that the initial state estimate (arising from the vehicle’s
odometry) can be very noisy. After traveling around a large
loop, the robot’s estimated position may differ markedly from
its true position.
Current SLAM algorithms have a limited ability to deal with
poor initial estimates. The family of linearizing approaches
(Extended Kalman Filter, Extended Information Filter, and
the many approaches that improve upon their computational
complexity [1], [2]), irrevocably introduce linearization error.
We will show the effects of this linearization error.
Nonlinear optimization algorithms do not suffer from linearization error, since they can re-evaluate constraint equations
as the state estimate improves. Current nonlinear methods are

Fig. 1. SLAM experiment with about 600 poses and nine loops. Dots (very
small) represent robot poses, and lines show constraints. The EKF solution
exhibits noticeable degradation due to linearization error. LU Decomposition
and our method both produce good results, though LU decomposition takes
much longer. Notice that Gauss-Seidel has failed: an extra loop has been
inserted in the lower right which will never be removed. See Fig. 2 for
convergence rate plots.

often slow, often failing to recover the global structure of the
map in a reasonable amount of run time.
We present a nonlinear map optimization algorithm which
can optimize a map even when the initial estimate of the map
is poor. Further, our approach is very fast. This paper makes
two main contributions:

Fig. 2. Convergence rates for a simple SLAM experiment (shown in Fig. 1).
Our method was dramatically faster than the others, while achieving lower
error. Note that the runtime for LU decomposition was actually about 320
seconds; it has been compressed for comparison purposes. The second plot
shows the same data as the ﬁrst, with a different vertical scale to show detail.

An alternative state space representation that allows a
single iteration to update many poses without incurring a
large computational cost.
• A variant of Stochastic Gradient Descent that is robust
against local minima and converges quickly.
Our algorithm is uniquely adept at recovering the macroscopic shape of pose graphs, quickly producing maps that
resemble the ground truth. Our algorithm generally does not
ﬁnd an exact minimum due to the approximations involved.
However, given reasonable time bounds, the approximate
minimum is typically better than that of other algorithms. If
an exact minimum is required, our algorithm could be used to
bootstrap another approach.
Our method consumes O(N + M ) memory for N poses
and M constraints, and each iteration of the optimizer (which
considers a single constraint) runs in O(log N ) time. This
paper demonstrates our approach on several data sets.
•

II. P RIOR W ORK
The SLAM problem can be represented as a graph in
which nodes are features and/or robot poses, and edges are
rigid-body constraints between nodes. The information matrix,
the inverse of the Kalman ﬁlter’s covariance matrix, has a
natural interpretation as the adjacency matrix of this graph.
Thrun et al. demonstrated that the information matrix can be
effectively sparsiﬁed [3]. Eustice et al. showed that by tracking
all robot poses, the information matrix becomes exactly sparse
[4]. While information ﬁlters can rapidly incorporate new
observations, exactly recovering the state estimate requires
inversion of the entire information matrix. Another limitation

of both Kalman and Information Filter approaches is that
they irrevocably introduce linearization error when performing
observation updates. When the state estimate is poor, these
linearization errors can lead to poor estimates, even though
the ﬁlter’s covariance estimate may indicate low error.
A promising family of SLAM approaches avoids introducing permanent linearization errors by continuously relinearizing around the current state estimate. These approaches
attempt to compute the fully nonlinear Maximum Likelihood
Estimate (MLE). However, because they perform optimization
in a highly non-linear domain, these approaches can suffer
from local minima and unboundedly slow convergence rates.
A brute-force nonlinear least squares implementation was
suggested by Lu and Milios [5]. Their approach is similar in
formulation to more modern approaches, but their brute-force
implementation makes their algorithm impractical.
Duckett et al. described an early nonlinear SLAM implementation [6] that uses Gauss-Seidel relaxation. However,
they assumed absolute knowledge of the robot’s orientation,
essentially making the problem linear. Frese, Larsson, and
Duckett addressed that limitation and attempted to improve
convergence speed in [7] with the Multi-Level Relaxation
(MLR) algorithm. Multi-resolution methods are typically applied to problems more spatially uniform than that of SLAM,
but they report good results. MLR, given time, can generally
ﬁnd the exact minimum of the graph.
Other nonlinear approaches include GraphSLAM [8], and
Graphical SLAM [9]. Konolige proposes a method [10] for
accelerating convergence by reducing the graph to poses that
have a loop constraint attached, solving for the other nodes
separately. This can save considerable CPU time, but requires
the graph to have low connectivity.
Paskin’s Thin Junction Tree Filter [11] and Frese’s TreeMap
[12] compute nonlinear map estimates, but their approaches
require factorization of the joint probability density, which
they achieve by ignoring small state correlations. These approximations can result in noticeable map artifacts.
A hybrid of linear and nonlinear solutions is Bosse’s Atlas
[13], which uses linearized (EKF-based) submaps but stitches
them together using nonlinear optimization.
Nonlinear optimization algorithms have a rich history outside the SLAM community. SLAM algorithms have typically
limited themselves to Gauss-Seidel or Gradient Descent approaches, but other approaches are commonly used in other
ﬁelds. In particular, Stochastic Gradient Descent [14] is often
used to train artiﬁcial neural networks.
III. O UR M ETHOD
A. Preliminaries
We consider a version of the 2D SLAM problem that
explicitly optimizes only the trajectory of robot poses. As
demonstrated by FastSLAM [15], positions of non-pose features can be trivially computed given the robot trajectory since
they become conditionally independent. This approach reduces
the search space.

In this paper, we assume that constraints between poses
are always given as full-rank rigid-body transformations with
uncertainty matrices. This allows us to treat constraints in
a uniform way, simplifying presentation. However, rankdeﬁcient constraints, such as those arising from data associations between lines, could be incorporated in principle. We
also make the typical assumption that constraints (which arise
from observations) are independent.
Before proceeding, we review how the graph can be optimized by solving a linear problem of the form Ax = b. If x
is the state vector representing robot poses, and f () represents
the constraint equations with expected values u and variances
Σ, we can write the log probability of the node positions as:
− log P (x) ∝ (f (x) − u)T Σ−1 (f (x) − u)

(1)

The constraint equations are nonlinear due to effects of
robot orientation. We proceed by linearizing f (x) = F |x +
J|x ∆x, using column vector F |x and matrix J|x (the Jacobian
of the constraint equations with respect to the state). A
single rigid-body constraint yields three constraint equations,
occupying a block-row of the Jacobian.
Since we always linearize around the current state estimate,
we simply write F and J. We will also write d = ∆x, which is
suggestive of the search direction, and aids readability. Finally,
we also set r = u − F , the residual. Eqn. 1 then becomes:
− log P (x) ∝
cost =

(Jd − r)T Σ−1 (Jd − r)
dT J T Σ−1 Jd − 2dT J T Σ−1 r +
rT Σ−1 r

(2)

We wish to improve our map by ﬁnding a d that maximizes
the probability, or equivalently minimizes the cost. Differentiating with respect to d and setting to zero, we ﬁnd that:
(J T Σ−1 J)d = J T Σ−1 r

(3)

This is a linear algebra problem of the form Ax = b, with
A = J T Σ−1 J, the information matrix. If we solve Eqn. 3 for
d once (thus limiting us to a single linearization point), we
have an Extended Information Filter. (An Extended Kalman
Filter would yield the same result through different means.) If
we solve for d repeatedly (by re-evaluating J around the state
estimate each time), we have the method of nonlinear least
squares.
In the case of SLAM, the state vector is often enormous–
thousands of poses, each pose with either three (2D) or six
(3D) degrees of freedom. If there are 2000 poses and the
A matrix is 6000 × 6000, a 2.8GHz Pentium-IV running
MATLAB requires 127 seconds to solve Eqn. 3 just once
(assuming a dense A). If there are 10000 poses, over four
hours per iteration are needed.
It is often the case that d can be reasonably estimated much
faster than it can be computed exactly. This is acceptable, since
even an “exact” d is only the solution to a linear approximation
of a non-linear problem. This motivates us to consider iterative
optimization algorithms.

B. Iterative Optimization
Iterative methods approach optimization by considering
only a subset of the information available in the problem. A
large family of algorithms consider the curvature of the cost
surface around the current estimate, ignoring the curvature farther away. These include Newton-Raphson, Gradient Descent,
Conjugate Gradient Descent [16], and Simplex methods.
When the state estimate is reasonably close to the optimum,
the curvature around the estimate can quickly guide these
algorithms to it. However, when the state estimate is corrupted
by signiﬁcant noise, the local gradient may point toward a local
minimum, not the global minimum. Unless gradient methods
happen to stumble out of the local basin, they will typically
fail to achieve a satisfactory solution.
For many SLAM problems, the cost function can have long
valleys where the gradient is nearly zero. These arise, for
example, when two clusters of nodes are only loosely coupled
to each other: so long as the individual clusters are wellaligned, the clusters can rotate with respect to each other with
very little impact on the total cost. In this situation, some
methods can converge so slowly that they appear to be stuck
in a local minimum.
Another family of iterative approaches considers only a
subset of the graph’s nodes and edges. Gauss-Seidel, for
example, considers a single node p and its edges. Fixing p’s
neighbors, the location of p is recomputed. Gauss-Seidel has
a major shortcoming: if a cluster of nodes are far from their
optimal position, but are positioned correctly relative to each
other, they will not be moved. Since only those nodes which
border an error are adjusted, it takes many iterations for the
effect of a loop closure to be percolated around the graph.
The basic idea of Stochastic Gradient Descent (SGD) is to
consider a single edge and the gradient with respect to just
that edge. The state is then moved in the direction of greatest
descent. It is “stochastic” in the sense that the constraint is
usually selected randomly.
Typically, different edges will lead to steps in different
directions. SGD thus tends to hop around from one local
minimum to another. SGD is also less likely to be caught in a
long valley, since there is probably at least one edge which has
a signiﬁcant gradient. This edge will cause the state estimate to
teleport to another part of the cost function, where the gradient
may be more helpful.
The distance that SGD travels for each edge is slowly
decreased over time in order prevent oscillation. This makes
it increasingly difﬁcult for the state estimate to transition
from one local minimum to another, with the result that it
becomes increasing likely that SGD will get stuck in the most
popular minimum (which is likely the global minimum). The
degree to which SGD modulates its step size is known as the
learning rate, and is analogous to the cooling rate in simulated
annealing methods.
A great number of learning rate schedules have been
explored in the literature ([17] is an interesting example); the
simplest are simple functions of iteration number, while others

incorporate convergence rate and other information. The simplest strategy is to set the learning rate α ∝ 1/n, where n is the
current iteration of the algorithm. The proportionality constant
is determined, in our case, by examining the “stiffnesses” of
the poses, as will be described in section III-E.
C. Derivation
Using some of the ideas underlying Stochastic Gradient
Descent, we now derive our method. We begin by considering
the cost of a single constraint i and its residual ri , evaluated
at the current state (i.e., the ith term of Eqn. 2 with d = 0):
T
ci = ri Σ−1 ri
i

(4)

The gradient of the cost function with respect to d (the
change in state) can be found by applying the chain rule to
Eqn. 4. With Ji the Jacobian of the ith constraint with respect
to d, we have:
∂ci
∇ci = −
→
∂d

→
ri
∂ci ∂ −
− −
→ →
∂ ri ∂ d
T
= 2ri Σ−1 Ji
i
=

(5)

Stochastic Gradient Descent would then prescribe that we
set d = α∇cT , where α represents the learning rate. (The
i
transpose arises from the convention of gradients being row
vectors and state vectors being column vectors.) In other
words:
d = 2αJiT Σ−1 ri
i

(6)

However, note that summing ∇cT for all constraints yields
i
the right-hand side of Eqn. 3, up to a scale factor. Consider
the left-hand side and suppose that we ﬁnd an easily invertible
matrix M such that M ≈ J T Σ−1 J (note that J is the full
Jacobian, not just the Jacobian for the ith constraint). We
can then write d so that it more accurately approximates the
solution to Eqn. 3:
d ≈ 2αM −1 JiT Σ−1 ri
i

(7)

A natural choice is to pick the diagonal matrix M with the
same diagonal elements as J T Σ−1 J. This is also known as
Jacobi Preconditioning [18].
Thus far, our method can be viewed as incrementally building up a solution to Eqn. 3, one constraint at a time. Note that
we propose to systematically iterate through constraints, rather
than selecting them randomly as would Stochastic Gradient
Descent.
We now make an additional modiﬁcation, motivated by the
fact that we know a priori the value of d which exactly
satisﬁes the constraint. While Stochastic Gradient Descent can
overshoot the minimum, we use this knowledge to clamp d so
that we never overshoot. In general, clamping will only occur
when the learning rate is large.

D. State Space Representation
Our discussion thus far has made no assumptions about
which state space representation is used. However, the choice
of state space has a large impact on the ultimate performance
of the algorithm.
The most widely used state space is global pose. The global
pose representation is simply the set of (x, y, θ) coordinates
for each pose. With this representation, the Jacobian of a single
rigid-body constraint is sparse. Usually this is a desirable
thing (since it permits efﬁcient computation), but the sparsity
also means that only the two poses adjacent to the edge
are directly affected by the constraint. This is undesirable
because the motion of the robot is cumulative: moving one
pose typically has an affect on a large number of other
poses. Naturally, this can be addressed by considering many
constraints simultaneously, but the resulting system rapidly
becomes more difﬁcult to solve.
A second possible state space is the set of rigid-body
transformations that relate successive poses, which we call
relative pose. The pose of the ith pose is the composition of
the ﬁrst (i−1) rigid-body transformations. Each of these rigidbody transformations is parameterized by three quantities:
a forward translation, a sideways translation, and a relative
rotation. A single constraint between two poses now affects
many poses, since the relative position of two poses is a
function of all of the rigid-body transformations between them.
Consequently, each iteration will move a number of poses,
permitting larger steps towards the minimum. However, the
Jacobians that result are highly non-linear (due to the effects
of each pose’s orientation), and non-sparse (since many rigidbody transformations are used).
We propose using a state representation, incremental pose,
that captures the cumulative nature of the robot’s motion,
but has a particularly convenient structure allowing efﬁcient
computation. The state space is composed of the incremental
change in robot pose between two successive poses, i.e.,
the difference between them. If the global robot poses are
(xi , yi , θi ), the state vectors for relative and incremental pose
are written:
 


x0
cos(θ0 )(x1 − x0 ) + sin(θ0 )(y1 − y0 )

 −sin(θ0 )(x1 − x0 ) + cos(θ0 )(y1 − y0 )  
y0

 

 


θ0
θ1 − θ0
 


 cos(θ1 )(x2 − x1 ) + sin(θ1 )(y2 − y1 )   x1 − x0 
 


 −sin(θ1 )(x2 − x1 ) + cos(θ1 )(y2 − y1 )   y1 − y0 
 


  θ1 − θ0 

θ2 − θ1
 


 cos(θ2 )(x3 − x2 ) + sin(θ2 )(y3 − y2 )   x2 − x1 
 


 −sin(θ2 )(x3 − x2 ) + cos(θ2 )(y3 − y2 )   y2 − y1 
 


  θ2 − θ1 

θ3 − θ2
...
...
xrel
xincr
In the xincr state space, rather than having to compose a
number of rigid-body transformations in order to compute the
relative position between two nodes, we simply add a number
of incremental motions. The Jacobian of a constraint between

pose 0 and pose

0 0
Ji =  0 0
0 0

2, for example, is just:
0 1 0 0 1 0 0 0 0 0
0 0 1 0 0 1 0 0 0 0
0 0 0 1 0 0 1 0 0 0



...
... 
...
(8)
With such a simple structure, J need never be explicitly
constructed or stored.
In other words, our method considers a single constraint
at a time. This constraint forms a loop with respect to the
robot trajectory that we are optimizing: the sum of the motions
along a portion of the trajectory must match the constraint.
We compute the residual of the constraint, and the Jacobian
tells us how to modify the robot trajectory to better satisfy
the constraint. In essence, we “spread” the error in the loop
constraint out over the trajectory. This generally reduces our
total error, since many small errors cost less than one large
error (due to the quadratic nature of Eqn. 2).
Intuitively, we would expect that the error should not
be distributed evenly: the position of some nodes may be
more constrained than others. This is precisely the effect
of premultiplying by M −1 . Each constraint links two poses,
adding tension to the portion of the trajectory between those
poses. The “stiffness” of any particular node is the sum of all
the tensions which affect it. (The “tension” of the constraints
are the block-diagonals of Σ−1 , and the product J T Σ−1 J
performs the summing.)
Eqn. 7 multiplies the Jacobian by the weighted residual.
Note, however, that the rows in the Jacobian are not unit
norm: constraints connecting more distant nodes have more
1’s in them. In effect, this ampliﬁes the effect of constraints
according to the length of their loop. This is desirable since
the error in a large loop can be spread out over many nodes;
intuitively, larger loops cost less to satisfy, and should be given
more weight. Our clamping heuristic prevents this behavior
from resulting in inappropriately large steps while α is large.
E. Learning Rate
The learning rate α controls the convergence properties of
stochastic gradient descent. If α decreases too quickly, the
gradient steps will become small before a good solution is
found. Conversely, if α decreases very slowly, convergence
will be needlessly delayed (though the ultimate solution will
likely be good).
We use a typical (though simple) learning rate suggested by
Robbins and Monro [14]. Given the current iteration number
t and a parameter γ:
α = 1/(γt)

(9)

This schedule permits large steps early on (correcting errors
on a macroscopic scale and permitting the graph to visit
local minima), while gradually reducing the step size in later
iterations.
If all of the constraint covariances are scaled by a constant
factor, the graph has the same solution (the constant factor
appears on both sides of Eqn. 3). Parameter γ gives us an

opportunity to account for this scaling. Consequently, we
propose:
γ = mini Σ−1
i

(10)

This learning schedule is likely suboptimal, though it appears to perform well over a variety of problems. “Searchthen-converge” is a slightly more sophisticated schedule that
changes the expression for α so that it stays near 1 for a
longer initial search period [19]. Adaptive rate schedules could
provide an even larger advantage, though we have not explored
them.
F. Implementation
In a practical implementation, much of the algebraic complexity vanishes once the properties of the simple Jacobian
are fully exploited. In this section, we describe our geometric
world model, some notation that is handy in this domain, and
provide a more practical algorithmic description to aid those
wishing to implement our approach. Not including a matrix
library, a simple version of our method can be implemented
in about 80 lines of Java code.
Each constraint is parameterized by two pose numbers (a
and b), a rigid-body transformation relating them (tab ), and
the rigid-body transformation’s covariance matrix, Σab .
A rigid-body transformation in 2D has three parameters, δx,
δy, and δθ; we represent it with a 3×1 vector t, corresponding
to a 3×3 transformation matrix T . We will freely switch back
and forth between the two representations:




cos(δθi ) − sin(δθi ) δxi
δxi
ti =  δyi  ⇐⇒ Ti =  sin(δθi ) cos(δθi ) δyi  (11)
0
0
1
δθi
Vehicle poses are expressed as rigid body transformations
(pi and its dual Pi ) relative to the origin. For example, if the
robot is at pi and observes a feature two meters in front of it,
the position of that feature in the global frame is Pi [2 0 1]T
(with the vector written in homogenous coordinates.) We also
write R = Rot(P ) to set R to just the rotational component of
rigid-body transformation P , i.e., where R1,3 = 0 and R2,3 =
0. This is useful when projecting a covariance matrix into a
different frame.
Each pose has a state vector xi , which is the difference in
absolute poses between it and the prior pose.
xi = pi − pi−1

(12)

Note that we ﬁx x0 = p0 = [0 0 0]T . Consequently:
xj

pi =

(13)

j∈[0,i]

The structure of the Jacobian makes it possible for us to
avoid constructing J or d explicitly. Instead, we compute the
weighted residual M −1 Σ−1 r, clamp it, then add it to the
appropriate state elements.
The straightforward method of adding the weighted residual
to the state elements runs in O(N ) time (for a graph with N

poses). This approach is easy to implement, and is shown in
Alg. 1.
However, it is possible to perform this operation in
O(log N ) time. Factoring out M , all of the affected elements
of xincr move by the same amount. Note that the affected
elements of xincr are always a contiguous portion of the
trajectory.
Consider a binary tree whose leaves are the elements of
xincr . As we iterate over the graph’s constraints, we compute
changes to the elements of xincr ; this tree stores and sums
those changes. If we deﬁne the total change for a leaf to be
the sum of all the tree nodes between it and the root, we can
add a change to any contiguous subset of xincr in O(log N )
time. Reading a value of xincr becomes O(log N ) as well.
The per-pose scaling is implemented by scaling the results
read from the tree by Mi−1 , and by scaling the changes that
are inserted into the tree by the total weight of the poses in
each loop.
Importantly, the cost of computing a pose does not rise to
O(N ); they can be computed in O(log N ), by employing a
similar tree. While fairly straightforward, the implementation
is rather tedious and we omit it here. An implementation is
available on the author’s website.
Also note that it is unnecessary to recompute M at every
iteration since the rough alignment of the map is generally
determined in the ﬁrst few iterations. In our performanceconscious implementation, we recompute M only at iterations
1, 2, 4, 8, and so on.
Our algorithm uses O(N + M ) memory, for N states and
M constraints. The coefﬁcients for each are small: neglecting
language overhead, constraints consume 80 bytes each, and
poses require 24 bytes. A graph with a million poses and two
million constraints would require about 160MB (though we
have not tried such a large problem).
Since each constraint can be evaluated in O(log N ), the
total time per iteration is O(M log N ). However, we do
not claim that our algorithm converges in O(M log N ) time
(i.e., in a constant number of iterations). In fact, deﬁning
“convergence” is tricky: particular choices of the learning-rate
parameter γ can force the step size to be small in a predictable
amount of time, but this does not imply that the solution is
near a minimum. This is a problem common to stochastic
optimization [19], and is a subject for future work.
IV. R ESULTS
We present results on two synthetic problems, a simple
double-loop and a very complex environment, and also on
real data from the large Killian Court data set.
Synthetic experiments on randomly-generated graphs allow
us to explore our algorithm’s behavior on extremely large
environments for which we do not have real datasets. It also
allowed us to test our algorithms with many different noise
characteristics, in order to be able to conﬁdently say that the
results presented here are typical.
We note that our graph generator creates “grid-world”
graphs, as though a robot was traveling the city streets in

Algorithm 1 SGD Optimize Graph
1: iters = 0
2: loop
3:
iters++
4:
γ = [∞ ∞ ∞]T
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:

{Update approximation M = J T Σ−1 J}
M = zeros(numstates, 3)
for all {a, b, tab , Σab } in Constraints do
R = Rot(Pa )
W = (RΣab RT )−1
for all i ∈ [a + 1, b] do
Mi,1:3 = Mi,1:3 + diag(W )
γ = min(γ, diag(W ))
{Modiﬁed Stochastic Gradient Descent}
for all {a, b, tab , Σab } in Constraints do
R = Rot(Pa )
Pb ′ = Pa Tab
r = pb ′ − pb
r3 = mod2pi(r3 )
d1:3 = 2(RT Σab R)−1 r
{Update x, y, and θ}
for all j ∈ [1, 3] do
alpha = 1/(γj ∗ iters)
totalweight = i∈[a+1,b] 1/Mi,j
β = (b − a)dj α
if |β| > |rj | then
β = rj
dpose = 0
for all i ∈ [a + 1, N ] do
if i ∈ [a + 1, b] then
dpose = dpose + β/Mi,j /totalweight
pi,j = pi,j + dpose

upper Manhattan. This makes it easy to visually appraise a
map: corners which appear to be almost ninety degrees should
be ninety degrees. Paths which appear to overlap actually do.
The graph optimizers were in no way tuned to exploit this
property.
A. Double Loop
On the double-loop experiment (see Fig. 1), the graph size
was small enough to allow us to try many different algorithms,
including those with O(N 2 ) or worse complexity.
LU decomposition (nonlinear least squares) converged in
four iterations to a very good solution, but required over 28
seconds of CPU time. The EKF’s performance is tolerable,
though it is noticeably worse than the nonlinear solutions.
Gauss-Seidel struggles with large worlds due to its adjustment of only one node at a time. Nodes that are close to
a loop closure very quickly move into relative alignment,
but the amount of adjustment decreases very quickly with
distance. In other words, it is often the case that the majority

Truth

Corrupted Input

Gauss-Seidel
60s , 2100 iters

MLR
8.6s, 238 iters

Fig. 4. Convergence rates for problem in Fig. 3. Our method converges
much faster than Gauss-Seidel and produces a higher quality map. MLR (not
shown) reduces the cost function faster than our method, but the resulting
graphs do not resemble the ground truth (see Fig. 3)

B. Complex Graph

Proposed Method
2.9s, 100 iters

Proposed Method
10s, 628 iters

Fig. 3. Difﬁcult SLAM problem consisting of 3500 poses and 5600 constraints. MLR and Gauss-Seidel have dramatically reduced the cost function,
but their solutions are unsatisfactory. Given 1833 seconds, MLR converges to
a satisfactory solution. Convergence rates are shown in Fig. 4.

of the “tension” in the graph can be released by adjusting
comparatively few nodes. Once the tension is reduced, the
steps become quite small, even if the graph is far from the
optimum.
Another consequence of Gauss-Seidel relaxation is that
nodes bordering large errors can move dramatically. In this
experiment, a number of nodes (in the lower right) jumped
a large distance, inducing an extraneous loop into the graph.
Based on our experiments, we do not believe this loop will
ever be corrected: Gauss-Seidel has become trapped in a local
minimum.
Our proposed method reduces the error much faster than the
other three algorithms (see Fig. 2). As expected, LU Decomposition ultimately converges to the lowest error (−8.09 × 104 ).
Our approach (−2.75420 × 107 ) outperforms the EKF by a
smaller but still signiﬁcant margin (−5.70×107 ). Gauss Seidel,
after 60 seconds of CPU time, is the worst (−5.63 × 108 ).

In order to demonstrate our algorithm on a difﬁcult problem,
we generated a large graph with signiﬁcant error in both the
initial state and loop closures. The graph consists of 3500
poses and 5600 constraints. In this graph, the robot visits
different areas with highly variable frequencies; some areas
are visited up to 13 times, while others are visited only once
(the mean was about 2).
Since in 2D there are three degrees of freedom, the Jacobian
(were it to be computed explicitly) would be 16800 × 10500.
The size of this problem makes it impossible to run LUDecomposition or an EKF implementation.
Our proposed method outperformed Gauss-Seidel both in
terms of convergence rate and subjective graph quality. On this
data set, we have results using Multilevel Relaxation (MLR)
[7]. MLR’s convergence rate (log probability) is actually
substantially faster than our method, but the subjective quality
of the maps is only marginally better than Gauss-Seidel (see
Fig. 3). After 1800 seconds of CPU time, MLR converges to a
minimum that corresponds to a subjectively high-quality map.
These results illustrate that log probability is not an adequate
measure of graph quality. This is an area for future study.
C. Killian Court
The Killian Court dataset is one of the larger SLAM
datasets, which after processing with our laser scan code,
produced about 1900 poses. Laser scan matching provided the
constraints connecting consecutive poses. Twenty additional
loop closures were identiﬁed manually by aligning laser scans.

The robustness and speed of our approach make it compelling as a stand-alone SLAM algorithm, but it could also
be used by other non-linear algorithms like GraphSLAM or
MLR to improve their convergence on difﬁcult problems by
providing a better initial estimate.
VI. ACKNOWLEDGEMENTS
Corrupted Input

Gauss-Seidel (60s)

We thank Udo Frese for helpful discussions regarding the
O(M log N ) variant of the algorithm, and for running his
MLR code on our datasets. Also, we thank Mike Bosse for
collecting and sharing the Killian Court data set.
This material is based upon work supported by the National
Science Foundation under Grant No. 0514639
R EFERENCES

Proposed Method (1s)
Fig. 5. Killian Court. The Killian data set is composed of about 1900 poses,
with a path length of 1,450 meters. Two hours of robot travel resulted in a
pose graph that was optimized in under a second using our algorithm.

Fig. 6. B21 Robot. This B21 was used to collect the Killian data set. Laser
scan data was processed ofﬂine to produce a pose graph. This pose graph was
then optimized by our algorithm.

Our algorithm rapidly recovered the structure of the environment (in about 1 second), while Gauss-Seidel made poor
progress given 60 seconds. These results demonstrate the
applicability of our algorithm to real-world datasets.
V. C ONCLUSION
We have presented a fast iterative algorithm for optimizing
pose graphs, even in the presence of signiﬁcant open-loop
error. It estimates the maximum likelihood solution, often
achieving lower error than a full EKF solution, and converges
much faster than most other approaches while producing maps
that closely resemble the true state.

[1] J. Knight, A. Davison, and I. Reid, “Towards constant
time SLAM using postponement,” 2001. [Online]. Available:
citeseer.ist.psu.edu/article/knight01towards.html
[2] J. Guivant and E. Nebot, “Compressed ﬁlter for real time implmentation
of simultaneous localization and map building,” FSR 2001 International
Conference on Field and Service Robots, 2001.
[3] S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, and H. DurrantWhyte, “Simultaneous localization and mapping with sparse extended
information ﬁlters,” April 2003.
[4] R. Eustice, H. Singh, and J. Leonard, “Exactly sparse delayed-state
ﬁlters,” in Proceedings of the 2005 IEEE International Conference on
Robotics and Automation, Barcelona, Spain, April 2005, pp. 2428–2435.
[5] F. Lu and E. Milios, “Globally consistent range scan
alignment for environment mapping,” 1997. [Online]. Available:
citeseer.nj.nec.com/lu97globally.html
[6] T. Duckett, S. Marsland, and J. Shapiro, “Learning globally consistent
maps by relaxation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’2000), San Francisco, CA,
2000.
[7] U. Frese, P. Larsson, and T. Duckett, “A multilevel relaxation algorithm
for simultaneous localisation and mapping,” IEEE Transactions on
Robotics, 2005.
[8] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT Press,
2005.
[9] J. Folkesson and H. I. Christensen, “Graphical SLAM - a self-correcting
map,” Proceedings of the IEEE International Conference on Robotics
and Automation, pp. 791–798, 2004.
[10] K. Konolige, “Large-scale map-making.” in AAAI, 2004, pp. 457–463.
[11] M. Paskin, “Thin junction tree ﬁlters for simultaneous localization and
mapping,” Ph.D. dissertation, Berkeley, 2002.
[12] U. Frese, “Treemap: An O(log(n)) algorithm for simultaneous localization and mapping,” in Spatial Cognition IV, 2004.
[13] M. Bosse, P. Newman, J. Leonard, and S. Teller, “An Atlas framework
for scalable mapping,” vol. 23, no. 12, pp. 1113–1139, December 2004.
[14] H. Robbins and S. Monro, “A stochastic approximation method,” Annals
of Mathematical Statistics, vol. 22, pp. 400–407, 1951.
[15] M. Montemerlo, “FastSLAM: A factored solution to the simultaneous localization and mapping problem with unknown data association,” Ph.D.
dissertation, Robotics Institute, Carnegie Mellon University, Pittsburgh,
PA, July 2003.
[16] J. R. Shewchuk, “An introduction to the conjugate gradient method
without the agonizing pain,” Pittsburgh, PA, USA, Tech. Rep., 1994.
[17] N. N. Schraudolph, “Local gain adaptation in stochastic gradient descent,
Tech. Rep. IDSIA-09-99, 8, 1999.
[18] G. Strang, Introduction to Linear Algebra. Wellesley-Cambridge Press,
1993.
[19] C. Darken, J. Chang, and J. Moody, “Learning rate schedules
for faster stochastic gradient search,” in Proc. Neural Networks
for Signal Processing 2. IEEE Press, 1992. [Online]. Available:
citeseer.ist.psu.edu/darken92learning.html