On computing the average orientation of vectors and lines
Edwin Olson
University of Michigan
ebolson@umich.edu
http://april.eecs.umich.edu

Abstract— Computing the “average” orientation of lines and
rotations is non-trivial due to the need to account for the
periodicity of rotation. For example, the average of +10 and
+350 degrees is 0 (or 360) degrees, not 180. This problem arises
in many disciplines, particularly those in which empirically
collected data is processed or ﬁltered.
In this paper, we review a common but sub-optimal method
for computing average orientation and provide a new geometric
interpretation. We also propose a new method which provides
additional insights into the geometry of the problem. Our
new method also produces signiﬁcantly more accurate results
in the regime of operation usually encountered in robotics
applications. We characterize this regime and provide guidance
regarding which method to use, and when.

I. I NTRODUCTION
Given a set of orientations (perhaps the direction of a
noisy compass over time), it is often desirable to compute
the average orientation. This is not a trivial operation due
to the periodicity of rotations. For example, when averaging
rotations of +10 degrees and +350 degrees, we want to obtain
an answer of 0 (or 360) degrees, rather than 180 degrees. As
the number of inputs increases, it becomes more difﬁcult to
account for the 2π periodicity of rotations.
Computing average orientations arises in many disciplines.
In the speciﬁc case of the authors, it arose in the case of
a difﬁcult Simultaneous Localization and Mapping (SLAM)
problem for which the initial solution was very noisy, but for
which the orientation of the robot was known accurately for
a handful of poses (due to GPS). Without a reasonable initialization (generally considered to mean heading errors less
than π/4), many SLAM algorithms are prone to divergence
due to the fact that the Jacobians do not necessarily point in
the direction of the global minimum [1]. By interpolating the
orientation of the known points along the robot trajectory, the
stability of the method is improved. This operation requires
computing weighted averages of the GPS headings.
Closely related to the idea of an “average orientation”
is that of a “canonical orientation”. In feature descriptor
algorithms like SIFT [2] and SURF [3], a histogram of
directions is used to ﬁnd a canonical orientation to describe
a pixel patch. The same idea has been applied to LIDAR
data [4]. An average orientation could serve a similar role and
would have the potential advantage of avoiding quantization
noise resulting from constructing the histogram.
Average orientation also arises in evaluation metrics; in
judging the performance of a mapping or localization system,
it can be useful to characterize the average heading error of

Fig. 1. Arc versus chord distance. Given an estimate of average orientation
ˆ
θ and an observation θi , we can consider two distance measures: the arc
length (d) and the chord length (c).

the system [5]. The average orientation error also arises in
the context of evaluating the dead-reckoning performance of
both robotic systems [6] and humans [7]. Human perception
of orientation also makes use of directional statistics [8].
More broadly, computing average orientation arises in
animal behavior (such as the behavior of homing pigeons),
geology (the directions of particles in sediments), meteorology (the direction of the wind), etc. These and other
examples are discussed in [9], [10].
Computing averages of orientations falls into the subject
of directional statistics, which despite having several wellwritten texts [9], [10], is largely treated in an ad-hoc fashion
by most application-oriented papers.
This paper begins with a review and analysis of the widely
used “vector sum” algorithm. We provide a new geometric
interpretation of the algorithm that makes it easier to understand the consequences of its underlying approximations.
The central contribution of this paper is a new method for
computing average orientation, which we describe in both
probabilistic and geometric terms. Our evaluation demonstrates that, for many applications, our approach computes
better estimates than the vector sum algorithm.
Finally, we discuss an important variant of the average
orientation problem that arises when working with undirected
lines. These problems have a periodicity of π, rather than 2π.
We show how both methods can be adapted to this domain,
and compare their performance.

II. T HE VECTOR SUM METHOD
The vector sum method is the typical approach used
to compute average orientation. We will ﬁrst describe the
method, review its properties, then give a new geometric
interpretation.
A. Method
Suppose we are given a set of direction measurements θi
for i ∈ [1, N ]. We ﬁrst compute the unit vector for each
measurement, then sum those vectors:
cos(θi )

x=

(1)

i

sin(θi )

y=

−π

i

Now, we compute the angle of the vector sum using the
four-quadrant arc tangent; this will be our estimate of the
mean orientation:
¯
θ = arctan(y, x)

(2)

This method produces “reasonable” results, however the
speciﬁc properties of this method are not obvious. Many
application-oriented papers employ this approach without
any additional consideration. One might ask: Is the method
optimal? What objective function does the method optimize?
In particular, many applications assume that angular measurements are contaminated with additive Gaussian noise,
ˆ
i.e., that there is some true orientation θ and that the obserˆ + wi , where wi ∼ N (0, σ 2 ).
vations are of the form θi = θ
As we will show in the following sections, the vector sum
method is not always an appropriate method under these
assumptions. But ﬁrst, we’ll need to review some of the
essential ideas in directional statistics.
B. The wrapped normal distribution
Let us consider more carefully the properties that a probability distribution should have in the case of circular data.
Speciﬁcally, because an orientation of θ is indistinguishable
from a rotation of θ + 2π, the probability density f must
satisfy:
f (θ) = f (θ + 2π)
(3)
Further, suppose we observe θi . This observation is indistinguishable from θi + 2π, and θi + 4π, etc. Our distribution
should account for this ambiguity.
For example, suppose that we are observing orientations
with a mean value of µ = 45 degrees, corrupted by Gaussian
noise wi with variance σ 2 . If we observe a value of 50
degrees, it is possible that we have observed a value in
which wi = 5, but it is also possible that wi = 365, or
more generally, wi = 5 + 2πk for any integer k. All of these
events (for different values of k) map to the same observed
value, thus the density associated with an observation is the
sum of the densities for every value of k:
f (θ) =

1
√
σ 2π

∞
2

e−(θ+2πk−µ)
k=−∞

Intuitively, we can imagine the usual Gaussian distribution “wrapped” around the unit circle. Since the Gaussian
distribution has inﬁnite support, it wraps around an inﬁnite
number of times, adding to the density at every orientation as
it goes around. Since the density of a Gaussian distribution
drops off rapidly with distance from the mean, the density
is generally dominated by the ﬁrst few wraps (and often just
the ﬁrst wrap).
As in the non-circular case, we require that a density
integrate to 1; however, we only consider an interval of 2π
degrees; any contiguous interval of 2π degrees (by Eqn. 3)
will do:
π
f (x)dx = 1
(5)

/2σ 2

(4)

The wrapped normal distribution represents the most rigorous way of representing circular data contaminated by
Gaussian noise, however it is analytically difﬁcult to work
with. This is because the density includes a sum; when
computing the maximum likelihood value of µ given a set
of observations, we typically take the logarithm in order
to simplify the expression and solve for µ. However, the
summation prevents us from doing this.
C. The von Mises distribution
There exists a probability distribution such that, if wi is
distributed according to it, the vector sum method is optimal.
This distribution is the von Mises distribution [9]. This
distribution is also sometimes called the Circular Normal
distribution, but we will not use that name here as it implies
a closer relationship to the Normal (Gaussian) distribution
than actually exists.
The von Mises distribution is:
1
eκcos(θ−µ)
(6)
f (θ, µ) =
2πI0 (κ)
The κ parameter determines the spread of the distribution,
and plays the same role as 1/σ 2 in a Gaussian distribution.
The leading term is simply for normalization; Io (κ) represents the modiﬁed Bessel function of the ﬁrst kind and zero
order. For our purposes here, only the exponential term is
interesting, since it determines the shape of the distribution.
It is obvious that it is not of the same form as a Gaussian
distribution, which would have a quadratic loss of the form
2
2
e(θ−µ) /σ .
The properties of the von Mises distribution are not obvious upon inspection. In contrast, the maximum likelihood
of the Gaussian distribution’s mean value µ is the one
that minimizes the squared distance between the mean and
each point. From a geometric point of view, what property
does the maximum likelihood solution of the von Mises
distribution have?
D. Geometric Interpretation
As we saw above, the vector sum algorithm computes the
optimal estimate of θ assuming that the noise is governed
by the von Mises distribution. It is not clear, however, what
this means in terms of the geometry of the points around

the unit circle. In this section, we derive a useful geometric
interpretation.
Let us again consider observations θi for i ∈ [1, N ]. The
unit vectors corresponding to each observation lie on the
ˆ
unit circle. Now, let us ﬁnd a θ that minimizes the squared
ˆ and each of the θi s.
distance between θ
The challenge is to deﬁne distance in a way that is
computationally convenient. The length of the arc between
ˆ
θ and θi would be exactly right if we assume our noise
is normally distributed and that there was no “wrapping”
(i.e., that the noise was bounded by π degrees); see Fig. 1.
However, the arc distance is difﬁcult to incorporate into a
closed-form solution.
Alternatively, we can use the chord distance (again, see
Fig. 1). The chord distance and arc distance are approxiˆ
mately the same when θ and θi are nearby. Thus, we can
approximate the maximum likelihood µ (assuming Gaussian
noise) by minimizing the squared chord distance.
Using basic trigonometry, the vector ri between the unit
ˆ
vectors corresponding to θ and θi is:
ri =

ˆ
cos(θ)
ˆ
sin(θ)

−

cos(θi )
sin(θi )

(7)

The squared distance between those points is then just
T
ri ri . We sum this over all observation points, expanding
the product:
χ2 =

i

ˆ
ˆ
cos2 (θ) − 2 cos(θ) cos(θi ) + cos2 (θi ) +
ˆ
ˆ
ˆ
sin2 (θ) − 2 sin(θ) sin(θi ) + sin2 (θ)

(8)

Noting that cos2 (x) + sin2 (x) = 1, we can simplify to:
χ2 =

i

ˆ
ˆ
2 − 2 cos(θ) cos(θi ) − 2 sin(θ) sin(θi ) = 0 (9)

We wish to minimize the sum of squared distances, so we
ˆ
differentiate with respect to θ and set to zero:
∂χ2
=
ˆ
∂θ

i

ˆ
ˆ
sin(θ) cos(θi ) − cos(θ) sin(θi ) = 0

(10)

ˆ
Collecting terms and pulling our unknown (θ) to the left
side, we obtain:
ˆ
sin(θ)
ˆ
cos(θ)

=

ˆ
θ =

i

sin(θi )
cos(θi )

arctan

(11)
sin(θi )

cos(θi ),
i

(12)

III. P ROPOSED M ETHOD
We now propose a new method for computing the average
orientation. In many cases, this method better approximates
the wrapped normal distribution than the von Mises distribution, while still being computationally inexpensive to
compute.
A. Interpretation
Our method can be described in both geometric and
probabilistic terms. In the previous section, we showed that
the von Mises distribution corresponds to minimizing the
squared chord length associated with each observation. The
arc length would be a more accurate method if we assume
Gaussian noise since the length of the arc is proportional to
the difference in orientation. (As a reminder, the maximum
likelihood estimator of µ minimizes the sum of the squares
of differences in orientation; this is the reason we use the
squared arc length.)
In fact, using the squared arc length is optimal if the noise
is bounded by π. When this does not hold, using the arc
length is an approximation: for an observation with noise
greater than π, we will pick the shorter direction around the
circle, rather than take the “long way” around the unit circle.
This type of error occurs only with extremely low probability,
or with Gaussians with very large standard deviations.
Probabilistically, minimizing the squared arc length corresponds to wrapping a normal distribution around the unit
circle without overlap, i.e., cutting off the tails of the normal
distribution that would otherwise continue to wrap around
the unit circle. Because the probability mass in the tails was
removed, the total probability mass will be less than 1.0. We
can correct for this by rescaling the distribution:
f (θ) =

2
2
1
1
√
√ e−(θ−µ) /2σ
erf(π/(σ 2)) σ 2π

(13)

From an analytic point-of-view, this density is fairly
awkward: it cannot be easily evaluated due to the presence
of the error function. However, our method (while producing
maximum likelihood estimates for the distribution) does not
require us to ever evaluate the densities, so this is not a
practical problem.
We’ve now described the geometric (minimize the squared
arc length) and probabilistic (truncate the Gaussian) interpretation of our method. We now show that the maximum
likelihood solution can be quickly and exactly computed.

i

ˆ
In other words, the estimate of θ that minimizes the
squared chord distance is the same as the vector sum
algorithm.
We can relate this directly to the form of the von Mises
distribution as well. First, we note that the squared chord
ˆ
length is 2 − cos(θi − θ). (This is easily derived from Fig. 1
using the law of cosines.) In minimizing the sum of these
costs, the 2 is inconsequential: we can drop the term without
changing the answer. Now, we are left with precisely the
exponentiated expression in the von Mises distribution.

B. Implementation
The trick is to “unwrap” the observations back onto a
line, and to use conventional (non-circular) estimators to
compute the mean. The challenge is that the position of each
observation on the line is not uniquely determined, since it
can be freely translated by any multiple of 2π. Of course,
no matter what the mean is, no point will be farther than
π degrees away from it. Otherwise, a lower error solution
could be obtained by moving that point by a multiple of 2π
degrees so that it is closer to the mean.

Consequently, there exists an arrangement of observations
that span no more than 2π degrees, and for which all points
are within π of the mean (which is still unknown at this
point). Supposing that our points were in this arrangement,
the simple arithmetic mean would yield the optimal mean
as measured by the fact that the mean squared arc length is
minimized.
The basic idea of our method is to construct the possible
arrangements of the observations, compute their mean and
error, and output the best. At ﬁrst, it may seem that there
are too many possible arrangements of observations: each
observation can appear at an inﬁnite number of positions
(all spaced 2π apart).
By the argument above, however, the only arrangements
of consequence are those that span less than 2π degrees.
Each of those arrangements has a “left-most” observation—
an observation whose orientation is numerically smaller than
all the other observations. Once a left-most point is identiﬁed,
the positions of all the other points is fully determined due to
the constraint that the arrangement span less than 2π degrees.
If we have N observations, then there are only N candidate arrangements. (Each observation takes turns being
the left-most; once selected, the position of all the other
observations is uniquely determined.)
Suppose that our observations are mapped to some arbitrary interval of width 2π (i.e., by moving each point to be
[0, 2π]). We can construct the N arrangements by repeatedly
taking the left most point and moving it to the right by adding
2π to it. Now, some other point is the left most point. After
N iterations, we will have constructed all N arrangements
of the observations.
At each iteration, we compute the mean and squared
error. These statistics can be updated incrementally after
each iteration by appropriately updating the ﬁrst and second
moments of the observations. For clarity, the ﬁrst moment
M1 and second moment M2 are simply:
θi

M1 =

(14)

i
2
θi

M2 =

(15)

i

We can compute the mean and squared error as:
ˆ
θi
χ2
i

=
=

M1 /N
2
M2 − M1 /N

(16)
(17)

Now, suppose that we take the left most point (call it
θi ) and move it to the far right by adding 2π. We can
incrementally update the moments as:
M1
M2

=

M1 + 2π

=
=

2
θi

M2 − + (θi + 2π)
M2 + 4πθi + (2π)2

(18)
2

(19)

We simply repeat the process of moving the left-most point
to the right N times, and record the mean that minimizes the
squared answer. The complete algorithm is given in Alg. 1.

Algorithm 1 Compute mean theta using squared arc-length
1: M1 = 0
2: M2 = 0
3: {Map all points to [0, 2π] and initialize moments}
4: for i = 1 to N do
5:
θi = mod2pi(θi )
6:
M1 = M1 + θi
2
7:
M2 = M2 + θi
8: end for
9: Sort θs into increasing order
10: σ 2 = ∞
11: for i = 0 to N do
12:
if i >= 1 then
13:
{Move the ith observation to the right by 2π}
14:
M1 = M1 + 2π
15:
M2 = M2 + 4πθi + (2π)2
16:
end if
ˆ
17:
θi = M1 /N
2
ˆ
ˆ2
18:
σi = M2 − 2 ∗ M1 ∗ θi + N ∗ θi
2
19:
if σi < σ 2 then
ˆ = θi
20:
θ ˆ
2
21:
σ 2 = σi
22:
end if
23: end for
ˆ
24: return θ

Our proposed method exactly computes the maximum
likelihood estimate of µ assuming that the noise is distributed
according to the truncated Gaussian distribution. As we’ve
also shown, this is equivalent to minimizing the squared arc
length.
IV. E VALUATION
In this section, we evaluate the performance of the vector
sum algorithm and the proposed method.
A. Approximation accuracy
Both algorithms attempt to approximate the wrapped
normal distribution. The distributions are plotted for σ =
0.5, 1.0, 2.0 in Fig. 2. For small σ, both algorithms closely
approximate the wrapped normal distribution. As the observation noise increases, however, there are two distinct
regimes:
• At moderate noise levels (σ < 1.6), the proposed
method out-performs the vector sum method. Intuitively,
this is because the “chord” distance is taking a short cut
around the circle, resulting in too much probability mass
away from the mean.
• At higher noise levels (σ > 1.6), the vector sum method
out-performs the proposed method. At these high noise
levels, it becomes more likely that observations will
have more than π noise. In other words, the “wrapping”
of the wrapped normal distribution starts to ﬂatten out
the distribution. The ﬂatter behavior of the von Mises
distribution (due to taking “short cuts” through the
center of the unit circle) is a better ﬁt.

Fig. 2. Probability density functions for σ = .5, 1.0, and 2.0. For small sigmas, the performance of the proposed distribution (formed by truncating the
Normal distribution) is signiﬁcantly closer to the wrapped Gaussian than the von Mises distribution (see left two ﬁgures). The relative improvement of the
proposed method increases with σ, until signiﬁcant probability mass begins to wrap around (right). Also see Fig. 3.

Fig. 3. KL divergence as a function of σ. The KL divergence from the
wrapped normal (ideal) distribution to the proposed and von Mises distribution shows the relative accuracy of the two approximations. The proposed
distribution not only has signiﬁcantly better worst-case performance, but
also has signiﬁcantly less error in low-variance regime (σ < 1.639) in
which most applications operate.

Fig. 4. Chord versus arc error as a function of σ. For each value of σ, a
Monte Carlo simulation was run in which 50 directional observations were
generated. Both the chord and arc (proposed) method were used to estimate
the original mean. The sample standard deviation measures the performance
of the estimator; the difference between the two are shown above. Values
below zero indicate that the proposed method works better for that σ; values
above zero indicate that the chord approximation is better. As predicted by
the KL divergence results, the cross-over point occurs at around σ = 1.5,
or about 85 degrees.

Fig. 2 shows the probability densities at three discrete
points. Alternatively, we can compute the KL divergence [11]
between the wrapped normal distribution and the von Mises
or truncated Gaussian. As shown in Fig. 3, the KL divergence
between the wrapped normal distribution and the von Mises
distribution peaks at around σ = 1, then becomes smaller
again. The truncated Gaussian distribution has signiﬁcantly
lower divergence in the low-to-moderate noise regime, but
does worse in the high-noise regime.
B. Empirical performance
Our KL divergence data suggests that the truncated Gaussian is a better approximation of the wrapped normal distribution for σ ∈ [0, 1.6π]. To verify this, we conducted

Fig. 6.

KL divergence as a function of σ (line case).

a Monte Carlo simulation in which we generated noisy
observations and estimated the mean using both methods.
The error for both methods was compared and plotted in
Fig. 4; negative values indicate that the vector sum algorithm
had a larger error than the proposed method. Positive values
indicate the opposite. The curve is very smooth due to a very
large number of trials.
This data reveals a surprising result: while the performance of the proposed algorithm is indeed better for low-tomoderate σ, the vector sum method does much better in the
high noise regime. In this case, the KL divergence analysis
was approximately correct in the cross-over point (where
vector sum would surpass the proposed method), but the
magnitude of the KL divergence results were not especially
predictive of the Monte Carlo results.
Still, the performance of the proposed method is demonstrably better than the vector sum method over the most
useful range of σs; it is not until σ = 1.6 (i.e., a standard deviation of 90 degrees) that the vector sum method surpasses
it.
For maximum accuracy over the broadest possible range
of σ, one could use both methods, switching between the
two based on the estimate of σ.
C. Line variant
We have so far considered the case of observations with
a periodicity of 2π, such as compass readings. In some
applications, the data may have a periodicity of π, such as
in estimating the average orientation of the walls in a room
from LIDAR data.
Both methods can be easily adapted to this case. In the
case of our proposed method, the changes are trivial: the
scale factor is adjusted so that the density integrates to 1
over the smaller interval [−π/2, π/2].

Fig. 5.

Probability density functions for σ = .5, 1.0, and 2.0 (line case). See also Fig. 6.

In the case of the vector sum algorithm, the orientations
of the observations can be multiplied by two, thus rescaling
the problem so that it has a periodicity of 2π. This has an
interesting consequence, as the probability density function
becomes:
1
f (θ, µ) =
eκcos(2(θ−µ))
(20)
πI0 (κ)
Note that the factor of two is taken inside the cosine
expression; the resulting density is not a simple rescaling
of the von Mises distribution.
To examine the effect of these modiﬁcations, we again
plotted the distributions for different values of σ (see Fig. 5)
and the KL divergence (see Fig. 6). Note that, in computing
κ for the von Mises distribution, we used 1/(2σ)2 instead
of the original 1/σ 2 , in agreement with the variance of a
random variable multiplied by two. Interestingly, the results
from this case mirror almost exactly those of the 2π case.
D. Complexity analysis
From a complexity stand-point, the proposed method has
additional costs versus the simpler vector sum algorithm.
Both algorithms conceptually handle the observations one
at a time, but the proposed method needs them to be in
sorted (left-to-right) order, which incurs an O(N log N ) cost.
In addition, this requires the observations to be stored in
memory; the vector sum algorithm can be applied to datasets
of virtually unbounded size since the data does not need to
be stored.
Computation Memory
Algorithm
Vector sum
O(N )
O(1)
Proposed method O(N log N )
O(N )
E. Runtime performance
While the asymptotic computational complexity of our
method is slower than the vector sum algorithm, the realworld difference is quite small. For example, our Java
implementation of the vector sum method takes 0.30 ms for
1280 points; our method takes 0.35 ms for the same data.
Additional runtime data is shown in Fig. 7.
We attribute the similarity in performance (despite the
difference in asymptotic costs) to two factors: the vector
sum operation uses relatively expensive trigonometric operations, whereas our proposed method uses only simple
arithmetic operations. The sorting method used is the standard Arrays.sort method, which also relies only on simple
comparisons, and is well optimized.

Fig. 7. Computational complexity. Despite having different asymptotic
complexities, the runtimes of the two methods are similar.

V. C ONCLUSION
This paper considers the problem of computing average
orientation, a problem made difﬁcult by the periodicity of
the data. In addition to providing an introduction to the
commonly-used vector sum product, we provide a new
geometric interpretation for that method. We also propose
and evaluate a new method that better approximates the ideal
wrapped normal distribution.
Implementations of our algorithm are available from our
web site http://april.eecs.umich.edu/. This research was supported by U.S. DoD grant W56HZV-04-2-0001.
R EFERENCES
[1] E. Olson, J. Leonard, and S. Teller, “Fast iterative optimization of
pose graphs with poor initial estimates,” in Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), 2006,
pp. 2262–2269.
[2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp.
91–110, November 2004.
[3] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust
features,” in In ECCV, 2006, pp. 404–417.
[4] R. Zlot and M. Bosse, “Place recognition using keypoint similarities,”
in in 2D lidar maps, in International Symposium on Experimental
Robotics, 2008.
[5] A. I. Mourikis and S. I. Roumeliotis, “Performance analysis of
multirobot cooperative localization,” IEEE Transactions on Robotics,
vol. 22, no. 4, pp. 666–681, Aug. 2006.
[6] J. Borenstein and L. Feng, “Gyrodometry: A new method for combining data from gyros and odom etry in mobile robots,” in In
Proceedings of the 1996 IEEE International Conference onRobotics
and Automation, 1996, pp. 423–428.
[7] E. Hunt and D. Waller, “Orientation and wayﬁnding: A review,” Tech.
Rep., 1999.
[8] H. Choo and S. Franconeri, “Perception of average orientation,”
Journal of Vision, vol. 09, no. 8, 2009.
[9] S. R. Jammalamadaka and A. SenGupta, Topics in Circular Statistics
(Multivariate Analysis Vol. 5). World Scientiﬁc, 2001.
[10] K. V. Mardia and P. E. Jupp, Directional Statistics. Wiley, 2000.
[11] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007.