AprilTag: A robust and ﬂexible visual ﬁducial system
Edwin Olson
University of Michigan
ebolson@umich.edu
http://april.eecs.umich.edu

Abstract— While the use of naturally-occurring features is a
central focus of machine perception, artiﬁcial features (ﬁducials)
play an important role in creating controllable experiments,
ground truthing, and in simplifying the development of systems
where perception is not the central objective.
We describe a new visual ﬁducial system that uses a 2D bar
code style “tag”, allowing full 6 DOF localization of features
from a single image. Our system improves upon previous
systems, incorporating a fast and robust line detection system,
a stronger digital coding system, and greater robustness to
occlusion, warping, and lens distortion. While similar in concept
to the ARTag system, our method is fully open and the
algorithms are documented in detail.

I. I NTRODUCTION
Visual ﬁducials are artiﬁcial landmarks designed to be
easy to recognize and distinguish from one another. Although
related to other 2D barcode systems such as QR codes [1],
they have signiﬁcantly goals and applications. With a QR
code, a human is typically involved in aligning the camera
with the tag and photographs it at fairly high resolution
obtaining hundreds of bytes, such as a web address. In
contrast, a visual ﬁducial has a small information payload
(perhaps 12 bits), but is designed to be automatically detected
and localized even when it is at very low resolution, unevenly
lit, oddly rotated, or tucked away in the corner of an
otherwise cluttered image. Aiding their detection at long
ranges, visual ﬁducials are comprised of many fewer data
cells: the alignment markers of a QR tag comprise about
268 pixels (not including required headers or the payload),
whereas the visual ﬁducials described in this paper range
from about 49 to 100 pixels, including the payload.
Unlike 2D barcode systems in which the position of the
barcode in the image is unimportant, visual ﬁducial systems
provide camera-relative position and orientation of a tag.
Fiducial systems also are designed to detect multiple markers
in a single image.
Visual ﬁducial systems are perhaps best known for their
application to augmented reality, which spurred the development of several popular systems including ARToolkit [2]
and ARTag [3]. Real-world objects can be augmented with
visual ﬁducials, allowing virtually-generated imagery to be
super-imposed. Similarly, visual ﬁducials can be used for
basic motion capture [4].
Visual ﬁducial systems have been used to improve human/robot interaction, allowing humans to signal commands
(such as “follow me” or “wait here”) by ﬂashing an appropriate card to a robot [5]. Planar tags have also been used to

Fig. 1. Example detections. This paper describes a visual ﬁducial system
based on 2D planar targets. The detector is robust to lighting variation and
occlusions and produces accurate localizations of the tags.

generate user interfaces that overlay robots’ plans and task
assignments onto a head-mounted display [6].
Performance evaluation and benchmarking of robot systems have become central issues for the research community;
visual ﬁducials are particularly helpful in this domain. For
example, ﬁducials can be used to generate ground-truth
robot trajectories and close control loops [7]. Similarly,
artiﬁcial features can make it possible to evaluate Simultaneous Localization and Mapping (SLAM) algorithms under
controlled algorithms [8]. Robotics applications have led to
the development of additional tag detection systems [9], [10].
Designing robust ﬁducials while minimizing the size
required is a challenging both from a marker detection
standpoint (which pixels in the image correspond to a tag?)
and from a error-tolerant data coding standpoint (which tag
is it?)
In this paper, we describe a new visual ﬁducial system that
signiﬁcantly improves performance over previous systems.
The central contributions of this paper are:
• We describe a method for robustly detecting visual ﬁducials. We propose a graph-based image segmentation
algorithm based on local gradients that allows lines to be
precisely estimated. We also describe a quad extraction
method that can handle signiﬁcant occlusions.
• We demonstrate that our detection system provides
signiﬁcantly better localization accuracy than previous
systems.
• We describe a new coding system that address problems

unique to 2D barcoding systems: robustness to rotation,
and robustness to false positives arising from natural
imagery. As demonstrated by our experimental results,
our coding system provides signiﬁcant theoretical and
real-world beneﬁts over previous work.
• We specify and provide results on a set of benchmarks
which will allow better comparisons of ﬁducial systems
in the future.
In contrast to previous methods (including ARTag and
Studierstube Tracker), our implementation is released under
an Open Source license, and its algorithms and implementation are well documented. The closed nature of these other
systems was a challenge for our experimental evaluation.
For the comparisons in this paper, we have used the limited
publicly-available information to enable as many objective
comparisons as possible. On the other hand, ARToolkitPlus
is open source and so we were able to make a more detailed
comparison. In addition to code, we are also making our
evaluation code available in order to make it easier for future
authors to perform comparisons.
In the next section, we review related work. We describe
our method in the following two sections: the tag detector
in Section 3, and the coding system in Section 4. In Section
5, we provide an experimental evaluation of our methods,
comparing them to previous algorithms.

ﬁnally Studierstube Tracker [12]. These versions introduced
digitally-encoded payloads like those used in ARTag. Despite
being later work, our experiments show that these encoding
systems do not perform as well as that used by ARTag, which
in turn is outperformed by our coding system.
In addition to monochrome tags, other coding systems
have been developed. For example, color information has
been used to increase the amount of information that can be
encoded [13], [14]. Tags using retro-reﬂectors [15] have also
been used. A particularly interesting approach is that used
by Bokode [16], which exploits the bokeh effect to detect
extremely small tags by intentionally defocusing the camera.

II. P REVIOUS W ORK
ARToolkit [11] was among the ﬁrst tag tracking systems,
and was targeted at artiﬁcial reality applications. Like the
systems that would follow, its tags contained a square-shaped
payload surrounded by a black border. It differed, however, in
that its payload was not directly encoded in binary: instead, it
used symbols such as the latin character ’A’. When decoding
a tag, the payload of the tag (sampled at high resolution) was
correlated against a database of known tags, with the bestcorrelating tag reported to the user. A major disadvantage
of this approach is the computational cost associated with
decoding tags, since each template required a separate,
slow correlation operation. A second disadvantage is that
it is difﬁcult to generate templates that are approximately
orthogonal to each other.
The tag detection scheme used by ARToolkit is based on
a simple binarization of the input image based on a userspeciﬁed threshold. This scheme is very fast, but not robust
to changes in illumination. In general, ARToolkit’s detections
can not handle even modest occlusions of the tag’s border.
ARTag [3] provided improved detection and coding
schemes. Like our own approach, the detection mechanism
was based on the image gradient, making it robust to changes
in lighting. While the details of the detector algorithm are not
public, ARTag’s detection mechanism is able to detect tags
whose border is partially occluded. ARTag also provided the
ﬁrst coding system based on forward error correction, which
made tags easier to generate, faster to correlate, and provided
greater orthogonality between tags.
The performance of ARTag inspired several improvements
to ARToolkit, which evolved into ARToolkitPlus [2], and

Fig. 2. Input image. This paper describe the processing of this sample
image which contains two tags. This example is purposefully simple for
explanatory reasons, though note that the tags are not rigidly planar. See
Fig. 1 for a more challenging result.

Besides two-dimensional barcodes, a number of other
artiﬁcial landmarks have been developed. Overhead cameras
have been used to track robots equipped with blinking
LEDs [17]. In contrast, the NorthStar system puts the visual ﬁducials on the ceiling [18]. Two-dimensional planar
systems, like the one described in this paper, have two
main advantages over LED-based systems: the targets can be
cheaply printed on a standard printer, and they provide 6DOF
position estimation without the need for multiple LEDs.
III. D ETECTOR
Our system is composed of two major components: the tag
detector and the coding system. In this section, we describe
the detector whose job is to estimate the position of possible
tags in an image. Loosely speaking, the detector attempts to
ﬁnd four-sided regions (“quads”) that have a darker interior
than their exterior. The tags themselves have black and white
borders in order to facilitate this (see Fig. 2).
The detection process is comprised of several distinct
phases, which are described in the following subsections and
illustrated using the example shown in Fig. 2.
Note that the quad detector is designed to have a very
low false negative rate, and consequently has a high false
positive rate. We rely on the coding system (described in
the next section) to reduce this false positive rate to useful
levels.

Fig. 3. Early processing steps. The tag detection algorithm begins by computing the gradient at every pixel, computing their magnitudes (ﬁrst) and direction
(second). Using a graph-based method, pixels with similar gradient directions and magnitude are clustered into components (third). Using weighted least
squares, a line segment is then ﬁt to the pixels in each component (fourth). The direction of the line segment is determined by the gradient direction, so
that segments are dark on the left, light on the right. The direction of the lines are visualized by short perpendicular “notches” at their midpoint; note that
these “notches” always point towards the lighter region.

A. Detecting line segments
Our approach begins by detecting lines in the image. Our
approach, similar in basic approach to the ARTag detector,
computes the gradient direction and magnitude at every pixel
(see Fig. 3) and agglomeratively clusters the pixels into
components with similar gradient directions and magnitudes.
The clustering algorithm is similar to the graph-based
method of Felzenszwalb [19]: a graph is created in which
each node represents a pixel. Edges are added between
adjacent pixels with an edge weight equal to the pixels’ difference in gradient direction. These edges are then sorted and
processed in terms of increasing edge weight: for each edge,
we test whether the connected components that the pixels
belong to should be joined together. Given a component n,
we denote the range of gradient directions as D(n) and the
range of magnitudes as M (n). Put another way, D(n) and
M (n) are scalar values representing the difference between
the maximum and minimum values of the gradient direction
and magnitude respectively. In the case of D(), some care
must be taken to handle 2π wrap-around. However, since
useful edges will have a span of much less than π degrees,
this is straightforward. Given two components n and m,
we join them together if both of the conditions below are
satisﬁed:
D(n ∪ m)
M (n ∪ m)

≤
≤

min(D(n), D(m)) + KD /|n ∪ m| (1)
min(M (n), M (m)) + KM /|n ∪ m|

The conditions are adapted from [19] and can be intuitively understood: small values of D() and M () indicate
components with little intra-component variation. Two clusters are joined together if their union is about as uniform
as the clusters taken individually. A modest increase in
intra-component variation is permitted via the KD and KM
parameters, however this rapidly shrinks as the components
become larger. During early iterations, the K parameters
essentially allow each component to “learn” its intra-cluster
variation. In our experiments, we used KD = 100 and
KM = 1200, though the algorithm works well over a broad
range of values.
For performance reasons, the edge weights are quantized
and stored as ﬁxed-point numbers. This allows the edges to

Fig. 4. Quad detection and sampling. Four quads are detected in the image
(which contains two tags). The third detected quad corresponds to three of
the edges of the foreground tag plus the edge of the paper (See Fig. 2).
A fourth quad is detected around one of the payload bits of the larger
tag. These two extraneous detections are eventually discarded because their
payload is invalid. The white dots correspond to samples around the tags
border which are used to ﬁt a linear model of intensity of “white” pixels;
a model is similarly ﬁt for the black pixels. These two models are used to
threshold the data payload bits, shown as yellow dots.

be sorted using a linear-time counting sort [20]. The actual
merging operation can be efﬁciently carried out by the unionﬁnd algorithm [20] with the upper and lower bounds of
gradient direction and magnitude stored in a simple array
indexed by each component’s representative member.
This gradient-based clustering method is sensitive to noise
in the image: even modest amounts of noise will cause local
gradient directions to vary, inhibiting the growth of the components. The solution to this problem is to low-pass ﬁlter the
image [19], [21]. Unlike other problem domains where this
ﬁltering can blur useful information in the image, the edges
of a tag are intrinsically large-scale features (particularly in
comparison to the data ﬁeld), and so this ﬁltering does not
cause information loss. We recommend a value of σ = 0.8.
Once the clustering operation is complete, line segments
are ﬁt to each connected component using a traditional
least-squares procedure, weighting each point by its gradient
magnitude (see Fig. 3). We adjust each line segment so that
the dark side of the line is on its left, and the light side is

on its right. In the next phase of processing, this allows us
to enforce a winding rule around each quad.
The segmentation algorithm is the slowest phase in our
detection scheme. As an option, this segmentation can be
performed at half the image resolution with a 4x improvement in speed. The sub-sampling operation can be efﬁciently
combined with the recommended low-pass ﬁlter. The consequence of this optimization is a modestly reduced detection
range, since very small quads may no longer be detected.
B. Quad detection
At this point, a set of directed line segments have been
computed for an image. The next task is to ﬁnd sequences
of line segments that form a 4-sided shape, i.e., a quad. The
challenge is to do this while being as robust as possible to
occlusions and noise in the line segmentations.
Our approach is based on a recursive depth-ﬁrst search
with a depth of four: each level of the search tree adds an
edge to the quad. At depth one, we consider all line segments.
At depths two through four, we consider all of the line
segments that begin “close enough” to where the previous
line segment ended and which obey a counter-clockwise
winding order. Robustness to occlusions and segmentation
errors is handled by adjusting the “close enough” threshold:
by making the threshold large, signiﬁcant gaps around the
edges can be handled. Our threshold for “close enough” is
twice the length of the line plus ﬁve additional pixels. This
is a large threshold which leads to a low false negative rate,
but also results in a high false positive rate.
We populate a two-dimensional lookup table to accelerate
queries for line segments that begin near a point in space.
With this optimization, along with early rejection of candidate quads that do not obey the winding rule, or which
use a segment more than once, the quad detection algorithm
represents a small fraction of the total computational requirements.
Once four lines have been found, a candidate quad detection is created. The corners of this quad are the intersections
of the lines that comprise it. Because the lines are ﬁt using
data from many pixels, these corner estimates are accurate
to a small fraction of a pixel.
C. Homography and extrinsics estimation
We compute the 3×3 homography matrix that projects 2D
points in homogeneous coordinates from the tag’s coordinate
system (in which [0 0 1]T is at the center of the tag and the
tag extends one unit in the x and y directions) to the 2D
ˆ
ˆ
image coordinate system. The homography is computed using the Direct Linear Transform (DLT) algorithm [22]. Note
that since the homography projects points in homogeneous
coordinates, it is deﬁned only up to scale.
Computation of the tag’s position and orientation requires
additional information: the camera’s focal length and the
physical size of the tag. The 3 × 3 homography matrix
(computed by the DLT) can be written as the product of
the 3 × 4 camera projection matrix P (which we assume is
known) and the 4×3 truncated extrinsics matrix E. Extrinsics

matrix are typically 4 × 4, but every position on the tag
is at z = 0 in the tag’s coordinate system. Thus, we can
rewrite every tag coordinate as a 2D homogeneous point
with z implicitly zero, and remove the third column of the
extrinsics matrix, forming the truncated extrinsics matrix.
We represent the rotation components of P as Rij and the
translation components as Tk . We also represent the unknown
scale factor as s:


h00 h01 h02
 h10 h11 h12  = sP E
h20 h21 h22



 R
R01 Tx
(2)
1/fx
0
0 0  00
R10 R11 Ty 

1/fy 0 0  
= s 0
 R20 R21 Tz 
0
0
1 0
0
0
1
Note that we cannot directly solve for E because P is rank
deﬁcient. We can expand the right hand side of Eqn. 2, and
write the expression for each hij as a set of simultaneous
equations:
h00
h01
h02

=
=
=

sR00 /fx
sR01 /fx
sTx /fx

(3)

...
These are all easily solved for the elements of Rij and Tk
except for the unknown scale factor s. However, since the
columns of a rotation matrix must all be of unit magnitude,
we can constrain the magnitude of s. We have two columns
of the rotation matrix, so we compute s as the geometric the
geometric average of their magnitudes. The sign of s can
be recovered by requiring that the tag appear in front of the
camera, i.e., that Tz < 0. The third column of the rotation
matrix can be recovered by computing the cross product of
the two known columns, since the columns of a rotation
matrix must be orthonormal.
The DLT procedure and the normalization procedure
above do not guarantee that the rotation matrix is strictly
orthonormal. To correct this, we compute the polar decomposition of R, which yields a proper rotation matrix while
minimizing the Frobenius matrix norm of the error [23].
IV. PAYLOAD DECODING
The ﬁnal task is to read the bits from the payload ﬁeld.
We do this by computing the tag-relative coordinates of each
bit ﬁeld, transforming them into image coordinates using the
homography, and then thresholding the resulting pixels. In
order to be robust to lighting (which can vary not only from
tag to tag, but also within a tag), we use a spatially-varying
threshold.
Speciﬁcally, we build spatially-varying model of the intensity of “black” pixels, and a second model for the intensity of
“white” models. We use the border of the tag, which contains
known examples of both white and black pixels, to learn this
model (see Fig. 4). We use the following intensity model:
I(x, y) = Ax + Bxy + Cy + D

(4)

This model has four parameters which are easily computed
using least squares regression. We build two such models,
one for black, the other for white. The threshold used when
decoding data bits is then just the average of the predicted
intensity values of the black and white models.
V. C ODING SYSTEM
Once the data payload is decoded from a quad, it is the
job of the coding system to determine it is valid or not. The
goals of a coding system are to:
• Maximize the number of distinguishable codes
• Maximize the number of bit errors that can be detected
or corrected
• Minimize the false positive/inter-tag confusion rate
• Minimize the total number of bits per tag (and thus the
size of the tag)
These goals are often in conﬂict, and so a given code
represents a trade-off. In this section, we describe a new
coding system based on lexicodes that provides signiﬁcant
advantages over previous methods. Our procedure can generate lexicodes with a variety of properties, allowing the user
to use a code that best ﬁts their needs.
A. Methodology
We propose the use of a modiﬁed lexicode [24]. Classical
lexicodes are parameterized by two quantities: the number
of bits n in each codeword and the minimum Hamming
distance between any two codewords d. Lexicodes can
correct (d − 1)/2 bit errors and detect d/2 bit errors.
For convenience, we will denote a 36 bit encoding with a
minimum Hamming distance of 10 (for example) as a 36h10
code.
Lexicodes derive their name from the heuristic used to
generate valid codewords: candidate codewords are considered in lexicographic order (from smallest to largest), adding
new codewords to the codebook when they are at least a
distance d from every codeword previously added to the
codebook. While very simple, this scheme is often very close
to optimal [25].
In the case of visual ﬁducials, the coding scheme must be
robust to rotation. In other words, it is critical that when
a tag is rotated by 90, 180, or 270 degrees, that it still
have a Hamming distance of d from every other code. The
standard lexicode generation algorithm does not guarantee
this property. However, the standard generation algorithm
can be trivially extended to support this: when testing a
new candidate codeword, we can simply ensure that all four
rotations have the required minimum Hamming distance.
The fact that the lexicode algorithm can be easily extended
to incorporate additional constraints is an advantage of our
approach.
Some codewords, despite satisfying the Hamming distance
constraint, are poor choices. For example, a code word
consisting of all zeros would result in a tag that looks like
a single black square. Such simple geometric patterns commonly occur in natural scenes, resulting in false positives.

The ARTag encoding system, for example, explicitly forbids
two codes because they are too likely to occur by chance.
Rather than identify problematic tags manually, we further modify the lexicode generation algorithm by rejecting
candidate codewords that result in simple geometric patterns.
Our metric is based on the number of rectangles required to
generate the tag’s 2D pattern. For example, a solid pattern
requires just one rectangle, while a black-white-black stripe
would require two rectangles (one large black rectangle with
a smaller white rectangle drawn second). Our hypothesis,
supported by experimental results later in this paper, is
that tag patterns with high complexity (which require many
rectangles to be reconstructed) occur less frequently in nature
and thus lead to lower false positive rates.
Using this idea, we again modify the lexicode generation
algorithm to reject candidate codewords that are too simple.
We approximate the number of rectangles required to generate the tag’s pattern with a simple greedy approach that
repeatedly considers all possible rectangles and adds the one
that reduces the error the most. Since the tags are generally
very small, this computation is not a bottleneck. Tags with
a minimum complexity less than a threshold (typically 10
in our experiments) are rejected. The appropriateness and
effectiveness of this heuristic are demonstrated in the results
section.
Lastly, we have empirically observed lower false positive
scores by making one more modiﬁcation to the lexicode
generation algorithm. Rather than test codewords in order
(0, 1, 2, 3, ...), we consider (b, b+1p, b+2p, b+3p, ...) where
b is an arbitrary number, p is a large prime, and the lowest
n bits are kept at each step. Intuitively, the tags generated
by this method have greater entropy at every bit position;
the lexicographic order, on the other hand, favors smallvalued codes. The disadvantage of this method is that fewer
distinguishable codes are created: the lexicographic ordering
tends to pack codewords quite densely, whereas the more
random order results in a less efﬁcient packing of codewords.
To summarize, we use a lexicode system that can generate
codes for any arbitrary tag size (e.g., 3x3, 4x4, 5x5, 6x6)
and minimum Hamming distance. Our approach explicitly
guarantees the minimum Hamming distance for all four
rotations of each tag and eliminates tags which are of
low geometric complexity. Computing the tags can be an
expensive operation, but is done ofﬂine. Small tags (5x5)
can be easily computed in seconds or minutes, but larger
tags (6x6) can take several days of CPU time. Many useful
code families are already computed and distributed with our
software; most users will not need to generate their own code
families.
B. Error correction analysis
Theoretical false positive rates can be easily estimated.
Assume that a false quad is identiﬁed and that the bit pattern
is random. The probability of a false positive is the fraction of
codewords which are accepted as valid tags versus the total
number of possible codewords, 2n . More aggressive error
correction increases this rate, since it increases the number

of codewords that are accepted. This unavoidable increase in
error rate is illustrated for the 36h10 and 36h15 codes below:
Bits corrected
0
1
2
3
4
5
6
7

36h10 FP (%)
0.000001
0.000041
0.000744
0.008714
0.074459
0.495232
N/A
N/A

36h15 FP (%)
0.000000
0.000002
0.000029
0.000341
0.002912
0.019370
0.104403
0.468827

Of course, the better performance of the 36h15 encoding
comes at a price: there are only 27 distinguishable codewords, as opposed to 36h10’s 2221 distinguishable codewords.
Our coding scheme is signiﬁcantly stronger than previous
schemes, including that used by ARTag and both systems
used by ARToolkitPlus: our coding system achieves a greater
minimum Hamming distance between all pairs of codewords
while encoding a larger number of distinguishable ids. This
improvement in minimum Hamming distance is illustrated
in Fig. 5 and in the table below:
Encoding Scheme
ARToolkit+ (simple)
ARToolkit+ (BCH)
ARTag
Proposed (36h9)
Proposed (36h10)

Length
36
36
36
36
36

Unique codes
512
4096
2046
4164
2221

Fig. 5. Hamming distances. Good codes have large Hamming distances
between valid codewords. Shown above are the Hamming distances for
several contemporary systems. Note that our coding scheme guarantees a
minimum Hamming distance by construction, whereas other systems have
some very similar codewords which leads to higher inter-tag confusion rates.

image corpus from the LabelMe dataset [26], which contains
180,829 images from a wide variety of indoor and outdoor
environments. Since none of these images contain one of our
tags, we can measure the false positive rate of our coding
systems by using these images.

Min. Hamming
4
2
4
9
10

In order to decode a possibly-corrupted code word, the
Hamming distance between the code word and each valid
code word in the code book is computed. If the best match
has a Hamming distance less than the user-speciﬁed threshold, a detection is reported. By specifying this threshold, the
user is able to control the tradeoff between false positives
and false negatives.
A disadvantage of our method is that this decoding process
takes linear time in the size of the codebook, since every
valid codeword must be considered. However, the coefﬁcient
is so small that the computational complexity is negligible
in comparison to the other image processing steps.
For a given coding scheme, larger tags (i.e., those with
36 bits versus 25 bits) have dramatically better coding
performance than smaller tags, although this comes at a price.
All other things being equal, the range at which a given
camera can read a 36 bit tag will be shorter than the range at
which the same camera can read a 16 or 25 bit tag. However,
the beneﬁt in range for smaller tags is quite modest due to the
4 pixel overhead of the borders; only a 25% improvement in
detection range can be expected by using 16 bit tags instead
of 36 bit tags. Thus, it is only in the most range-sensitive
applications where smaller tags are advantageous.
VI. E XPERIMENTAL R ESULTS
A. Empirical Experiments
A key question we wish to answer is whether our analytical predictions regarding false positive rates holds in realworld imagery. To answer this question, we used a standard

Fig. 6. Empirical false positives versus tag complexity. Our theoretical error
rates assume that all codewords are equally likely to occur by chance in a
real-world environment. Our hypothesis is that real-world environments are
biased towards codes which have a low rectangle covering complexity and
that by selecting codewords that have high rectangle covering complexity,
we can decrease false positive rates. This hypothesis is validated by the
graph above, which shows empirical false positive rates from the LabelMe
dataset (solid lines) for rectangle covering complexities from c=2 to c=10.
At complexities of c=9 and c=10, the false positive rate drops below the rate
predicted by the pessimistic model that real-world payloads are randomly
distributed.

Evaluation of complexity heuristic: We ﬁrst wish to evaluate our hypothesis that the false positive rate can be reduced
by imposing our geometric complexity heuristic to reject
candidate codewords. To do this, we generated ten variants
of the 25h9 family with minimum complexities ranging from
1 to 10. In Fig. 6, the false positive rate is given for each
of these complexities as a function of the maximum number
of bit errors corrected. Also displayed is the theoretical false
positive rate which was based on the assumption that data
payloads are randomly distributed.
Fig. 6 clearly demonstrates that our heuristic is effective
in reducing the false positive rate. Interestingly, once the
complexity exceeds 8, the performance is actually better than
predicted by theory.
Comparison to other coding schemes: We next compare
the false positive of our rate of our coding systems to those
used by ARToolkitPlus and ARTag.
Using the same real-world imagery dataset, we plot the
empirical false positive rates for ﬁve codes in Fig. 7.

Fig. 7. Empirical false positives. The best performing methods, in terms
of the rate of false positives on the LabelMe dataset, are 36h15 and
ARToolkitPlus-Simple. These two coding families also have the fewest
number of distinguishable codes, which gives them an advantage. The other
three systems have approximately comparable numbers of distinguishable
codes; ARToolkitPlus-BCH performs very poorly; ARTag does much better,
and our proposed 36h10 encoding does even better.

ARToolkitPlus’s BCH coding scheme has the highest false
positive rate, followed by ARTag. Our 36h10 encoding,
generated with a minimum complexity of 10, performs better
than both of these systems. This is a central result of this
paper.

Fig. 8. Example synthetic image. We generated ray-traced images in order
to create ground-truthed datasets for our evaluation. In this example, the tag
is 10m from the camera, and its normal vector points 30.3 degrees away
from the camera.

The plot shows data for two additional schemes: ARTPSimple performs about the same as our 36h10 encoding, but
because its tag family has one quarter as many distinguishable tags, its false positive rate is correspondingly lower. For
comparison purposes, we also include the false positive rate
for 36h15 family with only 27 distinguishable codewords.
As expected, it has an exceptionally low false positive rate.
B. Localization Accuracy
To evaluate the localization accuracy of the detector, we
used a ray tracer to generate images with known ground truth
(see Fig. 8 for an example). The true location and orientation
of the tags was varied randomly and compared to the detected
position. Images were generated at a resolution of 400 × 400
with a pinhole lens and focal length of 400 pixels.
The main factor in localization accuracy is the size of the
target, which is affected by both the distance and orientation
of the tag. To decouple these factors, we conducted two
experiments. The ﬁrst experiment measured the orientation
accuracy of targets while ﬁxing the distance. The critical
parameter is the angle φ between the target’s normal vector

Fig. 9.
Orientation accuracy. Using our ray-tracing simulator (which
provides ground truth), we evaluated the accuracy of two tag detector’s
localization accuracy. We ﬁxed the range to the tag and varied the angle
between the tag’s normal and the camera direction. For both systems,
the performance in localization accuracy and in success rate worsens as
the tag rotates away from the camera. However, the proposed system has
dramatically lower localization error and is able to detect targets more
reliably.

and the vector to the camera. When φ is 0, the target is
facing directly towards the target; as φ approaches π/2,
the target rotates out of view and we expect performance
to decrease. We measure performance in terms of both the
localization accuracy and detection rate. In Fig. 9, we see
that our detector signiﬁcantly outperforms the ARToolkitPlus
detector: not only are the orientation and distance estimates
more accurate, but it can detect tags over a larger range of
φ.
The complementary experiment is to hold φ = 0 and
to vary the distance. We expect that as distance increases,
accuracy will decrease. In Fig. 10, we see that our detector
works reliably to 50 m, while the ARToolkitPlus detector’s
detection rate drops to under 50% at around 25 m. In
addition, our detector provides signiﬁcantly more accurate
localization results.
Naturally, real-world performance of the system will be
lower than these synthetic experiments due to noise, lighting
variation, and other non-idealities (such as lens distortion or
tag non-planarity). Still, the real-world performance of our
system has been very good.
While our methods are generally more computationally
expensive than those used by ARToolkitPlus, our Java implementation runs at interactive rates (30 fps) on VGA
resolution images (Intel Core2 CPU at 2.6GHz). Higher
resolutions signﬁcantly impact runtime due to the graphbased clustering. We expect signiﬁcant speedups by making
use of SIMD optimizations and accelerated image procesing
libraries in our ongoing C port.
VII. C ONCLUSION
We have described a visual ﬁducial system that signiﬁcantly improves upon previous methods. We described a new
approach for detecting edges using a graph-based clustering
method along with a coding system that is demonstrably

Fig. 10. Distance accuracy. In order to evaluate the accuracy of distance
estimation to the target, we again used ground-truthed simulation data. For
this experiment, we ﬁxed the target so that it faced the camera but varied the
distance between the tag and the camera. Our proposed method signiﬁcantly
outperforms ARToolkitPlus, providing more accurate range estimates, and
over twice the working detection range.

stronger than previous methods. We have also described a set
of benchmarks that we hope will make it easier to evaluate
other methods in the future. In contrast to other systems (with
the notable exception of ARToolkit), our implementation is
fully open. Our source code and benchmarking software are
freely available:
http://april.eecs.umich.edu/
R EFERENCES
[1] C.-H. Chu, D.-N. Yang, and M.-S. Chen, “Image stablization for 2d
barcode in handheld devices,” in MULTIMEDIA ’07: Proceedings of
the 15th international conference on Multimedia. New York, NY,
USA: ACM, 2007, pp. 697–706.
[2] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Pose tracking from natural features on mobile phones,” in ISMAR ’08: Proceedings of the 7th IEEE/ACM International Symposium
on Mixed and Augmented Reality. Washington, DC, USA: IEEE
Computer Society, 2008, pp. 125–134.
[3] M. Fiala, “ARTag, a ﬁducial marker system using digital techniques,”
in CVPR ’05: Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05)
- Volume 2. Washington, DC, USA: IEEE Computer Society, 2005,
pp. 590–596.
[4] A. C. Sementille, L. E. Lourenco, J. R. F. Brega, and I. Rodello,
¸
“A motion capture system using passive markers,” in VRCAI ’04:
Proceedings of the 2004 ACM SIGGRAPH international conference
on Virtual Reality continuum and its applications in industry. New
York, NY, USA: ACM, 2004, pp. 440–447.
[5] J. Sattar, P. Gigu` re, and G. Dudek, “Sensor-based behavior control for
e
an autonomous underwater vehicle,” Int. J. Rob. Res., vol. 28, no. 6,
pp. 701–713, 2009.
[6] M. Fiala, “A robot control and augmented reality interface for multiple
robots,” in CRV ’09: Proceedings of the 2009 Canadian Conference on
Computer and Robot Vision. Washington, DC, USA: IEEE Computer
Society, 2009, pp. 31–36.
[7] ——, “Vision guided control of multiple robots,” Computer and Robot
Vision, Canadian Conference, vol. 0, pp. 241–246, 2004.
[8] U. Frese, “Deutsches zentrum f¨ r luft- und raumfahrt (DLR) dataset,”
u
2003.
[9] J. Sattar, E. Bourque, P. Giguere, and G. Dudek, “Fourier tags:
Smoothly degradable ﬁducial markers for use in human-robot interaction,” Computer and Robot Vision, Canadian Conference, vol. 0, pp.
165–174, 2007.

[10] T. Lochmatter, P. Roduit, C. Cianci, N. Correll, J. Jacot, and
A. Martinoli, “SwisTrack - A Flexible Open Source Tracking
Software for Multi-Agent Systems,” in Proceedings of the IEEE/RSJ
2008 International Conference on Intelligent Robots and Systems
(IROS 2008). IEEE, 2008, pp. 4004–4010. [Online]. Available:
http://iros2008.inria.fr/
[11] H. Kato and M. Billinghurst, “Marker tracking and hmd calibration for
a video-based augmented reality conferencing system,” in IWAR ’99:
Proceedings of the 2nd IEEE and ACM International Workshop on
Augmented Reality. Washington, DC, USA: IEEE Computer Society,
1999, p. 85.
[12] D. Wagner and D. Schmalstieg, “Making augmented reality practical
on mobile phones, part 1,” IEEE Computer Graphics and Applications,
vol. 29, pp. 12–15, 2009.
[13] S.-w. Lee, D.-c. Kim, D.-y. Kim, and T.-d. Han, “Tag detection algorithm for improving the instability problem of an augmented reality,”
in ISMAR ’06: Proceedings of the 5th IEEE and ACM International
Symposium on Mixed and Augmented Reality. Washington, DC, USA:
IEEE Computer Society, 2006, pp. 257–258.
[14] D. Parikh and G. Jancke, “Localization and segmentation of a 2d high
capacity color barcode,” in WACV ’08: Proceedings of the 2008 IEEE
Workshop on Applications of Computer Vision. Washington, DC,
USA: IEEE Computer Society, 2008, pp. 1–6.
[15] P. C. Santos, A. Stork, A. Buaes, C. E. Pereira, and J. Jorge, “A
real-time low-cost marker-based multiple camera tracking solution for
virtual reality applications,” January 2009.
[16] A. Mohan, G. Woo, S. Hiura, Q. Smithwick, and
R. Raskar, “Bokode: imperceptible visual tags for camera
based interaction from a distance,” ACM Trans. Graph.,
vol. 28, pp. 98:1–98:8, July 2009. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1531326.1531404
[17] J. McLurkin, “Analysis and implementation of distributed algorithms
for Multi-Robot systems,” Ph.D. thesis, Massachusetts Institute of
Technology, 2008.
[18] Y. Yamamoto, P. Pirjanian, M. Munich, E. Dibernardo, L. Goncalves,
J. Ostrowski, and N. Karlsson, “Optical sensing for robot perception
and localization,” in 2005 IEEE Workshop on Advanced Robotics and
its Social Impacts, 2005, pp. 14–17.
[19] P. F. Felzenszwalb and D. P. Huttenlocher, “Efﬁcient graph-based image segmentation,” International Journal of Computer Vision, vol. 59,
no. 2, pp. 167–181, 2004.
[20] R. L. Rivest and C. E. Leiserson, Introduction to Algorithms. New
York, NY, USA: McGraw-Hill, Inc., 1990.
[21] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp.
91–110, November 2004.
[22] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
Vision, 2nd ed. Cambridge University Press, 2004.
[23] K. Shoemake and T. Duff, “Matrix animation and polar decomposition,” in In Proceedings of the conference on Graphics interface 92.
Morgan Kaufmann Publishers Inc, 1992, pp. 258–264.
[24] A. Trachtenberg, A. T. M. S, E. Vardy, and C. L. Liu, “Computational
methods in coding theory,” Tech. Rep., 1996.
[25] R. A. Brualdi and V. S. Pless, “Greedy codes,” J. Comb. Theory Ser.
A, vol. 64, no. 1, pp. 10–30, 1993.
[26] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,
“Labelme: A database and web-based tool for image annotation,” Tech.
Rep. MIT-CSAIL-TR-2005-056, Massachusetts Institute of Technology, Tech. Rep., 2005.