Learning Convolutional Filters for Interest Point Detection
Andrew Richardson

Abstract— We present a method for learning efﬁcient feature
detectors based on in-situ evaluation as an alternative to handengineered feature detection methods. We demonstrate our
in-situ learning approach by developing a feature detector
optimized for stereo visual odometry.
Our feature detector parameterization is that of a convolutional ﬁlter. We show that feature detectors competitive with
the best hand-designed alternatives can be learned by random
sampling in the space of convolutional ﬁlters and we provide a
way to bias the search toward regions of the search space that
produce effective results. Further, we describe our approach for
obtaining the ground-truth data needed by our learning system
in real, everyday environments.

I. I NTRODUCTION
In this paper, we present a method to automatically learn
feature detectors that perform as well as the best handdesigned alternatives and are tailored to the desired application. Most current feature detectors are hand-designed.
Creating feature detectors by hand allows the designer to
leverage human intuition to create intricate and efﬁcient
detectors that would be hard to replicate automatically.
However, many feature detectors tend to be reused across
multiple applications with subtle differences that could be
leveraged to improve performance. Automatically-learned
detectors have the potential to exploit these differences
and improve application performance, as well as help us
better understand the properties of top-performing feature
detectors. Such a learning process, however, comes with
its own challenges, such as deﬁning a sufﬁciently general
parameterization in which good detectors can be found and
tractably exploring such a parameter space.
This work approaches automatic feature detector learning by learning fast and effective feature detectors based
on convolutional ﬁlters. Convolutional ﬁlters make up an
expressive space of feature detectors yet possess favorable
computational properties. We speciﬁcally learn these detectors for use in stereo Visual Odometry (VO), an application
in which the full set of feature detector invariances, such as
afﬁne invariance, are not required to achieve good motion
estimates and efﬁciency is key. Unlike feature detection
and matching evaluations limited to planar scenes, these
detectors are learned on realistic video sequences in everyday
environments to ensure the relevance of the learned results.
Learning high-performance feature detectors also requires
a good source of training data, which can be difﬁcult or
expensive to obtain. We detail our method for generating
The authors are with the Computer Science and Engineering
Department,
University
of
Michigan,
Ann
Arbor,
MI
48104,
USA
{chardson,ebolson}@umich.edu

http://april.eecs.umich.edu

Edwin Olson

Fig. 1: When using convolutional feature detectors, which
ﬁlter is best? Reasonable examples include the corner and
point detectors used in [8] (top, both) and a Differenceof-Gaussians (DOG) ﬁlter (bottom left) like those used for
scale-space search in SIFT [12]. We automatically generate
ﬁlters for feature detection to improve the 3D motion estimation of a stereo visual odometry system. Bottom right: the
most accurate ﬁlter in this work.

ground truth data — instrumenting an environment with 2D
ﬁducial markers known as AprilTags [17]. These markers can
be robustly detected with low false-positive rates, allowing
us to extract features with known data association across an
entire video. We can then solve for the global poses of all
cameras and AprilTags and use this information to evaluate
the motion estimate when using natural feature detections.
Importantly, we also use this global reconstruction to reject
any natural feature detections on or near AprilTags, ensuring
that we do not bias the learning process to detect ﬁducial
markers.
The feature detectors learned with our method are efﬁcient
and accurate. Processing times are comparable to FAST, a
well-known method for feature detection, and reprojection
errors for stereo visual odometry are lower than those with
FAST in most cases [18]. In addition, because our ﬁlters
use a simple convolutional structure, processing times are
reduced by both increases in CPU clock rates and SIMD
vector instruction widths.
The main contributions of this paper are:
1) We propose a framework for learning a feature detector
designed to maximize performance of a speciﬁc application.
2) We propose convolutional ﬁlters as a family of feature detectors with good computational properties for
general-purpose vector instruction hardware (e.g. SSE,
NEON).
3) We present a sampling-based search algorithm that can
incorporate empirical evidence that suggests where high

quality detectors can be found and tolerate both a large
search space and a noisy objective function. We detail
both the search algorithm and a set of experiments used
to steer the search towards effective portions of the
search space.
4) We present a method to evaluate the performance of
learned ﬁlters with easily-collected ground truth data.
This enables evaluation of the end application in arbitrary 3D environments.
In Section II, we discuss background material and prior
work. We then discuss the general principles of our method
for detector learning in Section III. Section IV focuses
on the speciﬁcs of our application, stereo visual odometry,
and ground-truthing method. Finally, our experiments and
evaluation are presented in Section V.
II. BACKGROUND
Many feature detectors have been designed to enhance feature matching repeatability and accuracy through properties
such as rotation, scale, lighting, and afﬁne-warp invariance.
Some well known examples include SIFT [12], SURF [2],
Harris [9], and FAST [18]. While SIFT and SURF aim to
solve the scale-invariant feature detection problem, Harris
and FAST detect single-scale point features with rotation
invariance at high framerates. Other detectors aim to also
achieve afﬁne-warp invariance to better cope with the effects
of viewpoint changes [14].
Many comparative evaluations of feature detectors and
descriptors are present in the literature [14], [16]. Breaking
from previous research, Gauglitz et al. evaluated detectors
and descriptors speciﬁcally for monocular visual tracking
tasks on video streams [7]. This evaluation is beneﬁcial, as
continuous motion between sequential frames can limit the
range of distortions, such as changes in rotation, scale, or
lighting, that the feature detectors and descriptors must handle, especially in comparison to image-based search methods that can make no such assumptions. Our performanceanalysis mechanism is similar; however, whereas [7] focused
on rotation, scale, and lighting changes for visual tracking,
we focus on non-planar 3D scenes.
An alternative to hand-crafted feature detectors and descriptors is automated improvement through machine learning. FAST, which enforces a brightness constraint on a
segment of a Bresenham circle via a decision tree, is a handdesigned detector that has been optimized for efﬁciency via
ID3 [18]. An extension of FAST, FAST-ER, used simulated
annealing to maximize detection repeatability [19]. In contrast to these approaches, we focus on learning a detector to
improve the output of our target application using a method
with a low and nearly-constant feature detection time.
Detector-learning work by Trujillo and Olague used genetic programming to assemble feature detectors composed
of primitive operations, such as Gaussian blurring and
equalization [20]. Their results are promising, though the
training dataset size was small. Additionally, they attempt
to maximize detector repeatability, whereas our method is

focused on the end-to-end system performance of our target
application.
In addition to the detector-learning methods, descriptor
learning methods like those from Brown et al. learn local
image descriptors to improve matching performance [4].
They use discriminative classiﬁcation and a ground-truthed
3D dataset. Their resulting descriptors perform signiﬁcantly
better than SIFT on the ROC curve, even with shorter
descriptors. As this paper focuses on detector learning, we
use a common descriptor for all detectors. This descriptor
uses a standard pixel-patch representation and allows an even
comparison between all detectors evaluated in this work.
III. L EARNING A F EATURE D ETECTOR
Feature detector learning requires three main components:
a parameterization for the detector, an evaluation metric,
and a learning algorithm. The parameterization deﬁnes a
continuum of detectors, ideally capable of describing the
range from ﬁxed-size point or corner detectors to scaleinvariant blob detectors, as well as concepts like “zero
mean” ﬁlters. The evaluation harness computes the error
of a proposed detector, which we want to minimize. Given
these components, we can construct a method to generate
feature detectors that maximize our learning objective. While
iterative optimization through gradient or coordinate descent
are popular ways to solve such problems, these approaches
are problematic in learning feature detectors due to the highdimensional search space and noise in the objective function.
Random sampling allows us to evaluate far more ﬁlters than
with iterative methods, ﬁnd good ﬁlters despite numerous
local minima, and develop an intuition for the constraints on
the ﬁlter design that yield the best performance.
A. Detector parameterization
We parameterize our feature detector as a convolutional
ﬁlter [15]. This is an attractive representation due to the
convolutional ﬁlter’s ﬂexibility and the ability to leverage
signal processing theory to interpret or constrain the qualities
of a detector. In addition to a convolutional ﬁlter’s ﬂexibility,
these ﬁlters can also be implemented very efﬁciently on
vector instruction hardware, such as Intel SSE, AVX, and
ARM NEON. This hardware is commonly available in
modern smartphone processors, as rich media applications
can beneﬁt signiﬁcantly from SIMD parallelism.
We want to ﬁnd the convolutional ﬁlter that yields the most
accurate result for our target application. This is different
from the standard metrics for feature detector evaluation
like repeatability, as the best features may not in fact be
detectable under all rotations. Our objective function (which
we minimize) measures the error in the motion estimate
against ground truth. This is in contrast to methods which
maximize an approximation of end-to-end performance like
repeatability [19], [20]. The advantage of our approach is the
potential to exploit properties speciﬁc to the application. In
stereo visual odometry, for example, edges that are vertical
from the perspective of the camera are easy to match between

(a) Bandpass

(a) 4 × 4 DCT basis set

(b) 4 × 4 Haar basis set

Fig. 2: Basis sets representing each frequency-domain transform for (a) the Discrete Cosine Transform (DCT) and (b)
the Haar Wavelet Transform on a 4 × 4 image patch. Each
patch corresponds to a single coefﬁcient in frequency space.
The top left basis corresponds to the DC component of the
ﬁlter. Note that we use 8 × 8 ﬁlters in this work.

the left and right frames due to the epipolar geometry constraints. This property is not captured by standard measures
like repeatability, so methods which use these measures
cannot be expected to exploit them.
Our detection method can be summarized as follows:
1) Convolve with the ﬁlter to compute the image response
2) Detect points exceeding the ﬁlter response threshold,
which is updated at runtime to detect a constant userspeciﬁed number of features
3) Apply non-extrema suppression using the ﬁlter response
over a 3 × 3 window
This work focuses on 8 × 8 convolutional ﬁlters. A na¨ve
ı
parameterization would simply specify the value of each
entry in the ﬁlter, a space of size R64 . In the following
section, we detail alternative parameterizations which allow
us to learn ﬁlters which both perform better and require less
time to learn.
B. Frequency domain parameterizations
Within the general class of convolutional ﬁlters, we parameterize our feature detectors with frequency domain representations. In this way, we can apply meaningful constraints
on the qualities of these ﬁlters that would not be easily
speciﬁed in the spatial domain. In doing so, we learn about
the important characteristics for successful feature detectors
built from convolutional ﬁlters.
We use the Discrete Cosine Transform (DCT) and Haar
Wavelet transform to describe our ﬁlters [1], [15]. Unlike the
Fast Fourier Transform (FFT), the DCT and Haar Wavelets
only use real-valued coefﬁcients and are known to represent image data more compactly than the FFT [3]. This
compactness is often exploited in image compression and
allows us to sample candidate ﬁlters more efﬁciently. These
transformations can be easily represented by orthonormal
matrices and computed through linear matrix products.
In this work, we make use of three representations for
ﬁlters — a straightforward pixel representation, the Discrete
Cosine Transform (DCT), and the Haar Wavelet Transform.

(b) Block-diagonal

Fig. 3: Example frequency-domain ﬁlter constraints for 8×8
ﬁlters. Coefﬁcients rendered in black are suppressed (forced
to zero). White coefﬁcients can take any value. A typical
bandpass ﬁlter is shown in (a). We propose the use of the
block-diagonal region of support in (b), which exhibited
superior performance in our tests.

Figure 2 shows a set of basis patches for the two frequencydomain representations. The pixel values of a ﬁlter in the
spatial domain can be uniquely described by a weighted linear combination of these basis patches, where the weights are
the coefﬁcients determined by each frequency transformation.
For convenience, we refer to both the spatial values of the
ﬁlters (pixel values) and frequency coefﬁcients as w for the
remainder of this paper.
C. Error minimization
In our application, our goal is to ﬁnd a convolutional
ﬁlter which yields the best motion estimate for a stereo
VO system. We represent the error function that evaluates
the accuracy of a proposed ﬁlter by E(w). As described
further in Section IV-B, our error function is the mean
reprojection error of the ground truth data, the four corners of
the AprilTags, using the known camera calibrations and the
camera motion estimated using the detector under evaluation.
While iterative optimization via gradient or coordinate
descent methods is a straightforward approach for learning
with an error function, we found that such iterative methods
get caught in local minima too frequently. We propose
instead to learn detectors by randomly-sampling frequency
coefﬁcient values while varying the size and position of a
coefﬁcient mask that zeroes all coefﬁcient values outside of
the mask. We refer to our mask of choice as a block-diagonal
constraint, as illustrated in Figure 3. We also evaluate the
performance of sampling with bandpass constraints and
sampling raw pixel values via the na¨ve approach. In all
ı
cases, sampled values are taken from a uniformly-random
distribution in the range [-127,127].
The constraints illustrated in Figure 3 are deﬁned by
low and high frequency cutoffs, which deﬁne the ﬁlter’s
bandwidth. The ﬁlters shown have low and high frequency
cutoffs of 0.250 and 0.625, respectively1 . Thus, the ﬁlters
have a bandwidth of 0.375.
Iteratively-optimizing ﬁlters can be very expensive. Steepest descent methods that use the local gradient of the error
function require at least n error evaluations for square ﬁlters
1 Note

that we use frequency ranges normalized to the interval (0, 1)

√
of width n. After gradient calculation, multiple step lengths
may be tried in a line-search minimization algorithm. At
a minimum, n + 1 error evaluations are required for every
update. Coordinate descent methods require at least 2n evaluations to update every coefﬁcient once. Both methods require
more calculations in practice. Because the error surface
contains a high number of local minima, step sizes must
be small and optimization converges quickly. This results
in a great deal of computation for only small changes to
the ﬁlter. We found that 95% of our best randomly-sampled
ﬁlters did not reduce their error signiﬁcantly after iterative
optimization. Many did not improve at all.
In contrast to iterative methods, computing the error for
a new ﬁlter only requires one evaluation. The result is that
in the time it would take to perform one round of gradient
descent for an 8 × 8 ﬁlter, we can evaluate a minimum of
65 random samples.

(a) Reconstructed ground truth trajectory

IV. S TEREO V ISUAL O DOMETRY
In this section, we describe aspects of our application
domain, stereo visual odometry. In principle, our in-situ
training methods apply to other stereo visual odometry
pipelines and other applications. We use a custom stereo
visual odometry pipeline for feature detector learning. Prior
work in this area has yielded accurate and reliable results
with high-framerate VO using corner detectors [13], [11]. A
number of system architectures are possible and have been
demonstrated in the literature, including monocular [11], [5]
and stereo approaches [13], [8].
A. Visual Odometry overview
Our approach to stereo visual odometry can be divided
into a number of sequential steps:
1) Image acquisition - 30 Hz hardware-triggered frames
are paired using embedded frame counters before stereo
rectiﬁcation via bilinear interpolation.
2) Feature detection - Features are detected in grayscale
images with non-extrema suppression. Zero-mean, 9×9
pixel patch descriptors are used for all features.
3) Matching - Features are matched between paired images using a zero-mean Sum of Absolute Differences
(SAD) error score. To increase robustness to noise,
we search over a +/-1 pixel offset when matching.
Unique matches are triangulated and added to the map.
Previously-mapped features are projected using their
last known 3D position and matched locally.
4) Outlier rejection - We provide robustness to
bad matches with both Random Sample Consensus
(RANSAC) and robustiﬁed cost functions during optimization (speciﬁcally, the Cauchy cost function with
b = 1.0) [6], [10].
5) Motion estimation - We initialize the motion estimate
to the best pose from RANSAC. We then use nonlinear
optimization to improve the point positions and camera
motion estimates, iterating until convergence.
The result is an updated 3D feature set and an estimate of
the camera motion between the two sequential updates.

(b) Reprojected ground truth features

Fig. 4: Ground truth reconstruction using AprilTags. Stereo
camera trajectory (orange) in (a) is reconstructed using
interest points set on the tag corners determined by the tag
detection algorithm. Shaded region (blue) corresponds to the
scene viewed in (b). The mask overlays (red) in (b) are
the result of reprojecting the tag corners using the ground
truth reconstruction and are used to ensure that no features
are detected on the AprilTags added to the scene. Example
feature detections from the best ﬁlter learned in this work
are shown for reference (green).

B. Ground truth using AprilTags
We compute our ground truth camera motion by instrumenting the scene with AprilTags and solving a global nonlinear optimization over all of the tags and camera positions.
Speciﬁcally, we treat the four tag corners as point features
with unique IDs for global data association. During learning,
we explicitly reject any feature detections on top of or within
a small window around any AprilTag so that we do not bias
the detector learning process. In other words, our system
rejects detections that would otherwise occur because of the
AprilTags in the scene so that we learn to make use of the
natural features in the environment. This is illustrated in
Figure 4b, where red overlays correspond to regions where
all feature detections are discarded. By using the reprojected
AprilTag positions, we can reject features on AprilTags even
when a tag is not detectable in the current frame.
Once the ground truth has been computed, we can compute
the error of a motion estimate computed with an arbitrary
feature detector. For every sequential pair of poses in the
dataset, we compute the 3D position of the AprilTag corners
with respect to the ﬁrst of the two poses, transform from the
ﬁrst to the second pose’s reference frame with the motion
estimate from the arbitrary feature detector, and compute the
reprojection error of these points in the images from the

(a) Ofﬁce

(b) Conference room

(c) Cubicle

(d) Lobby

Fig. 5: Images from the four datasets used in this work.
AprilTag masks described previously are shown in red.

Fig. 6: Time comparison between FAST-9 and an 8 × 8
convolutional ﬁlter feature detector. Times represent the
combined detection and non-extrema suppression time and
were computed on the PandaBoard ES (OMAP 4460) over
30 seconds of video with 376x240, grayscale imagery.

second pose. E(w) is the mean reprojection error over all
pairs of poses in the dataset using this method. It measures
how well, on average, the ground truth features are aligned
with their observations when using the candidate detector.
V. E XPERIMENTS
Our experiments focus on randomly-sampling ﬁlters under
different constraints and computing the mean reprojection
error when using these ﬁlters across multiple datasets. In
addition, we compare both error and computation time to
existing and widely-used feature detectors.
A. Implementation Details
Training and testing take place on four datasets between 23
and 56 seconds in length with 30 FPS video. These datasets
were collected in various indoor environments, including an
ofﬁce, conference room, cubicle, and lobby. In each case,
we randomly selected 15 seconds of video for training.
Figures 4 and 5 show camera trajectories and imagery from
these datasets. Our stereo rig uses two Point Grey FireFly
MV USB2.0 color cameras at a resolution of 376 × 240.
Experiments were run on a pair of 12-core 2.5 GHz Intel
Xeon servers, each with 32 GB of memory. Sampling 5,000
ﬁlters takes approximately 7 hours at 9.7 seconds per ﬁlter.
In contrast to the substantial computational resources
used in learning, our target application is limited to the
compute available on a typical mobile robot. A mobilegrade processor such as the OMAP4460, a dual-core ARM
Cortex-A9 device, used as an image preprocessing board, is
a compelling and scalable computing solution to our needs.
With a convolution implementation optimized via vector
instructions, speciﬁcally ARM NEON, we can detect around
300 features per 376 × 240 image in 3.65 ms for a 8 × 8
ﬁlter. In comparison, the ID3-optimized version of the FAST
feature detector performs similarly, requiring 3.20 ms. For
both methods, we dynamically adjust the detection threshold
to ensure the desired number of features are detected even
as the environment changes.

(a) Desk dataset

(b) Conference room dataset

Fig. 7: Mean reprojection error for the best ﬁlter sampled
so far (running minimum error) over 5,000 samples. Ground
truth system error shown in red.

While both methods are efﬁcient enough for real-time
use, the difference in computation time for a large number
of features is dramatic. Figure 6 shows the runtime as a
function of the desired number of features for both methods.
This time measurement includes detection and non-extrema
suppression. While both methods show a linear growth in
computation time, the growth for the convolutional methods
is much slower than for FAST. This is because the convolution time does not change as the detection threshold
is reduced. The linear growth is due only to non-extrema
suppression. This result is especially important for methods
where the desired number of detections is high [11], [8].
B. Randomly-sampled ﬁlters
We evaluated our approach by running a suite of randomsampling experiments for block-diagonal ﬁlters with both
the DCT and Haar parameterization. We also compare to
sampled bandpass and pixel ﬁlters. Figure 7 shows the error
of the best ﬁlter sampled so far as 5,000 ﬁlters are sampled.
In all four datasets, the pixel and bandpass ﬁlters never
outperformed the best block-diagonal ﬁlter. The ﬁnal ﬁlters
of each type and from each dataset2 are shown in Figure 8.
Figure 9 shows the error distributions for each of the
three constraints on the conference room dataset. For each
experiment, 5,000 ﬁlters were randomly generated with the
2 Final

coefﬁcients available at umich.edu/~chardson/icra2013feature.html

(a) Distributions for sampled DCT block-diagonal ﬁlters

Fig. 8: Best learned ﬁlter for every type and dataset combination. Best viewed in color.

(b) Distributions for sampled Haar block-diagonal ﬁlters

(a) Pixels

(b) Bandpass

(c) Block-diagonal

Fig. 9: Histogram of mean reprojection errors for three
ﬁlter generation methods on the conference room dataset.
While randomly-generated bandpass ﬁlters yield better performance, on average, than ﬁlters with uniformly-random
pixel values, block-diagonal ﬁlters have both better average
performance and a lower error for the best ﬁlters. 79%
and 72% of sampled bandpass and block-diagonal ﬁlters,
respectively, had errors under 2.5 pixels, while only 30% of
random pixel ﬁlters had such low errors.

appropriate set of constraints. For the bandpass and blockdiagonal constraints, we generated ﬁlters for all combinations
of the DCT or Haar transform, low frequency cutoff and ﬁlter
bandwidth, deﬁned previously. Of the 27 combinations of
low and high-frequency mask cutoffs available for both DCT
and Haar ﬁlters of size 8 × 8 (in total, 8 − 1 combinations
2
each for DCT and Haar), only 6 of them include the DC
component and, in the case of the block-diagonal ﬁlters, the
vertical and horizontal edge components.
From these plots, it is clear that limiting the search
for a good ﬁlter through the bandpass and block-diagonal
sampling constraints signiﬁcantly improved the percentage of
ﬁlters which yield low reprojection errors. Our interpretation
of this result is that ﬁlters are very sensitive to nonzero values
for speciﬁc frequency components. By strictly removing
these components in 21 of the 27 constraint combinations,
the average ﬁlter performance improves signiﬁcantly.
Figure 10 shows separate distributions for block-diagonal
ﬁlters for each of the possible low-frequency cutoffs for
ﬁlters with the most narrow ﬁlter bandwidth (0.250). One
plot is shown for each frequency transformation (DCT and
Haar). From these ﬁgures, it is clear that ﬁlters perform

Fig. 10: Distributions for block-diagonal ﬁlters with a bandwidth of 0.250 as a function of the low frequency cutoff
on the conference room dataset. For both (a) and (b), the
ﬁlters which include the DC and vertical/horizontal edge
components (LF cutoff of 0) have reprojection errors greater
than 4 pixels 84% and 87% of the time for DCT and Haar
ﬁlters, respectively. Beyond the DC components, only the
DCT ﬁlters with low-frequency cutoff of 0.5 or greater have
such large reprojection errors. The remaining distributions
progress smoothly from low error (left) to high error (right)
as the frequency increases. Best viewed in color.

signiﬁcantly worse when the DC and edge coefﬁcient values
are not zero. Beyond that, we see a trend of low error for
low-frequency ﬁlters, and an increasing error as frequency
increases. Finally, the DCT ﬁlters seriously degrade when
they begin to contain high frequency components; however,
the Haar ﬁlters do not. Our interpretation of this result is that,
because Haar basis patches are not periodic, a simple step
transition in an image will often result in a unique maxima. In
contrast, the periodic DCT basis patches will yield multiple
local maxima, causing a cluster of detections around edges.
Similar trends exist for ﬁlters with higher bandwidths.
The performance of the best sampled block-diagonal ﬁlters
is compared to baseline methods such as FAST, Shi-Tomasi,
a Difference of Gaussians ﬁlter, and ﬁlters used by Geiger et
al in Table I. The best result from every testing dataset (row)
is shown in bold. All detector evaluations were performed
with the same system parameters: detect 300 features after
non-extrema suppression and ﬁltering with the AprilTag
masks, use RANSAC and a Cauchy robust cost function
(b = 1.0), etc. These parameters were set via parameter
sweeps using the FAST feature detector on the ofﬁce and
conference room dataset. As such, these represent best-case
conditions for FAST. Mean values over 25 trials are reported
due to variability induced by RANSAC. The differences
in the means between the trained ﬁlters and FAST were
statistically signiﬁcant in 10 of the 12 cases with p values

Testing dataset

Filters trained on speciﬁed dataset
Ofﬁce

Ofﬁce
Conf rm.
Cubicle
Lobby

0.447
0.981
1.292
1.593

Conf rm.
0.466
0.929
1.142
1.552

Cubicle
0.471
0.996
1.134
1.628

Linear baseline methods

Nonlinear baseline methods

Lobby

DOG

Geiger Corner

Geiger Blob

FAST

Shi-Tomasi

SURF

0.468
0.981
1.368
1.482

40.281+

0.492
1.047
2.131
1.573

0.595
1.218
4.042
1.927

0.470
0.953
1.441
1.654

2.060
1.141
4.550
1.938

1.322∗
1.584∗
0.795∗
2.032∗

1.505
2.962
1.974

TABLE I: Testing error on each dataset using the learned feature detectors and baseline methods. Reported numbers are
mean values over 25 trials to compensate for the variability of RANSAC, except for the training errors (gray). Bolded values
are the best result for every row. ∗ SURF generated features adjacent to AprilTags that could not be easily ﬁltered out because
of SURF’s scale invariance. As such, the SURF results are not considered a fair comparison to the other methods. + Large
errors are the result of data association failures with the speciﬁed features.

less than 0.01 for a two-tailed t-test.
These results reinforce the notion that learned convolutional ﬁlters can compete with nonlinear detection methods,
like the FAST feature detector. Only on the conference
room dataset did FAST perform better than a learned ﬁlter.
On average, learned ﬁlters had a lower reprojection error
than FAST by a small amount, 0.05 pixels. For other
baseline methods, such as Shi-Tomasi, the improvement in
reprojection error was substantial. Note that while SURF
outperformed all methods on the cubicle datasets, this is
due to SURF detections adjacent to AprilTags that cannot
be rejected as easily due to SURF’s scale invariance.
For the linear baseline methods, the results vary greatly.
Geiger et al’s corner ﬁlter performs the best of the three, and
yet it and the other linear baselines perform quite poorly on
the cubicle dataset, unlike the learned ﬁlters. Interestingly,
these linear detectors (or an equivalent 8×8 ﬁlter) are simply
a few of the convolutional ﬁlters that could have been learned
in our framework.
These results also suggest that dataset choice, not learned
ﬁlter, was the best predictor of testing errors. The cubicle
dataset had a high error in a number of cases. From inspecting the video stream, this is not surprising — the cubicle
is generally featureless except for a few smooth edges and
a narrow strip at the top where the camera sees over the
cubicle wall.
VI. S UMMARY
We have presented a framework for automatically learning
feature detectors that can be efﬁciently computed on modern
architectures and result in performance that is generally
better than existing methods, sometimes substantially so. By
sweeping over frequency-domain constraints on the ﬁlters
during sampling, we learn detectors that outperform obvious
alternatives and prior work. In addition, our results indicate
that the best detectors are typically those that respond primarily to the lowest non-DC frequency components. These
learned detectors perform well on all datasets, despite only
using one dataset during training.
ACKNOWLEDGEMENTS
This work was supported by U.S. DoD Grant FA2386-111-4024.

R EFERENCES
[1] N. Ahmed, T. Natarajan, and K. Rao. Discrete cosine transform.
Computers, IEEE Transactions on, 100(1):90–93, 1974.
[2] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust
features. Computer Vision–ECCV 2006, pages 404–417, 2006.
[3] T. Bose, F. Meyer, and M. Chen. Digital signal and image processing.
J. Wiley, 2004.
[4] M. Brown, G. Hua, and S. Winder. Discriminative learning of local
image descriptors. IEEE transactions on pattern analysis and machine
intelligence, pages 43–57, 2010.
[5] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. MonoSLAM:
Real-time single camera SLAM. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 29:1052–1067, 2007.
[6] M. Fischler and R. Bolles. Random sample consensus: A paradigm
for model ﬁtting with applications to image analysis andd automated
cartography. Communications of the ACM, 24(6):381–395, June 1981.
[7] S. Gauglitz, T. H¨ llerer, and M. Turk. Evaluation of interest point
o
detectors and feature descriptors for visual tracking. International
Journal of Computer Vision, pages 1–26, 2011.
[8] A. Geiger, J. Ziegler, and C. Stiller. StereoScan: Dense 3d reconstruction in real-time. In IEEE Intelligent Vehicles Symposium, BadenBaden, Germany, June 2011.
[9] C. Harris and M. Stephens. A combined corner and edge detector. In
Alvey vision conference, volume 15, page 50. Manchester, UK, 1988.
[10] R. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision. Cambridge University Press, second edition, 2004.
[11] G. Klein and D. Murray. Parallel tracking and mapping for small AR
workspaces. In Proc. Sixth IEEE and ACM International Symposium
on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November
2007.
[12] D. G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, November
2004.
[13] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid. RSLAM:
A system for large-scale mapping in constant-time using stereo.
International Journal of Computer Vision, pages 1–17, 2010.
[14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
F. Schaffalitzky, T. Kadir, and L. Gool. A comparison of afﬁne region
detectors. International journal of computer vision, 65(1):43–72, 2005.
[15] T. Moon and W. Stirling. Mathematical methods and algorithms for
signal processing, volume 204. Prentice hall, 2000.
[16] P. Moreels and P. Perona. Evaluation of features detectors and
descriptors based on 3D objects. In Computer Vision, 2005. ICCV
2005. Tenth IEEE International Conference on, volume 1, pages 800
– 807 Vol. 1, 2005.
[17] E. Olson. AprilTag: A robust and ﬂexible visual ﬁducial system. In
Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), May 2011.
[18] E. Rosten and T. Drummond. Machine learning for high-speed corner
detection. Computer Vision–ECCV 2006, pages 430–443, 2006.
[19] E. Rosten, R. Porter, and T. Drummond. Faster and better: A machine
learning approach to corner detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, November 2008.
[20] L. Trujillo and G. Olague. Automated design of image operators
that detect interest points. Evolutionary Computation, 16(4):483–507,
2008.