Deep Feature Matching

The Classical Feature Matching Pipeline has certain problems with high image noise and images, where the exposure varies drastically. Another possible point of failure lies in the architecture of the pipeline itself, where the individual algorithms are run independently from each other. This means, that errors are multiplied by the subsequent steps in the pipeline.

In recent years, a new concept of Deep Feature Matching has been proposed and researched, where the superior semantic capabilities of the deeper layers in convolutional neural networks are used to make matches more robust against factors such as noise and exposure variation and the lower layers are used to refine the features in pixel-coordinates. The Deep Feature Matching algorithm proposes an all-in-one solution, where feature extraction and feature matching are merged into one step, therefore reducing the number of factors, where errors can occur.

The following section of this wiki examines the paper DFM: A performance baseline for Deep Feature Matching in order to determine, if it is viable for the proposed application. Afterwards experiments with different noise types are carried out to compare the DFM performance against classical algorithms.

Architecture

Architecture of the DFM algorithm
Image: Efe, Ufuk and Ince, Kutalmis Gokalp and Alatan, Aydin. DFM: A Performance Baseline for Deep Feature Matching

The DFM algorithm uses a pretrained VGG-19 convolutional neural network, which consists of a total of 16 convolutional layers and three fully connected layers. Since spatial information is lost in the fully connected layers, only the convolutional layers are used for DFM. The layers are grouped into five sections as the table below shows.

Section	Layers	Layer Size	Layer Depth
Section 1	conv_1_1, conv_1_2	$3 \times 3$	64
Section 2	conv_2_1, conv_2_2	$3 \times 3$	128
Section 3	conv_3_1, conv_3_2, conv_3_3, conv_3_4	$3 \times 3$	256
Section 4	conv_4_1, conv_4_2, conv_4_3, conv_4_4	$3 \times 3$	512
Section 5	conv_5_1, conv_5_2, conv_5_3, conv_5_4	$3 \times 3$	512

Note

These sections are incorrectly called layers in the paper, which is why in the following parts, the network will be referred to as having only 5 layers.

Between every Section, there is a $2 \times 2$ MaxPooling layer, which means the image is downscaled to $\frac{1}{16}$ the size of the original image at section 5.

Stage 1

Generating the Feature Maps

First the images A and B, that should be matched are propagated through the network and the feature maps $F^A$ and $F^B$ are extracted at every layer. A Feature Map is a 3D-Tensor $F \in \mathbb{R}^{h \times w \times n}$, where $h \times w$ is the spatial resolution and n is the number of channels. See D2-Net for more information.

A Descriptor d can now be found at the spatial coordinates x and y, by stacking the values of all channels n in a vector.

$$ d_{x,y} = \begin{pmatrix} F_{x,y,1} \\ F_{x,y,2} \\ F_{x,y,3} \\ \vdots \\ F_{x,y,n} \end{pmatrix} $$

Since the number of channels $n$ in a feature map is dependent on the depth of a layer, the descriptors in DFM are so too, which results in the descriptor size being equal to the layer depth in the table above, respective to the layer which the feature map was extracted at.

Dense Nearest Neighbor Search (DNNS)

Next a Dense Nearest Neighbor Search (DNNS) is carried out in the deepest (5th) layer of the network. Here, potential matches are found by calculating the l2-distance (or euclidean-distance) between all descriptors of image A and B. The distance between two descriptors $d_{x_A,y_A}^A$ and $d_{x_B,y_B}^B$ can be calculated by the square root of the quadratic difference of each element in $d_{x_A,y_A}^A$ and $d_{x_B,y_B}^B$. Note that the spatial coordinates $x_A, y_A$ and $x_B, y_B$ don't have to be identical.

$$ l2_{x_A,y_A,x_B,y_B} = \sqrt{\sum_{j=1}^n{(d_{x_A,y_B,j}^A - d_{x_A,y_B,j}^B)^2}} $$

The two descriptors $d_{x_A,y_A}^A$ and $d_{x_B,y_B}^B$ with the lowest l2-distance are saved as the best match with an l2-distance $l2_1$ and the two descriptors with the second lowest l2-distance are saved as the second-best match with an l2-distance $l2_2$. This is repeated for every descriptor in image A and B.

Ratio Test

After finding all possible matches, they are filtered by carrying out a ratio test between the best and second-best matches. If the ratio between the l2-distance of the best match and the l2-distance of the second best match is below a certain threshold $rt$, the match is kept, otherwise it's discarded.

$$ \frac{l2_1}{l2_2} < rt $$

For DFM, the ratio threshold differs for every layer. Ordered by first to last layer, the ratio thresholds are 0.9, 0.9, 0.9, 0.9, 0.95.
For SIFT, a ratio threshold $rt=0.7$ was found to produce the best results by David G. Lowe. Since the ratio test is carried out multiple times in DFM, higher values have been chosen. The ratios from the first three DFM-layers multiply to $rt = 0.9^3 = 0.729$.

Hierarchical Refinement

Hierarchical refinement of feature matches
Image: Schürlein, Mirko

Since the features have been extracted in the lowest layer, the resolution and therefore positioning of these features is very poor. The original image has been downsampled by the factor 16 in layer 5, therefore we need to refine these original matches. In order to do that, for every feature point, the area in the layer above is selected, that influences that feature point. This area is also called receptive field. Since the image is downsampled by $2 \times 2$ maxpooling layers, this area is always $2 \times 2$ pixels.

In this area, DNNS is carried out again, which can have three possible outcomes, that can be observed in the image above on the right side.

The match is refined and propagates another layer up (purple and green matches)
The match is discarded (yellow match)
The match is split into multiple matches (blue match)

Stage 0

The above described architecture didn't yield satisfying results for images, that are very twisted or distorted as the authors describe. Therefore, they propose the two stage approach, where another stage is added before Stage 1.

Generating the Feature Maps

First the images A and B are propagated through the network and the feature maps $F^A$ and $F^B$ are extracted in every layer, as described in Stage 0.

Dense Nearest Neighbor Search

Afterwards, a DNNS is carried out in the fifth layer to generate the initial matches as described in Stage 0.

Ratio Test

As in Stage 0, a ratio test is used to filter the matches.

Generating an initial Geometric Transformation Estimation

Hierarchical refinement is not used in the Stage 0, instead an approximate geometric transformation is produced, by using the cv.findHomography function from the OpenCV library. With this approximate geometric transformation, the initial image B can be warped into an image B', which now approximately lays on image A.

The Stage 1 can now be used with the warped Image B', which according to the authors results in a better matching performance.

Two Stage Approach

The algorithm can be run in one of the two following configurations.

Stage 0 & Stage 1: Promises better performance for images, that are very distorted or twisted.
Stage 1: Is generally faster and yields similar performance for images, that are good aligned.

Expectation of performance with noisy images

The architecture is expected to have two main effects when used on noisy images.

Because of the downsampling with the MaxPooling layers, the matches found in the fifth layer are expected to remain robust. Downsampling usually eliminates image noise, therefore the matches found in the deep layers are expected to remain constant.
During the Hierarchical Refinement, matches are better localized, however in the shallow layers of the network, the images are not yet downsampled a lot and therefore more susceptible to image noise.

It is therefore expected, that robust matches are still found on noisy images, however the localization of these matches could be affected by image noise. To test this theory, the algorithm was tested against classical algorithms in the following section.

Testing the Algorithm

To test the matching performance of the DFM-Algorithm against classical feature extraction algorithms like SIFT, SURF or AKAZE, the HPatches dataset has been used. It consists of groups of images with different illuminations and viewing angles, as well as the homographies, that map the positions of the images against each other.

The testing of the algorithms has been performed with the Image Matching Evaluation repository. This is a modified version of the original IME repository, where different ratio thresholds can be tested and additional programs for plotting results and augmenting noisy images have been added.

Note

If you wish to reproduce the results from the following section, clone the adjusted IME repository and follow the instructions in the README.

Initial Test

Mean Matching Accuracy (MMA)

Mean Matching Accuracy of different algorithms on the HPatches dataset
Image: Schürlein, Mirko

MMA is calculated, by first projecting the positions of the features from image A onto the image B, using the ground-truth homography, that comes with the HPatches dataset. Afterwards, the distance between the features from image B and the projected features from image A can be calculated in pixel-distance. To plot the results, the parameter threshold is created, where features with a distance below the threshold are evaluated with an accuracy of 100% and features above or at the threshold with an accuracy of 0%. Calculating the average for all matches results in the MMA-plots shown above.

In the initial test with the HPatches dataset, the MMA of the DFM-algorithm is better for every pixel threshold, as can be seen in the image above. The number of features and matches are displayed in the table below, where DFM also produces better results than the classical algorithms.

Note

For DFM, every feature is automatically a match, since the Dense Nearest Neighbor Search is executed directly in the algorithm, while with the classical algorithms, the Nearest Neighbor Search is executed after the feature extraction step.

Algorithm	#Features	#Matches
AKAZE	2691	658
DFM	7927	7927
KAZE	3116	1014
ORB	499	100
SIFT	4565	1291
SURF	5997	1377

Homography Estimation Accuracy (HEA)

Homography Estimation Accuracy of different algorithms on the HPatches dataset
Image: Schürlein, Mirko

Only using MMA as a performance measurement isn't enough, since an algorithm that only produces a few, poorly distributed matches would still produce very good results. Such an algorithm however would not be viable for geometric estimation. This is the reason, why it is very common, to provide a table with the number of matches along with the plots. Another way to measure matching performance is by using the Homography Estimation Accuracy.

To calculate HEA, a geometric transformation between image A and image B is calculated from the matched features, which we call estimated homography. Afterwards, image A is geometrically transformed by using the estimated homography, as well as the ground-truth homography, resulting in images $A_{estimated}$ and $A_{groundtruth}$. This allows us to calculate the average pixel-distance between the four corners of $A_{estimated}$ and $A_{groundtruth}$. As in MMA, a pixel threshold is defined for plotting and the accuracy is 100%, if the distance lays below that threshold and otherwise 0%. Lastly, the results of all images are averaged and plotted in the image above.

The HEA results show, that DFM still produces the best results for illumination changes at every threshold, however in the viewpoint changes, specifically at lower thresholds, most classical algorithms produce better results. At the 10px threshold, DFM catches up to SIFT and even surpasses KAZE. This suggests, that the DFM-algorithm is most useful for illumination changes and can be used for viewpoint changes, however with a larger localization error than SIFT or AKAZE.

In the table below, the number of matches and number of inliers are displayed, where DFM still produces by far the best results. The number of inliers is returned by the RANSAC-algorithm, when producing the estimated homography. It indicates how many matches have been used to generate the homography.

Algorithm	#Matches	#Inliers
AKAZE	658	472
DFM	7927	7263
KAZE	1014	793
ORB	100	58
SIFT	1291	952
SURF	1377	866

Testing Noisy Images

In order to test the matching performance of the algorithms on noisy images, the skimage package was used to add artificial noise to the dataset. Sensor noise can be estimated, by adding gaussian noise to the original images. Other noise types, such as salt & pepper and speckle were also tested. The parameters used for generating the noise can be seen below.

Noise Type	Parameters
gaussian	mean: 0, variance: 0.01
speckle	mean: 0.3, variance: 0.01
salt & pepper	amount: 0.005

Mean Matching Accuracy

Gaussian Noise

Mean Matching Accuracy of different algorithms using original images and images with gaussian noise
Mean Matching Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with gaussian noise. Image: Schürlein, Mirko

Speckle Noise

Mean Matching Accuracy of different algorithms using original images and images with speckle noise
Mean Matching Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with speckle noise. Image: Schürlein, Mirko

Salt & Pepper Noise

Mean Matching Accuracy of different algorithms using original images and images with salt & pepper noise
Mean Matching Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with salt & pepper noise. Image: Schürlein, Mirko

Original Images

Gaussian Noise

Speckle Noise

Salt & Pepper Noise

Algorithm	#Features	#Matches
AKAZE	2691	658
DFM	7927	7927
KAZE	3116	1014
ORB	499	100
SIFT	4565	1291
SURF	5997	1377

Algorithm	#Features	#Matches
AKAZE	2893	614
DFM	2372	2372
KAZE	3261	985
ORB	499	86
SIFT	5101	889
SURF	9923	1485

Algorithm	#Features	#Matches
AKAZE	3507	813
DFM	5633	5633
KAZE	4072	1297
ORB	2799	95
SIFT	5431	1290
SURF	7474	1480

Algorithm	#Features	#Matches
AKAZE	2712	644
DFM	4212	4212
KAZE	3131	1006
ORB	499	88
SIFT	4685	1198
SURF	6889	1386

As in the initial testing, DFM has a better Mean Matching Accuracy than the classical algorithms. SIFT and SURF show a major decrease in MMA, especially for gaussian noise, while KAZE and AKAZE almost maintain their MMA for speckle and salt & pepper noise and only show a slight decrease for gaussian noise. ORB is the overall worst algorithm and also decreases in MMA for all noise types, however not as much as SIFT and SURF for gaussian noise.

DFM again shows a bigger decrease for viewpoint changes compared to illumination changes. Another observation that can be made for DFM, is that the MMA for noisy images seems to catch up to the MMA of the original images for higher pixel thresholds. This seems to suggest, that the localization of matches isn't good, however the matches found in the deeper layers seem to stay robust. This would prove the hypothesis made in the analyzation of the DFM-architecture.

The number of features and matches stays almost identical with any noise type for the classical algorithms, whereas DFM shows a big decrease for gaussian noise, again indicating that the refinement is influenced by noisy images. Another interesting observation is, that the number of features found increases slightly for all classical algorithm and especially for SURF on gaussian noise, however the number of matches is lower. This indicates, that the classical feature detection isn't as robust against noise, and instead falsely qualifies noise as feature points.

Homography Estimation Accuracy

Gaussian Noise

Homography Estimation Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with gaussian noise. Image: Schürlein, Mirko

Speckle Noise

Homography Estimation Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with speckle noise. Image: Schürlein, Mirko

Salt & Pepper Noise

Homography Estimation Accuracy test with noisy images. The solid line represents the original images, the dotted line represents images with salt & pepper noise. Image: Schürlein, Mirko

Original Images

Gaussian Noise

Speckle Noise

Salt & Pepper Noise

Algorithm	#Matches	#Inliers
AKAZE	658	472
DFM	7927	7263
KAZE	1014	793
ORB	100	58
SIFT	1291	952
SURF	1377	866

Algorithm	#Matches	#Inliers
AKAZE	614	389
DFM	2372	1875
KAZE	985	723
ORB	86	40
SIFT	889	434
SURF	1485	534

Algorithm	#Matches	#Inliers
AKAZE	813	560
DFM	5633	5002
KAZE	1297	1002
ORB	95	51
SIFT	1290	851
SURF	1480	817

Algorithm	#Matches	#Inliers
AKAZE	644	453
DFM	4212	3711
KAZE	1006	779
ORB	88	43
SIFT	1198	830
SURF	1386	772

DFM is still the best performing algorithm in HEA, when considering illumination changes, regardless of noise type. However the performance for viewpoint changes drops significantly. For gaussian noise, DFM drops below every algorithm except ORB, however it shows the steepest curve, suggesting it could catch up to the other algorithms for higher pixel thresholds. For speckle and salt & pepper noise, DFM doesn't perform quite as bad and lies about in the middle of the other algorithms.

Interestingly, HEA for SIFT and SURF drops less for viewpoint changes, compared to illumination changes when noise is added to the images. AKAZE and KAZE again show only slight drops in HEA for noisy images, regardeless of noise type or if illumination or viewpoint changes are evaluated. However they both perform worse than SIFT due to their worse "starting position" at the original images.

The number of matches and inliers again shows a similar picture to MMA, where both values stay almost identical for classical algorithms and dramatically decrease for DFM. However DFM still manages to find the most inliers, regardless of noise type. One thing to note is, that KAZE and AKAZE manage to increase both the number of inliers and number of matches on images with speckle noise. As of now, no definitive explanation for this behavior has been found, except possibly luckily distributed noise.

Conclusion

The Mean Matching Accuracy of the DFM-algorithm is very promising and lays significantly above all other tested algorithms. This result is maintained, even when artificial noise is added to the testing images.
However Homography Estimation Accuracy, specifically for viewpoint changes shows a different picture, where DFM is outperformed by SIFT, SURF and KAZE. Adding artificial noise to the image decreases DFMs performance further, leading to AKAZE also outperforming DFM.

The difference in MMA and HEA performance of DFM suggests, that DFM largely finds good matches, that are concentrated in an area of an image. Since the matches are good, MMA is good, however if the matches are poorly distributed, HEA can't be accurate.
Another possible explanation can be found in the fact, that DFM seems to catch up to the other algorithms for larger thresholds. This suggests, that some matches aren't good but get still propagated to the first layer. If there are too many bad matches, the geometric transformation used to measure HEA is off, resulting in a bad HEA for lower thresholds and a better HEA for higher thresholds.

In succession to producing these results, it had to be evaluated, if the algorithm should be used for further photogrammetry research or whether another Deep Feature Matching algorithm would be better suited.
Since the viewpoint angles of the test images weren't given, and the noise of the later used camera sensors was not known either, an objective evaluation of the results compared to the real world application cannot really be made. In order to further evaluate the DFM-algorithm, it was decided to implement it into Meshroom and then compare the end-to-end results against the algorithms from the test.
Since the research into implementing DFM into Meshroom would also help with implementing other algorithms in case of bad results, this was the next logical step into the research of Deep Feature Matching Algorithms.

The next section will take you to the Developer Documentation for implementing new Nodes in Meshroom, a final conclusion of the project can be found afterwards.

Meshroom »

Deep Feature Matching

Architecture

Stage 1

Generating the Feature Maps

Dense Nearest Neighbor Search (DNNS)

Ratio Test

Hierarchical Refinement

Stage 0

Generating the Feature Maps

Dense Nearest Neighbor Search

Ratio Test

Generating an initial Geometric Transformation Estimation

Two Stage Approach

Expectation of performance with noisy images

Testing the Algorithm

Initial Test

Mean Matching Accuracy (MMA)

Homography Estimation Accuracy (HEA)

Testing Noisy Images

Mean Matching Accuracy

Gaussian Noise

Speckle Noise

Salt & Pepper Noise

Homography Estimation Accuracy

Gaussian Noise

Speckle Noise

Salt & Pepper Noise

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Research

Developer Documentation

Conclusion

Clone this wiki locally