OpenCV Tutorials - Feature Detection and Description

SIFT 全名為 Scale-Invariant Feature Transform，具有 scale-invariant 的特性，在找到 feature 之後，可以產生相對的 feature descriptor，供不同的影像間作 feature matching。

前面章節所介紹的 Harris Corner ，它具有 rotation-invariant 的特性，也就是說，即使影像旋轉了，還是可以找到相同的 feature，因為 corner 旋轉之後還是 corner，但，如果是放大、縮小呢？ Corner 經過放大、縮小之後，由相同的演算法來計算，未必會仍然判斷為 corner，如下圖所示，所以 Harris Corner 不算是 scale-invariant。

SIFT 在 2004 年被提出來，具有 scale-invariant 的特性，用來取出 feature 並計算其 descriptor。

在 SIFT 演算法中，主要步驟如下：

1，Scale-space Extrema Detection

由上面 corner 的例子來看，對於不同大小的 corner 需要有不同大小的 window 來作偵測，這時就需要 scale-space filtering。
LoG (Laplacian of Gaussian) 可以用來在不同大小的影像中作 blob 的偵測，σ為 scaling parameter。舉例來說，在上圖中，小 corner 的 σ會比較小，而大的 σ值就適用於大 corner。

但由於 LoG 的計算量很大，改用 Difference of Gaussians 來取代，可以得到和 LoG 近似的結果；Difference of Gaussian 可以由同一張影像經過 σ及 kσ 的 Gaussian blurring 後取得，整個過程可以透過 Gaussian Pyramid 完成，如下圖：

在 DoG 建立完成後，local extrema 會在不同的 scale 及 space 中被搜尋，例如該像素會跟相鄰的8個像素以及上一個 scale 及下一個 scale 附近共 18 個像素比較，如果該像素的值是 local extrema ，則該區域為 potential keypoint，相關需要比較的像素位置如下圖：

論文中的相關參數經驗值為： octaves = 4, number of scale levels = 5, 初始值 σ= 1.6, k = sqrt(2)。

2, Keypoint Localization

搜尋完影像中的 potential keypoint 後，再利用 Taylor series expansion of scale space 得到 extrema 的更精確位置，而且如果 intensity of extrema 的值小於某個 threshold (論文中的值為 0.03)，該 keypoint 將被剔除。這個 threshold 在 OpenCV也被稱為 contrastThreshold。

DoG 會把 edge 算進 potential keypoint，為了要移除這些 edge，會計算該 potential keypoint 的 eigen value，如果他們之間的比值超過 edgeThreshold，則該 keypoint 將被剔除，在論文中 edgeThreshold = 10。

到此，剔除了低對比及 edge 的 keypoint，剩下來的，就是所需要的 feature point。

3, Orientation Assignment

為了要達到 rotation-invariant，要提供 orientation 給每個 keypoint，在該 keypoint 會以它周邊的像素來計算各個像素的 gradient magnitude 及 gradient direction，把 gradient direction 分成 36 種，間隔為 10°，經過累加後，找出 gradient direction 強度最強的那個方向，即為該 keypoint 的 orientation，若有其它 gradient direction 超過強度最強的 80%，則該 gradient direction 也會被納入，成為計算該 keypoint 的 orientation 的參考。

4, Keypoint Descriptor

先將在該 keypoint 周圍 16 x 16 區域的像素點，分成 16 個 4x4 的子區塊，對於每個子區塊再分別計算它們的 8 bin orientation，所以總共有 8 x 16 = 128 個 bin values，由這些值構成的向量，即為 keypoint descriptor。

補充：這 128 個 bin value，的紀錄順序可能會以第三步驟所計算的 orientation 作為根據，以達到 rotation-invariant。

5, Keypoint Matching

在兩張影像中，如果要在第二張影像搜尋第一張影像的某個 keypoint，其搜尋方式就是透過該 keypoint 的 descriptor 與第二張影像的所有 keypoints 的 descriptors 來計算距離，通常距離最近的，即為相同的 keypoint，但如果最近距離與第二近距離的比值大於 0.8，則捨棄該 keypoint，因為誤判的機率很大，該作法可以減少 90% 的誤判，然而代價是同時有 5% 的機會捨棄已經正確搜尋到的 keypoints。

有關影像的 gradient magnitude 及 gradient direction 可以參考：
Gradient Magnitude & Gradient Direction

Reference

Introduction to SIFT
What is gradient orientation and gradient magnitude
counting non directional edges from canny

文字內容或影像內容部份參考、引用自網路，如有侵權，請告知，謝謝。