I'm trying to understand the section 5. of this paper about LSH, in particular how to bucket the generated hashes. Quoting the linked paper:
Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+epsilon) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permu- tation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order ob- tained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q. This can be done by maintaining two pointers for each sorted order O σ (one moves up and the other down). At each step we move one of the pointers up or down corresponding to the element with the longest matching prefix. (Here the length of the longest matching prefix in O σ is computed relative to q with its bits permuted by σ). We examine 2N = O(n 1/(1+epsilon) ) bit vec- tors in this way. Of all the bit vectors examined, we return the one that has the smallest Hamming distance to q.
I'm confused by this algorithm and I don't think that I understood how it works.
I found already this question on the topic, but I didn't understand the answer in the comments. Also in this question in point 2 the same algorithm is described but again, I don't understand how its works.
Can you please try to explain to me how it works step by step trying to be more simple as possible?
I even tried to make a list of things that I don't understand, but in practice is so bad written that I don't understand most of the sentences!
EDIT after gsamaras answer:
I mostly understood the answer, but I still have some doubts:
Is it correct to say that the total cost of performing the
N
permutations isO(Nnlogn)
, since we have to sort each one of them ?The permutation+sorting process described above is performed only once during the pre-processing or for every query
q
? It seems already pretty expensiveO(Nnlogn)
even in pre-processing, if we have to do this at query time it's a disaster :DAt the last point, where we compare
v0
andv4
toq
, we compare their permuted version or the original one (before their permutation)?