its about the libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle ) from Mr. Frank Denis. I′am trying to understand how to index and store the data in my mysql database. The generation of the vector is absolutly no problem.
Example:
# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');
# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);
# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
echo "Pictures are looking similar
";
} else {
echo "Pictures are different, distance=$d
";
}
Thats all clear to me - but now how do i work when i have a big amount of pictures >1.000.000? I calculate the vector and store it with the filename in the database? How to find the similar pictures now? If i store every vector in the mysql i have to open each record and calculate the distance with the puzzle_vector_normalized_distance function. That procedures takes alot of time (open each database entry - put it throw the function ,...)
I read the readme from the lib puzzle libaray and found the following:
Will it work with a database that has millions of pictures?
A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.
Similar signatures share identical “words”, ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.
Indexing through words and positions also makes it easy to split the data into multiple tables and servers.
So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.
Also i found this description about indexing:
------------------------ INDEXING ------------------------
How to quickly find similar pictures, if they are millions of records?
The original paper has a simple, yet efficient answer.
Cut the vector in fixed-length words. For instance, let's consider the following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.
Here's a very basic sample database schema:
+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+
+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+
I'd recommend splitting at least the "words" table into multiple tables and/or servers.
By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().
I think that compressing is the wrong way cause then i have to uncompress every vector before comparing it.
My question is now - whats the way to handle millions of pictures and how to compare them in a fast and efficient way. I cant understand how the "cutting of the vector" should help me with my problem.
Many thanks - maybe i can find someone here which is working with the libpuzzle libaray.
Cheers.
See Question&Answers more detail:os