Calculate the “column hash”: same as above, but for each column, move top to bottom.Calculate the “row hash”: for each row, move from left to right, and output a 1 bit if the next gray value is greater than or equal to the previous one, or a 0 bit if it’s less (each 9-pixel row produces 8 bits of output).Downsize to a 9x9 square of gray values (or 17x17 for a larger, 512-bit hash).It’s a very simple but surprisingly effective algorithm that involves the following steps (to produce a 128-bit hash value): We use a perceptual image hash called dHash (“difference hash”), which was developed by Neal Krawetz in his work on photo forensics. However, unlike a typical hashing algorithm, the idea with a perceptual hash is that the perceptual hashes are “close” (or equal) if the original images are close. A perceptual hash is like a regular hash in that it is a smaller, compare-able fingerprint of a much larger piece of data. So we decided to automate the task of filtering out duplicates using a perceptual hashing algorithm. And we don’t want our photo search page filled with dupes. The problem is, these come from a variety of sources and are uploaded in a semi-automated way, so there are often duplicates or almost-identical photos that sneak in. Jetsetter has hundreds of thousands of high-resolution travel photos, and we’re adding lots more every day. To achieve this, we wrote a Python implementation of the dHash perceptual hash algorithm and the nifty BK-tree data structure. Recently we implemented a duplicate image detector to avoid importing dupes into Jetsetter’s large image store. Go to: dHash | Dupe threshold | MySQL bit counting | BK-trees Duplicate image detection with perceptual hashing in Python
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |