Similar images detection in Ruby - Part 1

At AmberBit we develop Ruby on Rails web applications for clients that are start-ups in different areas. One of the clients required detection of near-duplicate images, which allowed us to explore the area of content-based image retrieval systems, existing commercial solutions, advanced algorithms such as SURF and computer vision libraries. We’re more than happy to share the knowledge in this multi-part blog posts series, but we’ll start small with simple solutions that might be just fine for you and Ruby on Rails application that you develop.

Detecting identical images

Identical images are the simplest case. Let’s say we have 2 images containing logo of our company, and we want to determine if they are identical. Naive approach means calculating MD5 checksum our of both files as suggested on this Stackoverflow thread.

AmberBit Ruby on Rails development company - logo1.png EXIF data present

AmberBit Ruby on Rails development company - logo2.png No EXIF data

This approach, however, fails pretty badly when original image has been stripped from EXIF metadata (logo2.png above), or metadata was added to it. Calculated MD5 checksum of a file, where at least one byte was changed, will yield in very different checksum. This is common problem, for example when you upload images to Wordpress, it will preserve the image data but will strip it of any EXIF metadata by default.

Better approach is to extract pixel data from image first, and calculate MD5 checksums only from that relevant part of images.

Let’s see checksums for whole files first:

require 'digest/md5'
Digest::MD5.file "logo1.png"
# => #<Digest::MD5: bd872be474f49fc5749136299149c659>
Digest::MD5.file "logo2.png"
# => #<Digest::MD5: df89a3a45d4861e9c70e08e11cea7838>

Nope, that won’t work. Let’s try pixel data only:

# Let's try pixel data only:
require 'digest/md5'
require 'RMagick'
img1 = Magick::Image.read("logo1.png").first
# => logo1.png PNG 200x49 200x49+0+0 DirectClass 8-bit 8kb
img2 = Magick::Image.read("logo2.png").first
# => logo2.png PNG 200x49 200x49+0+0 DirectClass 8-bit 8kb
Digest::MD5.hexdigest img1.export_pixels.join
# => "2b04ded676c6895a69fd221f64bb3e79"
Digest::MD5.hexdigest img2.export_pixels.join
# => "2b04ded676c6895a69fd221f64bb3e79"

Yup, that will do it. If we have Rails application, we can store MD5 value for each file and, if we want to, validate if an image was already uploaded.

# We add md5() method to uploader
class AssetUploader < CarrierWave::Uploader::Base
  include CarrierWave::RMagick
  ...
  def md5
    file = model.send(mounted_as)
    @md5 ||= Digest::MD5.hexdigest(
      Magick::Image.from_blob(file.read).extract_pixels.join
    )
  end
end
# And save attribute and validate uniqueness in model
class Asset < ActiveRecord::Base
  ...
  mount_uploader :data, AssetUploader
  validates_uniqueness_of :md5
  before_save :calculate_md5

  private
  def calculate_md5
    if data.present? && data_changed?
      self.md5  = data.md5
    end
  end
end

Detecting near-identical images with perceptual hash

Our example above will work great if you want to filter out only exactly the same images. But if we amend our company’s logo so it differs by one pixel, make it black-and-white or resize, it will fail to detect it.

AmberBit Ruby on Rails development company - logo3.png Scaled-down version of the image

AmberBit Ruby on Rails development company - logo4.png Slightly amended logo

Simpliest way to detect near-duplicate images in Ruby is to use phashion library. Let’s see how we can use it image:

require 'phashion'
img1 = Phashion::Image.new "logo1.png"
# => #<Phashion::Image:0x00000002799ad0 @filename="logo1.png">
img3 = Phashion::Image.new "logo3.png"
# => #<Phashion::Image:0x000000027ac950 @filename="logo3.png">
img4 = Phashion::Image.new "logo4.png"
# => #<Phashion::Image:0x000000027c24f8 @filename="logo4.png">
img1.duplicate? img4
# => true
img1.duplicate? img3
# => true
img3.duplicate? img4
# => true

img4.fingerprint
# => 54086765383280

Looks like both logo3.png and even logo4.png were detected as duplicates. That is great! But how does that sorcery work? Phashion calculates perceptual hash of images, also known as fingerprints:

img4.fingerprint
# => 54086765383280

Behind the scenes, phashion uses open source pHash library to calculate the values. This hash is then stored in 64-bit integer. Numeric value does not matter and you cannot judge similarity of two images based on it.

What matters is distance between two values, but calculated as hamming distance instead.

Phashion.hamming_distance 54086765383280, 54076765383280
# => 13

This distance is defined as number of bits that differ between two hashes. In example above, the numeric value differs by 10000000000 but in binary, those two number differ only by 13 bits that have different value on the same positions. Phashion, by default, detects images as duplicates if hashes differ by 15 bits or less.

Okay, that’s great, but how exactly does it work? I recommend you looking into this easy to read overview and pHash documentation itself. As it turns out it’s not a rocket science!

Where to go from here?

Phash is very useful but comparing the image against large set of other images will be impractical. We can speed things up a bit by moving the hamming distance function into our database, and create SQL function that will calculate it for us. Even better approach is to use one of existing solutions, such as pg_similarity for PostgreSQL.

pHash approach might work for you just fine. But you might find out it’s limitations unacceptable in long run. Some major drawbacks are:

  • ease of collision - even very different images can generate close hashes. This is particularly true for images having patterns. Detected duplicates will not look visually similar to human eye at all, will have different colors and shapes on them, but the perceptual hashes will be close.
  • performance - even if you use pg_similarity or your own equivalent, you will be hit with performance problems once your database of images get large. You definitely can’t query millions of images, you should think in maximum of thousands instead.
  • difficulty of matching amended images - the algorithm is used in pHash work quite fine with slightly amended images. They can handle scaling and even squeezing without preserving ratio pretty well. However, they will fail pretty badly on any crop operations and quite likely on color manipulations.

In next parts of this series, we will explore a few alternatives that scale nicer and discuss their good and weak sides. But if you are curious, here are a few helpful links:

by Hubert Łępicki, twitter: @hubertlepicki

comments powered by Disqus