December 4, 2019

395 words 2 mins read

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology.

Talk Title Using Siamese CNNs for removing duplicate entries from real estate listing databases
Speakers Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date May 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

Real estate databases are geospecific (e.g., East Bay, North Bay, South Bay, etc). If a house to be put up for sale is located close to the geoboundary, a real estate listing agent will often list it in both databases. For example, a house located in Milpitas, CA, would often be listed in both East Bay and South Bay databases, although the content of both database entries may be different to appeal to the different demographics of each area. Real estate brokerage firms enter in cross-area sharing agreements, and there are efforts underway to create a nationwide sharing framework as well. Herein lies the problem: when data feeds from East Bay and South Bay databases are aggregated, this results in duplicate listings. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology in TensorFlow 1.2. The curated dataset of over 3,000 images includes images of the front of the houses provided by MLS Listings, Inc., which contains entries (sets of JPEG images) that are a priori known to belong to duplicate real estate listings as well as those which are distinct. Before embarking on building a convolutional neural network, Sergey, Olga, and her team attempted a brute-force approach using a 1-nearest neighbor algorithm to establish a baseline for accuracy and precision of the prediction. Sergey and Olga explain why the brute-force nearest-neighbor approach was inadequate and how the CNN Siamese network was able to achieve accuracy of 69% and precision of 92%. To demonstrate that the implementation scales well with increased dataset, they also describe an implementation of the same CNN Siamese network in Spark’s BigDL framework and compare the results with those of TensorFlow.

comments powered by Disqus