Creating and evaluating a distance measure
Whether we're talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with (e.g., with unstructured data that isn't incorporated into your data models). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value.
Talk Title | Creating and evaluating a distance measure |
Speakers | |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 27-29, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Whether we’re talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with. You may have unstructured or vague data that isn’t incorporated into your data models (e.g., information from subject-matter experts who have a sense of whether something is good or bad, similar or different). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value—providing you with the tools to turn that expert information into numbers you can compare and use to quickly see structures in the data. Melissa walks you through setting expectations for a distance, creating distance metrics, iterating with experts to check expectations, validating the distance on a large chunk of the dataset, and then circling back to add more complexity and shares some real-world examples, such as distance from usual emails from a domain, quality scores for geographic data, and merging person records if they are sufficiently similar. Topics include: