New directions in record linkage
The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications.
Talk Title | New directions in record linkage |
Speakers | Yves Thibaudeau (US Census Bureau) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Francisco, California |
Date | March 26-28, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The US Census Bureau has been involved in record linkage (a.k.a. entity resolution) projects for over 40 years. In that time, there’s been a lot of change in computing capabilities and new techniques, as well as important new developments in machine learning algorithms and data science to support and improve record linkage processes. The Census Bureau is reviewing an inventory of linkage methodologies, including multiple homegrown methods and software packages, as it embarks on ever more challenging record linkage projects. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications and offers an overview of solutions, such as the homegrown linkage software BigMatch, which implements multikey quicksorting of character strings and is believed to be among the fastest software. (BigMatch is written in the low-level programming language C and is expected to be very efficient as the compiling and translating process is minimum.) Other packages under review include other Census Bureau software written in SAS and C as well as the Python record linkage toolkit. Yves details the strengths and weaknesses of these and identifies which are most effective in the context of the multiple record linkage applications and the mission of the Census Bureau. Along the way, he covers issues of speed and functionality in various modern environments (linking of business list, census roasters, etc.), as well as the difficult issue of specifying and estimating error bounds for the linked records and missed links.