Our updated API uses a statistical model to determine if your query matches one of the entities in the OpenSanctions database. As we do this, we put a premium on transparency and share both the training data and scoring code.
Getting into the fine tuning. [credit]
With the release of yente 1.4.0, our open source API server, we have significantly improved the precision of its results. The API not only powers the search on this web site, but also enables advanced users to perform KYC-style checks against the database.
To match two records, a good matcher will not just see if both people or companies have the same name. Instead, we want to apply various types of fuzzy name matching, and consider details like birth dates, nationalities or stated addresses in the comparison.
Selecting and weighting these matching criteria, though, poses its own challenge. Thankfully, we have already been generating relevant knowledge for the last seven months - by manually de-duplicating all of the entities in our own database. Using the 160,000 entity merging judgements created as part of this effort allowed us to train a simple statistical model and replicate the decision patterns of a human analyst.
But as we introduce more advanced matching, we want to stick to our core values: simplicity and transparency. That's why we're publishing not just the inner workings of our model, but also the training data that we've used to build it:
This way, others can review, suggest improvements, but most importantly: build their own - far smarter - technology based on the data and open source tools we provide. And, of course, they can use the OpenSanctions API to reliably do useful things.
Published:
This article is part of OpenSanctions, the open database of sanctions targets and persons of interest.