Improving the way we score matching results

Our updated API uses a statistical model to determine if your query matches one of the entities in the OpenSanctions database. As we do this, we put a premium on transparency and share both the training data and scoring code.

Getting into the fine tuning. [credit]

With the release of yente 1.4.0, our open source API server, we have significantly improved the precision of its results. The API not only powers the search on this web site, but also enables advanced users to perform KYC-style checks against the database.

To match two records, a good matcher will not just see if both people or companies have the same name. Instead, we want to apply various types of fuzzy name matching, and consider details like birth dates, nationalities or stated addresses in the comparison.

Selecting and weighting these matching criteria, though, poses its own challenge. Thankfully, we have already been generating relevant knowledge for the last seven months - by manually de-duplicating all of the entities in our own database. Using the 160,000 entity merging judgements created as part of this effort allowed us to train a simple statistical model and replicate the decision patterns of a human analyst.

But as we introduce more advanced matching, we want to stick to our core values: simplicity and transparency. That's why we're publishing not just the inner workings of our model, but also the training data that we've used to build it:

  • Our scoring transparency page lists the feature coefficients used by the regression model used to score results, giving you a sense of the relative weight that different features have in producing our API decisions.
  • The pairwise judgement data includes 160,000+ pairs of entities in FollowTheMoney format, each pair with an annotation to say if the two records are matching or distinct. Releasing this data is an invitation to the community: it can, of course, be used to train completely different models or matchers.

This way, others can review, suggest improvements, but most importantly: build their own - far smarter - technology based on the data and open source tools we provide. And, of course, they can use the OpenSanctions API to reliably do useful things.

Like what we're writing about? Keep the conversation going! You can follow us on Twitter or join the Slack chat to bring in your own ideas and questions. Or, check out the project documentation to learn more about OpenSanctions.

Published:

This article is part of OpenSanctions, the open database of sanctions targets and persons of interest.