New scoring modes in the OpenSanctions API

You can now select from a range of different algorithms that score your results when you use the OpenSanctions API to screen a set of companies and people.

The OpenSanctions API - both the easy-to-use hosted service and the self-hosted option - provides an easy way to submit a set of entity descriptions (e.g. a list of customers, business counterparties, or subjects of an investigation) and check their presence on a sanctions or PEPs list.

Until now, this matching API has used a simple statistical model to assign a match score to each result it has returned. With the new release of yente 3.4, we've made that mechanism more flexible: clients can now select one of a set of supported algorithms to optimise the behaviour of the API for their use case.

With the new release, we've added three new scoring systems to augment the existing model (now called regression-v1, it is used as the default if no other algorithm is specified):

  • regression-v2 is a new statistical model for matching people and companies. Unlike regression-v1 it uses pronunciation-based (phonetic/soundex) comparison for entity names, and it has reduced the impact of birthdates as a decision criterion. The new model will generally produce much lower scores for results, so you may want to reduce your matching threshold parameter in the API to 0.5 or 0.6.

  • name-based is a simple scoring mechanism based on name similarity only. It uses two criteria, the Jaro-Winkler string distance mechanism and the Soundex phonetic algorithm. This can be a useful tool to conduct matching on data where you only have entity names, and no other details such as birth dates, nationalities, etc.

  • name-qualified uses the score from the name-based mechanism but then considers other criteria, such as birth dates, nationalities, tax and registration identifiers. If any of these mismatch between the query and the result, the score is lowered. This attempts to anticipate a simple review process that a human analyst might otherwise undertake when a result is found.

You can read more about these mechanisms and inspect their detailed scoring criteria. But what's even more exciting: by making the matching logic of yente into a configurable component, we can now keep adding specialised scoring systems without breaking backward compatibility. And, because it's open source: you could, too.

In the future, we can add algorithms that introduce a more human-like understanding of names, or use name frequencies to predict the likelihood of a certain name being unique. (Inserting a sentence here about the future application of OpenAI's GPT will be left as an exercise to the inclined reader.)

We're keen for any feedback regarding this change, and what our next steps with customised scoring should be!

Like what we're writing about? Keep the conversation going! You can follow us on Twitter or join the Slack chat to bring in your own ideas and questions. Or, check out the project documentation to learn more about OpenSanctions.


This article is part of OpenSanctions, the open database of sanctions targets and persons of interest.