The OpenSanctions API supports matching of entities using a simple query-by-example mechanism. For transparency, you can find the weighting of features used in that API here.
The API uses a simple entity comparison model based on logistic regression. Both the training data and the code are fully public, inviting public scrutiny and proposals for improvement.
Feature | Coefficient | Description |
---|---|---|
name_match | 1.151 | Check for exact name matches between the two entities. |
name_token_overlap | 0.028 | Evaluate the proportion of identical words in each name. |
name_numbers | -0.224 | Find if names contain numbers, score if the numbers are different. |
name_levenshtein | 0.542 | Consider the edit distance (as a fraction of name length) between the two most similar names linked to both entities. |
phone_match | 0.036 | Matching phone numbers between the two entities. |
email_match | 0.048 | Matching email addresses between the two entities. |
identifier_match | 0.212 | Matching identifiers (e.g. passports, national ID cards, registration or tax numbers) between the two entities. |
dob_matches | 1.221 | The birth date or incorporation date of the two entities is the same. |
dob_year_matches | 0.236 | The birth date or incorporation year of the two entities is the same. |
first_name_match | 0.041 | Matching first/given name between the two entities. |
family_name_match | 0.117 | Matching family name between the two entities. |
birth_place | -0.102 | Same place of birth. |
gender_mismatch | -0.214 | Both entities have a different gender associated with them. |
country_mismatch | -0.296 | Both entities are linked to different countries. |
org_identifier_match | 0.560 | Matching identifiers (e.g. registration or tax numbers) between two organizations or companies. |
address_match | 0.888 | Text similarity between addresses. |
address_numbers | 0.099 | Find if names contain numbers, score if the numbers are different. |
OpenSanctions is free for non-commercial users. Businesses must acquire a data license to use the dataset.