The request

A deep dive into every aspect of a /match request.

The /match endpoint uses query-by-example: you describe an entity in as much detail as you can, and the API returns ranked candidates from the database. You can experiment with the /match endpoint via the advanced screening search.

An example request

import os
import requests

API_KEY = os.environ.get("OPENSANCTIONS_API_KEY")
BASE_URL = "https://api.opensanctions.org"
API_ENDPOINT = "match"
DATASET = "default"           # the dataset to look for matches in
PARAMS = {
    "topics": [               # only consider entities that have at least one of these topics
        "sanction",
        "sanction.linked",
        "debarment",
    ],
    "include_dataset": [      # only consider entities that are in at least one of these datasets
        "us_ofac_sdn",
        "us_ofac_cons",
    ],
    "algorithm": "logic-v2",  # the current default via `best`; you may want to pin it
    "threshold": 0.8,         # higher than the default (0.7); may lead to fewer results
}
PERSON_QUERY = {              # a description of a person
    "schema": "Person",       # the FtM schema
    "properties": {           # the relevant properties
        "firstName": ["Arkadii"],
        "fatherName": ["Romanovich"],
        "lastName": ["Rotenberg", "Ротенберг"],
        "birthDate": ["1951"],
    },
}
COMPANY_QUERY = {             # a description of a company
    "schema": "Company",      # the FtM schema
    "properties": {           # the relevant properties
        "name": ["Stroygazmontazh"],
        "jurisdiction": ["Russia"],
    },
}

session = requests.Session()
session.headers["Authorization"] = f"ApiKey {API_KEY}"

person_response = session.post(
    f"{BASE_URL}/{API_ENDPOINT}/{DATASET}",
    json={"queries": {"q": PERSON_QUERY}},
    params=PARAMS,
)

company_response = session.post(
    f"{BASE_URL}/{API_ENDPOINT}/{DATASET}",
    json={"queries": {"q": COMPANY_QUERY}},
    params=PARAMS,
)

Each element of the request is covered below.

Scoping

A match query is processed in two stages: First, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are evaluated against the query that has been provided by the API consumer.

You can narrow the candidate pool in the first stage by scoping the query, as described below. You can also tune the matching algorithm used in the second stage.

Results are always drawn from the default dataset, which contains all OpenSanctions data, deduplicated and enriched. Scoping narrows which entities are considered as candidates, not what data is returned about matched entities.

The dataset

Every request targets a dataset — a named scope specified in the URL path:

/match/default          # recommended; the combined collection of all datasets
/match/us_ofac_sdn      # only consider candidates that appear in this dataset
/match/eu_sanctions     # only consider candidates that appear in this collection
/match/maritime

The /match endpoint requires a dataset name.

In the advanced screening search, this is represented by the 'Dataset scope' dropdown.

Some datasets are collections that combine different data sources with similar meaning. The whole database is contained in the default collection. Collections are listed here.

Entities added during enrichment — such as the relatives and close associates (RCAs) of PEPs — are often only found in the default collection. Targeting a narrower dataset scope would cause these entities to be missed as candidates.

To include enriched entities, use the default collection and filter by topic instead.

Filtering

Regardless of which dataset you query, you may not want to screen against every entity it contains.

Filters can be added as parameters directly to the URL, or passed in via the params keyword.

Filters are joined with a logical OR (union) within the same parameter, and with AND (intersection) across different parameters.

If you pass "topics": ["role.pep", "role.rca"], you'll be looking at all PEPs and all RCAs. And if you pass "include_dataset": ["ca_commons", "ca_senate"], you'll be looking at all entities in either of those datasets. And if you pass both parameters, you'll be looking at all PEPs and all RCAs, but only in at least one of those two datasets.

Filtering on topics

Use the topics parameter to restrict results to entities with specific risk tags. Here, entities that have any of the given topics will be considered:

/match/default?topics=sanction&topics=sanction.linked&topics=debarment&topics=role.pep&topics=role.rca

Sanctions data sources sometimes list secondary entities which are not sanctioned. Hence, it’s possible for an entity to feature a sanctions dataset as a source, but not be tagged with the sanction topic. This is another reason to use topic filters rather than relying purely on a dataset.

Filtering on datasets

Use include_dataset to pick a custom set of sources. Here, only entities that appear in at least one of us_ofac_sdn or us_ofac_cons will be considered as possible candidates:

/match/default?include_dataset=us_ofac_sdn&include_dataset=us_ofac_cons

If you're using the include_dataset filter, either use default or double-check that the datasets you've listed are indeed in the more specific dataset you're matching against.

Use exclude_dataset to exclude sources that don't have regulatory relevance or produce false positives for you. Here, entities that appear only in iq_aml_list will be excluded from consideration:

/match/default?exclude_dataset=iq_aml_list

The query

The query describes an entity you wish to screen in a way that conforms with the FollowTheMoney (FtM) format. This is the 'example' in 'query-by-example'.

...
query = {
    "schema": "Person",
    "properties": {
        "firstName": ["Arkadii"],
        "fatherName": ["Romanovich"],
        "lastName": ["Rotenberg", "Ротенберг"],
        "birthDate": ["1951"],
        "nationality": ["Russia"],
    },
}

...

To start building a screening process that uses the /match API, write a piece of code that formats each of your counterparties (customers, suppliers, etc.) according to this format.

The following sections describe each part of the query.

The schema

The schema tells the API which type of entity you are looking for. It must be a valid matchable FtM schema name, and is required.

In the advanced screen search, this is represented by the 'Entity type' dropdown. Notice how different properties become available depending on which schema you choose.

Using the correct schema improves match quality, because the matching algorithm activates different features for different schemata. For example, the IMO or MMSI identifier is only considered an important point of comparison for the Vessel schema.

Use the most specific schema that you can; only use a generic schema like LegalEntity when you don't know which more specific schema applies. In that case, any matching entities of any subschemata will be matched against too — but any properties that belong only to a given subschema will be ignored.

For example, if you only have the registration number for a legal entity, but you're not sure if it's a Company or an Organization, use LegalEntity — if a match is found, you can check its schema. But do not use LegalEntity as a handy catch-all that you never update regardless of what information you have — a birthDate, which is only a property on the Person schema, would be ignored. Similarly, a supplied imoNumber would be ignored if you used the more generic Vehicle schema for a ship rather than Vessel.

The properties

Properties describe the entity you are screening, in as much detail as possible. They are also required.

Provide all values as lists of strings, even when there's only one value.

Consult the schema reference to see which properties are available per schema: if a property you're supplying belongs to a more specific schema than the one you're using, it will be ignored. To debug whether the properties you've supplied have been interpreted as expected, check the parsed query in the response.

The API internally uses standardized formats for country codes, dates, phone numbers, etc., but you can just supply a country name and the API will attempt to identify the correct country code (ru in our example) for the entity.

Don't worry too much about whether a country name should live in the country or jurisdiction property: the matching happens by FtM data type (in this case: country), not precise field name. Similarly for generic identifiers like registrationNumber, idNumber, taxNumber, or vatCode — but if the identifier has a format (e.g. leiCode), use that instead.

Guidance per schema:

Person

  • Prefer the more specific firstName, middleName, lastName properties over a single name.
  • Supply a birthDate, even if you only have the year or the year and month.
  • Include any identifier-type fields: idNumber, taxNumber, etc.
  • Any country-type property (nationality, citizenship, country) helps the algorithm penalise mismatches.

Company

  • Pass registrationNumber alongside jurisdiction or country.
  • Format-specific identifiers are evaluated with higher weight than the generic registrationNumber: if available, use leiCode, innCode, ogrnCode, etc.
  • Include name and alias.

Vessel

  • imoNumber is a globally unique identifier and the strongest matching signal for vessels.
  • mmsi is also evaluated.
  • Include flag or country.

Tuning the matching algorithm

When you choose a dataset and provide filters, the API uses those to locate possible candidate results. Once it finds a broad selection of result candidates, it evaluates them against the provided query.

Our matching defaults include:

  • algorithm=best (currently logic-v2): the matching algorithm used for evaluating your query against entities in our database.
    • Each algorithm has its own default weights per feature, which can be tuned via the weights parameter.
  • threshold=0.7: the threshold at or above which a result candidate's score is considered a match. The default algorithm is calibrated to be used with this value. For sanctions screening with low tolerance for false positives, this could be raised to 0.8 or even 0.85.
  • limit=5: the maximum number of results returned per query. This is a sensible default for most cases; increasing it will make queries slower.