Deploying yente in your infrastructure

yente is an open source data match-making API. It provides functions search, retrieve or match FollowTheMoney entities, including people, companies or vessels that are subject to international sanctions.

yente: Intro · Deployment · Settings · Custom datasets · FAQ

Requirements

Running yente requires a server that can run host the main screening application (a lightweight Python application) and the ElasticSearch backend used to store and query entity information. In total, we anticipate 500 MB memory per Python service, and 2-4GB of memory plus 8-10GB of disk volume size for the ElasticSearch index. Running ElasticSearch on SSD-backed hard drives will produce a significant performance gain.

Deploy using Docker containers

While it is possible to operate yente outside of Docker, we strongly encourage the use of containers as a simple means of dependency management and deployment. We provide pre-built containers of the latest released version of Yente at ghcr.io/opensanctions/yente:latest.

...with docker-compose

For the docker-compose container orchestration tool, we provide an example docker-compose.yml in the repository. You can use it to easily get started with Yente and later modify it to your individual needs.

mkdir -p yente && cd yente
wget https://raw.githubusercontent.com/opensanctions/yente/main/docker-compose.yml
docker-compose up

This will make the service available on Port 8000 of the local machine. You may have to wait for five to ten minutes until the service has finished indexing the data when it is first started.

Next: Configure yente

...with Kubernetes

If you run the container in a cluster management system like Kubernetes, you will need to run both of the services defined in the compose file (the API and ElasticSearch instance). We provide an example Kubernetes configuration in the repository. You may also need to assign the API container network policy permissions to fetch data from data.opensanctions.org once every hour so that it can update itself.

Note that in this configuration, the yente workers run with YENTE_AUTO_REINDEX disabled. Reindexing is performed by a reindex job that is launched periodically by the cluster management system.

Scaling to handle high loads

Yente tries to be gentle on resources — a single process on a reasonably modern CPU core can go a surprisingly long way. When scaling out, we recommend using Kubernetes or another managed cloud service (e.g. Google Cloud Run). In this model, scaling is achieved by launching more containers, each with a single worker process (the default) and access to one vCPU.

Got more questions? Our support is here to help. You can also join the discussion forum to meet the community.

FAQ Understanding the data

Entity structure

Data dictionary

Identifiers and de-duplication

Matcher training data

Statement-based data

Data enrichment