Building a custom data catalog

This page describes the advanced method for adding custom datasets to a yente deployment.

The yente application manages datasets and automatic updates via its manifest file. Inside the manifest, you can specify individual datasets but also catalogs, which are (remote) index files that describe a bundle of datasets. Using a catalog to manage datasets has two advantages:

  • You can add and remove datasets without changing the manifest file and re-deploying the application.
  • You can provide new versions of a dataset and use metadata to trigger the yente indexer and force a data update.

yente will automatically re-fetch the content of all the catalogs in its manifest in a regular interval, which is specified as a crontab entry in YENTE_CRONTAB. The specified URL needs to be retrievable by yente, either via HTTP(S) or the file:// scheme:

catalogs:
  - url: "https://data.your-infrastructure.net/screening/data/catalog.json"
    # The name of the dataset to be indexed (can also be an array, `scopes`):
    scope: screening
    # The name of the resource to be indexed:
    resource_name: entities.ftm.json
datasets: []

What's inside a catalog?

Catalog files are simple JSON files which define an array of datasets, each of which references a file with JSON-formatted FollowTheMoney entities:

{
    "datasets": [
        {
            "resources": [
                {
                    "name": "entities.ftm.json",
                    "url": "https://data.your-infrastructure.net/screening/data/entities.ftm.json",
                }
            ],
            "name": "screening",
            "title": "Combined screening data",
            "summary": "A collection of several in-house screening lists.",
            "datasets": [
                "local_fiu",
                "customer_blocklist"
            ],
            "type": "collection",
            "last_export": "2023-12-04T08:17:38"
        },
        {
            "name": "local_fiu",
            "title": "Confidential list from a local FIU",
            "summary": "High-risk entities to avoid.",
            "last_export": "2023-12-04T08:17:38"
        },
        {
            "name": "customer_blocklist",
            "title": "Fraudulent customers",
            "summary": "Blocked from doing business in the future.",
            "last_export": "2023-12-04T08:17:38"
        }
    ],
    "updated_at": "2023-12-04T08:17:38"
}

In this catalog, three datasets are defined: screening, local_fiu and customer_blocklist. Each of them defines a name and title (both mandatory) and a last_export field given as an ISO datetime stamp. The latter is used to generate the version ID for the data index, and should be changed each time the data is updated.

Note that in the example, two datasets describe data sources, while one is a collection that combines them into an easy-to-query grouping. Any of them could contain an array of resources. The resources metadata are not limited to FtM data. Your manifest file needs to specify which of the given resources should be indexed (see resource_name, which references the name field of the resource entry).

Operation

Once you have published this catalog and pointed yente at it using the manifest, the application will start polling the catalog regularly. You can trigger a re-index of all the specified datasets by replacing the catalog with an updated version that defines a new last_export field for the updated datasets.

Code

OpenSanctions uses the nomenklatura library to generate catalog files. This is an internal library which may change over time, but you are welcome to use it if you're aware of that caveat (please pin the dependency). Here's a code example:

import json
from nomenklatura.dataset import DataCatalog, DataResource

catalog = DataCatalog()
dataset = catalog.make_dataset({'title': "Example", 'name': "example"})
dataset.version = datetime.utcnow().isoformat()
resource = DataResource.from_path('entities.ftm.json')
resource.url = "https://data.your-infrastructure.net/screening/data/entities.ftm.json"
dataset.resource.append(resource)

with open('catalog.json', 'w') as fh:
    json.dump(catalog.to_dict(), fh)