This page describes the advanced method for adding custom datasets to a yente deployment.
The yente
application manages datasets and automatic updates via its manifest file. Inside the manifest, you can specify individual datasets but also catalogs, which are (remote) index files that describe a bundle of datasets. Using a catalog to manage datasets has two advantages:
yente
indexer and force a data update.yente
will automatically re-fetch the content of all the catalogs in its manifest in a regular interval, which is specified as a crontab entry in YENTE_CRONTAB
. The specified URL needs to be retrievable by yente
, either via HTTP(S) or the file://
scheme:
catalogs:
- url: "https://data.your-infrastructure.net/screening/data/catalog.json"
# The name of the dataset to be indexed (can also be an array, `scopes`):
scope: screening
# The name of the resource to be indexed:
resource_name: entities.ftm.json
datasets: []
Catalog files are simple JSON files which define an array of datasets, each of which references a file with JSON-formatted FollowTheMoney entities:
{
"datasets": [
{
"resources": [
{
"name": "entities.ftm.json",
"url": "https://data.your-infrastructure.net/screening/data/entities.ftm.json",
}
],
"name": "screening",
"title": "Combined screening data",
"summary": "A collection of several in-house screening lists.",
"datasets": [
"local_fiu",
"customer_blocklist"
],
"type": "collection",
"last_export": "2023-12-04T08:17:38"
},
{
"name": "local_fiu",
"title": "Confidential list from a local FIU",
"summary": "High-risk entities to avoid.",
"last_export": "2023-12-04T08:17:38"
},
{
"name": "customer_blocklist",
"title": "Fraudulent customers",
"summary": "Blocked from doing business in the future.",
"last_export": "2023-12-04T08:17:38"
}
],
"updated_at": "2023-12-04T08:17:38"
}
In this catalog, three datasets are defined: screening
, local_fiu
and customer_blocklist
. Each of them defines a name
and title
(both mandatory) and a last_export
field given as an ISO datetime stamp. The latter is used to generate the version ID for the data index, and should be changed each time the data is updated.
Note that in the example, two datasets describe data sources, while one is a collection that combines them into an easy-to-query grouping. Any of them could contain an array of resources
. The resources metadata are not limited to FtM data. Your manifest file needs to specify which of the given resources
should be indexed (see resource_name
, which references the name
field of the resource entry).
Once you have published this catalog and pointed yente
at it using the manifest, the application will start polling the catalog regularly. You can trigger a re-index of all the specified datasets by replacing the catalog with an updated version that defines a new last_export
field for the updated datasets.
OpenSanctions uses the nomenklatura library to generate catalog files. This is an internal library which may change over time, but you are welcome to use it if you're aware of that caveat (please pin the dependency). Here's a code example:
import json
from nomenklatura.dataset import DataCatalog, DataResource
catalog = DataCatalog()
dataset = catalog.make_dataset({'title': "Example", 'name': "example"})
dataset.version = datetime.utcnow().isoformat()
resource = DataResource.from_path('entities.ftm.json')
resource.url = "https://data.your-infrastructure.net/screening/data/entities.ftm.json"
dataset.resource.append(resource)
with open('catalog.json', 'w') as fh:
json.dump(catalog.to_dict(), fh)
OpenSanctions is free for non-commercial users. Businesses must acquire a data license to use the dataset.