We're now integrating persons of interest from Wikidata!

The structured-data edition of Wikipedia offers a compelling source of information on many persons in the public eye. Mining the data is, however, not for the faint of heart.

Wikidata is as ambitious a project as you can imagine: a fully structured knowledge graph encyclopedia, describing knowledge from all domains of human thought. These facts are contributed by hundreds of humans and robots - the Peppercat project we wrote about this week is a good example.

Probably the most exciting aspect of Wikidata from an investigative or due diligence perspective is its profiles of living persons: politicians, businesspeople, even terrorists. These profiles are often incredibly detailed, giving not just basic information (birth dates and places, roles in different organizations), but also rich name variations in dozens of alphabets and even details on family relations.

What's more, unlike many other datasets, the Wikidata items are almost perfectly de-duplicated: there is exactly one identifier (a "Q-ID") referring to Vladimir Putin: Q7747. All language-specific Wikipedia articles on this subject are linked to this item. If anyone were to create a new item for the Russian president, it would soon be discarded in favor of the existing one.

Companies and other organizations, on the other hand, are a weakness of Wikidata: the database tends not to retain sufficient detail about subsidiaries, beneficial owners, or branch offices to be useful as a source of corporate information.

But all this comes at a price: Wikidata is notoriously complex in how it represents information. Underlying the platform is a unique attempt to design an ontology - a set of commonly agreed terms and properties - for all human knowledge. Each fact about an item is recorded as a statement that assigns a value to such a property. President Putin, for example, is linked via property #569 to the 7. October 1952: it is his birthdate. Annotating this statement are multiple references and qualifiers, for example links to news articles and reference databases were this fact was sourced.

This data can be queries via the Wikidata Query Service, which lets users run SPARQL queries. Writing these queries is a bit of an art, especially as you dive deeper into the full Wikidata statement model. Thankfully, there are also other tools like PetScan that you can use to identify Wikidata items of interest.

Integrating Wikidata items into OpenSanctions

Give the wealth of material in Wikidata, integrating it into OpenSanctions has been a priority. By working with the Peppercat project, we got access to their curated item lists:

  • Peppercat World Leaders contains national-level cabinet members and key government positions for more than 150 countries and places, kept up-to-date by Peppercats set of crawlers and alerting systems.

  • Peppercat Legislators contains all the members of national parliaments that are correctly represented in Wikidata. This is dataset is an attempt to generate an up-to-date version of EveryPolitican. Unfortunately, many countries don't have complete member lists in Wikidata. In those cases, the crawler scripts created by EveryPolitician will have to be updated to compare what data exists in Wikidata with parliamentary web sites.

Starting with these two lists of Wikidata items, we used the Wikibase API (a simpler alternative to the SPARQL service) to fetch details about each entity and translate the given properties into the data model used by OpenSanctions.

Some of these values require special treatment: countries of birth that don't exist any more (from Rhodesia to the Ottoman Empire), dates with a varied precision (born in the 1910s) and most notably links between people. We decided to import family structures and information about associates/friends recursively, in line with the FATF guidance on relatives and close associates for politically exposed persons. The results of this are surprisingly detailed in some cases, like MBS, Angela Merkel or Donald Trump.

Finally, an important component of the integration we built is around caching item information from Wikidata so that we don't need to send hundreds of thousands of requests to their API every time we update the OpenSanctions lists.

Next steps

Importing the Wikidata items suggested by Peppercat enables us to import large entity sets along topical lines, but there's another way that we're hoping to integrate with Wikidata in the future: by matching and then enriching the profiles of existing entities in OpenSanctions. Many sanctioned individuals have corresponding Wikipedia pages, and linking those to the base data provided on sanctions lists can provide us better alias names, and information on family and associates.

Then, there's the inverse path: creating Wikidata items for every sanctioned entity in the world. Such an upload would make sense, at the least, for all natural persons mentioned on the EU, US or UN sanctions lists. We'll keep you up to date!

Like what we're writing about? Keep the conversation going! You can follow us on Twitter or join the Slack chat to bring in your own ideas and questions. Or, check out the project documentation to learn more about OpenSanctions.

Published:

This article is part of OpenSanctions, the open database of sanctions targets and persons of interest.