Name matching, especially in sensitive arenas like sanctions screening, is only effective when we understand the cultural, linguistic, and legal ambiguities baked into the data. In this post, we explore the unexpected complexities behind something as "simple" as a name. You'll learn what an abugida is, how writing systems collide with international compliance, and why matching a name on a sanctions list is very different from connecting entities across datasets.
All Names Were Created Equal… or Were They?
When we began matching names, we attempted to canonicalize them. This meant selecting a “true” or official form to unify references. It didn’t take long for us to discover the flaw in this idea.
Take Picasso. Or Pablo Picasso. Or shall we say Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y Picasso?
Try picking a canonical version of that.
Names are social constructs. They shift across time, cultures, and contexts. They’re processed by legacy computer systems built on the notion that 26 characters is plenty for everybody. In thinking about names, we’ve had to discard the idea of a “canonical form” for names: one written name form that is ideal for describing a person or organization. Instead, we’ve learned to consider that things often have many, equally valid, names.
All are welcome here.
Writing Systems: More Than Letters
Matching name variations is hard enough across languages that use the Latin alphabet (and its Cyrillic cousin). But other global writing systems often play by entirely different rules.
- In syllabaries like Japanese Hiragana, a single symbol (e.g., は) represents a syllable.
- In abugidas such as Ethiopian Geʽez, vowels are secondary marks added to consonants.
- In abjads, like Arabic or Hebrew, many vowel sounds aren’t written at all.
- In logographic systems like Chinese, each character encodes a whole concept or word. These can lose fidelity when transliterated into systems like Pinyin.
As Liuhuaying Yang demonstrates in their visual essay on transliterating Chinese names , converting logographic names into Latin characters (e.g., 杨柳桦樱 to “Liuhuaying Yang”) often leads to information loss.
And there's the issue of name changes. A person might Anglicize their name upon immigrating. A data trail might contain "Mohammed Al-Fulan," "Mo Alfolan," and "Mike Fuller" - all referring to the same individual. These transformations are especially common in diaspora communities.
People, Companies, and the Problem of Entity Types
Matching people is messy. Matching organizations? Equally chaotic.
Organization names suffer similar transliteration problems plus structural inconsistencies. Is “ABC GmbH” the same as “ABC Limited”? What about “ABC Inc.” operating in a different jurisdiction? Keep in mind that the people entering company names into watchlists, or into databases that need to be screened, might not be obsessed with accuracy. For example, 'LIMITED LIABILITY COMPANY PROMLOGISTIKA' refers to a Russian entity on the U.S. sanctions list. The original legal prefix, 'ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ', has simply been translated.
Sanctions Screening: High Stakes, Low Data
Knowing all this, it is more and more obvious that perfection in name matching is not a realistic goal. Like most engineering challenges, it's about choosing the right set of trade-offs for the problem you're trying to solve.
So: why are we doing name matching in the first place? Let’s distinguish two use cases:
- Data integration – Say we're deciding whether "Robin L. Williams" in two datasets is the same person. We have supporting metadata: birth dates, ID numbers, affiliations. This is also the category of matching challenge that most “Know your Customer” or “Customer 360” processes fall into.
- Sanctions screening – We may only have a name, and perhaps a country of residence. Due to strict compliance requirements, even partial or ambiguous matches must be flagged and investigated. To meet regulatory requirements, vague matches like “ISIS” often trigger alerts in sanctions screening, resulting in many false positives.
Unlike other domains where you can lean on additional identifiers, sanctions screening often hinges on under-specified data. That makes every letter count. A simple typo, a translation glitch, or a spelling variation can throw up false alarms or let an actor slip through.
What can we do?
We like to do three things, and they’re the same things we’ve been doing all along:
- We play well with others. Technology vendors like Senzing and Quantexa provide enterprise-grade entity resolution and name matching technology. We work to make sure our database plugs into these tools seamlessly, and that the rich semantics we collect from our data sources are available to support matching processes.
- We make open data and infrastructure. We’ve been creating valuable structured reference data for name matching. In this list of organizational suffixes, we track common legal forms (e.g., S.A., Ltd, GmbH) and their variants across languages. Another resource on symbolic representations captures multilingual equivalencies and formatting quirks for common name components like “Holding”, “Management” and so on. Check out this overview of the data assets we've published.
- We’ll evolve our own stack. A bit more about that below.
Please treat this not just as open data, but as a read/write project —contributions welcome!
Toward a Robust and Explainable Baseline
What about AI and large models?
While Large Language Models hold promise, regulatory frameworks around automated decision-making are still fairly complex. Financial institutions must be able to explain and audit every action their systems take.
Our goal at OpenSanctions is therefore to first build easy-to-adopt tooling using traditional and explainable methods. Tools like our Yente API and Nomenklatura offer open-source name matching for sanctions and KYC screening purposes. Given that this is all open and re-usable, we think of it as an industry baseline: If your current solution does less than that – ask why.
Contributions welcome!
We invite developers, investigators, academics and data scientists to explore and improve our reference data, contribute new name patterns or transliteration rules, and use our matching infrastructure to build more accurate, auditable compliance tools.
🔗 Explore our datasets: OpenSanctions.org
🔗 Contribute on GitHub: github.com/opensanctions