Here’s everything that goes wrong when AI does text extraction (and how we’re still using it)

From dropping middle names to hallucinating details that were never mentioned in the first place, mistakes are commonplace in Large Language Model (LLM) data extraction. Here’s how we’re embracing automation to extract entities and turn them into structured risk data (with a healthy dose of scepticism and human moderation).

Our database includes data extracted from a wide range of sources. Everything from the detail-obsessed, precise XML format used by Switzerland to images of typewriter text in PDFs, with dozens of different forms of human-authored spreadsheets and web pages somewhere in between.

Tables in PDFs, spreadsheets, and websites can all be turned into a series of records. Within those records, values can be complex — a single table cell in a "Name" column could include a name, aliases, and even refer to distinct entities, as shown in the Alabama screenshot below.

Cell values like these are not yet suitable as names for reliable entity matching in a risk screening system. Before a company or person can be entered into our database, the associated data must be cleaned. The more precisely we can clean identifying information, the more true positive matches we can find and the fewer false positives we encounter. But crucially, we don’t want to drop any real data values.

Some of the cleaning we need to do includes:

  • Splitting primary names from aliases and trading names
  • Splitting listed and related entities
  • Splitting identifiers like company numbers
  • Stripping text and punctuation that doesn’t form part of the true value, e.g. a.k.a. ( … ) or combat name:

Alabama suspended Medicaid providers list with occasional aliases and related entities

Missed details

Many of these data ambiguities will continue to require human review for the foreseeable future. But the more humans handle the actual data processing, the more error-prone and inconsistent it becomes. At OpenSanctions, we automate where possible and have humans step in where necessary — preferably in an oversight role rather than a basic data-processing role.

In an ideal world, we could split on punctuation — i.e., treat punctuation marks as separators —and, in some cases, this works. But as you try to handle more and more complex cases, the splitting and cleaning become more complicated, and before you know it, you’ve built a chaotic sequence of semi-magical data cleaning incantations, and it becomes hard to reason about exactly what will happen when any given value is thrown at your data cleaning monstrosity.

For example, Switzerland’s highly structured XML format for publishing sanctions designations is a model for others in many respects, but it still includes an “other-information” field where useful identifying information is arbitrarily interspersed with other text and punctuation.

Registration numbers in the Swiss SECO XML, separated by dash and slash

Registration number in the Swiss SECO XML, containing hyphens

Free text, free trouble

Another challenge arises when identifying and extracting information from free text sources. While sanctions, PEP, and debarment lists are usually published in structured or semi-structured formats, many regulatory bodies announce enforcement actions through press releases or online notices. The names and other relevant information listed in these free text sources are not just laborious to extract by hand, but are likely to be error-prone due to formatting inconsistencies.

One or more distinct names per list item in a CFTC press release

Extracting and cleaning with LLMs

When expanding our data collection to include enforcement actions and notices, we needed a way to reliably identify and extract risk-related entities without inadvertently labelling bystanders — such as authorities or their officials — as risky. We also needed the extraction approach to be adaptable to the range of entity types and identifying information in these sources, including names, identity numbers, addresses, and associates — all with varying levels of completeness.

This is where LLMs come in. We can prompt an LLM to categorise and extract structured data from semi-structured and completely free text sources. By defining a schema for the expected output format, the LLM produces data in a structure we can directly use in our data pipeline.

Really impressive extraction: everything was extracted correctly except the “doing business as” alias

The results of LLM-based extraction using current-generation models are far from perfect, however. Sometimes a name is cleaned too much — for instance, the “junior” suffix (jr) is dropped — or unnamed references described as “An unnamed individual” are extracted by the LLM as if it’s a name.

The article doesn’t name the suspect. The prompt tells the LLM to leave out unnamed individuals, yet the LLM creates one with a placeholder for name. The data analyst would delete the entry before accepting

Occasionally, the LLM makes things up entirely:

The birthDate, birthPlace and relationship type of “Family” are completely invented by the LLM and not indicated by the original text

Human moderation

In most cases, the extracted data is entirely or mostly correct, but given the risk of error, we have decided that all LLM data extraction at OpenSanctions will be reviewed by human moderators. Our moderation system allows analysts to compare the source data and automatically extracted data side-by-side. They can then review the extracted data and, if needed, make edits before accepting the result.

Our extraction review system now used for human moderation of data extraction by LLMs and custom logic where more precise logic are needed

As illustrated below, the extent of corrections needed depends on the complexity of the extraction task and the quality of the source data. Some datasets need a handful of corrections, while others need hundreds. In the longer term, we’re also working on building a dataset that could be used to validate and improve automated extraction.

Reviews result in examples where extraction wasn’t perfect, for evaluation and improvement.

It’s clear that generative systems are more capable of interacting with human-created text than previous generations of natural language processing systems. Yet they also bring with them a very new way of making mistakes: errors that distort the meaning of a text, not just in the selection or analysis of its raw tokens.

For watchlist entries, these errors could have catastrophic consequences for investigators screening a customer or client, leading to a missed name or even a name mislabeled as a risky entity.

Full automation would mean a compromise on data quality, which is central to what we do at OpenSanctions. For this reason, data extracted by an LLM will continue to be carefully reviewed by analysts — and if needed, corrected — before becoming part of a dataset.

Recently, we wrote about how we’re using free text sources to build a deeper understanding of entities and their networks — and how we’re tackling the extraction challenges along the way.

Like what we're writing about? Keep the conversation going! You can follow us on LinkedIn, subscribe to our E-Mail newsletter or join the discussion forum to bring in your own ideas and questions. Or, check out the project documentation to learn more about OpenSanctions.

Published:

This article is part of OpenSanctions, the open database of sanctions targets and persons of interest.

Here’s everything that goes wrong when AI does text extraction (and how we’re still using it) - OpenSanctions