Every week, someone posts a “fresh leak from Major Company X” on a forum or Telegram channel. Some of them are real and important. Many are recycled from older breaches with the metadata changed. Some are outright fabrications, synthetic data designed to embarrass a target or generate clicks. Publishing the wrong one is a credibility-ending event for a researcher or newsroom. Here’s the verification checklist that catches most of the bad ones.
1. Source provenance
Where did the dataset come from, and what’s the chain of custody? A claim from a known operator on their own leak site is one provenance level. A repost on a Telegram aggregator with no original source is another. A “leaked.zip” that surfaced on a paste site with no claim is the lowest. Note the source. If you can’t establish provenance, that’s reportable in itself, and a reason for skepticism.
2. Sample-record validation
Pull ten records at random from the dataset. For each, attempt independent verification. If the data claims to be from a customer database, do the customer records match real public records? If it’s claimed to be employee data, do the names tie to LinkedIn profiles consistent with employment at the named company? Sample validation catches synthetic data fast, fabricators rarely make every record internally consistent.
3. Recycled-breach check
Compare the dataset’s email addresses against Have I Been Pwned and similar services. If 80% of the emails appear in older breaches with the exact same passwords, you’re looking at recycled data. The “fresh” claim is wrong, but the data may still be real, just old. Important distinction for the reporting.
4. Internal-consistency check
Real datasets have anomalies, duplicates, malformed records, encoding errors, fields with non-uniform formatting. Synthetic datasets are too clean. Run a quick statistical look: distribution of created-at timestamps, distribution of email domains, length distribution of any free-text field. Real data has the messy distributions you’d expect; faked data tends to be uniform.
5. Direct-confirmation attempt
Reach out to the named victim. The standard journalism-ethics version: provide them with a small sample of the data (not the whole dataset), ask whether they recognise it, give them a deadline to respond. Legitimate victims often confirm, sometimes deny, sometimes hedge, but the conversation itself is signal. A complete refusal to engage is itself a data point. So is “we are investigating.”
When to publish anyway
If verification produces ambiguity, some signals positive, some negative, no clean confirmation, the right move is usually to publish the ambiguity itself. “Operator X claims to have breached Company Y. Independent verification of the dataset shows [these confirmed elements] and [these unconfirmed elements]. The company has [responded or not].” That’s a defensible piece. The lazy version, taking the leak claim at face value, reporting the dataset as fact, is the one that ends careers.
Handling the data itself
Don’t analyse leaked data on your normal work machine. Use a research VM. Don’t share the dataset internally beyond the people who need to see it. Don’t keep it longer than the investigation requires. Privacy harm to victims is real, and well-meaning research can compound it.
The verification process takes hours. The reporting decision sits on those hours. Skipping the verification to publish first is the choice that matters most for your long-term credibility, and for the victims you’re writing about.
