Why we rebuilt document intelligence from scratch
Every RegTech platform claims to "read" documents. Almost none of them actually do. Here's what we learned after parsing 600-page prospectuses — and why we had to build our own pipeline from the ground up.
The bar nobody measures against
When we started Mercurium, every vendor in the space sold us the same demo: a 2-page invoice, a driver's licence, maybe a utility bill. OCR extracts the name and address. 'Look how clean!'
That is not compliance. Compliance is a 640-page prospectus with 21 chapters, 358 defined terms, 492 internal cross-references, and 31 referenced external documents. It is a Mandarin board resolution scanned at 150 DPI with a red seal overlapping the signature. It is a subscription agreement where §6.3 amends a definition in §2.1(b)(iv) which was itself amended by a side letter you only see on page 312.
No off-the-shelf OCR, no generic LLM, no Document AI vendor handles that. They all quietly assume you will throw the messy 18% of documents back to a human. But the messy 18% is the whole job.
Three things that had to change
1. Layout as a first-class citizen
Text in isolation is almost useless. What matters is where the text sits: which column, which heading it's under, which table cell, which footnote. We rebuilt the layout model to preserve that structure in a canonical form — every paragraph knows its chapter, every number knows its table, every citation knows its target.
2. Images that mean something
A signature is not decoration — it's the compliance evidence. A corporate structure diagram is not decoration — it's the UBO chain. A stamp is not decoration — it's the regulator's authorisation. We train dedicated models to recognise and interpret these — not to discard them as 'noise' like most pipelines do.
3. One universal format downstream
After the reader runs, every document lands in the same Mercurium Document Format — whether it came in as a scanned PDF, an encrypted .doc, or an XLSX. Everything downstream — classification, extraction, cross-coherence checks, RAG — consumes that single schema. One integration, one debugging surface, one place where quality is measured.
What it unlocks
The biggest thing this buys us isn't accuracy on a leaderboard. It's the right to build everything else. Our cross-coherence checks, our adaptive questionnaire, our M&A Q&A — all of it only works because the foundation is solid. Without the rebuild, we'd be where our competitors are: stuck demoing 2-page invoices and explaining why anything more complex is "roadmap".
It was the hard path. It's also the only one that leads anywhere real.