The quiet power of Reference Data for Machine-Readability

Governed, shared reference data is the stable vocabulary your information backbone speaks in. Without it, every system upstream and downstream (yes - including AI) is guessing, and you cannot achieve machine readability.

In Post 4, I defined what an information backbone is - the shared, governed, model-centric, machine-readable place where meaning lives, and the foundation from which everything else in your information estate should be built.

In this post, I want to go one level deeper into something that tends to get treated as a technical detail, handed off to a data team, and quietly forgotten until something breaks.

That something is "Reference Data" (aka "Refdata"), and if the data model is the structure of your information backbone, reference data is the language it speaks in. Get it wrong - or fail to govern it - and the structure holds, but the meaning falls apart.

Bottom line up front:

Reference data is the shared, governed vocabulary that gives your information its meaning. Without it, your information backbone has structure but no stable language - and every system downstream is guessing.

The problem most people don’t see coming

By the time a COO is thinking seriously about an information backbone, they have usually understood the core argument: machine readability is a meaning problem, not a format problem. They have thought about the model - the entities, the relationships, the governed structure. They are ready to invest in getting that right.

And then someone mentions "reference data", and it sounds like a technical detail. A configuration task. Something for the data team to sort out once the architecture is agreed.

The instinct is to delegate reference data as a detail and move on to more visible product challenges.

That instinct is wrong. Deprioritizing reference data will damage your operations later.

Reference data is not a configuration task, it is the vocabulary the backbone speaks in - and without it, the model has structure but no stable language, and you will not achieve machine readability.

What reference data actually is

Your information backbone holds structure: a food product has ingredients, ingredients have allergen statuses, allergens have regulatory classifications. That structure tells you how things relate to each other. But it does not, by itself, tell you what those things mean.

Reference data is what fills that gap. It is the agreed, governed list of authorized values - the controlled vocabulary - that the information backbone draws from when it needs to express what a thing is, precisely.

When your system records that a product contains a specific substance, reference data is what defines that substance: its authorized name, its code, its classification, its regulatory status. When your system records a unit of measurement, reference data is what ensures that “gram” in your system means the same gram that your partner’s system, your regulatory authority, and your consumer app all recognise.

Without reference data, a substance code is just a string of characters. With governed reference data, it maps to a defined entity with known properties and known relationships to everything else in the model.

Reference data is the shared dictionary. The agreed-upon terms. The list that makes interoperability real rather than theoretical.

NB You may also hear reference data referred to as "taxonomy", "controlled vocabularies", "code sets", "lookup data" and more.

A concrete example: a food product label

A food manufacturer wants to produce a true digital product label - machine-readable, regulatory-compliant, and interoperable with supply chain systems, retail platforms, and consumer apps.

The product contains wheat flour. Simple enough. But consider what “wheat flour” means across the systems that need to consume it. The picture could look something like this:-

The manufacturer’s internal system records it under a product code used since the ERP was implemented in 2009.
The supplier’s system uses a different code from their own catalog.
The retailer’s platform expects it expressed against a standard ingredient taxonomy.
The regulatory authority requires it mapped to an allergen classification scheme.
A consumer app wants to display it in plain language with an allergen flag.

Without governed reference data, someone in the middle - almost always a subject matter expert, often the same one who has been doing it for years - manually reconciles those five representations every time the question is asked. They carry the mapping in their head. It works, mostly, until they are not there to add the meaning.

With shared, governed reference data, that reconciliation has already been done once, in the reference data. Wheat flour is wheat flour. The allergen status is the allergen status. The classification is the classification. Every system draws from the same governed source, and nobody downstream has to guess.

That is the operational value of reference data. Not glamorous. But the difference between a system that works and one that depends on a person who might not be there when the meaning needs to be rendered.

Why “governed” is the critical word

Reference data is only as good as its maintenance. An out-of-date reference dataset does not simply fail to help - it actively produces wrong answers with apparent confidence. A substance classified under a regulatory scheme that was revised eighteen months ago is not a neutral error. It is a compliance failure that looks like a correct output.

Governance for reference data means knowing, for every dataset your backbone uses: who owns it, how updates are managed, how changes propagate to the systems that depend on it, and what happens when a standard changes.

Some reference data comes from external sources - regulatory bodies, industry standards organizations, classification schemes maintained by third parties. Governance for those datasets means tracking when they are updated, deciding which changes are adopted, and having a reliable mechanism for propagating those changes through your systems. Some reference data is organization-specific - internal product codes, proprietary classifications, domain-specific terms and workflow controls that exist nowhere else. Governance for those datasets means clear ownership, a defined process for adding or retiring values, and version control that lets you trace when a value changed and why.

Going back to the earlier digital food label example, you may decide to invest in adding synonyms, images, descriptions, localized translations, and more against that singular "wheat flour" concept in order to it findable, usable and disambiguated in all the required workflow, rendering and integration scenarios. All of this is an ROI decision for your product.

The backbone is not static. Reference data evolves - regulations change, standards are updated, organizations restructure their product portfolios. The governance model needs to support that evolution without requiring a crisis program every time a code changes.

Where reference data comes from - and a word of warning

Most COOs will reasonably ask: do we have to build all of this from scratch? The answer is no. There is a rich ecosystem of existing reference datasets - industry classification schemes, regulatory substance lists, standard unit definitions, recognized allergen codes, geographic standards, currency lists. The skill is not in creating reference data from scratch but in selecting the right external sources, adopting them deliberately, and governing the subset that is specific to your organization.

This is where a very common mistake happens - and it happens for understandable reasons.

Wikidata is a good example. It is genuinely impressive: a massive, open, well-structured knowledge graph covering an enormous range of concepts, with rich properties for each one. It is free, well-maintained, and widely used. The instinct to reach for it is sensible. The problem is that organizations often load in far more than they need - thousands of concepts when they need dozens, and dozens of properties per concept when they need three or four. The overhead of governing all of that, keeping it current, and understanding what you have actually imported quickly becomes unmanageable.

The principle that applies here is the same one that applies to the backbone model itself: load just what you need to answer the questions you will actually ask of your data. If your food product label needs to reference allergen classifications, import the allergen classification scheme - don't be tempted by other surrounding reference data. If your regulatory submission needs substance codes, adopt the relevant regulatory list - not every substance ever identified.

You will need to govern whatever you import. Govern only what earns its place.

The unglamorous work that unlocks everything else

Reference data is the work that determines whether everything else holds together, whether the AI outputs are trustworthy. Whether the regulatory submissions are defensible. Whether the supply chain systems actually agree on what they are talking about. Whether the subject matter expert who retires next year takes institutional knowledge with them - or leaves it in the model.

The data model (or "schema" or "ontology") is your structure, reference data is your language, and you need both!

The organizations that treat reference data governance as a strategic asset - rather than a technical chore - are the ones whose information backbones actually deliver on the promise of machine-readability.

Simple Steps for COOs

Ask where your authoritative definitions actually live. For any critical classification in your product or regulatory information - substance codes, allergen statuses, material categories - ask: is this defined once, in a governed model that every system draws from? Or is it defined in a person’s head, a spreadsheet, or a legacy system that nobody fully controls? The answer tells you where your reference data risk lies.
Treat reference data adoption as a scoping decision, not a loading exercise. When evaluating external reference datasets or ontologies, resist the instinct to import everything available. Define the specific questions your backbone needs to answer, and import only the concepts and properties required to answer them. Govern that subset rigorously before expanding.
Make reference data ownership explicit before you build. For every reference dataset your backbone will use - internal or external - a) name an owner, b) define an update process, c) establish how changes propagate, and importantly d) make sure your experts are happy with their reference data management tools.

I hope this helps. Please do get in touch with any questions.

Next in this series:

Post 6 - AI observability in high-trust environments starts with a governed information backbone - If your AI outputs can't be checked against a governed reference, you don't have observability - you have review. And in high-trust environments, review doesn't scale.

Post 7 - What should a real information backbone look like? Seven characteristics to look for - When you take a closer look at an information backbone, what should you see? The answer is a solid set of characteristics, focused on its foundational purpose.

Previously in The COO’s Machine-Readable Information Backbone series:

Bonus Post: Build or buy your information backbone? Why the true cost of building a governed information backbone for a high-trust environment is almost always underestimated - and what that means for your build vs buy decision.

Post 4: What is an information backbone? A plain-language definition for operational leaders - written for organizations that already have systems, already have data, and are still asking why none of it feels reliable.

Post 3: Why AI needs a governed information backbone - not just better prompts. In regulated and high-trust environments, AI reliability isn’t a model problem. It’s a foundation problem.

Post 2: Machine-readable information architecture is better for your people too - better information architecture foundations improve the experience of the humans who work with product data every day.

Post 1: What does “machine-readable” really mean for digital product labels? Machine readability is a meaning problem, not a format problem.

Authors

Matt Shearer

COO

Get in touch