In the race to deploy AI, many organizations sprint toward the finish line while neglecting the qualifying rounds. They pour resources into the newest models, prompt engineering, and fine-tuning, then wonder why their outputs hallucinate, have heavy bias, or simply get things wrong. The answer, more often than not, lives upstream in the data pipeline.
Hi, I’m Brad Cooley, and I’m the Lead Senior Data Engineer here at Mutually Human. In honor of the 2026 Winter Olympic Games, and the increasing frequency in the amount of conversations we’re having with clients about this exact thing, it made me think about how the Medallion Architecture serves as the structural foundation not just for a good data platform, but for responsible AI. This article explores exactly that. By segmenting data concerns across Bronze, Silver, and Gold layers, organizations ensure their models are fed high-quality, curated data rather than the raw, unfiltered noise that so often leads to unreliable outcomes (and, honestly, disappointment). Whether you are building your first data platform or looking to close performance gaps in an existing one, understanding this architectural flow can be the difference between models that perform and models that mislead. I’m going to take us on a journey, using the Olympics as my framework for our discussions. Let’s begin, shall we?
The Opening Ceremony: Why Your Data Needs a Podium
There is a pattern we see repeatedly across clients and industries. An organization gets excited about AI, spins up a proof of concept using whatever data happens to be lying around, gets promising initial results, and then pushes toward production. Somewhere between the demo and the deployment, things fall apart. The model starts producing outputs that are inconsistent, biased, or, frankly, straight trash. The majority of the time, leadership loses confidence and becomes more hesitant in investing more resources into the project, which makes it stall.
Funny thing is, the root cause is almost never the model itself. It is the data.
The current wave of enthusiasm around Large Language Models (LLMs) and generative AI has amplified this problem. Organizations are being told they need to “leverage AI” to remain competitive. FOMO is real, even in large orgs, and that newly created urgency creates a temptation to skip the foundational work. “It’s slow! We can build the foundation later! I just want to move fast!” But here is the reality: you cannot train a gold-medal model on bronze-level raw data. The discipline of the pipeline matters as much as the sophistication of the model.
Ever used ChatGPT, or seen examples, and had it always agree with you?
“Yes, Brad, that’s an excellent idea! You should always test in production because it is the fastest way to ensure your code works against real, validated, production data. Want me to go ahead and create a script for you to run in production?”
Yeah… about that. Bias is super real in most LLMs because of the training data used, but also because youcan help reinforce the bias every time you respond with, “yes, that would be great. Create me a script to run in production!” This can get large organizations into a pickle, real fast.
Responsible AI, at its core, means building systems that are accurate, fair, and transparent. Accuracy requires clean, well-structured inputs. Fairness requires intentional curation that accounts for historical biases in the source data. Transparency requires lineage and traceability so that when a model produces an unexpected output, you can walk backward through the pipeline and understand why. The Medallion Architecture provides a framework for achieving all three.
II. The Qualifying Rounds: Understanding the Medallion Layers
The Medallion Architecture is a traditional multi-hop data design pattern that organizes data processing into three distinct layers, each with its own purpose and quality standards. The naming convention (Bronze, Silver, Gold) is more than a metaphor in this article. It comes from Databricks and reflects an intentional progression from raw, unrefined data to curated, business-ready datasets.
Bronze: The Raw Foundation
The Bronze layer is where data lands in its original form. This means ingesting data exactly as it arrives from source systems, whether from a flat file on an SFTP server, a JSON payload from a REST API, a CDC stream from a transactional database, or event data from an IoT sensor network. Nothing is transformed, filtered, or cleaned at this stage and that’s on purpose because we want to preserve the original format and act as an immutable historical record. The only caveat here is we tend to appended ingestion metadata (timestamps, source identifiers, batch IDs) to records for easier cataloging and troubleshooting of data issues.
Here’s a somewhat typical Bronze-layer ingestion pattern that we would build out for a client:
The key detail here is VARIANT (or its equivalent in your platform). The raw payload is stored without schema enforcement. We are not making assumptions about structure at this stage because those assumptions belong in the Silver layer, where they can be validated and documented.
This layer serves two critical purposes. As mentioned before, it acts as a system of record. If anything goes wrong downstream, we can always return to the Bronze layer and reprocess from the source of truth. Second, it preserves the full fidelity of the source data. Decisions about what to keep, transform, or discard happen in subsequent layers, where those decisions can be made deliberately and documented clearly.
For one of our clients in the healthcare space, the Bronze layer ingests over 200 distinct data feeds from claims systems, eligibility platforms, provider directories, and clinical data sources. Each feed has its own schema, cadence, and set of quirks. The Bronze layer absorbs all of that complexity without trying to reconcile it, which turned out to be essential when we later discovered discrepancies between source systems that would have been invisible if we had tried to normalize the data on ingest.
Silver: Refinement and Standardization
I like to consider the Silver layer as where the real engineering happens. This is the cleansing, conforming, and enrichment phase. Data from the Bronze layer is deduplicated, null values are handled, schemas are enforced, data types are standardized, and business rules are applied. I like to look at it this way, if the Bronze layer is about preservation, the Silver layer is about precision.
Building on our Bronze layer example from before, here is a simplified example of a Silver-layer transformation that cleans and deduplicates our claims data:
A few things worth noting here:
- The ROW_NUMBER() window function handles deduplication by keeping only the most recently ingested version of each claim. This is pretty common dedupe logic, but just wanted to be explicit about what it’s doing.
- The UPPER(TRIM(…)) on diagnosis codes is a small but important normalization step
- Without it, ‘E11.9’, ‘e11.9 ‘, and ‘ E11.9’ would all appear as distinct values to a downstream model, when they are the same code. Again, this type of standardization is a pretty normal data cleaning process, and I wanted to give an explicit example here.
- The lineage columns (_source_ingested_at, sourcebatch_id, silverprocessed_at) maintain traceability back to the Bronze layer.
The transformations at this stage are designed to be deterministic, testable, and well-documented. Our data team at Mutually Human follows the idea of “every operation should be traceable.” We define that as:
- What changed?
- Why it changed?
- What rule governed the change?
To answer these questions, we lean heavily on different data quality frameworks, with Great Expectations and Elementary Data being the two that we implement most often. Our purpose for utilizing these frameworks is because validating that the data meets defined expectations before it progresses to the next layer is crucial for documentation and observability.
Continuing with the examples, here’s how we would incorporate a data quality framework with our healthcare claims:
These validations act as gate checks. If the diagnosis codes do not match ICD-10 formatting, or if paid amounts exceed billed amounts, the pipeline stops and surfaces the issue before corrupted data reaches the Gold layer. Without these checks, bad data flows downstream silently, and, in the context of AI and machine learning, the model ends up learning the noise.
Gold: Business-Ready, Purpose-Built
Ah, the Gold layer. I’ve heard a wide range of meanings behind the term “gold layer” or “gold data.” Most of them are valid given specific business use cases and semantics. However, because there are so many definitions out there, let’s align on what it means for this article. The Gold layer contains aggregated, denormalized, and purpose-built datasets. These are not general-purpose tables; they are constructed to serve specific business outcomes. A Gold-layer dataset for a recommendation engine will look very different from one built for financial forecasting, even if they draw from the same Silver-layer sources.
Rounding out the healthcare claims example:
This is where data engineering and data science converge. The decisions about what to aggregate, how to define features, and which dimensions to include are informed by the specific requirements of the model. The result is a dataset that is not only clean but contextually appropriate, containing exactly what the model needs and nothing it does not.
III. The Training Diet: Data Segmentation and Model Feeding
The relationship between data quality and model performance is not linear in my opinion, it is exponential. A model trained on clean, well-structured data will not just perform marginally better than one trained on raw data, it will perform categorically better, with fewer hallucinations, lower compute costs, and more predictable behavior. It’s widely known that, on average, the better your data and the more of it, the better the model will perform.
One of the lesser-known things about LLMs is that when it hallucinates, it is not making a creative leap. Fundamentally, LLMs are rooted in pattern-matching. So, the hallucination is simply the model pattern-matching against noise. If the training data contains duplicated records, the model over-indexes on those patterns. If the data contains inconsistent labels, the model learns inconsistent associations. If the data includes irrelevant dimensions, the model wastes capacity trying to find signal where there is none.
To make this concept concrete, consider the difference between feeding a model directly from Bronze versus Gold. Sure, it’s easy to say, “Bronze is messy, Gold is not” and that’s true, but there’s more to it. The key here is about what is not in the Gold dataset. We don’t have duplicate claims inflating frequency signals, no null values forcing imputation guesswork, no inconsistent diagnosis codes splitting what should be a single feature into dozens of noisy variants. The model here is going to train faster, generalize better, and produce outputs that make sense and align with what the business needs. It makes buy-in from leadership easier when both the technical stakeholder and the business stakeholder can trust the output. We’ve seen this play out across multiple client engagements. In one case, a predictive analytics model that had been producing inconsistent results improved its accuracy by over 15% after we rebuilt the data pipeline with proper medallion layering and separation of concerns. The data model itself did not change, only the architecture around it did.
IV. Avoiding the Penalty Box: Responsible AI and Bias
I slightly deviated from the traditional Olympics theme, but hockey isa part of the games, so I think it’s valid. Regardless, the penalty box in the context of this conversation is not a literal one. It still is not a fun place to be, and if you find yourself in it for too long, your competitors might score.
The Bias Problem in Practice
Bias in machine learning is not an abstract concern. It’s been around long before the rise of LLMs and should be treated as a concrete, measurable problem that produces real-world harm. And it almost always originates in the data, not in the model architecture. However, we’ll see in a moment how adopting the Medallion architecture can help us with bias. Let’s talk about some history first.
Historical data reflects the world as it was, not necessarily as it should be. Lending data may encode decades of discriminatory practices. Healthcare data may underrepresent certain demographics. Hiring data may reflect systemic preferences that have nothing to do with candidate quality. As we covered before, when a model is trained on this data without intentional intervention, it does not correct these biases; it amplifies them.
This is true for Large Language Models, but it is equally true for traditional machine learning applications like predictive analytics, classification, and regression. LLMs have received the most attention recently not only because of their popularity, but also because their outputs are human-readable, making their biases therefore more visible. But that doesn’t mean that a credit scoring model that systematically disadvantages certain zip codes, or a claims adjudication model that denies coverage at disproportionate rates for specific demographics, isn’t important because it’s not visible. Often times, these types of models can cause more harm.
Statistical Bias Detection Across the Medallion Layers
The Medallion Architecture provides natural intervention points for identifying and mitigating bias, and the most effective approach is to implement detection at every layer rather than treating it as a final checkpoint.
Bronze Layer
Jumping back to our Bronze healthcare claims table, the primary concern in this layer is representational completeness. Before any transformation occurs, we should profile the incoming data to understand its demographic distribution:
Are certain groups underrepresented? Are there temporal gaps where data collection practices changed? Simple distribution analysis at this stage can surface problems that become invisible once the data is aggregated.
Silver Layer
At the Silver layer, bias detection becomes more rigorous. This is where we can implement stronger statistical tests to measure disparate impact across protected attributes:
The four-fifths rule, commonly used in employment law to assess adverse impact, can be adapted here to evaluate whether transformation logic introduces or amplifies disparities. If a deduplication algorithm disproportionately removes records from a particular demographic, that is a signal worth investigating.
Gold Layer
At the Gold layer, the focus shifts to fairness metrics that are specific to the model’s intended use case:
Equalized odds ensure that true-positive and false-positive rates are consistent across groups. Demographic parity ensures that positive prediction rates are similar across groups. Calibration ensures that a predicted probability of 70% means approximately 70% across all subgroups. The specific metric that matters depends on the use case. In healthcare, calibration might be paramount. In lending, equalized odds might take priority. Regardless of industry, the important thing with our statistical testing within the Gold layer is that the choice on priority is deliberate and documented.
Data Governance as the Through-Line
The biggest caveat with all of this is that bias detection is only as useful as the governance framework it’s tied to. If it is connected to a governance framework that can act on what it finds (or at least surface concerns appropriately), then the impact the detection framework has is far greater than a platform where the detection is logged and left to die. This also means that data lineage needs to be baked into every layer of the architecture, not bolted on after the fact.
When a model produces a biased output, the first question (besides “why does my model suck”) is always “where did this come from?” If you can trace the output back through the Gold layer’s aggregation logic, through the Silver layer’s transformation rules, and all the way back to the specific Bronze-layer records that contributed to the result, you can pinpoint where the bias was introduced and intervene precisely. Without that lineage, we are left guessing, and from experience, guessing at scale is not a great strategy.
Lineage also enables audit-ability, which is increasingly a regulatory requirement. Industries like healthcare, financial services, and insurance are subject to compliance frameworks that require organizations to demonstrate that their automated decision-making systems are fair and explainable. A well-implemented Medallion Architecture, with lineage and governance built into each layer, provides the documentation trail that these frameworks demand.
V. Finding Your Lane: Two Paths to Success
Path A: Building from the Ground Up
If your organization has not yet adopted the Medallion Architecture, the good news is that you do not need to build all three layers simultaneously. Start with Bronze. Getting your raw data into a centralized, append-only store with proper ingestion metadata is the single highest-leverage thing you can do. It creates the foundation that everything else builds on, and it eliminates the most common failure mode we see: teams trying to transform data on ingest and losing the ability to reprocess when requirements change.
Once Bronze is stable, introduce Silver-layer transformations incrementally. Pick a single domain or data source, define your quality standards, implement your testing framework, and build the transformation pipeline. Prove the pattern works, then expand. Trying to boil the ocean by building Silver-layer transformations for every data source at once is a recipe for stalled projects, frustrated teams, and very annoyed stakeholders.
Gold-layer datasets should be built on demand, driven by specific business or model requirements. Do not try to pre-build Gold-layer datasets for use cases that do not yet exist. The aggregation and feature engineering logic is inherently tied to the problem being solved, and building it speculatively leads to datasets that are either too generic to be useful or too specific to be reusable. Plus, it muddies your environment and can bloat your data catalog.
Path B: Closing Gaps in an Existing Architecture
If you are already running a Medallion Architecture, the most common gaps we see with clients are in the Silver-to-Gold transition and in the handling of edge cases that were not anticipated during the initial implementation.
Late-arriving data is a classic example. Many organizations build their pipelines with the assumption that data arrives in order and on schedule. When that assumption breaks, as it inevitably does, the pipeline either fails silently (producing incomplete Gold-layer datasets) or fails loudly (throwing errors that require manual intervention). Designing for late arrivals means implementing idempotent processing, watermarking strategies, and reconciliation logic that can gracefully handle data that shows up hours, days, or even weeks after it was expected.
Imagine healthcare claims arriving late. Surely that would never happen.But considering it frequently does, let’s design for that based on our previous examples:
The MERGE pattern here is idempotent. Running it twice with the same data produces the same result. The watermark (last_watermark) ensures we only process new arrivals, and the MATCHED condition checks whether the incoming record is actually newer than what we already have. This handles both the happy path (new records arriving on time) and the messy path (corrections, restatements, and late arrivals) without requiring manual intervention. This is a simplified example of course, and other factors can come into play, but I think this serves as a good starting point.
Real-time streaming adds another layer of complexity. Batch-oriented Medallion implementations are relatively straightforward. Applying the same layering pattern to streaming data requires careful consideration of windowing strategies, exactly-once processing guarantees, and how to maintain consistency between streaming and batch pipelines that feed the same Gold-layer datasets.
Across our engagements, we have implemented these patterns on a range of platforms including Snowflake, Databricks, Microsoft Fabric, and combinations thereof. The architectural principles are platform-agnostic, but the implementation details matter, and getting them right often requires experience with the specific tooling and its constraints.
VI. The Medal Ceremony: Conclusion
Whew, we made it. Let’s recap.
The Medallion Architecture is not new, and it is not complicated in concept. Bronze ingests. Silver refines. Gold delivers. What makes it powerful is not the pattern itself, but the discipline it imposes on the data lifecycle. That discipline, the insistence on separating concerns, enforcing quality at each layer, and maintaining lineage throughout, is exactly what Responsible AI requires. As the desire, experimentation, and adoption of AI moves at such a rapid pace, we believe doing the frameworks and systems that enable Responsible AI should be in place.
Models are only as good as the data that feeds them. A structured architecture should not be seen as an obstacle to AI adoption. Instead, we like to view it as an accelerator and the simplest pathway to Responsible AI. It enables AI adoption that is reliable, fair, and auditable. I hope this deep dive helped you understand where you or your team sits along this journey and provided some useful insights that you can implement immediately. If you’re working through these same challenges, this is the kind of data platform and AI foundation work we help design and implement every day at Mutually Human. If you’d like to explore how these principles translate into your own architecture, we’d welcome the opportunity to connect.