AI Strategy , Data science & AI , Software development Jun 19, 2026

Why Your ML Data Pipeline Is About to Look Nothing Like It Did Last Year

VECTOR Labs Team

Last updated on: Jun 23, 2026

The architectural assumptions that shaped most enterprise ML data pipelines between 2019 and 2024 were built around a specific set of constraints: CPU-bound transformation, batch-oriented ETL, and storage systems designed for analytical queries rather than model-serving workloads. Those constraints are dissolving faster than most infrastructure roadmaps account for. The convergence of GPU-native data processing, millisecond-latency lakehouse query engines, and object-level metadata at S3 scale is not a gradual evolution of existing patterns. It requires retiring foundational design decisions, not patching them.

The CPU-Centric ETL Assumption Is Breaking

For most of the last decade, the standard ML data pipeline moved data through a sequence of CPU-bound transformation stages: ingest, clean, featurise, store, serve. This made sense when the bottleneck was storage I/O and when model training happened infrequently enough that pipeline latency was acceptable. The problem is that neither condition holds at the workload volumes that characterise 2025 and 2026 production AI systems. Training runs now consume terabytes of preprocessed data per hour, and the CPU transformation layer has become the rate-limiting step in pipeline throughput, not the GPU cluster it feeds.

The mechanism is straightforward. When a 512-GPU training job is waiting on a data loader that serialises preprocessing across CPU threads, GPU utilisation drops below 40 percent during data-intensive phases. This is not a tuning problem. It is a structural mismatch between the processing unit the transformation layer was designed for and the consumption rate of the downstream compute. NVIDIA's RAPIDS ecosystem and the broader shift toward GPU-accelerated DataFrame operations with cuDF address this by moving transformation workloads onto the same memory hierarchy as the training process, eliminating the PCIe transfer overhead that CPU preprocessing introduces.

What GPU-Native Data Processing Actually Changes

The practical implication of GPU-native preprocessing is not just faster pipelines. It changes which transformations are economically viable at training time versus what must be precomputed and materialised. With CPU-bound ETL, teams routinely precompute and store derived features because on-the-fly transformation during training was prohibitively slow. GPU-native processing shifts that trade-off: complex augmentations, tokenisation at scale, and embedding generation can now happen within the training loop without degrading GPU utilisation, which means the feature store layer becomes less necessary for a class of workloads where it previously served as a performance compensator.

This has a direct cost implication. Feature stores carry non-trivial operational overhead: schema management, freshness guarantees, backfill jobs, and the engineering time to maintain consistency between offline and online feature definitions. Where GPU-native preprocessing can replace precomputed feature materialisation, that overhead disappears. The trade-off is compute cost per training run, which increases, but for teams paying for reserved GPU capacity this is often preferable to maintaining a parallel infrastructure layer.

Real-Time Lakehouse Serving and the End of the Lambda Architecture

The Lambda architecture, which maintains separate batch and streaming pipelines to serve both historical and real-time data, was a reasonable response to the limitations of first-generation data lake query engines. Hive-style query latency made it impractical to serve model features directly from a lakehouse at inference time, so teams built Redis or DynamoDB serving layers on top. Apache Iceberg, Delta Lake, and Apache Hudi, combined with query engines like Apache Arrow Flight SQL and DuckDB running against columnar object storage, have reduced point-query latency on lakehouse tables to the low-millisecond range under appropriate partitioning strategies.

The commercial implication is that the Lambda architecture's operational complexity is no longer justified for a growing proportion of ML serving workloads. Maintaining two code paths, two consistency models, and two sets of freshness SLAs doubles the surface area for data quality failures at inference time. When the batch layer and the serving layer diverge, which they do under any non-trivial write volume, model inputs at inference differ from the distribution the model was trained on. Collapsing to a single lakehouse serving layer with real-time ingestion via Apache Kafka or AWS Kinesis feeding directly into Iceberg tables removes that divergence structurally, rather than attempting to manage it operationally.

Federated Query Layers and Cross-Domain Feature Access

Enterprise ML pipelines increasingly need to join features across organisational boundaries: a fraud model that needs transaction data from payments, device telemetry from security, and account history from CRM. The conventional approach routes all of this through a central ETL job, which creates both a latency problem and a governance problem. Data movement across domain boundaries triggers data residency obligations, access control complexity, and the political friction of centralised data ownership.

Federated query engines, specifically Trino, Starburst, and the emerging class of semantic layer tools built on top of them, allow feature assembly at query time without physically moving data between domains. The query planner pushes predicates down to each domain's native storage engine and assembles the result set in the query coordinator. This matters for ML pipelines because it means feature engineering logic can reference live domain data without requiring a centralised data warehouse to serve as an intermediary. For organisations operating under GDPR or sector-specific data residency requirements, federation also reduces the regulatory surface area of the feature pipeline by keeping data in its originating jurisdiction until the moment of use.

Object-Level Metadata at S3 Scale and What It Enables for Agentic Workflows

AWS S3 Express One Zone and the broader shift toward object stores with rich, queryable metadata represent a less-discussed but structurally significant change for agentic AI workloads. The constraint that has historically limited what an agent can do with raw data in object storage is discovery: finding the right objects across a bucket containing hundreds of millions of files requires either a pre-built catalogue or full-bucket enumeration, both of which introduce latency and operational dependencies that make real-time agentic data access impractical.

Object-level metadata tagging, combined with S3 Metadata (currently in preview) and AWS Glue Data Catalog integration, allows agents to issue structured queries against object metadata without reading object content. An agent can identify the ten most recent model checkpoints for a specific experiment, filter training shards by data source lineage, or locate all inference logs for a given model version in under a second, without a separate catalogue service in the critical path. For agentic pipelines that need to make data retrieval decisions dynamically, this changes the architecture from one where the agent calls a catalogue API to one where the agent queries storage directly, removing an operational dependency and reducing the number of failure modes.

The Metadata Lineage Problem at Scale

None of these architectural improvements resolve the underlying data lineage problem, which becomes more acute as pipelines grow more dynamic. When GPU-native preprocessing transforms data within the training loop, when lakehouse tables receive continuous updates, and when federated queries assemble features at query time, the provenance chain from raw source to model input becomes harder to reconstruct. This matters for regulated industries where audit requirements demand that a model's training data composition be reproducible, and it matters for debugging, where a model performance regression may trace back to a schema change in a federated source that no downstream system detected.

Apache OpenLineage, now part of the Linux Foundation, provides a vendor-neutral specification for emitting lineage events from data processing frameworks including Spark, Airflow, dbt, and Flink. Instrumenting the full pipeline with OpenLineage events and storing them in a lineage backend such as Marquez gives engineering teams a queryable graph of data dependencies that spans the entire pipeline, including federated joins and GPU-side transformations if the training framework emits events. Without this, the architectural flexibility gained from the patterns described above comes at the cost of observability.

Infrastructure Investment Sequencing

The question for engineering leaders is not whether to adopt these patterns but in what order, given that each carries migration cost and operational risk. The most common sequencing mistake we observe is treating the GPU-native preprocessing migration as a prerequisite for everything else. It is not. The lakehouse serving consolidation and the federated query layer are largely independent of the preprocessing stack and often deliver faster return on investment because they reduce operational overhead immediately rather than requiring a training pipeline rebuild.

A practical sequencing starts with the serving layer: migrate feature serving from a dedicated key-value store to a real-time Iceberg table backed by a low-latency query engine, validate latency SLAs under production query patterns, and then extend that infrastructure to cover the batch training feature path. Once the lakehouse is the single source of truth for both training and serving features, the case for GPU-native preprocessing becomes easier to evaluate on its own merits, because the data contract between the preprocessing layer and the training job is now stable.

Where Vector Labs Fits

Vector Labs designs and builds production ML data pipelines for enterprises scaling AI workloads, including end-to-end ETL architecture, feature pipeline construction, and cloud data infrastructure on AWS. In our fraud detection engagement, we constructed a full ETL pipeline and data architecture from scratch on AWS, integrating ElasticSearch and Kibana to deliver a production-ready data engineering layer that supported parallel NLP and image recognition workloads - see the Image Recognition and NLP for Fraud Detection case study for the full scope. If your pipeline architecture is under review, contact us at vector-labs.ai/contacts.

FAQs

When does GPU-native preprocessing deliver a measurable return, and when does it not?

GPU-native preprocessing with tools like RAPIDS cuDF delivers measurable throughput improvement when the training job's GPU utilisation is bottlenecked by data loading rather than by compute. If your GPU utilisation during training is consistently above 85 percent, preprocessing is not the constraint and the migration will not improve training time. The case is strongest for image, video, and multimodal workloads where per-sample transformation cost is high. For tabular workloads with low transformation complexity, the overhead of GPU memory management may outweigh the throughput gain.

What query latency is realistically achievable from a lakehouse for online feature serving?

With Apache Iceberg tables on S3 Express One Zone, well-partitioned on the feature key, and queried via DuckDB or Arrow Flight SQL with metadata caching enabled, point-query latency in the 5–20 millisecond range is achievable for single-entity lookups. Multi-entity joins across large partition ranges will be slower. Whether this meets your inference SLA depends on the model's end-to-end latency budget. For models with sub-10ms inference targets, a dedicated key-value serving layer remains necessary. For models with 50–200ms budgets, lakehouse serving is viable and eliminates the dual-pipeline consistency problem.

How does federated query execution interact with data residency obligations under GDPR?

Federated query engines like Trino execute queries by pushing predicates to the data source and returning only the result set to the query coordinator. If the coordinator runs in the same jurisdiction as the data source, no cross-border transfer occurs. If the coordinator is in a different region, the result set constitutes a data transfer and GDPR transfer mechanisms apply. The key design decision is coordinator placement: deploying a regional coordinator per jurisdiction and routing queries accordingly keeps data in its originating region until the point of use, which satisfies residency requirements without requiring data replication.

What does adopting Apache OpenLineage actually require in terms of engineering effort?

OpenLineage provides native integrations for Apache Spark, Apache Airflow, dbt, and Apache Flink, which cover the majority of transformation workloads in a typical enterprise ML pipeline. Instrumentation for these frameworks requires adding the OpenLineage backend URL to the framework configuration and, in some cases, installing a plugin package. Custom transformation code written outside these frameworks requires manual event emission using the OpenLineage Python or Java client. The ongoing operational cost is running a lineage backend such as Marquez or integrating with a commercial lineage platform. For most teams, the instrumentation work is measured in days per framework, not weeks.

Should we migrate away from a feature store entirely, or maintain it alongside a lakehouse?

The answer depends on whether your feature store is primarily solving a serving latency problem or a feature reuse and governance problem. If it is primarily serving latency, and your inference SLA is compatible with lakehouse query latency, migration is justified. If it is primarily governance, meaning teams across the organisation discover and reuse features through the store's registry, the registry function has value independent of the serving layer and should be preserved, potentially backed by the lakehouse rather than a separate storage system. Feast, for example, supports Iceberg as an offline store backend, which allows the registry and feature definition layer to remain while the underlying storage consolidates.

How mature is S3 object-level metadata querying for production use in 2025?

AWS S3 Metadata, which enables SQL queries against object metadata without reading object content, was in public preview as of late 2024 and is moving toward general availability. For production workloads requiring guaranteed SLAs, teams should evaluate the current GA status before committing to it as a critical path dependency. The more mature alternative is AWS Glue Data Catalog with S3 event notifications triggering catalogue updates on object creation, which provides queryable metadata with a small ingestion lag. For agentic workflows where discovery latency of a few seconds is acceptable, the Glue-backed pattern is production-ready today. For sub-second discovery requirements, the native S3 Metadata capability is the target architecture but warrants a GA confirmation before production adoption.

A team that understands you

With 20+ years of experience in the world's leading consultancy companies, implementing AI and ML projects in industry-specific contexts, we are ready to hear your challenges.

Talk with an AI expert