Factory Data Lakehouse: Architecture, Cost, Benefits


Factories need a data backbone that handles machine time series at scale, powers BI and AI, and stays economically predictable. A lakehouse delivers that by unifying low cost object storage, robust governance, and SQL friendly performance. Below is a pragmatic blueprint with modelling methods, a cost playbook, and measurable KPIs for operations.

KEY TAKEAWAYS

• A lakehouse simplifies factory data by consolidating storage, governance, and analytics into one governed platform ready for BI and AI.

• KUse Bronze–Silver–Gold with downsampling and SCDs to keep time series usable at any horizon without losing fidelity.

Bake in cost controls: tier storage, optimise tables, autoscale compute, and publish only what stakeholders actually need.

Reference architecture for a factory lakehouse

Design for the edge-to-cloud path first, then your tables. On site, connect PLCs and machines via OPC UA. Publish telemetry to an MQTT broker at the edge for decoupled, reliable streaming. Optional: a Kafka hub for buffering and fan out. Land raw events in object storage as Bronze. Curate quality in Silver with schema checks, SCD lookups, and late-arrival handling. Publish Gold for BI and data products with dimensional or wide tables. A single catalog governs ACLs, lineage, and PII. BI tools query Gold directly with SQL while data scientists access Silver for feature building.
Keep the path observable end-to-end. Short sentences on purpose. No ETL sprawl. One storage layer. Lakehouse by design.

·  Edge: OPC UA to MQTT

·  Transport: optional Kafka

·  Storage: object store with ACID tables

·  Layers: Bronze, Silver, Gold

·  Access: catalog, BI, notebooks

Ingesting and modelling IIoT and time series

Ingest with append-only streams, validate types and units, and tag every record with asset_id, site_id, and capture_ts. Schema-on-read keeps ingest fast; schema-on-write applies contracts at Silver and Gold for reliability. Model time series with partitioning by date and asset, compact small files, and index frequently filtered columns. Downsampling is essential: 100 ms at the edge, 1 s for monitoring, 1 min for fleet KPIs, daily for trend archives. Keep raw at full fidelity in Bronze and publish resampled aggregates upward. Join telemetry with contextual masters via SCD Type 2 dimensions for equipment, shifts, and maintenance states. Use Delta or similar ACID tables to unify streaming and batch. For heavy ranges, precompute rollups and use column pruning for speed. Ingest. Validate. Append. Promote. Repeat.

Good practices:

  • Partition by date and asset
  • Compact small files after bursts
  • Maintain SCD lookup tables for context
  • Precompute 1 s, 1 min, 1 h rollups

A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses

Ben Lorica, Michael Armbrust, Reynold Xin, Matei Zaharia, Ali Ghodsi, Databricks Blog

Cost model and optimisation levers

Treat cost like latency: design for it. Separate storage, compute, and egress in your model. Storage: columnar formats with compression and automatic file compaction lower I/O. Tiering cuts long tail spend by pushing historical Silver to cool or archive while keeping Gold hot. Compute: autoscale jobs, cache hot Gold, and prefer SQL pushdown over wide scans. Stream-batch unification avoids duplicate pipelines.
Optimise tables after large ingests and vacuum aggressively within retention policy. Scheduling: run heavy rollups off shift, prioritise BI SLAs, and fence runaway notebooks. Governance: per-workload cost budgets and alerts.
Measure cost per query, cost per dataset published, and cost per user served to ensure value lines up with spend. Practical rule: promote only what you’ll use. Everything else stays cheap.

Key levers:

  • Compression and file size targets
  • Storage tiering by access pattern
  • Autoscaling and auto-stop on compute
  • Table optimisation and caching
  • Budget alerts and chargeback

Stop moving data. Start analysing.

Unify ingest, governance, and SQL friendly access so teams answer questions fast without copying or replatforming.

Measurable benefits and SLAs

Set SLAs that matter to operations and analytics. Query latency for Gold under 3 seconds for standard dashboards. Data freshness SLOs: telemetry to Gold under 5 minutes for monitoring, under 60 minutes for finance-quality KPIs. Self-service adoption: active monthly BI users and datasets reused per quarter.
Data product reliability: failed pipeline rate under 1 percent and time-to-recover under 15 minutes. Backlog throughput: new dataset lead time from request to Gold in under 10 working days. Security posture: 100 percent of Gold tables cataloged with owners, lineage, and policies. Track these with a shared scorecard visible to manufacturing, quality, and maintenance leaders. Unified architecture helps here by cutting copies and delays, raising trust in a single governed layer.

FAQ

What is a factory data lakehouse?

A unified architecture on object storage that supports ACID tables, governance, BI, and AI for industrial data.

How do we ingest OPC UA sources?

Use an OPC UA connector to publish messages to an MQTT broker at the edge, then land streams to Bronze.

How do we keep costs predictable?

Tier old Silver to cool storage, compact tables, autoscale compute, and monitor cost per dataset and user.

Which KPIs prove value?

Dashboard latency, freshness SLOs, self-service adoption, pipeline failure rate, and lead time to publish new datasets.


About the Author

Liam Rose

I founded this site to share concise, actionable guidance. While RFID is my speciality, I cover the wider Industry 4.0 landscape with the same care, from real-world tutorials to case studies and AI-driven use cases.