Building a Modern Data Lakehouse Architecture with Apache Iceberg

Introduction

In the evolving landscape of data engineering, organizations increasingly demand a unified platform that combines the flexibility of data lakes with the structure and performance of data warehouses. This is where the data lakehouse architecture shines, and Apache Iceberg emerges as a game-changing technology that makes this vision practical.

In this comprehensive guide, we'll explore how to build a production-ready data lakehouse architecture using Apache Iceberg, implementing CDC (Change Data Capture) replication from PostgreSQL, and organizing data into bronze, silver, and gold zones. We'll dive deep into practical implementation details including S3 bucket layout, data curation strategies, compaction, handling deletes, and modern data modeling approaches.

What is a Data Lakehouse?

A data lakehouse represents the convergence of data lakes and data warehouses. It provides:

Flexibility: Store any data format in open standards
Scalability: Handle petabyte-scale datasets economically
Performance: Query data directly in cloud storage at warehouse speeds
Data Quality: ACID transactions and schema enforcement
Governance: Delta Lake, Apache Iceberg, or Apache Parquet provide these capabilities

Apache Iceberg stands out as the open-table format of choice, providing a hidden partitioning model, schema evolution, and excellent query performance.

Understanding the Bronze-Silver-Gold Architecture

The medallion (or multi-hop) architecture organizes data processing into three distinct layers:

Bronze Zone (Raw Data)

The bronze zone captures raw, unprocessed data directly from source systems. This layer acts as an immutable record of source data.

Characteristics:

Minimal transformation applied
Full historical data retention
Schema may vary due to upstream changes
CDC captures changes efficiently from PostgreSQL
Optimized for append-only patterns
Typically highest volume and lowest query frequency

Silver Zone (Cleaned and Unified Data)

The silver layer applies business logic, data quality rules, and combines data from multiple sources.

Characteristics:

Consistent schema across similar data types
Deduplication and data quality checks applied
Historical changes tracked with slowly changing dimensions
Optimized for analytical queries
Reduced storage through compaction
Query-friendly table structure

Gold Zone (Analytics-Ready Data)

The gold layer presents aggregated, business-aligned datasets optimized for specific use cases.

Characteristics:

Business metrics and KPIs
Pre-aggregated dimensions and facts
Optimized schemas (star/snowflake)
High query performance
Governance and access control applied
Smallest dataset size, highest value

CDC Replication Architecture: PostgreSQL to Iceberg

Change Data Capture enables efficient incremental data loading from PostgreSQL by capturing insert, update, and delete operations.

PostgreSQL Setup

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;

-- Create publication for CDC
CREATE PUBLICATION cdc_publication FOR ALL TABLES;

-- Create replication slot
SELECT * FROM pg_create_logical_replication_slot('cdc_slot', 'test_decoding');

Handling CDC Events in Iceberg

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("CDC-Iceberg").getOrCreate()

# Read CDC events from Kafka
cdc_df = spark.readStream \\
    .format("kafka") \\
    .option("kafka.bootstrap.servers", "kafka:9092") \\
    .option("subscribe", "postgresql.public.customers") \\
    .load()

# Parse CDC payload (Debezium format)
from pyspark.sql.types import *

cdc_df = cdc_df.select(
    get_json_object(col("value").cast("string"), "$.op").alias("operation"),
    get_json_object(col("value").cast("string"), "$.after").alias("data"),
    get_json_object(col("value").cast("string"), "$.ts_ms").alias("cdc_timestamp")
)

# Write to Iceberg Bronze table
cdc_df.writeStream \\
    .format("iceberg") \\
    .mode("append") \\
    .option("path", "s3://data-lakehouse/bronze/postgresql/customers/") \\
    .option("checkpointLocation", "s3://data-lakehouse/checkpoints/customers/") \\
    .start() \\
    .awaitTermination()

S3 Bucket Layout and Organization

Organizing S3 buckets effectively is crucial for maintainability, security, and cost optimization.

Recommended S3 Structure

s3://org-data-lake/
├── bronze/
│   ├── postgresql/
│   │   ├── customers/
│   │   ├── orders/
│   │   └── products/
│   ├── external-apis/
│   │   ├── weather/
│   │   └── market-data/
│   └── logs/
├── silver/
│   ├── fact_tables/
│   ├── dimension_tables/
│   └── integration/
├── gold/
│   ├── marketing/
│   ├── finance/
│   └── operations/
├── temp/
│   └── spark-jobs/
└── metadata/
    └── schemas/

Best Practices for S3 Organization

Partition by Zone, Date-Based Partitioning, Source Tracking, and Archive Strategy are key organizational patterns.

Data Curation in the Silver Zone

Data curation involves transforming raw data into reliable, trustworthy datasets.

Curation Pipeline Example

from pyspark.sql.functions import *

# Read from Bronze
bronze_customers = spark.table("bronze.postgresql_customers")

# Data Quality Checks
quality_checks = bronze_customers \\
    .filter(col("email").isNotNull()) \\
    .filter(col("customer_id") != "")

# Deduplication
curated_customers = quality_checks \\
    .withColumn("row_num", 
        row_number().over(Window.partitionBy("customer_id")
            .orderBy(desc("cdc_timestamp")))) \\
    .filter(col("row_num") == 1) \\
    .drop("row_num")

# Add curation metadata
silver_customers = curated_customers \\
    .withColumn("curated_at", current_timestamp()) \\
    .withColumn("data_quality_score", lit(0.95))

# Merge into Silver table (upsert)
silver_customers.writeTo("silver.customers") \\
    .using("iceberg") \\
    .whenMatched() \\
    .updateAll() \\
    .whenNotMatched() \\
    .insertAll() \\
    .execute()

Compaction Strategy

Compaction is critical for maintaining Iceberg table performance by consolidating small files created during incremental updates.

Automatic Compaction with Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Compaction").getOrCreate()

# Compact Bronze tables
spark.sql("""
    CALL system.rewrite_data_files(
        table => 'bronze.customers',
        strategy => 'binpack',
        options => map('target-file-size-bytes', '536870912')
    )
""")

# Remove orphaned files
spark.sql("CALL system.remove_orphan_files(table => 'bronze.customers')")

Handling Deletes with Iceberg

Iceberg's delete support through position deletes ensures efficiency and data correctness.

# Direct Delete by ID
spark.sql("""
    DELETE FROM silver.customers
    WHERE customer_id = '12345'
""")

# Bulk Delete from Staging
spark.sql("""
    DELETE FROM silver.customers
    WHERE customer_id IN (SELECT customer_id FROM staging_deletes)
""")

Data Modeling in the Lakehouse

Modern data modeling for lakehouses combines dimensional modeling with flexibility.

Slowly Changing Dimension Type 2

CREATE TABLE silver.dim_customers (
    customer_key INT NOT NULL,
    customer_id VARCHAR,
    customer_name VARCHAR,
    email VARCHAR,
    start_date DATE,
    end_date DATE,
    is_current BOOLEAN
)
USING iceberg
PARTITIONED BY (year(start_date));

Monitoring and Governance

Effective lakehouse management requires continuous monitoring and governance practices.

-- Table size and growth
SELECT 
    table_name,
    ROUND(size_in_bytes / 1e9, 2) as size_gb,
    row_count
FROM iceberg_metadata
ORDER BY size_in_bytes DESC;

-- Data freshness
SELECT 
    table_name,
    MAX(loaded_timestamp) as latest_load
FROM bronze_tables
GROUP BY table_name;

Best Practices Summary

Isolate zones: Keep bronze, silver, and gold in separate logical areas
Partition strategically: Use date and source partitioning for efficient queries
Automate ingestion: Use managed CDC tools like Debezium for PostgreSQL
Compact regularly: Schedule weekly compaction for all zones
Version schemas: Track schema changes with Git or metadata tools
Monitor quality: Implement automated data quality checks
Encrypt sensitive data: Use column-level encryption in the lakehouse
Archive cold data: Move old partitions to S3 Glacier for cost optimization
Test transformations: Use staging areas for validation before production
Document lineage: Maintain data lineage for compliance and debugging

Conclusion

Building a modern data lakehouse with Apache Iceberg, CDC replication, and the bronze-silver-gold architecture provides a robust foundation for scalable, governed analytics. By following the patterns and practices outlined in this guide—from S3 organization to data curation, compaction, delete handling, and dimensional modeling—organizations can create a platform that balances flexibility with structure, performance with cost-effectiveness.

The combination of Iceberg's reliability, PostgreSQL's CDC capabilities, and thoughtful architectural patterns transforms data infrastructure into a competitive advantage. Start with foundational bronze tables capturing raw data, progress through silver zone curation and integration, and deliver business value through gold zone analytics.

Your journey to a modern data lakehouse starts today.

Introduction

What is a Data Lakehouse?

Understanding the Bronze-Silver-Gold Architecture

Bronze Zone (Raw Data)

Silver Zone (Cleaned and Unified Data)

Gold Zone (Analytics-Ready Data)

CDC Replication Architecture: PostgreSQL to Iceberg

PostgreSQL Setup

Handling CDC Events in Iceberg

S3 Bucket Layout and Organization

Recommended S3 Structure

Best Practices for S3 Organization

Data Curation in the Silver Zone

Curation Pipeline Example

Compaction Strategy

Automatic Compaction with Spark

Handling Deletes with Iceberg

Data Modeling in the Lakehouse

Slowly Changing Dimension Type 2

Monitoring and Governance

Best Practices Summary

Conclusion

Comments