Data Platform Reference

DuckDB Platform Evaluation & Implementation Guide

A Recommendation for Grow’s Foundational Data Platform

Version 2.0 · March 2026
Grow Fragrance — Internal
Prepared for evaluation

GROW FRAGRANCE

DuckDB Platform Evaluation

& Implementation Guide

A Recommendation for Grow's Foundational Data Platform

Version 2.0 · March 2026

Prepared for internal evaluation

Executive Summary

Bottom Line Up Front: DuckDB, Python, and Apache Parquet form a strong technical foundation for the analytics platform Grow's strategy describes. The stack delivers enterprise-grade analytical capability at near-zero infrastructure cost — no servers to manage, no cloud subscriptions to commit to. More specifically: it provides the technical execution layer the analytics strategy describes — clean, validated data flowing through defined transformation stages with explicit quality checks at each boundary. The normalization priorities Grow has already identified map directly onto the staging and core layers described in this document.

Scope note: This document describes a standalone data infrastructure implementation — a persistent, organisation-wide analytics platform for the Grow Data Initiative. It is distinct from, and should not be confused with, the DuckDB integration available within Claude's Cowork MCP environment. The Cowork MCP connection allows Claude to query a DuckDB instance interactively during a session; this guide describes the underlying data platform that would feed that environment and every other analytical workload at Grow. The two are complementary: this architecture is the foundation; MCP is one interface layer that sits on top of it.

THE RECOMMENDATION

This document proposes building Grow's analytics platform on three open-source components: DuckDB as the query engine, Python as the orchestration layer, and Apache Parquet as the storage format. Together they form a modern, production-grade data platform that can be stood up in weeks rather than months, operated without dedicated infrastructure staff, and migrated to a cloud data warehouse at any point in the future with minimal friction.

The case for this stack rests on five business-level arguments:

1. NO INFRASTRUCTURE COST OR COMMITMENT

All three components are open-source and free. There is no software license, no monthly compute subscription, and no cloud vendor to negotiate with. The only cost is storage — Parquet files on a shared drive or Google Cloud Storage bucket, at roughly $1–2 per month at Grow's data volumes. A team member with a laptop and an internet connection can run the full analytics stack. This matters especially at a growth stage where capital allocation decisions are still being made: the platform scales in cost exactly as usage scales, starting from zero.

2. SERVERLESS BY DESIGN

DuckDB runs embedded inside a Python script — there is no server to provision, patch, monitor, or restart. Traditional data warehouse platforms (Snowflake, BigQuery, Redshift) require provisioning compute clusters, managing connection pools, and paying for idle time. This platform requires none of that. The nightly pipeline runs as a scheduled script. Analysts query data from their own machines. There is no shared database server to take down or secure. This is not a step down — for Grow's team size and workload pattern, serverless is the architecturally correct choice.

3. PERFORMANCE THAT MATCHES THE WORKLOAD

DuckDB consistently matches or performs comparably to or better than Snowflake and BigQuery on analytical queries for datasets under 500 GB — Grow's likely data volume for the next several years. This is not marketing; it is a function of physics. Distributed systems carry unavoidable overhead: query planning across nodes, network data movement, scheduler latency. When all the data fits on one machine, eliminating that overhead wins. Grow gets cloud warehouse performance at zero cloud warehouse cost.

4. BUILT ON OPEN STANDARDS — NO VENDOR LOCK-IN

Every component is based on open, widely-supported standards. Parquet is the storage format used by Snowflake, BigQuery, Databricks, and Spark. SQL is the transformation language. Python is the orchestration layer. If Grow's data volumes eventually require a distributed system, the migration path is straightforward: point Snowflake or BigQuery at the existing Parquet files, make minor SQL dialect adjustments, and continue. There is no proprietary format to convert, no vendor-specific pipeline to rewrite, and no data to re-import. The architecture is designed to be outgrown gracefully.

5. SPEED TO VALUE

A working pipeline extracting data from Shopify, writing Parquet files to a lake, and querying them with DuckDB can be operational in days, not months. There is no procurement process, no infrastructure provisioning delay, and no lengthy onboarding to a SaaS platform. The team starts generating reliable analytical outputs immediately and builds complexity incrementally as the business demands it. For a company in Grow's position — making time-sensitive decisions about inventory replenishment, DSM-Firmenich ingredient lead times, channel mix, and production batching — speed to reliable data is a direct business advantage.

WHEN THIS APPROACH IS NOT SUFFICIENT

This stack is worth considering for Grow now. It becomes the wrong choice if data volumes consistently exceed 2 TB, if Grow requires real-time streaming analytics, or if enterprise compliance requirements emerge (fine-grained access control, SSO, regulatory audit logging). None of these conditions apply today, and each has a well-documented migration path when they do. Section 11 covers those paths in detail.

This document is structured for two audiences. Sections 1–5 are written for leadership and cover the recommendation, the stack, the key tradeoffs, and the storage options. Sections 6–13 are the technical implementation guide for the engineering team. The Glossary (Section 14) defines technical terms used throughout.

The Recommended Stack: DuckDB + Python + Parquet

The stack described here — DuckDB, Python, and Apache Parquet — is one approach worth considering for Grow's analytics foundation. Each component has a distinct role, and together they address what reliable analytical outputs require: clean data, consistent schemas, and explicit validation at every stage boundary.

THE THREE COMPONENTS

DuckDB is the query engine: it executes SQL across raw files, staged tables, and aggregated mart outputs — all within the same Python process that runs the pipeline. No separate server, no network hop, no credentials beyond file system access.

Python is the orchestration layer: it calls source APIs, writes Parquet files to the lake, runs DuckDB SQL models in sequence, and coordinates validation between stages. It is the glue that connects source APIs, data quality checks, and SQL transformations into a single repeatable process.

Parquet is the storage format: columnar, compressed, open-standard, and natively readable by DuckDB, Spark, BigQuery, Snowflake, and Pandas. Data stored in Parquet is vendor-neutral and portable — it survives any future tool change without conversion.

WHY PARQUET OVER CSV

CSV is familiar. Parquet is faster, smaller, and safer for analytical work. The practical differences compound quickly at production data volumes.

Property	CSV	Parquet
File size	Uncompressed text — baseline	5–10× smaller with ZSTD compression
Read speed	Full file scanned on every query	Only requested columns read — rest skipped
Data types	Everything is text — types inferred at read time	Types embedded in file — no inference, no drift
Schema enforcement	None — columns can change type between files silently	Schema embedded — mismatches caught immediately
Tooling compatibility	Universal	DuckDB, Spark, Pandas, BigQuery, Snowflake
Human readable?	Yes — opens in Excel or a text editor	No — binary format, requires a tool

The one trade-off worth naming: Parquet cannot be opened in Excel. CSV exports from the mart layer cover human-facing review. Parquet handles everything else.

HOW DUCKDB ACCESSES DATA

DuckDB operates in two modes, often used together:

— Mode 1 — Direct File Querying

DuckDB queries Parquet files directly from disk, a mounted share, or cloud storage — no import step, no database to load. This is the primary mode for the raw and staging layers.

-- Reads all daily Parquet partitions for January 2025 directly from disk.
-- hive_partitioning=true exposes year/month/day as filterable columns;
-- DuckDB skips every partition that does not match the WHERE clause.
SELECT order_date, sku_id, SUM(net_revenue) AS revenue
FROM read_parquet('data/lake/raw/shopify/orders/**/*.parquet',

hive_partitioning = true)

WHERE year = 2025 AND month = 1
GROUP BY order_date, sku_id;

— Mode 2 — Persistent .duckdb File

DuckDB can maintain a persistent database file storing pre-computed tables — mart aggregations, reference data, pipeline state. Faster for repeated interactive queries; useful for analyst exploration in DBeaver or notebooks.

# Open the local .duckdb file in read-only mode to avoid write-lock conflicts.
# The mart table was materialised by the nightly pipeline; no re-computation needed.
con = duckdb.connect('data/warehouse/analytics.duckdb', read_only=True)
df  = con.execute(
'SELECT * FROM mart_daily_sales WHERE order_date >= current_date - INTERVAL 30 DAYS'
).df()

	Mode 1: File Lake	Mode 2: .duckdb File
Source of truth	Always — raw Parquet is permanent	Derived — rebuilt from Parquet anytime
Team sharing	Unlimited concurrent readers	Single writer; others use read-only copies
Network storage	Works on NAS, GCS, S3	Local disk only — not safe on network shares
Best for	Pipeline processing, long-term storage	Interactive exploration, notebooks, DBeaver

For Grow, the file lake is the primary architecture. The .duckdb file is a convenience layer — generated nightly, used locally for exploration, discarded when done. A hybrid pattern (Parquet for sharing, .duckdb for fast local queries, CSV for stakeholder exports) covers every access pattern without a shared database server.

The .duckdb file is optional. Parquet alone, queried via DuckDB's direct file mode, is a complete solution.

Key Considerations Before Adopting This Stack

Every data platform involves deliberate tradeoffs. The following are the ones worth examining before committing to this approach — not as objections, but as design decisions any adopting team should make consciously.

RAW DATA: IMMUTABLE OR OVERWRITE?

A common pattern with Parquet lakes is treating raw data as immutable — files written once, never modified. Corrections arrive as new files; history is always preserved. This makes the system fully auditable and reproducible, and it is the pattern that supports backtesting and model validation directly.

The simpler alternative is an overwrite model — re-extract and replace. Easier to manage, but history is lost. For a team that wants to validate model outputs against prior data states, immutability is worth the additional storage and the discipline of a defined retention policy.

VALIDATION: STRICT OR PERMISSIVE?

Pipelines can be configured to halt loudly when data quality checks fail — missing columns, type mismatches, implausible values. This prevents bad data from reaching downstream consumers but requires upfront investment in writing the rules. The alternative — log and continue — produces outputs that may be subtly wrong. For decision-support analytics, strict validation is generally worth the build cost.

BUSINESS LOGIC: SQL OR PYTHON?

SQL-first for transformations means logic is auditable, version-controlled, and readable by anyone who can write a SELECT statement. Python handles orchestration, file management, and model integration. Some teams prefer embedding more logic in Python — more flexible, harder to audit. The SQL-first approach tends to age better as teams grow.

SINGLE MACHINE CONSTRAINTS

DuckDB runs on one machine. For a team where one process runs pipelines and others consume outputs, this tends to be sufficient. The mitigation pattern — Parquet lake for sharing, .duckdb snapshot for local exploration, CSV for non-technical stakeholders — handles Grow's scale without a shared database server. If that changes, Section 11 covers migration paths.

WHAT THIS STACK DOES NOT PROVIDE

Capability	Traditional Warehouse	DuckDB Stack	Risk at Grow's Scale
User authentication	OAuth, SSO, username/password	File system access only	Low — small trusted team
Access roles	Row/column-level permissions	All-or-nothing file permissions	Low — internal use only
Concurrent writes	Full MVCC	.duckdb: single writer. Parquet lake: safe if writing separate partitions	Low — pipelines naturally write distinct date partitions
High availability	Replication, failover	Single machine; no failover	Medium — mitigated by Parquet backups
Audit logging	Who ran what, when	Pipeline logs + Git only	Low — internal analytics use
Real-time ingestion	Streaming inserts	Batch-oriented only	Low — nightly batch is sufficient

None of these represent material risk at Grow's current profile. If any become relevant as the business scales, they have defined migration paths.

CONCURRENCY

The single-writer constraint applies specifically to the .duckdb binary file — only one process should open it in read-write mode at a time. The Parquet lake has no such restriction. Because each pipeline run writes to its own date-partitioned path, multiple pipelines can write to the lake concurrently without any conflict — a Shopify extraction writing year=2025/month=01/day=14/ and an Amazon extraction writing year=2025/month=01/day=15/ are operating on completely separate files. The one scenario to avoid is two processes writing to the exact same partition path simultaneously; in practice this only arises if two backfill jobs are scheduled for the same source and the same date, which is straightforward to prevent with basic scheduling logic.

Scenario	Safe?	Note
Multiple pipelines writing different date partitions	Yes	Separate file paths — no conflict possible
Multiple pipelines writing the same partition	No	Last writer wins or filesystem collision — prevent with scheduler locks
Multiple analysts reading the Parquet lake	Yes	Unlimited concurrent readers on any file
Nightly pipeline writing .duckdb while analyst reads	Risky	Analyst should query Parquet directly or use a local read-only copy
Two processes writing .duckdb simultaneously	No	DuckDB single-writer constraint — serialize with a scheduler

SECURITY

Security is file system level — no row or column permissions. For Grow's internal team: raw directories writable by the pipeline only; mart exports read-only for analysts; credentials in environment variables, never in code; PII segregated at the staging layer.

BACKUP AND RECOVERY

Because raw Parquet is immutable and retained, the entire platform can be rebuilt by replaying the pipeline against the lake. Three tiers:

Tier	What Is Restored	Recovery Time	Data Loss Risk
CSV restore	Latest mart exports	Minutes	Up to 24 hours
.duckdb restore	Modeled tables from backup	< 1 hour	Up to 24 hours
Full rebuild	Complete platform	Hours	None — Parquet is source of truth

Raw Parquet files should be backed up nightly — only new partitions need copying since prior partitions are immutable.

HOW DUCKDB COMPARES TO ALTERNATIVES

System	Type	Perf (<500 GB)	Scalability	Concurrency	Cost	Ops Complexity
DuckDB + Parquet	Embedded OLAP	Excellent	Single node to ~2 TB	Single writer	Free (open source)	Very low
SQLite	Embedded OLTP	Poor for analytics	Small datasets only	Single writer	Free	Very low
PostgreSQL	Client-server OLTP	Good (with indexes)	Vertical scaling	High (MVCC)	Free (hosting cost)	Medium
MariaDB	Client-server OLTP	Good (with indexes)	Vertical + replication	High	Free (hosting cost)	Medium
Snowflake	Cloud OLAP warehouse	Excellent	Virtually unlimited	High	High — compute + storage	Low (managed)
BigQuery	Cloud OLAP warehouse	Excellent	Virtually unlimited	High	Pay-per-query or flat rate	Low (managed)

SQLite is not suited to analytical workloads. PostgreSQL and MariaDB make sense if Grow eventually needs a transactional database alongside analytics — they can coexist with DuckDB. Snowflake and BigQuery are the natural upgrade path if data volumes outgrow single-node performance; both read Parquet natively, making migration low-friction.

At Grow's current data volumes — likely under 50 GB of structured analytical data — DuckDB performs as well as or better than cloud warehouses on the queries that matter, at a fraction of the cost. Revisit Snowflake or BigQuery when volumes approach 500 GB or when multi-user write requirements emerge.

Section 4 Lake Storage Options: Where Should the Data Live?

Lake Storage Options: Where Should the Data Live?

The Parquet data lake is a folder of files — which means it can live almost anywhere that Python and DuckDB can reach. For Grow, operating on Google Workspace without a dedicated shared drive today, there are four realistic options. Each has a different profile of cost, setup complexity, performance, and team accessibility.

One rule applies across all options: the analytics.duckdb warehouse file must never live on a network share or cloud-synced folder. DuckDB's official documentation explicitly warns that read-write mode on NAS, SMB, NFS, or cloud-synced paths can cause data corruption and unpredictable errors. The .duckdb file always stays on the local machine running the pipeline. Only the Parquet lake files are shared.

OPTION 1: GOOGLE CLOUD STORAGE (GCS) — RECOMMENDED

Google Cloud Storage is the cleanest solution for a Google Workspace team. GCS is Google's object storage service — same Google account, same billing, but exposed as a storage bucket that DuckDB and Python can query directly without any desktop sync client.

pip install gcsfs google-cloud-storage
# Python: write Parquet directly to GCS
import gcsfs, pandas as pd
fs = gcsfs.GCSFileSystem(project='grow-fragrance')
df.to_parquet('gs://grow-lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet',

filesystem=fs, index=False, compression='zstd')

# DuckDB: query GCS directly -- no download
import duckdb
con = duckdb.connect()  # in-memory; .duckdb stays local
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""CREATE SECRET (TYPE gcs, KEY_ID '...', SECRET '...')""")
df = con.execute("""
SELECT year, month, SUM(net_revenue) AS revenue
FROM read_parquet('gs://grow-lake/raw/shopify/orders/**/*.parquet',

hive_partitioning = true)

WHERE year = 2025 GROUP BY 1,2 ORDER BY 2
""").df()

Aspect	Detail
DuckDB support	Native — read_parquet('gs://...') works out of the box with the httpfs extension
Python support	gcsfs and google-cloud-storage are mature, well-supported libraries
Cross-platform	Identical on Mac, Windows, and Linux — no mount configuration per machine
Google account	Same Google Workspace project and billing as the rest of Grow's tools
Cost	Negligible at Grow's volumes — approximately $1–2/month for storage
Setup	One GCS bucket + HMAC keys for DuckDB; approximately 30 minutes to configure
Requires internet	Pipeline and queries need internet access; no offline operation

OPTION 2: GOOGLE DRIVE FOR DESKTOP (MOUNTED LOCAL PATH)

Google Drive for Desktop mounts your Drive as a folder accessible through File Explorer on Windows (e.g. G:\) or Finder on Mac (/Volumes/GoogleDrive/). DuckDB and Python see it as a local path — no special extensions needed.

# Windows path (Drive for Desktop mounted as G:)
BASE = r'G:\My Drive\grow-analytics\data\lake'
# macOS path
BASE = '/Users/david/Library/CloudStorage/GoogleDrive-.../My Drive/grow-analytics/data/lake'
# DuckDB reads from the mounted path exactly like local disk
con = duckdb.connect('data/warehouse/analytics.duckdb')  # LOCAL -- never on Drive
con.execute(f"SELECT * FROM read_parquet('{BASE}/raw/shopify/orders/**/*.parquet')")

Aspect	Detail
DuckDB support	Full — reads the mounted path as local files; no extensions needed
Python support	Full — standard pathlib and pyarrow; no extra libraries
Cross-platform	Native on Mac and Windows; Linux requires rclone or ocamlfuse (extra setup)
Familiar interface	Files visible in the normal Drive UI
No extra cost	Included in Google Workspace subscription
.duckdb file	Must remain on local disk — never sync the .duckdb file to Drive
Linux support	Not officially supported; third-party tools required

OPTION 3: ON-PREMISES NAS WITH SMB (SAMBA)

For a team that wants the lake on hardware they control — without relying on any cloud service — a Network-Attached Storage device running SMB via Samba is the most cross-platform on-premises option available.

— Why SMB / Samba for a Mixed Mac, Windows, Linux Team

SMB version 3 is the right protocol choice. AFP (Apple Filing Protocol) is deprecated on modern macOS and should not be used for new deployments. NFS has no native Windows client and requires third-party software. SMB v3 is natively supported by all three operating systems and provides the best combination of cross-platform compatibility, security, and performance for a mixed team.

— Hardware Options

Consumer NAS appliance (Synology, QNAP): pre-built, low-power, runs Samba out of the box. A 2-bay unit with 4–8 TB starts around $300–600. Recommended starting point.

Repurposed mini PC or workstation: install Ubuntu Server, configure Samba manually. More control, lower cost if hardware is available.

Raspberry Pi 5: extremely low-cost and low-power, adequate for Grow's data volumes.

— Mounting the Share on Each OS and DuckDB Usage

# macOS -- Finder: Go > Connect to Server > smb://nas-hostname/grow-lake
# Or terminal:
mount -t smbfs //david@nas-hostname/grow-lake /Volumes/grow-lake
# Windows -- Map network drive in File Explorer, or PowerShell:
net use G: \\\\nas-hostname\\grow-lake /persistent:yes
# Linux:
sudo apt install cifs-utils
sudo mount -t cifs //nas-hostname/grow-lake /mnt/grow-lake -o username=david,vers=3.0
# Once mounted, DuckDB reads it as a local path (READ only -- .duckdb stays local)
con = duckdb.connect('~/local/analytics.duckdb')  # local machine only
df = con.execute("""
SELECT * FROM read_parquet('/Volumes/grow-lake/raw/shopify/orders/**/*.parquet',

hive_partitioning = true)

""").df()

# Minimal Samba share config (/etc/samba/smb.conf on the NAS/server):
# [grow-lake]
#    path = /data/grow-lake
#    valid users = david, louis, drew
#    writable = yes
#    create mask = 0664

Aspect	Detail
DuckDB support	Parquet reads work fine over SMB mount; .duckdb file must stay local
Python support	Full — standard pathlib; no extra libraries once mounted
Cross-platform	SMB v3 native on Mac and Windows; cifs-utils on Linux (straightforward)
No cloud dependency	Fully on-premises; no internet required for lake access
Cost	One-time hardware cost (~$300–600 for a Synology starter); no ongoing fees
Setup effort	Moderate — hardware, OS config, Samba setup, mount config per machine
Remote access	VPN required for off-site access; not as seamless as cloud options
.duckdb file	Must remain on local disk — never store it on the NAS share

OPTION 4: GOOGLE DRIVE API (PROGRAMMATIC ONLY)

Python can read and write files to Google Drive via the API without mounting the drive. This works for specific tasks — such as publishing CSV mart exports to a shared Drive folder — but is not a viable primary lake strategy. DuckDB cannot query Drive API paths directly and would require downloading files to a local temp directory first, which negates the partition pruning and direct-file querying advantages. Recommended only for the export/delivery layer, not the lake itself.

RECOMMENDATION SUMMARY

Option	Best For	DuckDB Lake	.duckdb File	Linux	Setup Effort
GCS (Option 1)	Primary recommendation — cloud-native, zero hardware	Native gs:// paths	Local only	Full — no mount	Low (~30 min)
Drive for Desktop (Opt 2)	Team already uses Drive heavily; GCS not desired	Via local mount path	Local only	Third-party tools	Low (per machine)
On-prem NAS / SMB (Opt 3)	Data on-site; no cloud dependency; team on local network	Via SMB mount path	Local only	cifs-utils (easy)	Medium
Drive API (Opt 4)	CSV export delivery to stakeholders only	Not suitable	N/A	Full	Low

For Grow's current setup, GCS is the cleanest path — zero hardware, negligible cost, same Google account, and DuckDB queries it natively. If the team prefers data on local hardware, a Synology NAS running SMB is the on-premises equivalent. Both patterns use identical Python and DuckDB code — only the path prefix changes.

Layered Data Architecture

A layered data architecture is a widely-used pattern for organizing analytical pipelines. The example below illustrates one common approach — a five-layer model in which data flows in one direction, from raw source extracts through to outputs consumed by downstream tools and stakeholders. This is not the only valid structure, and real implementations often compress or split layers based on team size, tooling preferences, and data complexity.

The following is an illustrative example of how this architecture could be applied. Layer boundaries, naming conventions, and storage choices should be treated as starting points rather than fixed requirements — the right structure depends on the specific sources, consumers, and team workflows involved.

Layer	Purpose	Common Storage Format	Typical Owner
Raw	Source data as delivered — unchanged from extraction	Parquet (partitioned by date)	Pipeline / extraction scripts
Staging	Schema normalization, type casting, deduplication	Parquet or DuckDB tables	SQL models
Core	Cross-source integration, unified identifiers, standardized metrics	DuckDB tables or Parquet	SQL models
Marts	Domain-specific aggregations pre-built for reporting or decisions	DuckDB tables + CSV / Parquet exports	SQL models
Consumption Layer	Outputs served to any downstream consumer: BI tools, ML models, forecasting pipelines, reports, or end users	Parquet, CSV, or direct DuckDB queries	Domain owners

RAW LAYER

One common approach treats the raw layer as an immutable record of data exactly as it arrived from source systems — no transformations, no cleaning. Files are typically partitioned by extraction date so that specific periods can be re-pulled or corrected without touching the rest of the history. Some implementations allow targeted overwrites of specific date partitions when source data is corrected; others prefer an append-only model with corrections tracked separately.

STAGING LAYER

The staging layer is where raw data is shaped into a consistent, queryable structure. In the context of Grow's normalization priorities, this is where the foundational work happens: unified SKU taxonomy enforcement, date/time standardization to UTC with daily grain, revenue normalization (net of returns and discounts), and deduplication of records introduced by API pagination. A common principle is that only structural conformance happens here — no business logic — which keeps the layer predictable and straightforward to validate.

CORE LAYER

Core models bring data from multiple sources into alignment — unifying order records from Shopify, Amazon, Faire, and Klaviyo under a shared customer identity, normalizing revenue to a net-of-returns basis across channels, aligning attribution windows across Meta Ads, Google Ads, and Triple Whale, and resolving cost basis differences between QuickBooks and production records. This is where Grow's cross-cutting normalization priorities become executable SQL, and where the single source of truth for cross-functional analysis lives.

MART LAYER

Marts are pre-aggregated, domain-specific views of core data, designed for a particular audience or decision. A mart built for channel P&L analysis looks quite different from one built for warehouse operations, even if both draw from the same core tables. Keeping marts purpose-built and narrow tends to make them faster, easier to maintain, and easier to document — though some teams prefer a smaller number of wider marts depending on their query patterns.

CONSUMPTION LAYER

The consumption layer is the interface between the data platform and everything that uses it. This is deliberately broad — it could be a BI dashboard querying a mart directly, a forecasting model reading a Parquet export, a Python notebook doing ad hoc analysis, an operational CSV report, or a data scientist connecting DBeaver to a local .duckdb snapshot. The key design principle is that this layer should consume clean, validated mart outputs rather than performing its own transformations.

BI and reporting: Tableau, Looker, Metabase, Google Sheets, or Power BI connected to mart exports or direct DuckDB queries

ML and forecasting pipelines: Python reading mart Parquet files as training datasets or feature stores

Operational outputs: scheduled CSV exports, email reports, or Slack-delivered summaries

Interactive analysis: DBeaver, DuckDB CLI, or notebooks pointed at a local .duckdb copy or Parquet lake

File-Based Data Lake Design

The raw and staging layers are implemented as a file-based data lake: a directory tree of Parquet files organized by partition keys. DuckDB reads these files directly using glob patterns, enabling SQL queries across the entire lake without first importing data into a database file.

Partitioning is the single most impactful design decision in a data lake. Without it, every query scans every file regardless of the date range being asked about — a query for last month reads a full year of data. With partitioning, DuckDB skips irrelevant directories at the filesystem level before opening a single file. As data accumulates, the performance gap widens: on a two-year daily lake, a one-month query touches ~4% of files without partitioning costing you. This is not premature optimization — it is a reasonable default from day one.

PARTITIONING STRATEGIES

Partitioning organizes files into sub-directories based on key values, so queries only need to read the subset of files relevant to the question being asked. DuckDB supports three main approaches.

— 1. Hive-Style Directory Partitioning

The most common and best-supported pattern. Directory names encode partition key and value using key=value syntax. DuckDB reads this automatically and exposes the partition keys as queryable columns, enabling efficient partition pruning.

data/lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet
data/lake/raw/shopify/orders/year=2025/month=01/day=16/extract.parquet

— 2. Single Flat Folder

All Parquet files for a table sit in one directory. DuckDB uses a glob to read them together. Simplest approach — works well for lower-volume tables or mart outputs where partition pruning is not needed.

data/lake/raw/shopify/orders/orders_2025_01_15.parquet
data/lake/raw/shopify/orders/orders_2025_01_16.parquet

— 3. Filename-Encoded Partitioning

Date values are embedded in the filename rather than directory structure. DuckDB exposes the filename as a virtual column which can be parsed with string functions. Less ergonomic than Hive-style but common when files originate from external systems with fixed naming conventions.

-- Parse date from filename using the filename virtual column
SELECT

regexp_extract(filename, '(\d{4})_(\d{2})_(\d{2})', 0) AS file_date,

COUNT(*) AS rows

FROM read_parquet('data/lake/raw/shopify/orders/*.parquet', filename = true)
WHERE filename LIKE '%2025_01%'
GROUP BY file_date;

Pattern	Best For	DuckDB Support	Tradeoffs
Hive-style directory	Daily extracts, time-series, high-volume tables	Native — partition columns auto-inferred	Deeper directory tree; more folders to manage
Single flat folder	Low-volume tables, reference data, mart outputs	Full glob support	No partition pruning — all files read on every query
Filename-encoded	Files from external systems with fixed naming	Via filename virtual column + string parsing	Manual parsing; fragile if naming changes

PYTHON: IMPLEMENTING PARTITIONED WRITES

The pipeline scripts are responsible for constructing partition paths dynamically. Two functions cover the two main operational patterns.

— Full Historical Load

Pulls all available data from a source API and writes one Parquet file per day-partition. If a partition already exists it is overwritten — making this safe to re-run for historical corrections.

import pandas as pd
from pathlib import Path
from datetime import date, timedelta
# Writes df to a Hive-style partition directory derived from the target date.
# File is ZSTD-compressed; immediately queryable by DuckDB without import.
def write_partition(df: pd.DataFrame, base: str, d: date):
out = Path(base) / f'year={d.year}' / f'month={d.month:02d}' / f'day={d.day:02d}'
out.mkdir(parents=True, exist_ok=True)
df.to_parquet(out / 'extract.parquet', index=False, compression='zstd')
# Iterates over a date range, writing one Parquet partition per day.
# Overwrites existing partitions -- safe to re-run for corrections or backfills.
def full_historical_load(fetch_fn, base: str, start: date, end: date):

d = start

while d <= end:

df = fetch_fn(d)

if not df.empty:

write_partition(df, base, d)

d += timedelta(days=1)

— Incremental Daily Load

Runs each night. Fetches only yesterday's data and writes it to the correct partition. Existing partitions are untouched.

from datetime import date, timedelta
from pathlib import Path
# Nightly load: defaults to yesterday if no date is provided.
# Pass an explicit date to backfill or correct a specific partition.
# Prior partitions are never touched -- safe to schedule without side effects.
def incremental_daily_load(fetch_fn, base: str, target: date = None):

if target is None:

target = date.today() - timedelta(days=1)

df = fetch_fn(target)

if df.empty:

return # zero records on a date is valid -- not an error condition

out = Path(base) / f'year={target.year}' / f'month={target.month:02d}' / f'day={target.day:02d}'
out.mkdir(parents=True, exist_ok=True)
df.to_parquet(out / 'extract.parquet', index=False, compression='zstd')

PARTITIONING BENEFITS SUMMARY

Benefit	How It Works in Practice
Query speed	DuckDB skips partitions that don't match a date filter — a query for January reads only January files
Targeted corrections	Overwrite only the affected day-partitions when source data is corrected
Incremental backups	Only new partitions need copying in nightly backups — prior partitions never change
Parallel processing	Different date ranges can be processed simultaneously by independent pipeline workers
Auditable history	Filesystem timestamps on partition files record exactly when each day's data was written or corrected

Recommended Project Folder Structure

The following structure separates data (gitignored) from code (always versioned), and organizes code by function rather than layer. The analytics.duckdb file is explicitly local — it is never committed or shared via a network path.

📁 grow-analytics/
📁 data/
📁 lake/
📁 raw/
📁 shopify/orders/year=YYYY/month=MM/day=DD/
📁 amazon/orders/
📁 quickbooks/invoices/
📁 faire/orders/

klaviyo/

triple_whale/

📁 staging/
📁 core/
📁 marts/  ─── daily_sales/  channel_pnl/  inventory/
📁 warehouse/
📄 analytics.duckdb  ← LOCAL ONLY — never shared
📁 exports/  ─── csv/   snapshots/
📁 pipelines/
📁 extract/  ─── shopify_orders.py   amazon_orders.py ...
📁 transform/  ─── run_staging.py   run_core.py   run_marts.py
📁 load/  ─── export_csv.py
📁 sql/
📁 staging/  ─── stg_shopify_orders.sql   stg_amazon_orders.sql
📁 core/  ─── core_orders.sql   core_customers.sql
📁 marts/  ─── mart_daily_sales.sql   mart_channel_pnl.sql
📁 validation/  ─── assert_positive_revenue.sql
📁 models/  ─── demand_forecasting/   channel_attribution/   quality_control/ ...
📁 notebooks/  ─── exploration/   analysis/
📁 logs/
📄 config/settings.yaml
📄 README.md

KEY DESIGN DECISIONS

data/ is separate from code/ — Parquet files and the .duckdb binary are gitignored; the scripts that generate them are always versioned

sql/ holds transformation logic as standalone files — models are reviewable and testable independently of Python orchestration

models/ has a directory per decision domain, aligned with the Consumption Layer outputs in Section 6

validation/ SQL scripts run at each layer boundary — not afterthoughts

Faire is included as a raw source from the outset

DuckDB SQL Capabilities

DuckDB's SQL dialect is ANSI-compliant and extends standard SQL with direct file-querying functions, automatic partition discovery, and a range of analytical capabilities that make it well-suited for data engineering and analytical workloads. The examples below progress from basic file access through to pipeline-integrated patterns, each with a description of what the query is doing and why it is written that way.

QUERYING CSV FILES DIRECTLY

read_csv_auto() opens a CSV file and automatically infers column names, data types, and delimiters from the file contents — no schema definition required. The query reads only the three specified columns (DuckDB applies column projection even on CSV) and filters for orders above $50. This is useful for quickly interrogating a raw export without creating any tables or importing data.

-- read_csv_auto infers column names, types, and delimiters from file contents.
-- Column projection and WHERE filtering reduce I/O to only what the query needs.
SELECT order_id, customer_id, total_price
FROM read_csv_auto('data/lake/raw/shopify/orders/2025-01-15.csv')
WHERE total_price > 50;

QUERYING JSON FILES DIRECTLY

DuckDB can query JSON files directly using read_json_auto(), which infers the schema from the file contents including nested objects and arrays. This is useful for interrogating raw API responses — for example, a Klaviyo event export or a Gorgias ticket dump — without first flattening or importing the data. DuckDB exposes nested keys using dot notation and automatically unnests arrays when needed.

-- read_json_auto infers schema from the file, including nested fields.
-- Dot notation accesses nested keys; arrow notation (->) also works.
-- Useful for quickly auditing raw API exports before committing to a schema.
SELECT

event_id,

properties.email AS customer_email,

properties.\$value AS event_value,

datetime AS event_ts

FROM read_json_auto('data/lake/raw/klaviyo/events/2025-01-15.json')
WHERE event_type = 'Placed Order'
ORDER BY event_ts DESC

LIMIT 100;

CONVERTING JSON TO PARQUET FOR THE LAKE

Raw API responses often arrive as JSON. The recommended pattern is to query the JSON file directly with DuckDB — flattening any nested fields into a clean tabular structure — then write the result to a ZSTD-compressed Parquet file at the correct Hive partition path. This converts a verbose, uncompressed JSON export into a compact, queryable Parquet partition in a single SQL statement, with no intermediate DataFrame or pandas step required.

-- Reads a raw JSON export, flattens nested properties into columns,
-- and writes the result directly to the Hive-partitioned lake as Parquet.
-- A 50 MB JSON file typically compresses to ~5-8 MB Parquet with ZSTD.
COPY (
SELECT

event_id::VARCHAR AS event_id,

properties.email::VARCHAR AS customer_email,

properties.\$value::DOUBLE AS event_value,

datetime::TIMESTAMP AS event_ts,

date_trunc('day', datetime::TIMESTAMP)::DATE AS event_date

FROM read_json_auto('data/lake/raw/klaviyo/events/2025-01-15.json')
WHERE event_type = 'Placed Order'
)

TO 'data/lake/raw/klaviyo/events/year=2025/month=01/day=15/extract.parquet'

(FORMAT PARQUET, COMPRESSION ZSTD);

-- The resulting Parquet file is immediately queryable via partition pruning:
-- SELECT * FROM read_parquet('data/lake/raw/klaviyo/events/**/*.parquet',
--                             hive_partitioning = true)
-- WHERE year = 2025 AND month = 1;

QUERYING PARQUET WITH AN EXPLICIT SCHEMA

read_parquet() with a columns parameter tells DuckDB exactly what types to expect for each column rather than inferring them from the file metadata. This matters in production pipelines where type drift between API versions can cause silent coercion — for example, a field that was an integer in one extract becoming a float in a later one. By specifying the schema explicitly, any mismatch raises an error at read time rather than producing subtly wrong downstream values. The query then selects only three of those columns; Parquet's columnar format means the remaining columns in the file are never read from disk.

-- Explicit schema enforcement: type mismatches raise an error at read time
-- rather than silently coercing values -- important for production pipelines.
-- Only the three named columns are read; the rest are skipped by Parquet's columnar layout.
SELECT order_id, order_date, net_revenue
FROM read_parquet(
'data/lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet',

columns = {

'order_id': 'VARCHAR',

'order_date': 'DATE',

'net_revenue': 'DOUBLE'

}

);

QUERYING A FLAT FOLDER WITH A GLOB PATTERN

The *.parquet glob pattern tells DuckDB to read every Parquet file in the orders/ directory and treat them as a single unified table. This works regardless of how many files are present — whether the folder contains one file or thousands, the query syntax is identical. date_trunc('month', order_date) truncates the order_date column to the first day of each month, which groups all daily records into monthly buckets. The result is a monthly revenue summary across the entire history of the flat folder.

-- Reads all .parquet files in the folder as one unified table.
-- date_trunc truncates order_date to the first of each month,
-- grouping all daily records into monthly revenue buckets.
SELECT date_trunc('month', order_date) AS month,

SUM(net_revenue) AS revenue

FROM read_parquet('data/lake/raw/shopify/orders/*.parquet')
GROUP BY 1
ORDER BY 1;

HIVE PARTITION DISCOVERY AND PRUNING

The ** glob recursively descends through all sub-directories. With hive_partitioning = true, DuckDB reads the key=value directory names (year=2025, month=01, day=15) and exposes them as virtual columns on every row — without those values actually being stored inside each Parquet file. The WHERE clause filters on those virtual columns before any files are opened: DuckDB physically skips every partition directory that does not match year=2025 AND month=1, reading only the January 2025 files from disk. On a lake with two years of daily data, this reduces I/O from ~730 files to ~31 files for a single-month query.

-- hive_partitioning=true exposes year/month/day as virtual columns from directory names.
-- DuckDB skips all directories that do not match the WHERE clause at the filesystem level --
-- no data is read from non-matching partitions regardless of lake size.
SELECT year, month, COUNT(*) AS order_count
FROM read_parquet(
'data/lake/raw/shopify/orders/**/*.parquet',

hive_partitioning = true

)

WHERE year = 2025 AND month = 1
GROUP BY year, month;

FILENAME-BASED PARTITION DISCOVERY

When files use a naming convention rather than Hive directories — for example orders_2025_01_15.parquet — the filename = true parameter exposes the full file path as a virtual column called filename on every row. regexp_extract() then parses date components out of that string using a capture group pattern: (\d{4})_(\d{2})_(\d{2}) matches four digits, underscore, two digits, underscore, two digits, and the third argument (0) returns the entire match rather than a specific group. The WHERE clause filters by filename string before the file contents are read, limiting I/O to files whose names contain 2025_01.

-- filename=true adds the full file path as a virtual column on every row.
-- regexp_extract parses the YYYY_MM_DD date from the filename string.
-- The LIKE filter limits which files DuckDB opens before reading any data.
SELECT

regexp_extract(filename, '(\d{4})_(\d{2})_(\d{2})', 0) AS file_date,

COUNT(*) AS rows,

SUM(net_revenue) AS revenue

FROM read_parquet(
'data/lake/raw/shopify/orders/*.parquet',

filename = true

)

WHERE filename LIKE '%2025_01%'
GROUP BY file_date
ORDER BY file_date;

UNION ALL ACROSS HETEROGENEOUS SOURCES

UNION ALL concatenates the rows from two queries into a single result set without deduplication — each row from both sources appears in the output. The challenge with combining Shopify and Amazon orders is that each source has source-specific columns that do not exist in the other (shopify_cart_token and shopify_discount_code exist only on Shopify records; amazon_asin and amazon_fulfillment_channel exist only on Amazon records). The EXCLUDE keyword drops those named columns from the SELECT *, leaving only the shared columns that both sources have in common. A literal string 'shopify' AS source and 'amazon' AS source tags each row with its origin so downstream queries can filter or group by channel. The result is a single unified orders table spanning both source systems.

-- EXCLUDE drops source-specific columns so both SELECT * lists produce matching schemas.
-- A literal channel string tags each row with its origin for downstream filtering.
-- UNION ALL appends without deduplication -- every row from both sources is preserved.
SELECT * EXCLUDE (shopify_cart_token, shopify_discount_code),
'shopify' AS source
FROM read_parquet(
'data/lake/raw/shopify/orders/**/*.parquet',

hive_partitioning = true

)

UNION ALL

SELECT * EXCLUDE (amazon_asin, amazon_fulfillment_channel),
'amazon' AS source
FROM read_parquet(
'data/lake/raw/amazon/orders/**/*.parquet',

hive_partitioning = true

);

WRITING QUERY RESULTS TO PARQUET

The COPY statement executes the inner SELECT query and writes its output directly to a Parquet file rather than returning rows to the client. FORMAT PARQUET specifies the output format; COMPRESSION ZSTD applies ZSTD compression, which typically reduces file size by 60–80% compared to uncompressed Parquet while remaining fast to decompress during reads. The inner query aggregates daily sales from core_orders — a DuckDB table built by the staging pipeline — grouping by date, SKU, and channel. This is how the mart layer is materialized: the pipeline runs this COPY statement nightly, replacing the previous mart file with a fresh aggregation over the full history.

-- COPY executes the inner SELECT and writes the result directly to a Parquet file.
-- ZSTD compression reduces output size ~60-80% vs uncompressed Parquet.
-- Replaces the mart file on each run with a fresh full-history aggregation.
COPY (
SELECT

order_date,

sku_id,

channel,

SUM(units_sold) AS units_sold,

SUM(net_revenue) AS net_revenue

FROM core_orders
GROUP BY 1, 2, 3
) TO 'data/lake/marts/daily_sales.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);

EXPORTING QUERY RESULTS TO CSV

CSV exports are how the platform delivers outputs to non-technical stakeholders, dashboards, and any downstream tool that cannot query DuckDB or Parquet directly. The COPY statement works identically to the Parquet export above, but with FORMAT implied by the .csv extension. HEADER writes column names as the first row — required by most dashboard tools. The INTERVAL 90 DAYS expression subtracts 90 days from today's date at query execution time, so the filter window automatically rolls forward each day without any hardcoded dates. The Python loop variant below shows how to export all mart tables in one pass: it connects to the .duckdb file in read-only mode, iterates over a dictionary of mart names and their SQL queries, and writes each result to a named CSV file while printing a row count for the run log.

-- HEADER writes column names as row 1, required by most dashboard tools.
-- current_date - INTERVAL 90 DAYS evaluates at query time -- rolling window,
-- no hardcoded dates to update.
COPY (
SELECT

order_date,

channel,

sku_id,

SUM(units_sold) AS units_sold,

SUM(net_revenue) AS net_revenue

FROM mart_daily_sales
WHERE order_date >= current_date - INTERVAL 90 DAYS
GROUP BY order_date, channel, sku_id
ORDER BY order_date, channel
) TO 'data/exports/csv/daily_sales_90d.csv' (HEADER, DELIMITER ',');
# Opens analytics.duckdb read-only to prevent write-lock conflicts during export.
# Iterates each mart, writes a CSV, and logs row count for pipeline monitoring.
import duckdb
from pathlib import Path
EXPORTS = {
'daily_sales':  'SELECT * FROM mart_daily_sales',
'channel_pnl':  'SELECT * FROM mart_channel_pnl',
'sku_velocity': 'SELECT * FROM mart_sku_velocity',

}

def export_marts(db_path: str, export_dir: str):
con = duckdb.connect(db_path, read_only=True)

Path(export_dir).mkdir(parents=True, exist_ok=True)

for name, query in EXPORTS.items():

out = f'{export_dir}/{name}.csv'
con.execute(f"COPY ({query}) TO '{out}' (HEADER, DELIMITER ',')")

n = con.execute(f'SELECT COUNT(*) FROM ({query})').fetchone()[0]

print(f' {name}.csv: {n:,} rows -> {out}')

con.close()

export_marts('data/warehouse/analytics.duckdb', 'data/exports/csv')

RUNNING SQL FILES FROM PYTHON

In a well-structured pipeline, SQL transformation logic lives in standalone .sql files — not embedded as strings inside Python scripts. This keeps business logic readable, reviewable in a code editor, and independently testable. The Python layer's only job is to read each file in dependency order, execute it against DuckDB, and validate that the result is non-empty before proceeding to the next model. Path(sql_path).read_text() reads the entire .sql file as a string, which is then passed directly to con.execute(). The table name is derived from the filename stem (stg_shopify_orders.sql becomes stg_shopify_orders), enabling a generic row-count check without hardcoding table names. If any model produces zero rows the pipeline aborts with a descriptive error, preventing downstream models from running against empty inputs.

# Reads each SQL model file from disk and executes it against the open DuckDB connection.
# Row count check after each model: empty output halts the pipeline immediately,
# preventing downstream models from running against missing or failed staging data.
import duckdb
from pathlib import Path
import sys
DB    = 'data/warehouse/analytics.duckdb'
SQL   = Path('sql/staging')
MODELS = [
'stg_shopify_orders.sql',
'stg_amazon_orders.sql',
'stg_quickbooks.sql',
]
def run_staging():
con = duckdb.connect(DB)

for model in MODELS:

path = SQL / model

sql = path.read_text() # read the .sql file from disk

con.execute(sql)               # execute it against DuckDB

table = path.stem # derive table name from filename

n = con.execute(f'SELECT COUNT(*) FROM {table}').fetchone()[0]

print(f' {table}: {n:,} rows')

if n == 0:

print(f'ERROR: {table} is empty -- aborting')

con.close()

sys.exit(1)

con.close()

Python Pipeline Architecture

This section focuses on how DuckDB integrates into an extraction pipeline — not how to build the pipeline itself. The assumption is that the implementing engineer is comfortable with scheduling, API clients, and DataFrame processing. What follows is specifically how DuckDB fits into each stage.

WHERE DUCKDB SITS IN THE PIPELINE

DuckDB is not involved in extraction or API calls. It enters the picture at two points: writing validated data to the lake as Parquet (indirectly — via pyarrow or pandas), and querying that data in the transform layer via SQL. The extract step is standard Python; DuckDB takes over from staging onward.

Step	DuckDB Involved?	What DuckDB Does
Extract	No	Standard API client — requests, pagination, auth
Normalize	No	Standard pandas — flatten JSON, cast types, derive fields
Validate	No	Standard assertions — check columns, nulls, value ranges
Write	Indirectly	DataFrame written to Hive partition path via df.to_parquet(..., compression='zstd')
Stage	Yes	read_parquet() scans the raw lake; SQL model normalizes and loads a staging table
Transform	Yes	SQL models build core and mart tables from staged data
Export	Yes	COPY statement writes mart outputs to CSV or Parquet

WRITING THE RAW PARQUET FILE

The extraction step ends by writing the validated DataFrame to the correct Hive partition. This is the only DuckDB-adjacent step in extraction — the partition path convention is what makes the lake queryable later.

# Constructs the Hive partition path from the target date and writes the DataFrame.
# ZSTD compression; the resulting file is immediately queryable by DuckDB.
out_dir = base / f'year={d.year}' / f'month={d.month:02d}' / f'day={d.day:02d}'
out_dir.mkdir(parents=True, exist_ok=True)
df.to_parquet(out_dir / 'extract.parquet', index=False, compression='zstd')
# Result path: data/lake/raw/shopify/orders/year=2025/month=03/day=15/extract.parquet

READING THE LAKE IN THE TRANSFORM STAGE

Once the Parquet file is on disk, DuckDB reads it directly via read_parquet() with hive_partitioning enabled. The staging SQL model runs against this scan and writes a normalized table. A row count check confirms the table is non-empty before any downstream model runs.

import duckdb
con = duckdb.connect('data/warehouse/analytics.duckdb')
con.execute(Path('sql/staging/stg_shopify_orders.sql').read_text())
# Validate: staging table must be non-empty before proceeding

row_count = con.execute('SELECT COUNT(*) FROM stg_shopify_orders').fetchone()[0]

if row_count == 0:

raise RuntimeError(

'stg_shopify_orders is empty. '

'Check that raw Parquet files exist and the SQL model ran cleanly.'

)

WRITING MART OUTPUTS

The transform stage ends by materializing marts. DuckDB writes these directly to Parquet or CSV via the COPY statement — no pandas, no intermediate objects.

# Materialize mart to Parquet (portable, queryable by any downstream tool)
con.execute("""
COPY (SELECT * FROM mart_daily_sales)

TO 'data/lake/marts/daily_sales/latest.parquet'

(FORMAT PARQUET, COMPRESSION ZSTD)

""")

# Export to CSV for stakeholder delivery or co-work function input
con.execute("""
COPY (
SELECT * FROM mart_channel_pnl
WHERE order_date >= current_date - INTERVAL 90 DAYS
)

TO 'data/exports/csv/channel_pnl_90d.csv' (HEADER, DELIMITER ',')

""")

con.close()

DuckLake: A Possible Next Step

As Grow's data platform matures, there is a named upgrade path that adds more robust data management without moving to cloud infrastructure. It is called DuckLake, and it is worth knowing it exists — even though Grow does not need it at the outset.

WHAT IS DUCKLAKE?

DuckLake is an optional layer that sits on top of the existing Parquet-based lake and adds transaction management. Think of it as giving the data lake a proper change log. Without DuckLake, if a pipeline run fails halfway through writing a new batch of files, the lake could be left in a partial state. With DuckLake, either the entire batch commits successfully, or none of it does — no in-between.

DuckLake also adds the ability to query the lake as it existed at a specific point in time — for example, asking what the sales data looked like on a specific date before a correction was applied. And it allows table schemas to evolve — adding or renaming columns — without rebuilding the entire dataset.

HOW IT FITS INTO THIS ARCHITECTURE

DuckLake is not a replacement for the Parquet lake or DuckDB. It is an extension. The underlying data still lives in Parquet files. DuckDB still executes all queries. Python still orchestrates the pipeline. DuckLake adds a lightweight metadata catalog — stored in DuckDB itself — that tracks what changed, when, and in what order. Existing pipelines can be migrated to DuckLake incrementally.

WHEN WOULD GROW CONSIDER DUCKLAKE?

Pipeline failures are leaving the lake in partial states requiring manual cleanup

The team wants to query historical snapshots for backtesting or audit purposes

Schema changes (adding new source fields) are becoming difficult to manage across existing Parquet files

Multiple pipeline writers need to run concurrently without file conflicts

DuckLake is a well-supported, named upgrade path that keeps the team on the DuckDB ecosystem. It is not a migration to a new tool — it is the same tool with more capabilities turned on when needed.

Migration Paths

One of the strongest arguments for this architecture is that it is designed to migrate cleanly if the business outgrows it. Three properties make this true: business logic lives in plain SQL files, Parquet is supported natively by every major target system, and Python pipeline scripts are engine-agnostic. In most cases, migrating the query layer means changing a connection string and making minor SQL dialect adjustments — the data and the logic move intact.

MIGRATION TO CLOUD WAREHOUSES

— Snowflake

Parquet files load directly into Snowflake's internal stage via the COPY INTO command. SQL transformation models require minimal modification — Snowflake's SQL dialect is highly compatible with DuckDB's. The primary migration effort is authentication setup, storage integration configuration, and warehouse sizing decisions.

— BigQuery

BigQuery can query Parquet files stored in Google Cloud Storage directly as external tables, or load them into native tables. Since GCS is already the recommended lake storage option for Grow, this migration path requires no data movement whatsoever — BigQuery simply points at the existing GCS bucket. SQL models require minor dialect adjustments, primarily around date functions.

— Databricks

Databricks supports Delta Lake natively. The migration involves converting the Parquet lake to Delta tables using the Delta Lake Python library, after which Databricks SQL or Spark SQL can execute equivalent transformation logic with minimal changes.

MIGRATION TO DISTRIBUTED SQL

— Trino

Trino queries Parquet files on local filesystems or cloud object stores via its Hive connector. SQL models are highly portable to Trino's dialect. The primary code change is replacing DuckDB Python connections with Trino JDBC or HTTP connections.

— Apache Spark

Spark reads Parquet natively and supports equivalent SQL via Spark SQL. Migration effort is higher than Trino due to Spark's different execution model and cluster management requirements. The Parquet lake remains the clean handoff point.

MIGRATION TO LOCAL SQL SERVERS

— PostgreSQL

For teams that eventually need multi-user write access or OLTP capabilities alongside analytics, PostgreSQL is the natural local alternative. A foreign data wrapper mechanism allows PostgreSQL to query existing Parquet files via DuckDB during a transitional period — the two systems can coexist while migration proceeds incrementally.

— MySQL / MariaDB

MySQL and its open-source fork MariaDB are widely deployed relational databases suited to transactional workloads. MariaDB's ColumnStore engine adds columnar analytics capability. For teams already operating MySQL infrastructure, MariaDB is a reasonable migration target for the transactional layer, with DuckDB continuing to serve the analytical workloads where columnar performance matters. SQL dialect differences (particularly around window functions and date handling) require some model adjustment.

WHY MIGRATION IS MANAGEABLE

Business logic lives in SQL files — not in vendor-specific stored procedures or proprietary transformation tools

Parquet is a first-class format for every migration target listed above — no data conversion required

Python pipeline orchestration is engine-agnostic — swapping the DuckDB connection for Snowflake or Trino is typically a one-line change per script

No ETL platform lock-in — the pipeline is plain Python, not a vendor's proprietary workflow engine

If the recommendation today is DuckDB, the option tomorrow is Snowflake or BigQuery — and the path between them is well-paved. Choosing DuckDB now does not foreclose anything.

Scaling Guidelines

The architecture evolves in predictable steps as data volume grows. The following thresholds provide a practical framework for when to consider each transition. They are approximate — actual performance depends on query complexity, available RAM, and storage speed — but they give the team a concrete basis for planning.

Data Volume	Recommended Architecture	Notes
< 100 GB	DuckDB + Parquet (current recommendation)	Full dataset fits in memory on modern hardware; queries complete in seconds; Grow is here now
100 GB – 500 GB	DuckDB + Partitioned Parquet Lake	Partition pruning becomes important; increase pipeline machine RAM to 32–64 GB
500 GB – 2 TB	DuckDB + Partitioned Lake + DuckLake	Evaluate DuckLake for transactional guarantees; DuckDB out-of-core execution handles this range
2 TB+	Distributed engine: Snowflake, BigQuery, Trino	Single-node ceiling reached; migrate using paths in Section 14

PERFORMANCE LEVERS BEFORE CONSIDERING MIGRATION

Before concluding that data volume requires a distributed system, exhaust the following optimizations — they often yield 5–10x query speedups on the existing stack:

Partition pruning: ensure queries filter on partition keys (year, month, day) so DuckDB skips irrelevant Parquet files entirely

ZSTD compression: Parquet files compressed with ZSTD are significantly smaller and faster to scan than the Snappy default

Column projection: select only needed columns rather than SELECT * — Parquet only reads what is requested

Pre-aggregation: push aggregations into the mart layer rather than computing them at query time

Memory allocation: explicitly configure DuckDB's memory_limit and thread count for the available hardware

-- Set DuckDB memory and parallelism at the start of a session
SET memory_limit = '24GB';
SET threads = 8;
SET enable_progress_bar = true;

If Adopted: Suggested Next Steps

The approach described in this document is a proposal based on Grow's current data profile, team structure, and the analytical objectives outlined in the existing strategy framework. Whether this is the right path is ultimately a business decision — but if the direction resonates, the following sequencing tends to work well.

WHY THIS STACK TENDS TO WORK WELL AT THIS STAGE

At Grow's data volumes, DuckDB performs comparably to cloud warehouses on the queries that matter — without the billing overhead, cluster management, or provisioning delays. The infrastructure cost is storage only. Every component is based on open standards, so no future migration is foreclosed. And the architecture can be operational in weeks, not months.

More importantly: this approach maps directly onto the data quality and normalization framework already in place. The seven normalization priorities — unified SKU taxonomy, date standardization, revenue normalization, attribution window alignment, cost basis standardization, customer identity resolution, and units of measure — each have a natural home in the staging and core layers of this architecture. Every pipeline stage applies explicit validation rules; every SQL model produces a guaranteed, auditable output. The stack doesn't sit alongside the strategy; it runs it.

SUGGESTED SEQUENCING

Step 1 — Storage: decide where the lake lives. GCS is the path of least resistance for a Google Workspace team. On-premises NAS is a reasonable alternative if cloud storage is not preferred. This decision shapes everything downstream.

Step 2 — Data audit: before writing a line of pipeline code, assess what's actually available from each source — Shopify, Amazon, QuickBooks, Faire. Known gaps (stockout periods, COGS granularity, attribution window mismatches) should be surfaced early so the staging layer can account for them.

Step 3 — First pipeline: stand up one extraction script for Shopify orders. Write the Parquet file. Query it with DuckDB. Validate the output. This is the proof of concept — once it works end-to-end on one source, the pattern is established for all others.

Step 4 — Staging and core models: build the normalization layer with validation rules at each boundary. This is where data quality rules get implemented. It is the highest-leverage investment in the platform.

Step 5 — Mart layer: with reliable core data, build the domain-specific aggregations. These are what feed dashboards, decisions, and the consumption layer.

Step 6 — Consumption layer: connect mart outputs to whatever tools the team already uses — dashboards, notebooks, forecasting pipelines, or direct analyst access via DBeaver.

This document is a reference, not a mandate. The implementation choices that matter most — which sources to prioritize, how strict validation should be, how to structure the mart layer for Grow's specific decision domains — are best made by the team closest to the business. This guide is intended to give that team a well-reasoned starting point.

Glossary and Key Definitions

This glossary defines technical concepts, acronyms, and architectural terms used throughout this document at or above an upper-division computer science level. Terms are grouped by category.

STORAGE AND FILE FORMATS

— Parquet

An open-source, columnar storage file format developed by the Apache Software Foundation. Unlike row-oriented formats (CSV, JSON), Parquet stores data column-by-column, enabling queries that read only the columns they need. It includes built-in compression, schema metadata, and min/max statistics per column chunk for fast filtering.

— Columnar Storage

A data layout where all values of a single column are stored contiguously on disk. Contrast with row storage, where all fields of a single record are stored together. Columnar storage is dramatically more efficient for analytical queries (which typically aggregate a few columns across many rows) because irrelevant columns are never read.

— ZSTD (Zstandard)

A lossless data compression algorithm developed by Facebook. Used in Parquet files to reduce storage size. ZSTD provides a better compression ratio than the older Snappy format (Parquet's default) while maintaining fast decompression speeds. Recommended for all Parquet files in this architecture.

— Glob Pattern

A string using wildcard characters to match file paths. In this architecture, *.parquet matches all Parquet files in a single directory, and **/*.parquet recursively matches all Parquet files in a directory tree. DuckDB uses glob patterns to read multiple files as a single virtual table.

— Arrow (Apache Arrow)

An open-source, in-memory columnar data format and inter-process communication standard. DuckDB uses Arrow internally for its execution engine and can exchange data with Pandas, Spark, and other tools via Arrow without copying or serializing data. When DuckDB returns results as a DataFrame, Arrow is the bridge.

DATABASE AND QUERY ENGINE CONCEPTS

— OLAP (Online Analytical Processing)

A category of database workload optimized for complex analytical queries over large datasets — aggregations, groupings, time-series analysis, and multi-dimensional exploration. Contrasts with OLTP. DuckDB is an OLAP engine.

— OLTP (Online Transaction Processing)

A category of database workload optimized for high-volume, low-latency read and write operations — point lookups, inserts, updates, and deletes. Typical of application databases (e-commerce order systems, CRM records). PostgreSQL and MySQL are OLTP engines.

— Vectorized Execution

A query execution strategy where operations are applied to entire batches (vectors) of column values at once, rather than one row at a time. Vectorized execution exploits modern CPU SIMD (Single Instruction, Multiple Data) instructions, enabling significantly higher throughput for analytical computations.

— Predicate Pushdown

An optimization where filter conditions (WHERE clauses) are applied as early as possible in query execution — ideally at the file scan level — so that irrelevant data is never read into memory. DuckDB pushes predicates down into Parquet scans, using the file's built-in column statistics to skip row groups that cannot satisfy the filter.

— Partition Pruning

The process of skipping entire file system partitions (directories or files) based on query filter conditions. When a query filters on year=2025 AND month=1 against a Hive-partitioned lake, DuckDB's partition pruning means only the year=2025/month=01/ directories are opened — all others are ignored at the filesystem level before any data is read.

— MVCC (Multi-Version Concurrency Control)

A concurrency model used by traditional databases (PostgreSQL, MySQL) that allows multiple transactions to read and write data simultaneously without blocking each other by maintaining multiple versions of each row. DuckDB does not implement MVCC — it uses a single-writer model instead.

— In-Process Database

A database engine that runs within the same operating system process as the application that uses it, rather than as a separate server process. DuckDB is in-process: a Python script imports it as a library, and the query engine runs inside the Python interpreter. No network connection, authentication, or daemon management is required.

DATA LAKE AND PIPELINE ARCHITECTURE

— Data Lake

A storage architecture where raw data from multiple source systems is collected in its original format — typically as files — before any transformation or schema enforcement. In this architecture, the raw/ and staging/ layers of the Parquet directory tree constitute the data lake.

— Hive Partitioning

A directory naming convention originating in the Apache Hive ecosystem where partition key and value are encoded directly in directory names using key=value syntax (e.g. year=2025/month=01/day=15/). DuckDB natively reads and automatically exposes these directory names as virtual columns, enabling partition pruning without any metadata catalog.

— ETL / ELT

Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are two approaches to data pipeline architecture. ETL transforms data before loading it into the destination system. ELT loads raw data first, then transforms it inside the destination. This architecture follows ELT: raw Parquet files are loaded to the lake first, then transformed by DuckDB SQL models inside the warehouse.

— Idempotent

An operation is idempotent if running it multiple times produces the same result as running it once. In pipeline design, idempotent extraction functions can be re-run for any date without risk of duplicate or corrupted data — the result is always the same Parquet file for that partition. This is what makes historical backfills and correction re-runs safe.

— Medallion Architecture

A layered data architecture pattern (also called bronze/silver/gold or raw/staging/mart) where data is progressively refined through defined layers. Raw data lands in the first layer unchanged; each subsequent layer adds normalization, integration, and aggregation. The five-layer model in Section 5 (Raw, Staging, Core, Mart, Consumption) is an implementation of this pattern.

— DAG (Directed Acyclic Graph)

A mathematical structure used to represent dependencies between tasks where edges have direction and no cycles exist. In pipeline orchestration, a DAG defines the order in which pipeline steps must run — staging must complete before core models, core models before marts, etc. Tools like Prefect and Dagster manage pipeline DAGs explicitly.

INFRASTRUCTURE AND NETWORKING

— NAS (Network-Attached Storage)

A dedicated file storage device connected to a local network, accessible to multiple machines as a shared file system. In this architecture, a NAS running Samba/SMB is one option for hosting the Parquet data lake. DuckDB can read Parquet files from a mounted NAS share but must not run its .duckdb database file on a NAS due to file locking limitations.

— SMB / Samba

Server Message Block (SMB) is a network file sharing protocol natively supported by Windows, macOS, and Linux. Samba is the open-source Linux/Unix implementation of SMB. SMB v3 is the recommended protocol for a cross-platform team sharing a NAS-hosted data lake.

— GCS (Google Cloud Storage)

Google's object storage service — a cloud-based system for storing and accessing files (objects) at any scale. Analogous to AWS S3. In this architecture, GCS is the recommended home for the Parquet data lake: DuckDB queries it natively via the httpfs extension using gs:// paths, and Python writes to it via the gcsfs library.

— NFS (Network File System)

A distributed file system protocol enabling file access over a network, common on Unix/Linux systems. DuckDB's documentation explicitly warns against running a .duckdb file in read-write mode on NFS shares due to locking and consistency issues.

ACRONYMS QUICK REFERENCE

Acronym	Stands For	Context in This Document
OLAP	Online Analytical Processing	DuckDB's workload category — aggregations, reporting, time-series
OLTP	Online Transaction Processing	PostgreSQL/MySQL workload — application databases, point lookups
ETL	Extract, Transform, Load	Pipeline pattern; this architecture uses ELT
ELT	Extract, Load, Transform	Raw data lands first; DuckDB transforms it in place
DAG	Directed Acyclic Graph	Dependency graph for pipeline step ordering
NAS	Network-Attached Storage	On-premises shared storage option for the Parquet lake
SMB	Server Message Block	Cross-platform file sharing protocol; Samba is its Linux implementation
GCS	Google Cloud Storage	Recommended cloud storage for the Parquet lake
NFS	Network File System	Unix/Linux file sharing protocol; not safe for .duckdb files
ZSTD	Zstandard	Compression algorithm used for Parquet files
MVCC	Multi-Version Concurrency Control	Concurrency model in PostgreSQL/MySQL; not present in DuckDB
SIMD	Single Instruction, Multiple Data	CPU feature exploited by DuckDB's vectorized execution engine
API	Application Programming Interface	How the pipeline connects to Shopify, Amazon, QuickBooks, etc.
PII	Personally Identifiable Information	Customer names, emails, addresses — segregated at the staging layer
CSV	Comma-Separated Values	Human-readable tabular file format; used for mart exports to stakeholders
JSON	JavaScript Object Notation	Common API response format; DuckDB queries JSON files directly via read_json_auto()
SQL	Structured Query Language	The query language used by DuckDB and all transformation models
DTC	Direct-to-Consumer	Grow's primary sales channel via Shopify