Data Platform Reference
DuckDB Platform Evaluation & Implementation Guide
A Recommendation for Grow’s Foundational Data Platform
GROW FRAGRANCE
DuckDB Platform Evaluation
& Implementation Guide
A Recommendation for Grow's Foundational Data Platform
Version 2.0 · March 2026
Prepared for internal evaluation
Executive Summary
Bottom Line Up Front: DuckDB, Python, and Apache Parquet form a strong technical foundation for the analytics platform Grow's strategy describes. The stack delivers enterprise-grade analytical capability at near-zero infrastructure cost — no servers to manage, no cloud subscriptions to commit to. More specifically: it provides the technical execution layer the analytics strategy describes — clean, validated data flowing through defined transformation stages with explicit quality checks at each boundary. The normalization priorities Grow has already identified map directly onto the staging and core layers described in this document.
Scope note: This document describes a standalone data infrastructure implementation — a persistent, organisation-wide analytics platform for the Grow Data Initiative. It is distinct from, and should not be confused with, the DuckDB integration available within Claude's Cowork MCP environment. The Cowork MCP connection allows Claude to query a DuckDB instance interactively during a session; this guide describes the underlying data platform that would feed that environment and every other analytical workload at Grow. The two are complementary: this architecture is the foundation; MCP is one interface layer that sits on top of it.
THE RECOMMENDATION
This document proposes building Grow's analytics platform on three open-source components: DuckDB as the query engine, Python as the orchestration layer, and Apache Parquet as the storage format. Together they form a modern, production-grade data platform that can be stood up in weeks rather than months, operated without dedicated infrastructure staff, and migrated to a cloud data warehouse at any point in the future with minimal friction.
The case for this stack rests on five business-level arguments:
1. NO INFRASTRUCTURE COST OR COMMITMENT
All three components are open-source and free. There is no software license, no monthly compute subscription, and no cloud vendor to negotiate with. The only cost is storage — Parquet files on a shared drive or Google Cloud Storage bucket, at roughly $1–2 per month at Grow's data volumes. A team member with a laptop and an internet connection can run the full analytics stack. This matters especially at a growth stage where capital allocation decisions are still being made: the platform scales in cost exactly as usage scales, starting from zero.
2. SERVERLESS BY DESIGN
DuckDB runs embedded inside a Python script — there is no server to provision, patch, monitor, or restart. Traditional data warehouse platforms (Snowflake, BigQuery, Redshift) require provisioning compute clusters, managing connection pools, and paying for idle time. This platform requires none of that. The nightly pipeline runs as a scheduled script. Analysts query data from their own machines. There is no shared database server to take down or secure. This is not a step down — for Grow's team size and workload pattern, serverless is the architecturally correct choice.
3. PERFORMANCE THAT MATCHES THE WORKLOAD
DuckDB consistently matches or performs comparably to or better than Snowflake and BigQuery on analytical queries for datasets under 500 GB — Grow's likely data volume for the next several years. This is not marketing; it is a function of physics. Distributed systems carry unavoidable overhead: query planning across nodes, network data movement, scheduler latency. When all the data fits on one machine, eliminating that overhead wins. Grow gets cloud warehouse performance at zero cloud warehouse cost.
4. BUILT ON OPEN STANDARDS — NO VENDOR LOCK-IN
Every component is based on open, widely-supported standards. Parquet is the storage format used by Snowflake, BigQuery, Databricks, and Spark. SQL is the transformation language. Python is the orchestration layer. If Grow's data volumes eventually require a distributed system, the migration path is straightforward: point Snowflake or BigQuery at the existing Parquet files, make minor SQL dialect adjustments, and continue. There is no proprietary format to convert, no vendor-specific pipeline to rewrite, and no data to re-import. The architecture is designed to be outgrown gracefully.
5. SPEED TO VALUE
A working pipeline extracting data from Shopify, writing Parquet files to a lake, and querying them with DuckDB can be operational in days, not months. There is no procurement process, no infrastructure provisioning delay, and no lengthy onboarding to a SaaS platform. The team starts generating reliable analytical outputs immediately and builds complexity incrementally as the business demands it. For a company in Grow's position — making time-sensitive decisions about inventory replenishment, DSM-Firmenich ingredient lead times, channel mix, and production batching — speed to reliable data is a direct business advantage.
WHEN THIS APPROACH IS NOT SUFFICIENT
This stack is worth considering for Grow now. It becomes the wrong choice if data volumes consistently exceed 2 TB, if Grow requires real-time streaming analytics, or if enterprise compliance requirements emerge (fine-grained access control, SSO, regulatory audit logging). None of these conditions apply today, and each has a well-documented migration path when they do. Section 11 covers those paths in detail.
This document is structured for two audiences. Sections 1–5 are written for leadership and cover the recommendation, the stack, the key tradeoffs, and the storage options. Sections 6–13 are the technical implementation guide for the engineering team. The Glossary (Section 14) defines technical terms used throughout.
The Recommended Stack: DuckDB + Python + Parquet
The stack described here — DuckDB, Python, and Apache Parquet — is one approach worth considering for Grow's analytics foundation. Each component has a distinct role, and together they address what reliable analytical outputs require: clean data, consistent schemas, and explicit validation at every stage boundary.
THE THREE COMPONENTS
DuckDB is the query engine: it executes SQL across raw files, staged tables, and aggregated mart outputs — all within the same Python process that runs the pipeline. No separate server, no network hop, no credentials beyond file system access.
Python is the orchestration layer: it calls source APIs, writes Parquet files to the lake, runs DuckDB SQL models in sequence, and coordinates validation between stages. It is the glue that connects source APIs, data quality checks, and SQL transformations into a single repeatable process.
Parquet is the storage format: columnar, compressed, open-standard, and natively readable by DuckDB, Spark, BigQuery, Snowflake, and Pandas. Data stored in Parquet is vendor-neutral and portable — it survives any future tool change without conversion.
WHY PARQUET OVER CSV
CSV is familiar. Parquet is faster, smaller, and safer for analytical work. The practical differences compound quickly at production data volumes.
| Property | CSV | Parquet |
|---|---|---|
| File size | Uncompressed text — baseline | 5–10× smaller with ZSTD compression |
| Read speed | Full file scanned on every query | Only requested columns read — rest skipped |
| Data types | Everything is text — types inferred at read time | Types embedded in file — no inference, no drift |
| Schema enforcement | None — columns can change type between files silently | Schema embedded — mismatches caught immediately |
| Tooling compatibility | Universal | DuckDB, Spark, Pandas, BigQuery, Snowflake |
| Human readable? | Yes — opens in Excel or a text editor | No — binary format, requires a tool |
The one trade-off worth naming: Parquet cannot be opened in Excel. CSV exports from the mart layer cover human-facing review. Parquet handles everything else.
HOW DUCKDB ACCESSES DATA
DuckDB operates in two modes, often used together:
— Mode 1 — Direct File Querying
DuckDB queries Parquet files directly from disk, a mounted share, or cloud storage — no import step, no database to load. This is the primary mode for the raw and staging layers.
-- Reads all daily Parquet partitions for January 2025 directly from disk.
-- hive_partitioning=true exposes year/month/day as filterable columns;
-- DuckDB skips every partition that does not match the WHERE clause.
SELECT order_date, sku_id, SUM(net_revenue) AS revenue
FROM read_parquet('data/lake/raw/shopify/orders/**/*.parquet',
hive_partitioning = true)
WHERE year = 2025 AND month = 1
GROUP BY order_date, sku_id;
— Mode 2 — Persistent .duckdb File
DuckDB can maintain a persistent database file storing pre-computed tables — mart aggregations, reference data, pipeline state. Faster for repeated interactive queries; useful for analyst exploration in DBeaver or notebooks.
# Open the local .duckdb file in read-only mode to avoid write-lock conflicts.
# The mart table was materialised by the nightly pipeline; no re-computation needed.
con = duckdb.connect('data/warehouse/analytics.duckdb', read_only=True)
df = con.execute(
'SELECT * FROM mart_daily_sales WHERE order_date >= current_date - INTERVAL 30 DAYS'
).df()
| Mode 1: File Lake | Mode 2: .duckdb File | |
|---|---|---|
| Source of truth | Always — raw Parquet is permanent | Derived — rebuilt from Parquet anytime |
| Team sharing | Unlimited concurrent readers | Single writer; others use read-only copies |
| Network storage | Works on NAS, GCS, S3 | Local disk only — not safe on network shares |
| Best for | Pipeline processing, long-term storage | Interactive exploration, notebooks, DBeaver |
For Grow, the file lake is the primary architecture. The .duckdb file is a convenience layer — generated nightly, used locally for exploration, discarded when done. A hybrid pattern (Parquet for sharing, .duckdb for fast local queries, CSV for stakeholder exports) covers every access pattern without a shared database server.
The .duckdb file is optional. Parquet alone, queried via DuckDB's direct file mode, is a complete solution.
Key Considerations Before Adopting This Stack
Every data platform involves deliberate tradeoffs. The following are the ones worth examining before committing to this approach — not as objections, but as design decisions any adopting team should make consciously.
RAW DATA: IMMUTABLE OR OVERWRITE?
A common pattern with Parquet lakes is treating raw data as immutable — files written once, never modified. Corrections arrive as new files; history is always preserved. This makes the system fully auditable and reproducible, and it is the pattern that supports backtesting and model validation directly.
The simpler alternative is an overwrite model — re-extract and replace. Easier to manage, but history is lost. For a team that wants to validate model outputs against prior data states, immutability is worth the additional storage and the discipline of a defined retention policy.
VALIDATION: STRICT OR PERMISSIVE?
Pipelines can be configured to halt loudly when data quality checks fail — missing columns, type mismatches, implausible values. This prevents bad data from reaching downstream consumers but requires upfront investment in writing the rules. The alternative — log and continue — produces outputs that may be subtly wrong. For decision-support analytics, strict validation is generally worth the build cost.
BUSINESS LOGIC: SQL OR PYTHON?
SQL-first for transformations means logic is auditable, version-controlled, and readable by anyone who can write a SELECT statement. Python handles orchestration, file management, and model integration. Some teams prefer embedding more logic in Python — more flexible, harder to audit. The SQL-first approach tends to age better as teams grow.
SINGLE MACHINE CONSTRAINTS
DuckDB runs on one machine. For a team where one process runs pipelines and others consume outputs, this tends to be sufficient. The mitigation pattern — Parquet lake for sharing, .duckdb snapshot for local exploration, CSV for non-technical stakeholders — handles Grow's scale without a shared database server. If that changes, Section 11 covers migration paths.
WHAT THIS STACK DOES NOT PROVIDE
| Capability | Traditional Warehouse | DuckDB Stack | Risk at Grow's Scale |
|---|---|---|---|
| User authentication | OAuth, SSO, username/password | File system access only | Low — small trusted team |
| Access roles | Row/column-level permissions | All-or-nothing file permissions | Low — internal use only |
| Concurrent writes | Full MVCC | .duckdb: single writer. Parquet lake: safe if writing separate partitions | Low — pipelines naturally write distinct date partitions |
| High availability | Replication, failover | Single machine; no failover | Medium — mitigated by Parquet backups |
| Audit logging | Who ran what, when | Pipeline logs + Git only | Low — internal analytics use |
| Real-time ingestion | Streaming inserts | Batch-oriented only | Low — nightly batch is sufficient |
None of these represent material risk at Grow's current profile. If any become relevant as the business scales, they have defined migration paths.
CONCURRENCY
The single-writer constraint applies specifically to the .duckdb binary file — only one process should open it in read-write mode at a time. The Parquet lake has no such restriction. Because each pipeline run writes to its own date-partitioned path, multiple pipelines can write to the lake concurrently without any conflict — a Shopify extraction writing year=2025/month=01/day=14/ and an Amazon extraction writing year=2025/month=01/day=15/ are operating on completely separate files. The one scenario to avoid is two processes writing to the exact same partition path simultaneously; in practice this only arises if two backfill jobs are scheduled for the same source and the same date, which is straightforward to prevent with basic scheduling logic.
| Scenario | Safe? | Note |
|---|---|---|
| Multiple pipelines writing different date partitions | Yes | Separate file paths — no conflict possible |
| Multiple pipelines writing the same partition | No | Last writer wins or filesystem collision — prevent with scheduler locks |
| Multiple analysts reading the Parquet lake | Yes | Unlimited concurrent readers on any file |
| Nightly pipeline writing .duckdb while analyst reads | Risky | Analyst should query Parquet directly or use a local read-only copy |
| Two processes writing .duckdb simultaneously | No | DuckDB single-writer constraint — serialize with a scheduler |
SECURITY
Security is file system level — no row or column permissions. For Grow's internal team: raw directories writable by the pipeline only; mart exports read-only for analysts; credentials in environment variables, never in code; PII segregated at the staging layer.
BACKUP AND RECOVERY
Because raw Parquet is immutable and retained, the entire platform can be rebuilt by replaying the pipeline against the lake. Three tiers:
| Tier | What Is Restored | Recovery Time | Data Loss Risk |
|---|---|---|---|
| CSV restore | Latest mart exports | Minutes | Up to 24 hours |
| .duckdb restore | Modeled tables from backup | < 1 hour | Up to 24 hours |
| Full rebuild | Complete platform | Hours | None — Parquet is source of truth |
Raw Parquet files should be backed up nightly — only new partitions need copying since prior partitions are immutable.
HOW DUCKDB COMPARES TO ALTERNATIVES
| System | Type | Perf (<500 GB) | Scalability | Concurrency | Cost | Ops Complexity |
|---|---|---|---|---|---|---|
| DuckDB + Parquet | Embedded OLAP | Excellent | Single node to ~2 TB | Single writer | Free (open source) | Very low |
| SQLite | Embedded OLTP | Poor for analytics | Small datasets only | Single writer | Free | Very low |
| PostgreSQL | Client-server OLTP | Good (with indexes) | Vertical scaling | High (MVCC) | Free (hosting cost) | Medium |
| MariaDB | Client-server OLTP | Good (with indexes) | Vertical + replication | High | Free (hosting cost) | Medium |
| Snowflake | Cloud OLAP warehouse | Excellent | Virtually unlimited | High | High — compute + storage | Low (managed) |
| BigQuery | Cloud OLAP warehouse | Excellent | Virtually unlimited | High | Pay-per-query or flat rate | Low (managed) |
SQLite is not suited to analytical workloads. PostgreSQL and MariaDB make sense if Grow eventually needs a transactional database alongside analytics — they can coexist with DuckDB. Snowflake and BigQuery are the natural upgrade path if data volumes outgrow single-node performance; both read Parquet natively, making migration low-friction.
At Grow's current data volumes — likely under 50 GB of structured analytical data — DuckDB performs as well as or better than cloud warehouses on the queries that matter, at a fraction of the cost. Revisit Snowflake or BigQuery when volumes approach 500 GB or when multi-user write requirements emerge.
Section 4 Lake Storage Options: Where Should the Data Live?
Lake Storage Options: Where Should the Data Live?
The Parquet data lake is a folder of files — which means it can live almost anywhere that Python and DuckDB can reach. For Grow, operating on Google Workspace without a dedicated shared drive today, there are four realistic options. Each has a different profile of cost, setup complexity, performance, and team accessibility.
One rule applies across all options: the analytics.duckdb warehouse file must never live on a network share or cloud-synced folder. DuckDB's official documentation explicitly warns that read-write mode on NAS, SMB, NFS, or cloud-synced paths can cause data corruption and unpredictable errors. The .duckdb file always stays on the local machine running the pipeline. Only the Parquet lake files are shared.
OPTION 1: GOOGLE CLOUD STORAGE (GCS) — RECOMMENDED
Google Cloud Storage is the cleanest solution for a Google Workspace team. GCS is Google's object storage service — same Google account, same billing, but exposed as a storage bucket that DuckDB and Python can query directly without any desktop sync client.
pip install gcsfs google-cloud-storage
# Python: write Parquet directly to GCS
import gcsfs, pandas as pd
fs = gcsfs.GCSFileSystem(project='grow-fragrance')
df.to_parquet('gs://grow-lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet',
filesystem=fs, index=False, compression='zstd')
# DuckDB: query GCS directly -- no download
import duckdb
con = duckdb.connect() # in-memory; .duckdb stays local
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""CREATE SECRET (TYPE gcs, KEY_ID '...', SECRET '...')""")
df = con.execute("""
SELECT year, month, SUM(net_revenue) AS revenue
FROM read_parquet('gs://grow-lake/raw/shopify/orders/**/*.parquet',
hive_partitioning = true)
WHERE year = 2025 GROUP BY 1,2 ORDER BY 2
""").df()
| Aspect | Detail |
|---|---|
| DuckDB support | Native — read_parquet('gs://...') works out of the box with the httpfs extension |
| Python support | gcsfs and google-cloud-storage are mature, well-supported libraries |
| Cross-platform | Identical on Mac, Windows, and Linux — no mount configuration per machine |
| Google account | Same Google Workspace project and billing as the rest of Grow's tools |
| Cost | Negligible at Grow's volumes — approximately $1–2/month for storage |
| Setup | One GCS bucket + HMAC keys for DuckDB; approximately 30 minutes to configure |
| Requires internet | Pipeline and queries need internet access; no offline operation |
OPTION 2: GOOGLE DRIVE FOR DESKTOP (MOUNTED LOCAL PATH)
Google Drive for Desktop mounts your Drive as a folder accessible through File Explorer on Windows (e.g. G:\) or Finder on Mac (/Volumes/GoogleDrive/). DuckDB and Python see it as a local path — no special extensions needed.
# Windows path (Drive for Desktop mounted as G:)
BASE = r'G:\My Drive\grow-analytics\data\lake'
# macOS path
BASE = '/Users/david/Library/CloudStorage/GoogleDrive-.../My Drive/grow-analytics/data/lake'
# DuckDB reads from the mounted path exactly like local disk
con = duckdb.connect('data/warehouse/analytics.duckdb') # LOCAL -- never on Drive
con.execute(f"SELECT * FROM read_parquet('{BASE}/raw/shopify/orders/**/*.parquet')")
| Aspect | Detail |
|---|---|
| DuckDB support | Full — reads the mounted path as local files; no extensions needed |
| Python support | Full — standard pathlib and pyarrow; no extra libraries |
| Cross-platform | Native on Mac and Windows; Linux requires rclone or ocamlfuse (extra setup) |
| Familiar interface | Files visible in the normal Drive UI |
| No extra cost | Included in Google Workspace subscription |
| .duckdb file | Must remain on local disk — never sync the .duckdb file to Drive |
| Linux support | Not officially supported; third-party tools required |
OPTION 3: ON-PREMISES NAS WITH SMB (SAMBA)
For a team that wants the lake on hardware they control — without relying on any cloud service — a Network-Attached Storage device running SMB via Samba is the most cross-platform on-premises option available.
— Why SMB / Samba for a Mixed Mac, Windows, Linux Team
SMB version 3 is the right protocol choice. AFP (Apple Filing Protocol) is deprecated on modern macOS and should not be used for new deployments. NFS has no native Windows client and requires third-party software. SMB v3 is natively supported by all three operating systems and provides the best combination of cross-platform compatibility, security, and performance for a mixed team.
— Hardware Options
Consumer NAS appliance (Synology, QNAP): pre-built, low-power, runs Samba out of the box. A 2-bay unit with 4–8 TB starts around $300–600. Recommended starting point.
Repurposed mini PC or workstation: install Ubuntu Server, configure Samba manually. More control, lower cost if hardware is available.
Raspberry Pi 5: extremely low-cost and low-power, adequate for Grow's data volumes.
— Mounting the Share on Each OS and DuckDB Usage
# macOS -- Finder: Go > Connect to Server > smb://nas-hostname/grow-lake
# Or terminal:
mount -t smbfs //david@nas-hostname/grow-lake /Volumes/grow-lake
# Windows -- Map network drive in File Explorer, or PowerShell:
net use G: \\\\nas-hostname\\grow-lake /persistent:yes
# Linux:
sudo apt install cifs-utils
sudo mount -t cifs //nas-hostname/grow-lake /mnt/grow-lake -o username=david,vers=3.0
# Once mounted, DuckDB reads it as a local path (READ only -- .duckdb stays local)
con = duckdb.connect('~/local/analytics.duckdb') # local machine only
df = con.execute("""
SELECT * FROM read_parquet('/Volumes/grow-lake/raw/shopify/orders/**/*.parquet',
hive_partitioning = true)
""").df()
# Minimal Samba share config (/etc/samba/smb.conf on the NAS/server):
# [grow-lake]
# path = /data/grow-lake
# valid users = david, louis, drew
# writable = yes
# create mask = 0664
| Aspect | Detail |
|---|---|
| DuckDB support | Parquet reads work fine over SMB mount; .duckdb file must stay local |
| Python support | Full — standard pathlib; no extra libraries once mounted |
| Cross-platform | SMB v3 native on Mac and Windows; cifs-utils on Linux (straightforward) |
| No cloud dependency | Fully on-premises; no internet required for lake access |
| Cost | One-time hardware cost (~$300–600 for a Synology starter); no ongoing fees |
| Setup effort | Moderate — hardware, OS config, Samba setup, mount config per machine |
| Remote access | VPN required for off-site access; not as seamless as cloud options |
| .duckdb file | Must remain on local disk — never store it on the NAS share |
OPTION 4: GOOGLE DRIVE API (PROGRAMMATIC ONLY)
Python can read and write files to Google Drive via the API without mounting the drive. This works for specific tasks — such as publishing CSV mart exports to a shared Drive folder — but is not a viable primary lake strategy. DuckDB cannot query Drive API paths directly and would require downloading files to a local temp directory first, which negates the partition pruning and direct-file querying advantages. Recommended only for the export/delivery layer, not the lake itself.
RECOMMENDATION SUMMARY
| Option | Best For | DuckDB Lake | .duckdb File | Linux | Setup Effort |
|---|---|---|---|---|---|
| GCS (Option 1) | Primary recommendation — cloud-native, zero hardware | Native gs:// paths | Local only | Full — no mount | Low (~30 min) |
| Drive for Desktop (Opt 2) | Team already uses Drive heavily; GCS not desired | Via local mount path | Local only | Third-party tools | Low (per machine) |
| On-prem NAS / SMB (Opt 3) | Data on-site; no cloud dependency; team on local network | Via SMB mount path | Local only | cifs-utils (easy) | Medium |
| Drive API (Opt 4) | CSV export delivery to stakeholders only | Not suitable | N/A | Full | Low |
For Grow's current setup, GCS is the cleanest path — zero hardware, negligible cost, same Google account, and DuckDB queries it natively. If the team prefers data on local hardware, a Synology NAS running SMB is the on-premises equivalent. Both patterns use identical Python and DuckDB code — only the path prefix changes.
Layered Data Architecture
A layered data architecture is a widely-used pattern for organizing analytical pipelines. The example below illustrates one common approach — a five-layer model in which data flows in one direction, from raw source extracts through to outputs consumed by downstream tools and stakeholders. This is not the only valid structure, and real implementations often compress or split layers based on team size, tooling preferences, and data complexity.
The following is an illustrative example of how this architecture could be applied. Layer boundaries, naming conventions, and storage choices should be treated as starting points rather than fixed requirements — the right structure depends on the specific sources, consumers, and team workflows involved.
| Layer | Purpose | Common Storage Format | Typical Owner |
|---|---|---|---|
| Raw | Source data as delivered — unchanged from extraction | Parquet (partitioned by date) | Pipeline / extraction scripts |
| Staging | Schema normalization, type casting, deduplication | Parquet or DuckDB tables | SQL models |
| Core | Cross-source integration, unified identifiers, standardized metrics | DuckDB tables or Parquet | SQL models |
| Marts | Domain-specific aggregations pre-built for reporting or decisions | DuckDB tables + CSV / Parquet exports | SQL models |
| Consumption Layer | Outputs served to any downstream consumer: BI tools, ML models, forecasting pipelines, reports, or end users | Parquet, CSV, or direct DuckDB queries | Domain owners |
RAW LAYER
One common approach treats the raw layer as an immutable record of data exactly as it arrived from source systems — no transformations, no cleaning. Files are typically partitioned by extraction date so that specific periods can be re-pulled or corrected without touching the rest of the history. Some implementations allow targeted overwrites of specific date partitions when source data is corrected; others prefer an append-only model with corrections tracked separately.
STAGING LAYER
The staging layer is where raw data is shaped into a consistent, queryable structure. In the context of Grow's normalization priorities, this is where the foundational work happens: unified SKU taxonomy enforcement, date/time standardization to UTC with daily grain, revenue normalization (net of returns and discounts), and deduplication of records introduced by API pagination. A common principle is that only structural conformance happens here — no business logic — which keeps the layer predictable and straightforward to validate.
CORE LAYER
Core models bring data from multiple sources into alignment — unifying order records from Shopify, Amazon, Faire, and Klaviyo under a shared customer identity, normalizing revenue to a net-of-returns basis across channels, aligning attribution windows across Meta Ads, Google Ads, and Triple Whale, and resolving cost basis differences between QuickBooks and production records. This is where Grow's cross-cutting normalization priorities become executable SQL, and where the single source of truth for cross-functional analysis lives.
MART LAYER
Marts are pre-aggregated, domain-specific views of core data, designed for a particular audience or decision. A mart built for channel P&L analysis looks quite different from one built for warehouse operations, even if both draw from the same core tables. Keeping marts purpose-built and narrow tends to make them faster, easier to maintain, and easier to document — though some teams prefer a smaller number of wider marts depending on their query patterns.
CONSUMPTION LAYER
The consumption layer is the interface between the data platform and everything that uses it. This is deliberately broad — it could be a BI dashboard querying a mart directly, a forecasting model reading a Parquet export, a Python notebook doing ad hoc analysis, an operational CSV report, or a data scientist connecting DBeaver to a local .duckdb snapshot. The key design principle is that this layer should consume clean, validated mart outputs rather than performing its own transformations.
BI and reporting: Tableau, Looker, Metabase, Google Sheets, or Power BI connected to mart exports or direct DuckDB queries
ML and forecasting pipelines: Python reading mart Parquet files as training datasets or feature stores
Operational outputs: scheduled CSV exports, email reports, or Slack-delivered summaries
Interactive analysis: DBeaver, DuckDB CLI, or notebooks pointed at a local .duckdb copy or Parquet lake
File-Based Data Lake Design
The raw and staging layers are implemented as a file-based data lake: a directory tree of Parquet files organized by partition keys. DuckDB reads these files directly using glob patterns, enabling SQL queries across the entire lake without first importing data into a database file.
Partitioning is the single most impactful design decision in a data lake. Without it, every query scans every file regardless of the date range being asked about — a query for last month reads a full year of data. With partitioning, DuckDB skips irrelevant directories at the filesystem level before opening a single file. As data accumulates, the performance gap widens: on a two-year daily lake, a one-month query touches ~4% of files without partitioning costing you. This is not premature optimization — it is a reasonable default from day one.
PARTITIONING STRATEGIES
Partitioning organizes files into sub-directories based on key values, so queries only need to read the subset of files relevant to the question being asked. DuckDB supports three main approaches.
— 1. Hive-Style Directory Partitioning
The most common and best-supported pattern. Directory names encode partition key and value using key=value syntax. DuckDB reads this automatically and exposes the partition keys as queryable columns, enabling efficient partition pruning.
data/lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet
data/lake/raw/shopify/orders/year=2025/month=01/day=16/extract.parquet
— 2. Single Flat Folder
All Parquet files for a table sit in one directory. DuckDB uses a glob to read them together. Simplest approach — works well for lower-volume tables or mart outputs where partition pruning is not needed.
data/lake/raw/shopify/orders/orders_2025_01_15.parquet
data/lake/raw/shopify/orders/orders_2025_01_16.parquet
— 3. Filename-Encoded Partitioning
Date values are embedded in the filename rather than directory structure. DuckDB exposes the filename as a virtual column which can be parsed with string functions. Less ergonomic than Hive-style but common when files originate from external systems with fixed naming conventions.
-- Parse date from filename using the filename virtual column
SELECT
regexp_extract(filename, '(\d{4})_(\d{2})_(\d{2})', 0) AS file_date,
COUNT(*) AS rows
FROM read_parquet('data/lake/raw/shopify/orders/*.parquet', filename = true)
WHERE filename LIKE '%2025_01%'
GROUP BY file_date;
| Pattern | Best For | DuckDB Support | Tradeoffs |
|---|---|---|---|
| Hive-style directory | Daily extracts, time-series, high-volume tables | Native — partition columns auto-inferred | Deeper directory tree; more folders to manage |
| Single flat folder | Low-volume tables, reference data, mart outputs | Full glob support | No partition pruning — all files read on every query |
| Filename-encoded | Files from external systems with fixed naming | Via filename virtual column + string parsing | Manual parsing; fragile if naming changes |
PYTHON: IMPLEMENTING PARTITIONED WRITES
The pipeline scripts are responsible for constructing partition paths dynamically. Two functions cover the two main operational patterns.
— Full Historical Load
Pulls all available data from a source API and writes one Parquet file per day-partition. If a partition already exists it is overwritten — making this safe to re-run for historical corrections.
import pandas as pd
from pathlib import Path
from datetime import date, timedelta
# Writes df to a Hive-style partition directory derived from the target date.
# File is ZSTD-compressed; immediately queryable by DuckDB without import.
def write_partition(df: pd.DataFrame, base: str, d: date):
out = Path(base) / f'year={d.year}' / f'month={d.month:02d}' / f'day={d.day:02d}'
out.mkdir(parents=True, exist_ok=True)
df.to_parquet(out / 'extract.parquet', index=False, compression='zstd')
# Iterates over a date range, writing one Parquet partition per day.
# Overwrites existing partitions -- safe to re-run for corrections or backfills.
def full_historical_load(fetch_fn, base: str, start: date, end: date):
d = start
while d <= end:
df = fetch_fn(d)
if not df.empty:
write_partition(df, base, d)
d += timedelta(days=1)
— Incremental Daily Load
Runs each night. Fetches only yesterday's data and writes it to the correct partition. Existing partitions are untouched.
from datetime import date, timedelta
from pathlib import Path
# Nightly load: defaults to yesterday if no date is provided.
# Pass an explicit date to backfill or correct a specific partition.
# Prior partitions are never touched -- safe to schedule without side effects.
def incremental_daily_load(fetch_fn, base: str, target: date = None):
if target is None:
target = date.today() - timedelta(days=1)
df = fetch_fn(target)
if df.empty:
return # zero records on a date is valid -- not an error condition
out = Path(base) / f'year={target.year}' / f'month={target.month:02d}' / f'day={target.day:02d}'
out.mkdir(parents=True, exist_ok=True)
df.to_parquet(out / 'extract.parquet', index=False, compression='zstd')
PARTITIONING BENEFITS SUMMARY
| Benefit | How It Works in Practice |
|---|---|
| Query speed | DuckDB skips partitions that don't match a date filter — a query for January reads only January files |
| Targeted corrections | Overwrite only the affected day-partitions when source data is corrected |
| Incremental backups | Only new partitions need copying in nightly backups — prior partitions never change |
| Parallel processing | Different date ranges can be processed simultaneously by independent pipeline workers |
| Auditable history | Filesystem timestamps on partition files record exactly when each day's data was written or corrected |
Recommended Project Folder Structure
The following structure separates data (gitignored) from code (always versioned), and organizes code by function rather than layer. The analytics.duckdb file is explicitly local — it is never committed or shared via a network path.
📁 grow-analytics/
📁 data/
📁 lake/
📁 raw/
📁 shopify/orders/year=YYYY/month=MM/day=DD/
📁 amazon/orders/
📁 quickbooks/invoices/
📁 faire/orders/
klaviyo/
triple_whale/
📁 staging/
📁 core/
📁 marts/ ─── daily_sales/ channel_pnl/ inventory/
📁 warehouse/
📄 analytics.duckdb ← LOCAL ONLY — never shared
📁 exports/ ─── csv/ snapshots/
📁 pipelines/
📁 extract/ ─── shopify_orders.py amazon_orders.py ...
📁 transform/ ─── run_staging.py run_core.py run_marts.py
📁 load/ ─── export_csv.py
📁 sql/
📁 staging/ ─── stg_shopify_orders.sql stg_amazon_orders.sql
📁 core/ ─── core_orders.sql core_customers.sql
📁 marts/ ─── mart_daily_sales.sql mart_channel_pnl.sql
📁 validation/ ─── assert_positive_revenue.sql
📁 models/ ─── demand_forecasting/ channel_attribution/ quality_control/ ...
📁 notebooks/ ─── exploration/ analysis/
📁 logs/
📄 config/settings.yaml
📄 README.md
KEY DESIGN DECISIONS
data/ is separate from code/ — Parquet files and the .duckdb binary are gitignored; the scripts that generate them are always versioned
sql/ holds transformation logic as standalone files — models are reviewable and testable independently of Python orchestration
models/ has a directory per decision domain, aligned with the Consumption Layer outputs in Section 6
validation/ SQL scripts run at each layer boundary — not afterthoughts
Faire is included as a raw source from the outset
DuckDB SQL Capabilities
DuckDB's SQL dialect is ANSI-compliant and extends standard SQL with direct file-querying functions, automatic partition discovery, and a range of analytical capabilities that make it well-suited for data engineering and analytical workloads. The examples below progress from basic file access through to pipeline-integrated patterns, each with a description of what the query is doing and why it is written that way.
QUERYING CSV FILES DIRECTLY
read_csv_auto() opens a CSV file and automatically infers column names, data types, and delimiters from the file contents — no schema definition required. The query reads only the three specified columns (DuckDB applies column projection even on CSV) and filters for orders above $50. This is useful for quickly interrogating a raw export without creating any tables or importing data.
-- read_csv_auto infers column names, types, and delimiters from file contents.
-- Column projection and WHERE filtering reduce I/O to only what the query needs.
SELECT order_id, customer_id, total_price
FROM read_csv_auto('data/lake/raw/shopify/orders/2025-01-15.csv')
WHERE total_price > 50;
QUERYING JSON FILES DIRECTLY
DuckDB can query JSON files directly using read_json_auto(), which infers the schema from the file contents including nested objects and arrays. This is useful for interrogating raw API responses — for example, a Klaviyo event export or a Gorgias ticket dump — without first flattening or importing the data. DuckDB exposes nested keys using dot notation and automatically unnests arrays when needed.
-- read_json_auto infers schema from the file, including nested fields.
-- Dot notation accesses nested keys; arrow notation (->) also works.
-- Useful for quickly auditing raw API exports before committing to a schema.
SELECT
event_id,
properties.email AS customer_email,
properties.\$value AS event_value,
datetime AS event_ts
FROM read_json_auto('data/lake/raw/klaviyo/events/2025-01-15.json')
WHERE event_type = 'Placed Order'
ORDER BY event_ts DESC
LIMIT 100;
CONVERTING JSON TO PARQUET FOR THE LAKE
Raw API responses often arrive as JSON. The recommended pattern is to query the JSON file directly with DuckDB — flattening any nested fields into a clean tabular structure — then write the result to a ZSTD-compressed Parquet file at the correct Hive partition path. This converts a verbose, uncompressed JSON export into a compact, queryable Parquet partition in a single SQL statement, with no intermediate DataFrame or pandas step required.
-- Reads a raw JSON export, flattens nested properties into columns,
-- and writes the result directly to the Hive-partitioned lake as Parquet.
-- A 50 MB JSON file typically compresses to ~5-8 MB Parquet with ZSTD.
COPY (
SELECT
event_id::VARCHAR AS event_id,
properties.email::VARCHAR AS customer_email,
properties.\$value::DOUBLE AS event_value,
datetime::TIMESTAMP AS event_ts,
date_trunc('day', datetime::TIMESTAMP)::DATE AS event_date
FROM read_json_auto('data/lake/raw/klaviyo/events/2025-01-15.json')
WHERE event_type = 'Placed Order'
)
TO 'data/lake/raw/klaviyo/events/year=2025/month=01/day=15/extract.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);
-- The resulting Parquet file is immediately queryable via partition pruning:
-- SELECT * FROM read_parquet('data/lake/raw/klaviyo/events/**/*.parquet',
-- hive_partitioning = true)
-- WHERE year = 2025 AND month = 1;
QUERYING PARQUET WITH AN EXPLICIT SCHEMA
read_parquet() with a columns parameter tells DuckDB exactly what types to expect for each column rather than inferring them from the file metadata. This matters in production pipelines where type drift between API versions can cause silent coercion — for example, a field that was an integer in one extract becoming a float in a later one. By specifying the schema explicitly, any mismatch raises an error at read time rather than producing subtly wrong downstream values. The query then selects only three of those columns; Parquet's columnar format means the remaining columns in the file are never read from disk.
-- Explicit schema enforcement: type mismatches raise an error at read time
-- rather than silently coercing values -- important for production pipelines.
-- Only the three named columns are read; the rest are skipped by Parquet's columnar layout.
SELECT order_id, order_date, net_revenue
FROM read_parquet(
'data/lake/raw/shopify/orders/year=2025/month=01/day=15/extract.parquet',
columns = {
'order_id': 'VARCHAR',
'order_date': 'DATE',
'net_revenue': 'DOUBLE'
}
);
QUERYING A FLAT FOLDER WITH A GLOB PATTERN
The *.parquet glob pattern tells DuckDB to read every Parquet file in the orders/ directory and treat them as a single unified table. This works regardless of how many files are present — whether the folder contains one file or thousands, the query syntax is identical. date_trunc('month', order_date) truncates the order_date column to the first day of each month, which groups all daily records into monthly buckets. The result is a monthly revenue summary across the entire history of the flat folder.
-- Reads all .parquet files in the folder as one unified table.
-- date_trunc truncates order_date to the first of each month,
-- grouping all daily records into monthly revenue buckets.
SELECT date_trunc('month', order_date) AS month,
SUM(net_revenue) AS revenue
FROM read_parquet('data/lake/raw/shopify/orders/*.parquet')
GROUP BY 1
ORDER BY 1;
HIVE PARTITION DISCOVERY AND PRUNING
The ** glob recursively descends through all sub-directories. With hive_partitioning = true, DuckDB reads the key=value directory names (year=2025, month=01, day=15) and exposes them as virtual columns on every row — without those values actually being stored inside each Parquet file. The WHERE clause filters on those virtual columns before any files are opened: DuckDB physically skips every partition directory that does not match year=2025 AND month=1, reading only the January 2025 files from disk. On a lake with two years of daily data, this reduces I/O from ~730 files to ~31 files for a single-month query.
-- hive_partitioning=true exposes year/month/day as virtual columns from directory names.
-- DuckDB skips all directories that do not match the WHERE clause at the filesystem level --
-- no data is read from non-matching partitions regardless of lake size.
SELECT year, month, COUNT(*) AS order_count
FROM read_parquet(
'data/lake/raw/shopify/orders/**/*.parquet',
hive_partitioning = true
)
WHERE year = 2025 AND month = 1
GROUP BY year, month;
FILENAME-BASED PARTITION DISCOVERY
When files use a naming convention rather than Hive directories — for example orders_2025_01_15.parquet — the filename = true parameter exposes the full file path as a virtual column called filename on every row. regexp_extract() then parses date components out of that string using a capture group pattern: (\d{4})_(\d{2})_(\d{2}) matches four digits, underscore, two digits, underscore, two digits, and the third argument (0) returns the entire match rather than a specific group. The WHERE clause filters by filename string before the file contents are read, limiting I/O to files whose names contain 2025_01.
-- filename=true adds the full file path as a virtual column on every row.
-- regexp_extract parses the YYYY_MM_DD date from the filename string.
-- The LIKE filter limits which files DuckDB opens before reading any data.
SELECT
regexp_extract(filename, '(\d{4})_(\d{2})_(\d{2})', 0) AS file_date,
COUNT(*) AS rows,
SUM(net_revenue) AS revenue
FROM read_parquet(
'data/lake/raw/shopify/orders/*.parquet',
filename = true
)
WHERE filename LIKE '%2025_01%'
GROUP BY file_date
ORDER BY file_date;
UNION ALL ACROSS HETEROGENEOUS SOURCES
UNION ALL concatenates the rows from two queries into a single result set without deduplication — each row from both sources appears in the output. The challenge with combining Shopify and Amazon orders is that each source has source-specific columns that do not exist in the other (shopify_cart_token and shopify_discount_code exist only on Shopify records; amazon_asin and amazon_fulfillment_channel exist only on Amazon records). The EXCLUDE keyword drops those named columns from the SELECT *, leaving only the shared columns that both sources have in common. A literal string 'shopify' AS source and 'amazon' AS source tags each row with its origin so downstream queries can filter or group by channel. The result is a single unified orders table spanning both source systems.
-- EXCLUDE drops source-specific columns so both SELECT * lists produce matching schemas.
-- A literal channel string tags each row with its origin for downstream filtering.
-- UNION ALL appends without deduplication -- every row from both sources is preserved.
SELECT * EXCLUDE (shopify_cart_token, shopify_discount_code),
'shopify' AS source
FROM read_parquet(
'data/lake/raw/shopify/orders/**/*.parquet',
hive_partitioning = true
)
UNION ALL
SELECT * EXCLUDE (amazon_asin, amazon_fulfillment_channel),
'amazon' AS source
FROM read_parquet(
'data/lake/raw/amazon/orders/**/*.parquet',
hive_partitioning = true
);
WRITING QUERY RESULTS TO PARQUET
The COPY statement executes the inner SELECT query and writes its output directly to a Parquet file rather than returning rows to the client. FORMAT PARQUET specifies the output format; COMPRESSION ZSTD applies ZSTD compression, which typically reduces file size by 60–80% compared to uncompressed Parquet while remaining fast to decompress during reads. The inner query aggregates daily sales from core_orders — a DuckDB table built by the staging pipeline — grouping by date, SKU, and channel. This is how the mart layer is materialized: the pipeline runs this COPY statement nightly, replacing the previous mart file with a fresh aggregation over the full history.
-- COPY executes the inner SELECT and writes the result directly to a Parquet file.
-- ZSTD compression reduces output size ~60-80% vs uncompressed Parquet.
-- Replaces the mart file on each run with a fresh full-history aggregation.
COPY (
SELECT
order_date,
sku_id,
channel,
SUM(units_sold) AS units_sold,
SUM(net_revenue) AS net_revenue
FROM core_orders
GROUP BY 1, 2, 3
) TO 'data/lake/marts/daily_sales.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);
EXPORTING QUERY RESULTS TO CSV
CSV exports are how the platform delivers outputs to non-technical stakeholders, dashboards, and any downstream tool that cannot query DuckDB or Parquet directly. The COPY statement works identically to the Parquet export above, but with FORMAT implied by the .csv extension. HEADER writes column names as the first row — required by most dashboard tools. The INTERVAL 90 DAYS expression subtracts 90 days from today's date at query execution time, so the filter window automatically rolls forward each day without any hardcoded dates. The Python loop variant below shows how to export all mart tables in one pass: it connects to the .duckdb file in read-only mode, iterates over a dictionary of mart names and their SQL queries, and writes each result to a named CSV file while printing a row count for the run log.
-- HEADER writes column names as row 1, required by most dashboard tools.
-- current_date - INTERVAL 90 DAYS evaluates at query time -- rolling window,
-- no hardcoded dates to update.
COPY (
SELECT
order_date,
channel,
sku_id,
SUM(units_sold) AS units_sold,
SUM(net_revenue) AS net_revenue
FROM mart_daily_sales
WHERE order_date >= current_date - INTERVAL 90 DAYS
GROUP BY order_date, channel, sku_id
ORDER BY order_date, channel
) TO 'data/exports/csv/daily_sales_90d.csv' (HEADER, DELIMITER ',');
# Opens analytics.duckdb read-only to prevent write-lock conflicts during export.
# Iterates each mart, writes a CSV, and logs row count for pipeline monitoring.
import duckdb
from pathlib import Path
EXPORTS = {
'daily_sales': 'SELECT * FROM mart_daily_sales',
'channel_pnl': 'SELECT * FROM mart_channel_pnl',
'sku_velocity': 'SELECT * FROM mart_sku_velocity',
}
def export_marts(db_path: str, export_dir: str):
con = duckdb.connect(db_path, read_only=True)
Path(export_dir).mkdir(parents=True, exist_ok=True)
for name, query in EXPORTS.items():
out = f'{export_dir}/{name}.csv'
con.execute(f"COPY ({query}) TO '{out}' (HEADER, DELIMITER ',')")
n = con.execute(f'SELECT COUNT(*) FROM ({query})').fetchone()[0]
print(f' {name}.csv: {n:,} rows -> {out}')
con.close()
export_marts('data/warehouse/analytics.duckdb', 'data/exports/csv')
RUNNING SQL FILES FROM PYTHON
In a well-structured pipeline, SQL transformation logic lives in standalone .sql files — not embedded as strings inside Python scripts. This keeps business logic readable, reviewable in a code editor, and independently testable. The Python layer's only job is to read each file in dependency order, execute it against DuckDB, and validate that the result is non-empty before proceeding to the next model. Path(sql_path).read_text() reads the entire .sql file as a string, which is then passed directly to con.execute(). The table name is derived from the filename stem (stg_shopify_orders.sql becomes stg_shopify_orders), enabling a generic row-count check without hardcoding table names. If any model produces zero rows the pipeline aborts with a descriptive error, preventing downstream models from running against empty inputs.
# Reads each SQL model file from disk and executes it against the open DuckDB connection.
# Row count check after each model: empty output halts the pipeline immediately,
# preventing downstream models from running against missing or failed staging data.
import duckdb
from pathlib import Path
import sys
DB = 'data/warehouse/analytics.duckdb'
SQL = Path('sql/staging')
MODELS = [
'stg_shopify_orders.sql',
'stg_amazon_orders.sql',
'stg_quickbooks.sql',
]
def run_staging():
con = duckdb.connect(DB)
for model in MODELS:
path = SQL / model
sql = path.read_text() # read the .sql file from disk
con.execute(sql) # execute it against DuckDB
table = path.stem # derive table name from filename
n = con.execute(f'SELECT COUNT(*) FROM {table}').fetchone()[0]
print(f' {table}: {n:,} rows')
if n == 0:
print(f'ERROR: {table} is empty -- aborting')
con.close()
sys.exit(1)
con.close()
Python Pipeline Architecture
This section focuses on how DuckDB integrates into an extraction pipeline — not how to build the pipeline itself. The assumption is that the implementing engineer is comfortable with scheduling, API clients, and DataFrame processing. What follows is specifically how DuckDB fits into each stage.
WHERE DUCKDB SITS IN THE PIPELINE
DuckDB is not involved in extraction or API calls. It enters the picture at two points: writing validated data to the lake as Parquet (indirectly — via pyarrow or pandas), and querying that data in the transform layer via SQL. The extract step is standard Python; DuckDB takes over from staging onward.
| Step | DuckDB Involved? | What DuckDB Does |
|---|---|---|
| Extract | No | Standard API client — requests, pagination, auth |
| Normalize | No | Standard pandas — flatten JSON, cast types, derive fields |
| Validate | No | Standard assertions — check columns, nulls, value ranges |
| Write | Indirectly | DataFrame written to Hive partition path via df.to_parquet(..., compression='zstd') |
| Stage | Yes | read_parquet() scans the raw lake; SQL model normalizes and loads a staging table |
| Transform | Yes | SQL models build core and mart tables from staged data |
| Export | Yes | COPY statement writes mart outputs to CSV or Parquet |
WRITING THE RAW PARQUET FILE
The extraction step ends by writing the validated DataFrame to the correct Hive partition. This is the only DuckDB-adjacent step in extraction — the partition path convention is what makes the lake queryable later.
# Constructs the Hive partition path from the target date and writes the DataFrame.
# ZSTD compression; the resulting file is immediately queryable by DuckDB.
out_dir = base / f'year={d.year}' / f'month={d.month:02d}' / f'day={d.day:02d}'
out_dir.mkdir(parents=True, exist_ok=True)
df.to_parquet(out_dir / 'extract.parquet', index=False, compression='zstd')
# Result path: data/lake/raw/shopify/orders/year=2025/month=03/day=15/extract.parquet
READING THE LAKE IN THE TRANSFORM STAGE
Once the Parquet file is on disk, DuckDB reads it directly via read_parquet() with hive_partitioning enabled. The staging SQL model runs against this scan and writes a normalized table. A row count check confirms the table is non-empty before any downstream model runs.
import duckdb
con = duckdb.connect('data/warehouse/analytics.duckdb')
con.execute(Path('sql/staging/stg_shopify_orders.sql').read_text())
# Validate: staging table must be non-empty before proceeding
row_count = con.execute('SELECT COUNT(*) FROM stg_shopify_orders').fetchone()[0]
if row_count == 0:
raise RuntimeError(
'stg_shopify_orders is empty. '
'Check that raw Parquet files exist and the SQL model ran cleanly.'
)
WRITING MART OUTPUTS
The transform stage ends by materializing marts. DuckDB writes these directly to Parquet or CSV via the COPY statement — no pandas, no intermediate objects.
# Materialize mart to Parquet (portable, queryable by any downstream tool)
con.execute("""
COPY (SELECT * FROM mart_daily_sales)
TO 'data/lake/marts/daily_sales/latest.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD)
""")
# Export to CSV for stakeholder delivery or co-work function input
con.execute("""
COPY (
SELECT * FROM mart_channel_pnl
WHERE order_date >= current_date - INTERVAL 90 DAYS
)
TO 'data/exports/csv/channel_pnl_90d.csv' (HEADER, DELIMITER ',')
""")
con.close()
DuckLake: A Possible Next Step
As Grow's data platform matures, there is a named upgrade path that adds more robust data management without moving to cloud infrastructure. It is called DuckLake, and it is worth knowing it exists — even though Grow does not need it at the outset.
WHAT IS DUCKLAKE?
DuckLake is an optional layer that sits on top of the existing Parquet-based lake and adds transaction management. Think of it as giving the data lake a proper change log. Without DuckLake, if a pipeline run fails halfway through writing a new batch of files, the lake could be left in a partial state. With DuckLake, either the entire batch commits successfully, or none of it does — no in-between.
DuckLake also adds the ability to query the lake as it existed at a specific point in time — for example, asking what the sales data looked like on a specific date before a correction was applied. And it allows table schemas to evolve — adding or renaming columns — without rebuilding the entire dataset.
HOW IT FITS INTO THIS ARCHITECTURE
DuckLake is not a replacement for the Parquet lake or DuckDB. It is an extension. The underlying data still lives in Parquet files. DuckDB still executes all queries. Python still orchestrates the pipeline. DuckLake adds a lightweight metadata catalog — stored in DuckDB itself — that tracks what changed, when, and in what order. Existing pipelines can be migrated to DuckLake incrementally.
WHEN WOULD GROW CONSIDER DUCKLAKE?
Pipeline failures are leaving the lake in partial states requiring manual cleanup
The team wants to query historical snapshots for backtesting or audit purposes
Schema changes (adding new source fields) are becoming difficult to manage across existing Parquet files
Multiple pipeline writers need to run concurrently without file conflicts
DuckLake is a well-supported, named upgrade path that keeps the team on the DuckDB ecosystem. It is not a migration to a new tool — it is the same tool with more capabilities turned on when needed.
Migration Paths
One of the strongest arguments for this architecture is that it is designed to migrate cleanly if the business outgrows it. Three properties make this true: business logic lives in plain SQL files, Parquet is supported natively by every major target system, and Python pipeline scripts are engine-agnostic. In most cases, migrating the query layer means changing a connection string and making minor SQL dialect adjustments — the data and the logic move intact.
MIGRATION TO CLOUD WAREHOUSES
— Snowflake
Parquet files load directly into Snowflake's internal stage via the COPY INTO command. SQL transformation models require minimal modification — Snowflake's SQL dialect is highly compatible with DuckDB's. The primary migration effort is authentication setup, storage integration configuration, and warehouse sizing decisions.
— BigQuery
BigQuery can query Parquet files stored in Google Cloud Storage directly as external tables, or load them into native tables. Since GCS is already the recommended lake storage option for Grow, this migration path requires no data movement whatsoever — BigQuery simply points at the existing GCS bucket. SQL models require minor dialect adjustments, primarily around date functions.
— Databricks
Databricks supports Delta Lake natively. The migration involves converting the Parquet lake to Delta tables using the Delta Lake Python library, after which Databricks SQL or Spark SQL can execute equivalent transformation logic with minimal changes.
MIGRATION TO DISTRIBUTED SQL
— Trino
Trino queries Parquet files on local filesystems or cloud object stores via its Hive connector. SQL models are highly portable to Trino's dialect. The primary code change is replacing DuckDB Python connections with Trino JDBC or HTTP connections.
— Apache Spark
Spark reads Parquet natively and supports equivalent SQL via Spark SQL. Migration effort is higher than Trino due to Spark's different execution model and cluster management requirements. The Parquet lake remains the clean handoff point.
MIGRATION TO LOCAL SQL SERVERS
— PostgreSQL
For teams that eventually need multi-user write access or OLTP capabilities alongside analytics, PostgreSQL is the natural local alternative. A foreign data wrapper mechanism allows PostgreSQL to query existing Parquet files via DuckDB during a transitional period — the two systems can coexist while migration proceeds incrementally.
— MySQL / MariaDB
MySQL and its open-source fork MariaDB are widely deployed relational databases suited to transactional workloads. MariaDB's ColumnStore engine adds columnar analytics capability. For teams already operating MySQL infrastructure, MariaDB is a reasonable migration target for the transactional layer, with DuckDB continuing to serve the analytical workloads where columnar performance matters. SQL dialect differences (particularly around window functions and date handling) require some model adjustment.
WHY MIGRATION IS MANAGEABLE
Business logic lives in SQL files — not in vendor-specific stored procedures or proprietary transformation tools
Parquet is a first-class format for every migration target listed above — no data conversion required
Python pipeline orchestration is engine-agnostic — swapping the DuckDB connection for Snowflake or Trino is typically a one-line change per script
No ETL platform lock-in — the pipeline is plain Python, not a vendor's proprietary workflow engine
If the recommendation today is DuckDB, the option tomorrow is Snowflake or BigQuery — and the path between them is well-paved. Choosing DuckDB now does not foreclose anything.
Scaling Guidelines
The architecture evolves in predictable steps as data volume grows. The following thresholds provide a practical framework for when to consider each transition. They are approximate — actual performance depends on query complexity, available RAM, and storage speed — but they give the team a concrete basis for planning.
| Data Volume | Recommended Architecture | Notes |
|---|---|---|
| < 100 GB | DuckDB + Parquet (current recommendation) | Full dataset fits in memory on modern hardware; queries complete in seconds; Grow is here now |
| 100 GB – 500 GB | DuckDB + Partitioned Parquet Lake | Partition pruning becomes important; increase pipeline machine RAM to 32–64 GB |
| 500 GB – 2 TB | DuckDB + Partitioned Lake + DuckLake | Evaluate DuckLake for transactional guarantees; DuckDB out-of-core execution handles this range |
| 2 TB+ | Distributed engine: Snowflake, BigQuery, Trino | Single-node ceiling reached; migrate using paths in Section 14 |
PERFORMANCE LEVERS BEFORE CONSIDERING MIGRATION
Before concluding that data volume requires a distributed system, exhaust the following optimizations — they often yield 5–10x query speedups on the existing stack:
Partition pruning: ensure queries filter on partition keys (year, month, day) so DuckDB skips irrelevant Parquet files entirely
ZSTD compression: Parquet files compressed with ZSTD are significantly smaller and faster to scan than the Snappy default
Column projection: select only needed columns rather than SELECT * — Parquet only reads what is requested
Pre-aggregation: push aggregations into the mart layer rather than computing them at query time
Memory allocation: explicitly configure DuckDB's memory_limit and thread count for the available hardware
-- Set DuckDB memory and parallelism at the start of a session
SET memory_limit = '24GB';
SET threads = 8;
SET enable_progress_bar = true;
If Adopted: Suggested Next Steps
The approach described in this document is a proposal based on Grow's current data profile, team structure, and the analytical objectives outlined in the existing strategy framework. Whether this is the right path is ultimately a business decision — but if the direction resonates, the following sequencing tends to work well.
WHY THIS STACK TENDS TO WORK WELL AT THIS STAGE
At Grow's data volumes, DuckDB performs comparably to cloud warehouses on the queries that matter — without the billing overhead, cluster management, or provisioning delays. The infrastructure cost is storage only. Every component is based on open standards, so no future migration is foreclosed. And the architecture can be operational in weeks, not months.
More importantly: this approach maps directly onto the data quality and normalization framework already in place. The seven normalization priorities — unified SKU taxonomy, date standardization, revenue normalization, attribution window alignment, cost basis standardization, customer identity resolution, and units of measure — each have a natural home in the staging and core layers of this architecture. Every pipeline stage applies explicit validation rules; every SQL model produces a guaranteed, auditable output. The stack doesn't sit alongside the strategy; it runs it.
SUGGESTED SEQUENCING
Step 1 — Storage: decide where the lake lives. GCS is the path of least resistance for a Google Workspace team. On-premises NAS is a reasonable alternative if cloud storage is not preferred. This decision shapes everything downstream.
Step 2 — Data audit: before writing a line of pipeline code, assess what's actually available from each source — Shopify, Amazon, QuickBooks, Faire. Known gaps (stockout periods, COGS granularity, attribution window mismatches) should be surfaced early so the staging layer can account for them.
Step 3 — First pipeline: stand up one extraction script for Shopify orders. Write the Parquet file. Query it with DuckDB. Validate the output. This is the proof of concept — once it works end-to-end on one source, the pattern is established for all others.
Step 4 — Staging and core models: build the normalization layer with validation rules at each boundary. This is where data quality rules get implemented. It is the highest-leverage investment in the platform.
Step 5 — Mart layer: with reliable core data, build the domain-specific aggregations. These are what feed dashboards, decisions, and the consumption layer.
Step 6 — Consumption layer: connect mart outputs to whatever tools the team already uses — dashboards, notebooks, forecasting pipelines, or direct analyst access via DBeaver.
This document is a reference, not a mandate. The implementation choices that matter most — which sources to prioritize, how strict validation should be, how to structure the mart layer for Grow's specific decision domains — are best made by the team closest to the business. This guide is intended to give that team a well-reasoned starting point.
Glossary and Key Definitions
This glossary defines technical concepts, acronyms, and architectural terms used throughout this document at or above an upper-division computer science level. Terms are grouped by category.
STORAGE AND FILE FORMATS
— Parquet
An open-source, columnar storage file format developed by the Apache Software Foundation. Unlike row-oriented formats (CSV, JSON), Parquet stores data column-by-column, enabling queries that read only the columns they need. It includes built-in compression, schema metadata, and min/max statistics per column chunk for fast filtering.
— Columnar Storage
A data layout where all values of a single column are stored contiguously on disk. Contrast with row storage, where all fields of a single record are stored together. Columnar storage is dramatically more efficient for analytical queries (which typically aggregate a few columns across many rows) because irrelevant columns are never read.
— ZSTD (Zstandard)
A lossless data compression algorithm developed by Facebook. Used in Parquet files to reduce storage size. ZSTD provides a better compression ratio than the older Snappy format (Parquet's default) while maintaining fast decompression speeds. Recommended for all Parquet files in this architecture.
— Glob Pattern
A string using wildcard characters to match file paths. In this architecture, *.parquet matches all Parquet files in a single directory, and **/*.parquet recursively matches all Parquet files in a directory tree. DuckDB uses glob patterns to read multiple files as a single virtual table.
— Arrow (Apache Arrow)
An open-source, in-memory columnar data format and inter-process communication standard. DuckDB uses Arrow internally for its execution engine and can exchange data with Pandas, Spark, and other tools via Arrow without copying or serializing data. When DuckDB returns results as a DataFrame, Arrow is the bridge.
DATABASE AND QUERY ENGINE CONCEPTS
— OLAP (Online Analytical Processing)
A category of database workload optimized for complex analytical queries over large datasets — aggregations, groupings, time-series analysis, and multi-dimensional exploration. Contrasts with OLTP. DuckDB is an OLAP engine.
— OLTP (Online Transaction Processing)
A category of database workload optimized for high-volume, low-latency read and write operations — point lookups, inserts, updates, and deletes. Typical of application databases (e-commerce order systems, CRM records). PostgreSQL and MySQL are OLTP engines.
— Vectorized Execution
A query execution strategy where operations are applied to entire batches (vectors) of column values at once, rather than one row at a time. Vectorized execution exploits modern CPU SIMD (Single Instruction, Multiple Data) instructions, enabling significantly higher throughput for analytical computations.
— Predicate Pushdown
An optimization where filter conditions (WHERE clauses) are applied as early as possible in query execution — ideally at the file scan level — so that irrelevant data is never read into memory. DuckDB pushes predicates down into Parquet scans, using the file's built-in column statistics to skip row groups that cannot satisfy the filter.
— Partition Pruning
The process of skipping entire file system partitions (directories or files) based on query filter conditions. When a query filters on year=2025 AND month=1 against a Hive-partitioned lake, DuckDB's partition pruning means only the year=2025/month=01/ directories are opened — all others are ignored at the filesystem level before any data is read.
— MVCC (Multi-Version Concurrency Control)
A concurrency model used by traditional databases (PostgreSQL, MySQL) that allows multiple transactions to read and write data simultaneously without blocking each other by maintaining multiple versions of each row. DuckDB does not implement MVCC — it uses a single-writer model instead.
— In-Process Database
A database engine that runs within the same operating system process as the application that uses it, rather than as a separate server process. DuckDB is in-process: a Python script imports it as a library, and the query engine runs inside the Python interpreter. No network connection, authentication, or daemon management is required.
DATA LAKE AND PIPELINE ARCHITECTURE
— Data Lake
A storage architecture where raw data from multiple source systems is collected in its original format — typically as files — before any transformation or schema enforcement. In this architecture, the raw/ and staging/ layers of the Parquet directory tree constitute the data lake.
— Hive Partitioning
A directory naming convention originating in the Apache Hive ecosystem where partition key and value are encoded directly in directory names using key=value syntax (e.g. year=2025/month=01/day=15/). DuckDB natively reads and automatically exposes these directory names as virtual columns, enabling partition pruning without any metadata catalog.
— ETL / ELT
Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) are two approaches to data pipeline architecture. ETL transforms data before loading it into the destination system. ELT loads raw data first, then transforms it inside the destination. This architecture follows ELT: raw Parquet files are loaded to the lake first, then transformed by DuckDB SQL models inside the warehouse.
— Idempotent
An operation is idempotent if running it multiple times produces the same result as running it once. In pipeline design, idempotent extraction functions can be re-run for any date without risk of duplicate or corrupted data — the result is always the same Parquet file for that partition. This is what makes historical backfills and correction re-runs safe.
— Medallion Architecture
A layered data architecture pattern (also called bronze/silver/gold or raw/staging/mart) where data is progressively refined through defined layers. Raw data lands in the first layer unchanged; each subsequent layer adds normalization, integration, and aggregation. The five-layer model in Section 5 (Raw, Staging, Core, Mart, Consumption) is an implementation of this pattern.
— DAG (Directed Acyclic Graph)
A mathematical structure used to represent dependencies between tasks where edges have direction and no cycles exist. In pipeline orchestration, a DAG defines the order in which pipeline steps must run — staging must complete before core models, core models before marts, etc. Tools like Prefect and Dagster manage pipeline DAGs explicitly.
INFRASTRUCTURE AND NETWORKING
— NAS (Network-Attached Storage)
A dedicated file storage device connected to a local network, accessible to multiple machines as a shared file system. In this architecture, a NAS running Samba/SMB is one option for hosting the Parquet data lake. DuckDB can read Parquet files from a mounted NAS share but must not run its .duckdb database file on a NAS due to file locking limitations.
— SMB / Samba
Server Message Block (SMB) is a network file sharing protocol natively supported by Windows, macOS, and Linux. Samba is the open-source Linux/Unix implementation of SMB. SMB v3 is the recommended protocol for a cross-platform team sharing a NAS-hosted data lake.
— GCS (Google Cloud Storage)
Google's object storage service — a cloud-based system for storing and accessing files (objects) at any scale. Analogous to AWS S3. In this architecture, GCS is the recommended home for the Parquet data lake: DuckDB queries it natively via the httpfs extension using gs:// paths, and Python writes to it via the gcsfs library.
— NFS (Network File System)
A distributed file system protocol enabling file access over a network, common on Unix/Linux systems. DuckDB's documentation explicitly warns against running a .duckdb file in read-write mode on NFS shares due to locking and consistency issues.
ACRONYMS QUICK REFERENCE
| Acronym | Stands For | Context in This Document |
|---|---|---|
| OLAP | Online Analytical Processing | DuckDB's workload category — aggregations, reporting, time-series |
| OLTP | Online Transaction Processing | PostgreSQL/MySQL workload — application databases, point lookups |
| ETL | Extract, Transform, Load | Pipeline pattern; this architecture uses ELT |
| ELT | Extract, Load, Transform | Raw data lands first; DuckDB transforms it in place |
| DAG | Directed Acyclic Graph | Dependency graph for pipeline step ordering |
| NAS | Network-Attached Storage | On-premises shared storage option for the Parquet lake |
| SMB | Server Message Block | Cross-platform file sharing protocol; Samba is its Linux implementation |
| GCS | Google Cloud Storage | Recommended cloud storage for the Parquet lake |
| NFS | Network File System | Unix/Linux file sharing protocol; not safe for .duckdb files |
| ZSTD | Zstandard | Compression algorithm used for Parquet files |
| MVCC | Multi-Version Concurrency Control | Concurrency model in PostgreSQL/MySQL; not present in DuckDB |
| SIMD | Single Instruction, Multiple Data | CPU feature exploited by DuckDB's vectorized execution engine |
| API | Application Programming Interface | How the pipeline connects to Shopify, Amazon, QuickBooks, etc. |
| PII | Personally Identifiable Information | Customer names, emails, addresses — segregated at the staging layer |
| CSV | Comma-Separated Values | Human-readable tabular file format; used for mart exports to stakeholders |
| JSON | JavaScript Object Notation | Common API response format; DuckDB queries JSON files directly via read_json_auto() |
| SQL | Structured Query Language | The query language used by DuckDB and all transformation models |
| DTC | Direct-to-Consumer | Grow's primary sales channel via Shopify |