MiddleWay - Query your Azure Data Lake using DuckDB

DuckDB is a high-performance, in-process SQL OLAP (Online Analytical Processing) database management system. Unlike traditional client-server databases, DuckDB runs embedded within your application – similar to SQLite, but optimized for analytical workloads rather than transactional ones.

Think of it as “SQLite for analytics” – it requires zero configuration, has no external dependencies, and can process complex analytical queries on datasets ranging from megabytes to hundreds of gigabytes directly on your local machine.

Modes of Operation

DuckDB offers two primary modes of operation to fit different use cases:

In-Memory (Transient): Data exists only for the duration of the process. Perfect for exploratory analysis, temporary transformations, and one-off queries where persistence isn’t required.
Persistent (Disk-Based): Data is stored in a single portable file. Ideal for building local data warehouses, caching processed results, or maintaining analytical datasets across sessions.

Core Engine Features

DuckDB’s architecture is built for analytical performance:

Vectorized Execution Engine: Processes data in batches (vectors) rather than row-by-row, enabling SIMD (Single Instruction, Multiple Data) CPU optimizations for massive throughput.
Automatic Parallel Execution: Features a work-stealing scheduler that automatically parallelizes queries across available CPU cores.
Intelligent Memory Management: Gracefully spills to disk for out-of-core workloads when datasets exceed available RAM.
Columnar Storage: Data is stored and processed column-by-column, dramatically improving compression ratios and query performance for analytical workloads.

Setup

Getting started with DuckDB is remarkably simple – no server installation, no configuration files, no dependencies.

CLI Installation

Download and install the standalone executable for your platform.

Python Installation

pip install duckdb

That’s it! No database server to configure, no ports to open, no users to create.

Note: Clients for other programming languages exist (Go, Java, JS, C/C++, R, Rust, and more)

How to Use DuckDB

CLI

The DuckDB CLI provides an interactive SQL shell for quick analysis:

# Start in-memory session
duckdb

# Or connect to a persistent database
duckdb my_analytics.db

Common CLI operations:

-- Query local files directly (no import needed!)
SELECT * FROM 'data/sales.parquet' LIMIT 10;

-- Query CSV files with automatic schema detection
SELECT * FROM 'data.csv';

-- Export results
COPY (SELECT * FROM 'data/sales.parquet' WHERE year = 2024) 
TO 'output/sales_2024.parquet' (FORMAT PARQUET);

-- Show available tables
SHOW TABLES;

-- Describe table structure
DESCRIBE my_table;

Python

Python integration is where DuckDB truly shines:

import duckdb

# Create a connection
con = duckdb.connect()  # in-memory
# con = duckdb.connect('my_db.db')  # persistent

# Query local files
result = con.execute("""
    SELECT category, SUM(amount) as total
    FROM 'sales/*.parquet'
    GROUP BY category
    ORDER BY total DESC
""")

# Get results as a Pandas DataFrame
df = con.execute("SELECT * FROM 'data.parquet'").df()

# Get results as a Polars DataFrame
pl_df = con.execute("SELECT * FROM 'data.parquet'").pl()

# You can then work with classic DataFrames

Query an ADLS (Azure Data Lake Storage)

DuckDB seamlessly connects to Azure Blob Storage and ADLS Gen2 through the azure extension, enabling you to query cloud data as if it were local.

Initial Setup

INSTALL azure;
LOAD azure;

Note: These commands are optionnal, the azure extension is automaticly downloaded and installed when using any of the extension keywords (e.g. TYPE AZURE)

Authentication Possibilities

DuckDB’s unified SECRET management system provides multiple authentication methods:

1. Connection String (Easiest for Local Development)

CREATE SECRET azure_secret (
    TYPE AZURE,
    CONNECTION_STRING 'DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=...;EndpointSuffix=core.windows.net'
);

2. Service Principal (For non Azure-hosted workloads)

CREATE SECRET my_azure_spn (
    TYPE AZURE,
    PROVIDER SERVICE_PRINCIPAL,
    TENANT_ID '00000000-0000-0000-0000-000000000000',
    CLIENT_ID '00000000-0000-0000-0000-000000000000',
    CLIENT_SECRET 'my-client-secret-value',
    ACCOUNT_NAME 'myaccount'
);

3. Managed Identity (For Azure-hosted workloads)

CREATE SECRET azure_mi (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME 'myaccount'
);

ABFSS or AZ Format

DuckDB supports multiple URI formats for Azure storage:

ABFSS Format (Azure Blob File System Secure)

-- Standard ADLS Gen2 path format
SELECT * FROM 'abfss://<container>@<storageaccount>.dfs.core.windows.net/path/to/file.parquet';

-- Simplified format (when account is specified in secret)
SELECT * FROM 'abfss://<container>/path/to/file.parquet';

AZ Format (Azure Blob)

-- Azure Blob Storage format
SELECT * FROM 'az://container/path/to/file.parquet';

Both formats support glob patterns for querying multiple files:

-- Query all parquet files in a directory
SELECT * FROM 'abfss://datalake/raw/sales/*.parquet';

-- Query files across partitioned directories
SELECT * FROM 'abfss://datalake/raw/sales/year=*/month=*/*.parquet';

Complex Query Over Multiple Storages

One of DuckDB’s most powerful features is the ability to join data across different sources in a single query – local files, different cloud storage accounts, and in-memory dataframes.

Example: Join on Master Data in Another Storage

Consider a scenario where your transactional data lives in one storage account and master data (product catalog, customer info) lives in another:

-- Set up secrets for multiple storage accounts
CREATE SECRET adlsduckdb (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME 'adlsduckdb'
);

CREATE SECRET adlsmasterdata (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    ACCOUNT_NAME 'adlsmasterdata'
);

select count(*) from 
'abfss://data@adlsduckdb.dfs.core.windows.net/orders.parquet' orders join          -- from main data lake
'abfss://masterdata@adlsmasterdata.dfs.core.windows.net/customer.parquet' customer -- from master data lake
on customer.c_custkey = orders.o_custkey;

Mixing Cloud and Local Data

-- Enrich cloud data with local reference data
SELECT 
    cloud_data.*,
    local_mapping.display_name
FROM 'abfss://datalake/raw/events/*.parquet' cloud_data
JOIN 'local_reference/code_mappings.csv' local_mapping
    ON cloud_data.code = local_mapping.code;

Python Integration

DuckDB’s Python integration is exceptionally powerful, offering zero-copy data exchange with popular data science libraries like Pandas or Polars.

Pandas Integration

DuckDB provides a seamless SQL interface for Pandas, allowing for complex relational queries on existing local DataFrames.

Because it accepts and outputs DataFrames natively, the tool fits perfectly into existing data pipelines.

import duckdb
import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300]
})

# Query pandas DataFrame directly - zero copy!
result = duckdb.query("SELECT * FROM df WHERE value > 150").df()

# Register DataFrame as a virtual table
con = duckdb.connect()
con.register('my_table', df)
con.execute("SELECT * FROM my_table").fetchall()

Polars Integration

import duckdb
import polars as pl

# Create a Polars DataFrame
df = pl.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300]
})

# Query Polars DataFrame directly
result = duckdb.query("SELECT * FROM df WHERE value > 150").pl()

# Use Polars lazy frames for optimized execution
lazy_df = df.lazy()
result = duckdb.query("SELECT * FROM lazy_df").pl()

Limitations on Local

While DuckDB is remarkably capable, there are some limitations to be aware of when running locally:

Concurrency Limitations

Single Writer: DuckDB supports only one concurrent write connection to a database file. Multiple read connections are supported.
No Multi-Process Writes: Unlike client-server databases, you cannot have multiple processes writing to the same database simultaneously.

Memory Constraints

RAM-Bound Performance: While DuckDB can spill to disk, optimal performance requires sufficient RAM for your dataset and query complexity.
Large Join Operations: Hash joins on very large tables may require significant memory or careful memory limit configuration.

Network Dependencies

Cloud Latency: Querying remote storage (ADLS, S3) is subject to network latency. Consider caching frequently accessed data locally.
No Native Streaming: DuckDB is designed for batch processing, not real-time streaming analytics.

Extension Compatibility

Platform-Specific Extensions: Some extensions may not be available on all platforms (Windows, Linux, macOS, ARM).
Version Dependencies: Extensions must match your DuckDB version exactly.

Going further: DuckDB as a query engine

More than a database, DuckDB can be used as a query engine in small data workflows, replacing or completing Pandas and Polars.

Furthermore, DuckDB ability to scale enable its usage in Big Data environnements, and can be easily integrated within plaforms like Fabric or BigQuery.

Useful Resources

You will find there useful information about DuckDB: