Sun, Feb 2, 2025. Here is the summary of DuckCon on 31 Jan 2025 in Amsterdam. I copied the transcript from YouTubeTranscript and passed it through Gemini 2.0 Flash Exp with the system prompt: "Summarize this transcript from the DuckDB conference without missing any points. Cover every point mentioned. A lot of spelling errors that sound like DuckDB are likely to be DuckDB". #6
Introduction & Welcome:
DuckCon : This is the 6th DuckDB conference, held in their hometown. The first DuckCon was online due to the pandemic.
Live Streaming: This is the first time DuckCon is being live-streamed, chosen to accommodate global time zones (especially China and the US).
Global Reach: The live stream is intended to reach users in areas where in-person DuckCons are unlikely.
Q&A: Slido (qa.duckdb.org) will be used for Q&A, with upvoting to prioritize questions.
Sponsors: Thanks to gold sponsor monday.com and silver sponsors Real and Crunchy Data.
DuckCon Purpose: DuckCon is a place for users to connect, share experiences, and provide feedback to the DuckDB team.
Inspiration: The team is inspired by the community's use of DuckDB and how far the project has come.
Mission Statement: DuckDB aims to make large datasets less intimidating and more accessible, moving away from fear of data to confidence in handling it.
Motivation: The project was born from seeing people struggle with data that didn't fit in Excel and the lack of user-friendly tools.
Industry Trends: Single-node processing capabilities have grown faster than the size of useful datasets.
Data Singularity: A prediction that most data analysis queries can run on a single node is now a reality.
Real-World Data Sizes: Analysis of Snowflake and Redshift data shows that 99.9% of datasets are under 300GB.
Raspberry Pi Benchmark: The industry-standard TPCH benchmark (scale factor 300, ~300GB) can run on a Raspberry Pi using DuckDB.
Single Node Growth: Single-node processing power is rapidly increasing, allowing for larger datasets to be handled.
Adoption Numbers:
32 Million Extension Installs: 32 million DuckDB extension installs in the last month.
1.8 Million Unique Website Visitors: 1.8 million unique visitors per month to the DuckDB website.
Blue Sky Community: Growing community on Blue Sky, with the hashtag #dataBS.
Technical Updates (Mark):
Extension Ecosystem: Focus on enabling the community to build and share extensions.
Community Extensions: Making it easier to create and use community-built extensions.
DuckDB v1.2 (Harlequin Duck): Releasing next week, named after the Harlequin duck.
CSV Reader Improvements: Significant improvements to the CSV reader.
Friendlier SQL: Improvements to the SQL experience.
CLI Autocomplete: Reworked and improved CLI autocomplete.
Performance Optimizations: Many queries are now faster due to performance work.
C API for Extensions: Introducing a C API to make building extensions easier.
Logging Features: Improved logging for production use.
Lakehouse Focus: The main focus for the year is on lakehouse formats and related features.
Q&A (Mark & Hanis):
Doubling Team: If the team doubled, they would focus on client integrations and other projects, not a major architectural change.
Partitioning: Near-term plans to add support for partitioning, related to lakehouse formats.
DuckDB WASM: The WASM ecosystem is evolving, with exciting possibilities for in-browser use.
Financial/Pharmaceutical Industries: DuckDB could replace some SAS workflows due to its cost-effectiveness and capabilities.
Lakehouse & MotherDuck: Lakehouse work is separate from MotherDuck, though MotherDuck will likely support lakehouse features.
Contributing to Extensions: Plans to make it easier to contribute to extensions, including support for Rust and Go.
Airport Extension (Rusty):
Analogy: The airport extension allows DuckDB to "fly" to remote servers using Apache Arrow Flight.
Functionality: Supports select, insert, update, and delete operations on remote data sources.
Motivation: To reduce the burden of writing extensions and enable faster development using existing code.
Arrow Flight: Uses Arrow Flight for communication, enabling connections to various data sources.
Demo 1: Delta Lake:
Attaches to a flight server for Delta Lake access.
Allows creating schemas, tables, and performing standard SQL operations.
Uses Python and deltars (Rust implementation of Delta Lake).
Supports predicate pushdown and C integration with the DuckDB catalog.
Demo 2: AutoGluon:
Integrates the AutoGluon AutoML package.
Predicts Hacker News post votes using a trained model.
Demonstrates table-returning functions for model fitting and prediction.
No C++ code required, just Python.
Demo 3: Geocoding:
Uses a geocoder service to convert addresses to coordinates and vice versa.
Demonstrates scalar UDFs for vectorized requests.
Uses a Python example for a simple uppercase function.
Features:
List flights, take flights.
Catalog integration.
Select, update, delete.
Scalar UDFs.
Table in/out functions.
Authentication for row/column filtering.
Availability: Requires DuckDB 1.2, MIT licensed, available on GitHub.
Q&A (Rusty):
Most Proud Extension: Airport is the most fun, but the AWS API wrapper also brings joy.
Extension Resources: The GitHub DuckDB extension template and reading others' source code are helpful.
Airport & Other Extensions: Airport is separate and can be used alongside other extensions like spatial or httpfs.
Graph Support: Graph database support is planned, with examples like Kuzu, Neptune, and Neo4j.
Licensing: Airport is MIT licensed, compatible with Apache license.
Scaling Out: Airport can be used to query multiple DuckDB instances on different machines.
Ibis & Geospatial (Nati):
Nati Clementi: Senior software engineer at Nvidia, working on open-source projects like Ibis.
Ibis: Open-source Python library for data wrangling, with a DataFrame API and interfaces to 15+ engines, including DuckDB.
DuckDB for Geospatial: DuckDB is fast, has a geospatial extension, and supports various geospatial formats.
Geop Parquet: Becoming a standard for geospatial data, enabling cloud data warehouse interoperability and compression.
Geo Arrow: A way of representing geospatial vector data in memory for faster processing.
Ibis Benefits: Allows writing Python instead of SQL, with deferred execution determined by the engine.
Demo:
Uses OverTour Maps data in geop parquet format.
Filters data using bounding boxes.
Demonstrates geospatial operations like ST_Distance and ST_Transform.
Plots data using Lumber.
Shows how to find points of interest near a location (e.g., the Van Gogh Museum).
Ibis & DuckDB: Ibis uses DuckDB for the parquet reader and lets DuckDB do the heavy lifting.
Ibis Optimizations: Ibis does type checking but doesn't do query optimization, leaving that to the engine.
Ibis in Browser: Ibis works in the browser through DuckDB WASM.
Q&A (Nati):
Linear Interpolation: Ibis ML module can help with regression-related tasks.
Missing Features: No major features are missing in the DuckDB/Ibis geospatial setup, with minimal overhead.