Integrates DuckDB with Google BigQuery
Background
In today's data-driven world, the ability to efficiently query and manage large datasets is crucial for businesses. DuckDB, an in-process SQL OLAP database management system, offers a powerful solution for handling complex analytical queries with ease. Google BigQuery, on the other hand, has been a staple for decades. It revolutionised the concept of a serverless data warehouse, making it incredibly user-friendly from the outset. By integrating DuckDB with BigQuery, users can leverage the strengths of both platforms to enhance their data processing capabilities. This integration allows for direct querying and management of BigQuery datasets, providing a seamless experience for data analysts and engineers.
DuckDB's extension capabilities enable it to connect with various data sources, including BigQuery, through community-driven integrations. This collaboration between DuckDB and BigQuery offers several advantages, including improved query performance, cost efficiency, and the ability to work with large-scale datasets without requiring extensive data movement. By combining DuckDB's efficient query engine with BigQuery's robust data storage and processing infrastructure, users can achieve faster insights and more effective data management.
Let’s dive into a lab to see how it works!
Set it up
To access BigQuery, we use gcloud CLI to set up ADC.
➜ ~ gcloud auth application-default login
Your browser has been opened to visit:
https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=xxx
Credentials saved to file: [/Users/derrickqin/.config/gcloud/application_default_credentials.json]
These credentials will be used by any library that requests Application Default Credentials (ADC).
Quota project "derrick-playground" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.
Install and load BigQuery DuckDB extension
➜ ~ duckdb
DuckDB v1.3.1 (Ossivalis) 2063dda3e6
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D -- Install and load the DuckDB BigQuery extension from the Community Repository
D FORCE INSTALL 'bigquery' FROM community;
LOAD 'bigquery';
100% ▕████████████████████████████████████████████████████████████▏
After that attach the GCP project so DuckDB knows where to run the query jobs
ATTACH 'project=derrick-playground' as bq (TYPE bigquery, READ_ONLY);
Test drive with scanning a table using BigQuery Storage API and yay, we can access it from DuckDB CLI!
SELECT * FROM bigquery_scan('bigquery-public-data.geo_us_boundaries.cnecta', billing_project='derrick-playground');
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────┬──────────────────┬───────────────────┬───────────────┬───────────────┬────────────────────────────────────────────────────────────────────────────────────────┐
│ geo_id │ cnecta_fips_code │ name │ name_lsad │ lsad │ mtfcc_feature_clas… │ area_land_meters │ area_water_meters │ int_point_lat │ int_point_lon │ cnecta_geom │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ int64 │ double │ double │ varchar │
├─────────┼──────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────┼──────────────────┼───────────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ 710 │ 710 │ Augusta-Waterville… │ Augusta-Waterville… │ M4 │ G3200 │ 2155036649 │ 183936850 │ 44.4092939 │ -69.6901717 │ POLYGON((-69.792813 44.57733, -69.795348 44.577668, -69.797891 44.578045, -69.798362… │
│ 775 │ 775 │ Portland-Lewiston-… │ Portland-Lewiston-… │ M4 │ G3200 │ 6254813062 │ 1537827560 │ 43.8562034 │ -70.3192682 │ POLYGON((-70.480078 44.032078, -70.483885 44.033878, -70.488816 44.036178, -70.49552… │
│ 770 │ 770 │ Pittsfield-North A… │ Pittsfield-North A… │ M4 │ G3200 │ 1524389768 │ 24514153 │ 42.5337519 │ -73.1678825 │ POLYGON((-73.306984 42.632646, -73.308333 42.629064, -73.309828 42.625081, -73.31011… │
│ 790 │ 790 │ Springfield-Hartfo… │ Springfield-Hartfo… │ M4 │ G3200 │ 8710006868 │ 256922043 │ 42.0359069 │ -72.6213616 │ POLYGON((-72.636821 42.577985, -72.637099 42.57691, -72.637355 42.575968, -72.637466… │
│ 715 │ 715 │ Boston-Worcester-P… │ Boston-Worcester-P… │ M4 │ G3200 │ 21419197698 │ 3004814151 │ 42.3307869 │ -71.3296644 │ MULTIPOLYGON(((-71.471454 43.411298, -71.464171 43.413557, -71.463266 43.414547, -71… │
│ 725 │ 725 │ Lebanon-Claremont,… │ Lebanon-Claremont,… │ M4 │ G3200 │ 3031319652 │ 58476158 │ 43.6727226 │ -72.2484543 │ POLYGON((-72.396019 43.428835, -72.396027 43.428797, -72.396038 43.42873, -72.396108… │
│ 720 │ 720 │ Bridgeport-New Hav… │ Bridgeport-New Hav… │ M4 │ G3200 │ 3940389150 │ 374068423 │ 41.3603421 │ -73.1284227 │ MULTIPOLYGON(((-72.602721 41.266584, -72.602742 41.266458, -72.60248 41.266178, -72.… │
└─────────┴──────────────────┴──────────────────────┴──────────────────────┴─────────┴──────────────────────┴──────────────────┴───────────────────┴───────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
How about we run a query?
SELECT * FROM bigquery_query('derrick-playground', 'SELECT * FROM bigquery-public-data.geo_us_boundaries.cnecta');
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┬──────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────┬──────────────────┬───────────────────┬───────────────┬───────────────┬────────────────────────────────────────────────────────────────────────────────────────┐
│ geo_id │ cnecta_fips_code │ name │ name_lsad │ lsad │ mtfcc_feature_clas… │ area_land_meters │ area_water_meters │ int_point_lat │ int_point_lon │ cnecta_geom │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ int64 │ double │ double │ varchar │
├─────────┼──────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────┼──────────────────┼───────────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ 710 │ 710 │ Augusta-Waterville… │ Augusta-Waterville… │ M4 │ G3200 │ 2155036649 │ 183936850 │ 44.4092939 │ -69.6901717 │ POLYGON((-69.792813 44.57733, -69.795348 44.577668, -69.797891 44.578045, -69.798362… │
│ 775 │ 775 │ Portland-Lewiston-… │ Portland-Lewiston-… │ M4 │ G3200 │ 6254813062 │ 1537827560 │ 43.8562034 │ -70.3192682 │ POLYGON((-70.480078 44.032078, -70.483885 44.033878, -70.488816 44.036178, -70.49552… │
│ 770 │ 770 │ Pittsfield-North A… │ Pittsfield-North A… │ M4 │ G3200 │ 1524389768 │ 24514153 │ 42.5337519 │ -73.1678825 │ POLYGON((-73.306984 42.632646, -73.308333 42.629064, -73.309828 42.625081, -73.31011… │
│ 790 │ 790 │ Springfield-Hartfo… │ Springfield-Hartfo… │ M4 │ G3200 │ 8710006868 │ 256922043 │ 42.0359069 │ -72.6213616 │ POLYGON((-72.636821 42.577985, -72.637099 42.57691, -72.637355 42.575968, -72.637466… │
│ 715 │ 715 │ Boston-Worcester-P… │ Boston-Worcester-P… │ M4 │ G3200 │ 21419197698 │ 3004814151 │ 42.3307869 │ -71.3296644 │ MULTIPOLYGON(((-71.471454 43.411298, -71.464171 43.413557, -71.463266 43.414547, -71… │
│ 725 │ 725 │ Lebanon-Claremont,… │ Lebanon-Claremont,… │ M4 │ G3200 │ 3031319652 │ 58476158 │ 43.6727226 │ -72.2484543 │ POLYGON((-72.396019 43.428835, -72.396027 43.428797, -72.396038 43.42873, -72.396108… │
│ 720 │ 720 │ Bridgeport-New Hav… │ Bridgeport-New Hav… │ M4 │ G3200 │ 3940389150 │ 374068423 │ 41.3603421 │ -73.1284227 │ MULTIPOLYGON(((-72.602721 41.266584, -72.602742 41.266458, -72.60248 41.266178, -72.… │
└─────────┴──────────────────┴──────────────────────┴──────────────────────┴─────────┴──────────────────────┴──────────────────┴───────────────────┴───────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
From Google Cloud Console, we can see the BigQuery job triggered by the query from DuckDB CLI

Summary
Integrating DuckDB with BigQuery enables seamless data integration between the two platforms. This integration enables users to utilise DuckDB's efficient query engine locally, which can result in significant cost savings. By scanning the data once and performing all computations on a powerful local machine, such as an "overpowered" Apple Silicon MacBook, you can optimise your data processing tasks without incurring a surprise BigQuery bill.