<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Derrick Qin's Blog]]></title><description><![CDATA[Derrick Qin's Blog]]></description><link>https://derrickqin.com</link><generator>RSS for Node</generator><lastBuildDate>Sat, 18 Apr 2026 15:45:36 GMT</lastBuildDate><atom:link href="https://derrickqin.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Integrates DuckDB with Google BigQuery]]></title><description><![CDATA[Background
In today's data-driven world, the ability to efficiently query and manage large datasets is crucial for businesses. DuckDB, an in-process SQL OLAP database management system, offers a powerful solution for handling complex analytical queri...]]></description><link>https://derrickqin.com/integrates-duckdb-with-google-bigquery</link><guid isPermaLink="true">https://derrickqin.com/integrates-duckdb-with-google-bigquery</guid><category><![CDATA[duckDB]]></category><category><![CDATA[bigquery]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Wed, 24 Sep 2025 14:00:00 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-background">Background</h2>
<p>In today's data-driven world, the ability to efficiently query and manage large datasets is crucial for businesses. <a target="_blank" href="https://duckdb.org/">DuckDB</a>, an in-process SQL OLAP database management system, offers a powerful solution for handling complex analytical queries with ease. Google BigQuery, on the other hand, has been a staple for decades. It revolutionised the concept of a serverless data warehouse, making it incredibly user-friendly from the outset. By integrating DuckDB with BigQuery, users can leverage the strengths of both platforms to enhance their data processing capabilities. This integration allows for direct querying and management of BigQuery datasets, providing a seamless experience for data analysts and engineers.</p>
<p>DuckDB's extension capabilities enable it to connect with various data sources, including BigQuery, through community-driven integrations. This collaboration between DuckDB and BigQuery offers several advantages, including improved query performance, cost efficiency, and the ability to work with large-scale datasets without requiring extensive data movement. By combining DuckDB's efficient query engine with BigQuery's robust data storage and processing infrastructure, users can achieve faster insights and more effective data management.</p>
<p>Let’s dive into a lab to see how it works!</p>
<h2 id="heading-set-it-up">Set it up</h2>
<p>To access BigQuery, we use <code>gcloud</code> CLI to set up ADC.</p>
<pre><code class="lang-bash">➜  ~ gcloud auth application-default login
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&amp;client_id=xxx


Credentials saved to file: [/Users/derrickqin/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project <span class="hljs-string">"derrick-playground"</span> was added to ADC <span class="hljs-built_in">which</span> can be used by Google client libraries <span class="hljs-keyword">for</span> billing and quota. Note that some services may still bill the project owning the resource.
</code></pre>
<p>Install and load <a target="_blank" href="https://github.com/hafenkran/duckdb-bigquery">BigQuery DuckDB extension</a></p>
<pre><code class="lang-bash">➜  ~ duckdb
DuckDB v1.3.1 (Ossivalis) 2063dda3e6
Enter <span class="hljs-string">".help"</span> <span class="hljs-keyword">for</span> usage hints.
Connected to a transient in-memory database.
Use <span class="hljs-string">".open FILENAME"</span> to reopen on a persistent database.
D -- Install and load the DuckDB BigQuery extension from the Community Repository
D FORCE INSTALL <span class="hljs-string">'bigquery'</span> FROM community;
LOAD <span class="hljs-string">'bigquery'</span>;
100% ▕████████████████████████████████████████████████████████████▏
</code></pre>
<p>After that attach the GCP project so DuckDB knows where to run the query jobs</p>
<pre><code class="lang-bash">ATTACH <span class="hljs-string">'project=derrick-playground'</span> as bq (TYPE bigquery, READ_ONLY);
</code></pre>
<p>Test drive with scanning a table using <a target="_blank" href="https://cloud.google.com/bigquery/docs/reference/storage">BigQuery Storage API</a> and yay, we can access it from DuckDB CLI!</p>
<pre><code class="lang-bash">SELECT * FROM bigquery_scan(<span class="hljs-string">'bigquery-public-data.geo_us_boundaries.cnecta'</span>, billing_project=<span class="hljs-string">'derrick-playground'</span>);
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────┬──────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────┬──────────────────┬───────────────────┬───────────────┬───────────────┬────────────────────────────────────────────────────────────────────────────────────────┐
│ geo_id  │ cnecta_fips_code │         name         │      name_lsad       │  lsad   │ mtfcc_feature_clas…  │ area_land_meters │ area_water_meters │ int_point_lat │ int_point_lon │                                      cnecta_geom                                       │
│ varchar │     varchar      │       varchar        │       varchar        │ varchar │       varchar        │      int64       │       int64       │    double     │    double     │                                        varchar                                         │
├─────────┼──────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────┼──────────────────┼───────────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ 710     │ 710              │ Augusta-Waterville…  │ Augusta-Waterville…  │ M4      │ G3200                │       2155036649 │         183936850 │    44.4092939 │   -69.6901717 │ POLYGON((-69.792813 44.57733, -69.795348 44.577668, -69.797891 44.578045, -69.798362…  │
│ 775     │ 775              │ Portland-Lewiston-…  │ Portland-Lewiston-…  │ M4      │ G3200                │       6254813062 │        1537827560 │    43.8562034 │   -70.3192682 │ POLYGON((-70.480078 44.032078, -70.483885 44.033878, -70.488816 44.036178, -70.49552…  │
│ 770     │ 770              │ Pittsfield-North A…  │ Pittsfield-North A…  │ M4      │ G3200                │       1524389768 │          24514153 │    42.5337519 │   -73.1678825 │ POLYGON((-73.306984 42.632646, -73.308333 42.629064, -73.309828 42.625081, -73.31011…  │
│ 790     │ 790              │ Springfield-Hartfo…  │ Springfield-Hartfo…  │ M4      │ G3200                │       8710006868 │         256922043 │    42.0359069 │   -72.6213616 │ POLYGON((-72.636821 42.577985, -72.637099 42.57691, -72.637355 42.575968, -72.637466…  │
│ 715     │ 715              │ Boston-Worcester-P…  │ Boston-Worcester-P…  │ M4      │ G3200                │      21419197698 │        3004814151 │    42.3307869 │   -71.3296644 │ MULTIPOLYGON(((-71.471454 43.411298, -71.464171 43.413557, -71.463266 43.414547, -71…  │
│ 725     │ 725              │ Lebanon-Claremont,…  │ Lebanon-Claremont,…  │ M4      │ G3200                │       3031319652 │          58476158 │    43.6727226 │   -72.2484543 │ POLYGON((-72.396019 43.428835, -72.396027 43.428797, -72.396038 43.42873, -72.396108…  │
│ 720     │ 720              │ Bridgeport-New Hav…  │ Bridgeport-New Hav…  │ M4      │ G3200                │       3940389150 │         374068423 │    41.3603421 │   -73.1284227 │ MULTIPOLYGON(((-72.602721 41.266584, -72.602742 41.266458, -72.60248 41.266178, -72.…  │
└─────────┴──────────────────┴──────────────────────┴──────────────────────┴─────────┴──────────────────────┴──────────────────┴───────────────────┴───────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
</code></pre>
<p>How about we run a query?</p>
<pre><code class="lang-bash">SELECT * FROM bigquery_query(<span class="hljs-string">'derrick-playground'</span>, <span class="hljs-string">'SELECT * FROM bigquery-public-data.geo_us_boundaries.cnecta'</span>);
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────┬──────────────────┬──────────────────────┬──────────────────────┬─────────┬──────────────────────┬──────────────────┬───────────────────┬───────────────┬───────────────┬────────────────────────────────────────────────────────────────────────────────────────┐
│ geo_id  │ cnecta_fips_code │         name         │      name_lsad       │  lsad   │ mtfcc_feature_clas…  │ area_land_meters │ area_water_meters │ int_point_lat │ int_point_lon │                                      cnecta_geom                                       │
│ varchar │     varchar      │       varchar        │       varchar        │ varchar │       varchar        │      int64       │       int64       │    double     │    double     │                                        varchar                                         │
├─────────┼──────────────────┼──────────────────────┼──────────────────────┼─────────┼──────────────────────┼──────────────────┼───────────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────┤
│ 710     │ 710              │ Augusta-Waterville…  │ Augusta-Waterville…  │ M4      │ G3200                │       2155036649 │         183936850 │    44.4092939 │   -69.6901717 │ POLYGON((-69.792813 44.57733, -69.795348 44.577668, -69.797891 44.578045, -69.798362…  │
│ 775     │ 775              │ Portland-Lewiston-…  │ Portland-Lewiston-…  │ M4      │ G3200                │       6254813062 │        1537827560 │    43.8562034 │   -70.3192682 │ POLYGON((-70.480078 44.032078, -70.483885 44.033878, -70.488816 44.036178, -70.49552…  │
│ 770     │ 770              │ Pittsfield-North A…  │ Pittsfield-North A…  │ M4      │ G3200                │       1524389768 │          24514153 │    42.5337519 │   -73.1678825 │ POLYGON((-73.306984 42.632646, -73.308333 42.629064, -73.309828 42.625081, -73.31011…  │
│ 790     │ 790              │ Springfield-Hartfo…  │ Springfield-Hartfo…  │ M4      │ G3200                │       8710006868 │         256922043 │    42.0359069 │   -72.6213616 │ POLYGON((-72.636821 42.577985, -72.637099 42.57691, -72.637355 42.575968, -72.637466…  │
│ 715     │ 715              │ Boston-Worcester-P…  │ Boston-Worcester-P…  │ M4      │ G3200                │      21419197698 │        3004814151 │    42.3307869 │   -71.3296644 │ MULTIPOLYGON(((-71.471454 43.411298, -71.464171 43.413557, -71.463266 43.414547, -71…  │
│ 725     │ 725              │ Lebanon-Claremont,…  │ Lebanon-Claremont,…  │ M4      │ G3200                │       3031319652 │          58476158 │    43.6727226 │   -72.2484543 │ POLYGON((-72.396019 43.428835, -72.396027 43.428797, -72.396038 43.42873, -72.396108…  │
│ 720     │ 720              │ Bridgeport-New Hav…  │ Bridgeport-New Hav…  │ M4      │ G3200                │       3940389150 │         374068423 │    41.3603421 │   -73.1284227 │ MULTIPOLYGON(((-72.602721 41.266584, -72.602742 41.266458, -72.60248 41.266178, -72.…  │
└─────────┴──────────────────┴──────────────────────┴──────────────────────┴─────────┴──────────────────────┴──────────────────┴───────────────────┴───────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────┘
</code></pre>
<p>From Google Cloud Console, we can see the BigQuery job triggered by the query from DuckDB CLI</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758774714002/07bcfbdb-afbd-4ca7-ba00-a4906237328a.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-summary">Summary</h2>
<p>Integrating DuckDB with BigQuery enables seamless data integration between the two platforms. This integration enables users to utilise DuckDB's efficient query engine locally, which can result in significant cost savings. By scanning the data once and performing all computations on a powerful local machine, such as an "overpowered" Apple Silicon MacBook, you can optimise your data processing tasks without incurring a surprise BigQuery bill.</p>
]]></content:encoded></item><item><title><![CDATA[Missing DAGs in Airflow UI: A Cloud Composer Filter Bug]]></title><description><![CDATA[I recently assisted a customer in troubleshooting an issue with their Google Cloud Composer environment and found myself delving deep to uncover the problem. They had 22 DAGs, but only 11 were visible in the UI. It turns out this is a known bug that ...]]></description><link>https://derrickqin.com/missing-dags-in-airflow-ui-a-cloud-composer-filter-bug</link><guid isPermaLink="true">https://derrickqin.com/missing-dags-in-airflow-ui-a-cloud-composer-filter-bug</guid><category><![CDATA[google cloud]]></category><category><![CDATA[cloud composer]]></category><category><![CDATA[airflow]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 29 Aug 2025 08:20:40 GMT</pubDate><content:encoded><![CDATA[<p>I recently assisted a customer in troubleshooting an issue with their Google Cloud Composer environment and found myself delving deep to uncover the problem. They had 22 DAGs, but only 11 were visible in the UI. It turns out this is a known bug that isn't likely to be fixed soon, so I decided to document it here for anyone who might encounter the same issue in the future.</p>
<h2 id="heading-whats-happening">What's happening</h2>
<p>Airflow is a great tool for orchestrating data pipelines. One of the many reasons that people love it is because of its great UI. In the main page, it shows all the DAGs that are orchestrated by Airflow. For this customer's case, they should have seen 22 DAGs, but for some reason, only 11 were showing up. Weird!</p>
<p>DAGs were running perfectly. And Google Cloud Console UI showed the DAGs perfectly. What could go wrong?</p>
<h2 id="heading-going-down-the-wrong-path">Going down the wrong path</h2>
<p>When I first encountered this case, I thought it was one of those common DAG parsing issues in Airflow. I know how it goes—something's wrong with your DAG code, or Airflow can't parse lots of DAGs properly, so they don't show up in the UI.</p>
<p>So I followed Google's documentation about <a target="_blank" href="https://cloud.google.com/composer/docs/composer-2/troubleshooting-dag-processor">Troubleshooting DAG Processor issues</a>. Started looking into all the Airflow configuration options that define timeouts for parsing DAGs:</p>
<ul>
<li><p><code>[core]dagbag_import_timeout</code> - defines how much time the DAG processor has to parse a single DAG</p>
</li>
<li><p><code>[core]dag_file_processor_timeout</code> - defines the total amount of time the DAG processor can spend on parsing all DAGs</p>
</li>
</ul>
<p>But one thing bothered me - the customer only had 22 DAGs. Unlike one of my customers, who had 2000+ DAGs, 22 DAGs should not hit parsing timeouts.</p>
<p>Turns out it wasn't a parsing issue at all!</p>
<h2 id="heading-which-versions-are-broken">Which versions are broken</h2>
<p>I spun up three Composer environments:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756453906312/af45017f-90fe-48af-8ff1-cb14df4b8534.png" alt class="image--center mx-auto" /></p>
<p>I can confirm the bug exists in both Airflow versions 2.10.5 and 2.10.2. Probably 2.10.3, 2.10.4, and 2.11.0 too. 2.9.3 worked fine.</p>
<p>I only tested Cloud Composer, but I suspect the bug exists in MWAA too.</p>
<h2 id="heading-how-to-reproduce-this">How to reproduce this</h2>
<p>Every Cloud Composer environment comes with a default <code>airflow_monitoring</code> DAG to run a health check for the Airflow cluster. This is perfect - I don’t need to write any dummy DAGs.</p>
<p>To reproduce the bug, in 2.10.2 and 2.10.5 environments:</p>
<p>In my Composer, I navigated to Airflow UI. The URL was https://3ca7d775de394874b2bc4ab37359e639-dot-us-central1.composer.googleusercontent.com<a target="_blank" href="https://3ca7d775de394874b2bc4ab37359e639-dot-us-central1.composer.googleusercontent.com/home￼It">  
</a>It showed the default <code>airflow_monitoring</code> DAG, great. On Airflow UI, I could see <code>All</code> button showed value 1, same with <code>Active</code> and <code>Failed</code> buttons showed 0:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/xBJBmNH3w0uUErJRdYSKkPPrc/?name=image.png" alt /></p>
<p>After clicking the <code>Failed</code> button, the button turned blue and both <code>All</code> and <code>Active</code> buttons <strong>incorrectly</strong> showed <strong>0</strong>:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/LdpZk10ptWHvKm80HNZyKjSGt/?name=image.png" alt /></p>
<p>I noticed that the URL had changed to: https://3ca7d775de394874b2bc4ab37359e639-dot-us-central1.composer.googleusercontent.com<strong>/home?lastrun=failed</strong></p>
<p>When I navigated to the Airflow UI again via https://3ca7d775de394874b2bc4ab37359e639-dot-us-central1.composer.googleusercontent.com, Chrome dev tools had two redirections:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/x1QIgFzDUPD3CeL4XlAnBKJiu/?name=image.png" alt /></p>
<p>This indicated Airflow remembered the filter somehow.</p>
<p>In version 2.9.3, after clicking <code>Failed</code> button, both <code>All</code> and <code>Actived</code> buttons showed the correct DAG number:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/uvTmgjw8YDrNCPHWcHb9DM5M4/?name=image.png" alt /></p>
<h2 id="heading-it-is-a-bug-but-wont-get-fixed">It is a bug, but won’t get fixed</h2>
<p>After some digging, I found the <a target="_blank" href="https://github.com/apache/airflow/issues/53705">GitHub issue</a> for this bug. It is listed as <code>Closed as not planned</code>. This means it won’t be fixed as I understand Airflow maintainers must be busy working with Airflow 3.</p>
<p>Since Composer is just managed Airflow, Google can't really do much about it. They normally rely on the upstream project to fix it, which apparently isn't happening.</p>
<h2 id="heading-workarounds">Workarounds</h2>
<p>To anyone who encountered the same issue, there are two workarounds:</p>
<ol>
<li><p>Delete the <code>?lastrun=failed</code> part from the URL</p>
</li>
<li><p>Click the "Failed" button again to turn off the filter</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Effective Use of Google Cloud IAM Deny Policies for BigQuery Data Security]]></title><description><![CDATA[Background
Recently, I assisted a customer who required protection for sensitive data in BigQuery, ensuring access was restricted to a small group of individuals. They wanted to prevent any accidental access by employees. To address this, I recommend...]]></description><link>https://derrickqin.com/effective-use-of-google-cloud-iam-deny-policies-for-bigquery-data-security</link><guid isPermaLink="true">https://derrickqin.com/effective-use-of-google-cloud-iam-deny-policies-for-bigquery-data-security</guid><category><![CDATA[bigquery]]></category><category><![CDATA[Data security]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 28 Mar 2025 13:00:00 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-background">Background</h2>
<p>Recently, I assisted a customer who required protection for sensitive data in BigQuery, ensuring access was restricted to a small group of individuals. They wanted to prevent any accidental access by employees. To address this, I recommended using <a target="_blank" href="https://cloud.google.com/iam/docs/deny-overview">Google Cloud IAM deny policies</a>. These policies allow you to establish strict access controls on BigQuery resources by defining deny rules at the organisation level. This ensures that selected principals are prevented from using certain permissions, regardless of the roles they have been granted. I want to use a lab to show you how it works so you can apply this in your organization if you have similiar requirements.</p>
<h2 id="heading-lab">Lab</h2>
<p>Grant your user <code>roles/iam.denyAdmin</code> role, or you get an error like the one below:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/lxgv0dc0IXA49pbo17KrfOFLQ/?name=image.png" alt /></p>
<p>To grant the role, you need to do it at Organization level IAM:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759320043712/2898d7a5-43d4-4c2d-b2a2-80efdb07041c.png" alt class="image--center mx-auto" /></p>
<p>Create a deny policy to block a group of users from accessing BigQuery data and creating BigQuery jobs:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/2iexVdET8HUOUUeNY8zUMkCQH/?name=image.png" alt /></p>
<p>After being created, the policy looks like this:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/ZN0c3tjVvfx3LFkrptNmpvWmk/?name=image.png" alt /></p>
<p>This policy means, users in <code>admins@googlegroups.com</code> cannot have these IAM permissions, therefore, they will not be able to create BigQuery jobs and view the data in the tables.</p>
<h2 id="heading-summary">Summary</h2>
<p>In conclusion, securing sensitive data in BigQuery is crucial for organizations that handle confidential information. By implementing Google Cloud IAM deny policies, you can effectively restrict access to BigQuery resources, ensuring that only authorized individuals have the necessary permissions. This approach not only prevents accidental access by employees but also strengthens your organization's data security posture. By following the steps outlined in this guide, you can apply these practices to safeguard your data and maintain compliance with security standards.</p>
]]></content:encoded></item><item><title><![CDATA[How to Implement IAM Authentication in Cloud SQL PostgreSQL: A Step-by-Step Tutorial]]></title><description><![CDATA[Background
IAM Authentication in Cloud SQL for PostgreSQL lets you manage database user access using Google Cloud Identity and Access Management (IAM) identities instead of traditional PostgreSQL usernames and passwords. This integrates database acce...]]></description><link>https://derrickqin.com/how-to-implement-iam-authentication-in-cloud-sql-postgresql-a-step-by-step-tutorial</link><guid isPermaLink="true">https://derrickqin.com/how-to-implement-iam-authentication-in-cloud-sql-postgresql-a-step-by-step-tutorial</guid><category><![CDATA[cloud sql]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[IAM]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 01 Feb 2025 13:00:00 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-background">Background</h2>
<p>IAM Authentication in Cloud SQL for PostgreSQL lets you manage database user access using Google Cloud Identity and Access Management (IAM) identities instead of traditional PostgreSQL usernames and passwords. This integrates database access control with your existing Google Cloud IAM policies, providing better security and easier management.</p>
<p>Recently, I helped one of my customers set it up. The official <a target="_blank" href="https://cloud.google.com/sql/docs/postgres/iam-authentication">documentation</a> is great, but it doesn't offer a step-by-step guide, so I want to document it here for anyone who wants to set it up.</p>
<p>Now, let's head to my lab!</p>
<h2 id="heading-set-it-up">Set it up</h2>
<p>First, create a PostgreSQL 15 database on Cloud SQL:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707950287950/878ef2e4-f413-4842-8de2-cf232decbe3d.png" alt class="image--center mx-auto" /></p>
<p>To enable IAM authentication, add the IAM auth flag <strong>cloudsql.iam_authentication</strong> while in Edit Instance mode:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707950338475/5c889549-1c4f-4df7-9d1c-5ac71a27cf79.png" alt class="image--center mx-auto" /></p>
<p>Create a Service Account that will be associated to a database user later:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707950446342/72115f27-4de3-422d-b675-3042e31a82a4.png" alt class="image--center mx-auto" /></p>
<p>Assign proper IAM roles to the service account:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707984279738/1c327083-fd5b-4bb2-8220-c81b00607f44.png" alt class="image--center mx-auto" /></p>
<p>Add a database IAM user:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707980612539/61a8b6a6-7f1a-4f2e-a860-644094bccb32.png" alt class="image--center mx-auto" /></p>
<p>Once it is done, it looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707980519164/e7964f95-c351-4e60-890a-0832f0614242.png" alt class="image--center mx-auto" /></p>
<p>Lastly, grant database privileges to the IAM user because, by default, it doesn't have any permissions.</p>
<p>Log in as the default <code>postgres</code> user:</p>
<pre><code class="lang-bash">➜  ~ gcloud sql connect pg-15 --user=postgres --quiet
Allowlisting your IP <span class="hljs-keyword">for</span> incoming connection <span class="hljs-keyword">for</span> 5 minutes...done.                                    
Connecting to database with SQL user [postgres].Password: 
psql (16.1 (Debian 16.1-1.pgdg110+1), server 15.4)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type <span class="hljs-string">"help"</span> <span class="hljs-keyword">for</span> <span class="hljs-built_in">help</span>.
postgres=&gt; \du
                                        List of roles
           Role name            |                         Attributes                         
--------------------------------+------------------------------------------------------------
 cloudsqladmin                  | Superuser, Create role, Create DB, Replication, Bypass RLS
 cloudsqlagent                  | Create role, Create DB
 cloudsqliamserviceaccount      | Cannot login
 cloudsqliamuser                | Cannot login
 cloudsqlimportexport           | Create role, Create DB
 cloudsqllogical                | Cannot login, Replication
 cloudsqlreplica                | Replication
 cloudsqlsuperuser              | Create role, Create DB
 postgres                       | Create role, Create DB
 test-iam-auth@airflow-talk.iam | 

postgres=&gt; GRANT cloudsqlsuperuser to <span class="hljs-string">"test-iam-auth@airflow-talk.iam"</span>;                                
GRANT ROLE
</code></pre>
<p>All done!</p>
<h2 id="heading-testing">Testing</h2>
<p>To test whether the IAM authentication is working, we set up the connection to Cloud SQL using the Cloud SQL Auth Proxy.</p>
<p>First, we run the proxy with service account impersonation:</p>
<pre><code class="lang-bash">➜  ~ ./cloud-sql-proxy --auto-iam-authn --impersonate-service-account=test-iam-auth@airflow-talk.iam.gserviceaccount.com airflow-talk:us-central1:pg-15
2024/02/15 08:03:17 Impersonating service account with Application Default Credentials
2024/02/15 08:03:18 [airflow-talk:us-central1:pg-15] Listening on 127.0.0.1:5432
2024/02/15 08:03:18 The proxy has started successfully and is ready <span class="hljs-keyword">for</span> new connections!
</code></pre>
<p>Then, connect to Postgres through the proxy, and you will see it is working:</p>
<pre><code class="lang-bash">➜  ~ psql -h 127.0.0.1 \
 -U <span class="hljs-string">"test-iam-auth@airflow-talk.iam"</span> \
 --port 5432 \
 --dbname=postgres

psql (16.1 (Debian 16.1-1.pgdg110+1), server 15.4)
Type <span class="hljs-string">"help"</span> <span class="hljs-keyword">for</span> <span class="hljs-built_in">help</span>.

postgres=&gt;
</code></pre>
<p>That’s it. Hope this works for you.</p>
]]></content:encoded></item><item><title><![CDATA[How to trigger the DAG in Amazon Managed Workflows for Apache Airflow (MWAA)]]></title><description><![CDATA[Background
Directed Acyclic Graphs (DAGs) in Amazon Managed Workflows for Apache Airflow (MWAA) can be triggered in several ways, depending on how much automation and integration you need. While you can manually trigger DAGs using the Airflow UI, we ...]]></description><link>https://derrickqin.com/how-to-trigger-the-dag-in-amazon-managed-workflows-for-apache-airflow-mwaa</link><guid isPermaLink="true">https://derrickqin.com/how-to-trigger-the-dag-in-amazon-managed-workflows-for-apache-airflow-mwaa</guid><category><![CDATA[mwaa]]></category><category><![CDATA[AWS]]></category><category><![CDATA[airflow]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Wed, 01 Jan 2025 13:00:00 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-background">Background</h2>
<p>Directed Acyclic Graphs (DAGs) in Amazon Managed Workflows for Apache Airflow (MWAA) can be triggered in several ways, depending on how much automation and integration you need. While you can manually trigger DAGs using the Airflow UI, we often prefer to do it programmatically.</p>
<p>Recently, I helped a customer set this up. The official <a target="_blank" href="https://docs.aws.amazon.com/mwaa/latest/userguide/samples-invoke-dag.html"><strong>documentation</strong></a> doesn't provide a step-by-step guide with screenshots, so I want to document it here for anyone interested in setting it up.</p>
<h2 id="heading-lab">Lab</h2>
<p>First, let’s create an MWAA environment. The simplest way is to download the CloudFormation template from this AWS <a target="_blank" href="https://docs.aws.amazon.com/mwaa/latest/userguide/quick-start.html#quick-start-template">tutorial</a> and deploy it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732092116/82d99c2c-0073-4d71-b97b-212affe338a8.png" alt class="image--center mx-auto" /></p>
<p>Create two simple DAGs that print out the <code>dag.conf</code> values.</p>
<p>The first DAG uses the traditional method of defining tasks and employs a Jinja template to access the value:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta
<span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.models.param <span class="hljs-keyword">import</span> Param

<span class="hljs-keyword">from</span> airflow.operators.bash_operator <span class="hljs-keyword">import</span> BashOperator

default_args = {
    <span class="hljs-string">"owner"</span>: <span class="hljs-string">"airflow"</span>,
    <span class="hljs-string">"depends_on_past"</span>: <span class="hljs-literal">False</span>,
    <span class="hljs-string">"retries"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"retry_delay"</span>: timedelta(minutes=<span class="hljs-number">5</span>),
    <span class="hljs-string">"email"</span>: [<span class="hljs-string">"airflow@example.com"</span>],
    <span class="hljs-string">"email_on_failure"</span>: <span class="hljs-literal">False</span>,
    <span class="hljs-string">"email_on_retry"</span>: <span class="hljs-literal">False</span>,
}

<span class="hljs-keyword">with</span> DAG(
    <span class="hljs-string">"10_dag_run_conf_dag"</span>,
    default_args=default_args,
    description=<span class="hljs-string">"Dag Run Conf Dag"</span>,
    schedule_interval=<span class="hljs-string">"0 12 * * *"</span>,
    start_date=datetime(<span class="hljs-number">2023</span>, <span class="hljs-number">1</span>, <span class="hljs-number">24</span>),
    catchup=<span class="hljs-literal">False</span>,
    tags=[<span class="hljs-string">"custom"</span>]
) <span class="hljs-keyword">as</span> dag:

    start = BashOperator(
        task_id=<span class="hljs-string">"start"</span>,
        bash_command=<span class="hljs-string">"echo start"</span>,
    )
    print_dag_run_conf = BashOperator(
        task_id=<span class="hljs-string">"print_dag_run_conf"</span>,
        bash_command=<span class="hljs-string">"echo value: {{ dag_run.conf['conf1'] }}"</span>,
        dag=dag,
    )
    end = BashOperator(
        task_id=<span class="hljs-string">"end"</span>,
        bash_command=<span class="hljs-string">"echo stop"</span>,
    )

    start &gt;&gt; print_dag_run_conf &gt;&gt; end
</code></pre>
<p>The second DAG uses Airflow <a target="_blank" href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/taskflow.html">TaskFlow</a> API:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> pprint <span class="hljs-keyword">import</span> pprint

<span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.decorators <span class="hljs-keyword">import</span> task
<span class="hljs-keyword">from</span> airflow.operators.python <span class="hljs-keyword">import</span> PythonOperator


<span class="hljs-keyword">with</span> DAG(
    dag_id=<span class="hljs-string">"11_taskflow_dag_run_conf_dag"</span>,
    schedule_interval=<span class="hljs-literal">None</span>,
    start_date=datetime(<span class="hljs-number">2023</span>, <span class="hljs-number">1</span>, <span class="hljs-number">24</span>),
    catchup=<span class="hljs-literal">False</span>,
    tags=[<span class="hljs-string">"custom"</span>],
) <span class="hljs-keyword">as</span> dag:
<span class="hljs-meta">    @task(task_id="sleep_for_1")</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">my_sleeping_function</span>():</span>
        time.sleep(<span class="hljs-number">1</span>)

    sleeping_task = my_sleeping_function()

<span class="hljs-meta">    @task(task_id="print_dag_conf")</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">print_dag_conf</span>(<span class="hljs-params">ds=None, **kwargs</span>):</span>
        print(kwargs[<span class="hljs-string">"dag_run"</span>].conf)
        <span class="hljs-keyword">return</span> <span class="hljs-string">"Whatever you return gets printed in the logs"</span>

    print_dag_conf = print_dag_conf(my_keyword=<span class="hljs-string">"Airflow"</span>)

    sleeping_task &gt;&gt; print_dag_conf
</code></pre>
<p>Now let's upload the DAGs to S3 so MWAA can load them:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732528785/a7969c04-f1f3-4dc1-b20d-f696b7c461a5.png" alt class="image--center mx-auto" /></p>
<p>Check that the DAGs are loaded in the Airflow UI:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732488548/17d52aaf-c63f-47cc-9495-f02a7ab696c0.png" alt class="image--center mx-auto" /></p>
<p><strong>TIPS</strong>: When navigating to the MWAA Airflow UI, you might see an "Internal Server Error." This is an unresolved issue.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732375365/80bb2c62-bcd0-4ed1-92ff-de45441c479f.png" alt class="image--center mx-auto" /></p>
<p>To fix this, you need to delete the cookie in your browser. For example, in Chrome Developer Tools:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732438953/e474e25e-fa2d-4568-90d9-b0ca6777f646.png" alt class="image--center mx-auto" /></p>
<p>To trigger the DAG, you need an IAM user with the <code>AmazonMWAAAirflowCliAccess</code> AWS managed policy.</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-attr">"Statement"</span>: [
        {
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-attr">"Action"</span>: [
                <span class="hljs-string">"airflow:CreateCliToken"</span>
            ],
            <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"*"</span>
        }
    ]
}
</code></pre>
<p>Create an access key:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732930394/afcba0f6-54ca-488b-8f9e-00aee6fde878.png" alt class="image--center mx-auto" /></p>
<p>Use the keys in the AWS CLI to test it:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674732991499/1c0c33f6-83e6-41b5-bd82-456e6da0d97c.png" alt class="image--center mx-auto" /></p>
<p>Write a Python script to trigger the DAG:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> requests 
<span class="hljs-keyword">import</span> base64

mwaa_env_name = <span class="hljs-string">'Derrick-MWAA-MwaaEnvironment'</span>
dag_name = <span class="hljs-string">'10_dag_run_conf_dag'</span>
key = <span class="hljs-string">"conf1"</span>
value = <span class="hljs-string">"cool value1"</span>
conf = <span class="hljs-string">"{\""</span> + key + <span class="hljs-string">"\":\""</span> + value + <span class="hljs-string">"\"}"</span>

client = boto3.client(<span class="hljs-string">'mwaa'</span>)

mwaa_cli_token = client.create_cli_token(
  Name=mwaa_env_name
)

mwaa_auth_token = <span class="hljs-string">'Bearer '</span> + mwaa_cli_token[<span class="hljs-string">'CliToken'</span>]
mwaa_webserver_hostname = <span class="hljs-string">'https://{0}/aws_mwaa/cli'</span>.format(mwaa_cli_token[<span class="hljs-string">'WebServerHostname'</span>])
raw_data = <span class="hljs-string">"dags trigger {0} --conf '{1}'"</span>.format(dag_name, conf)
<span class="hljs-comment">#raw_data = "trigger_dag {0} -c '{1}'".format(dag_name, conf)</span>
mwaa_response = requests.post(
      mwaa_webserver_hostname,
      headers={
          <span class="hljs-string">'Authorization'</span>: mwaa_auth_token,
          <span class="hljs-string">'Content-Type'</span>: <span class="hljs-string">'text/plain'</span>
          },
      data=raw_data
      )

mwaa_std_err_message = base64.b64decode(mwaa_response.json()[<span class="hljs-string">'stderr'</span>]).decode(<span class="hljs-string">'utf8'</span>)
mwaa_std_out_message = base64.b64decode(mwaa_response.json()[<span class="hljs-string">'stdout'</span>]).decode(<span class="hljs-string">'utf8'</span>)

print(mwaa_response.status_code)
print(mwaa_std_err_message)
print(mwaa_std_out_message)
</code></pre>
<p>Note that before Airflow version xxx, you need to use the <code>trigger_dag</code> command. After that version, you use the <code>dags trigger</code> command.</p>
<p>Run the Python script:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674733103209/1472fb8a-44cc-4f39-b3c3-6fde58fea47c.png" alt class="image--center mx-auto" /></p>
<p>Yay! We can see the value of <code>dag.conf</code> in the Airflow logs! This means that triggering the DAG from our Python script works.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674733237628/816af0c2-bfcf-45b8-a10b-abc220c94d05.png" alt class="image--center mx-auto" /></p>
<p>To double-check this, we can update the Python script to trigger the second DAG. The logs look promising:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674733858517/66fcd736-de7f-4464-b3aa-1658ce438af0.png" alt class="image--center mx-auto" /></p>
<p>That’s it. Hope this works for you.</p>
]]></content:encoded></item><item><title><![CDATA[Odd IAM issue using GCP Pub/Sub Ruby SDK to publish messages]]></title><description><![CDATA[Ruby is a programming language that is popular in Google Cloud community. Recently, I got an odd IAM issue from a client using its GCP Pub/Sub SDK when publishing messages.
Like always, I ran a lab to show you.
I wrote a Ruby script to publish a mess...]]></description><link>https://derrickqin.com/odd-iam-issue-using-gcp-pubsub-ruby-sdk-to-publish-messages</link><guid isPermaLink="true">https://derrickqin.com/odd-iam-issue-using-gcp-pubsub-ruby-sdk-to-publish-messages</guid><category><![CDATA[PubSub]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[Ruby]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 29 Sep 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/EhPih0l5bjw/upload/17503118dcec39762891e7db7997e46c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ruby is a programming language that is popular in Google Cloud community. Recently, I got an odd IAM issue from a client using its GCP Pub/Sub SDK when publishing messages.</p>
<p>Like always, I ran a lab to show you.</p>
<p>I wrote a Ruby script to publish a message to Pub/Sub:</p>
<pre><code class="lang-ruby"><span class="hljs-keyword">require</span> <span class="hljs-string">"google/cloud/pubsub"</span>

pubsub = Google::Cloud::PubSub.new(
  <span class="hljs-symbol">project_id:</span> <span class="hljs-string">"derrick-sandbox"</span>,
  <span class="hljs-symbol">credentials:</span> <span class="hljs-string">"test-pubsub-sa.json"</span>
)

<span class="hljs-comment"># Retrieve a topic</span>
topic = pubsub.topic <span class="hljs-string">"test-topic"</span>

<span class="hljs-comment"># Publish a new message</span>
msg = topic.publish <span class="hljs-string">"new-message2"</span>

puts <span class="hljs-string">'Sent a message to PubSub!'</span>
</code></pre>
<p>I have two test projects: <code>airflow-talk</code> and <code>derrick-sandbox</code>.<br />To simulate your case, I am running my script with a service account from <code>airflow-talk</code>, while I created a PubSub topic in <code>derrick-sandbox</code> and granted permissions to the PubSub topic there.  </p>
<p>I created a PubSub topic and subscription in project <code>derrick-sandbox</code>:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/WR4evYgNbFtVExjD7PjfpecEj/?name=image.png" alt /></p>
<p>I created a service account in project <code>airflow-talk</code>:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/KPBvePIjNPy1mdGbSVKnxQSRO/?name=image.png" alt /></p>
<p>I then granted only Pub/Sub Publisher (roles/pubsub.publisher) to this service account, which in theory, is good enough to publish the message to Pub/Sub:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/gPONR2hnbZqJ47Yu95Wj8ErPm/?name=image.png" alt /></p>
<p>Now back to <code>airflow-talk</code> project, in Cloud Shell, I was able to use <code>gcloud</code> to publish a message to the topic:<br />(Note that I did authorize my <code>gcloud</code> using the service account first)</p>
<pre><code class="lang-bash">➜  gcloud auth activate-service-account test-pubsub@airflow-talk.iam.gserviceaccount.com  --key-file=test-pubsub-sa.json --project=derrick-sandbox

➜  gcloud pubsub topics publish projects/derrick-sandbox/topics/test-topic --message=<span class="hljs-string">"Hello World"</span>                                                
messageIds:
- <span class="hljs-string">'8303462186509395'</span>
</code></pre>
<p>But when running the Ruby script, it got an error:</p>
<pre><code class="lang-bash">➜  ruby-pubsub ruby test-pubsub.rb
Traceback (most recent call last):
        11: from test-pubsub.rb:9:<span class="hljs-keyword">in</span> `&lt;main&gt;<span class="hljs-string">'
        10: from /home/derrick/.gems/gems/google-cloud-pubsub-2.15.4/lib/google/cloud/pubsub/project.rb:171:in `topic'</span>
         9: from /home/derrick/.gems/gems/google-cloud-pubsub-2.15.4/lib/google/cloud/pubsub/service.rb:109:<span class="hljs-keyword">in</span> `get_topic<span class="hljs-string">'
         8: from /home/derrick/.gems/gems/google-cloud-pubsub-v1-0.18.0/lib/google/cloud/pubsub/v1/publisher/client.rb:584:in `get_topic'</span>
         7: from /home/derrick/.gems/gems/gapic-common-0.20.0/lib/gapic/grpc/service_stub.rb:190:<span class="hljs-keyword">in</span> `call_rpc<span class="hljs-string">'
         6: from /home/derrick/.gems/gems/gapic-common-0.20.0/lib/gapic/grpc/service_stub/rpc_call.rb:123:in `call'</span>
         5: from /home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:173:<span class="hljs-keyword">in</span> `block <span class="hljs-keyword">in</span> request_response<span class="hljs-string">'
         4: from /home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/interceptors.rb:170:in `intercept!'</span>
         3: from /home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:174:<span class="hljs-keyword">in</span> `block (2 levels) <span class="hljs-keyword">in</span> request_response<span class="hljs-string">'
         2: from /home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:377:in `request_response'</span>
         1: from /home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:186:<span class="hljs-keyword">in</span> `attach_status_results_and_complete_call<span class="hljs-string">'
/home/derrick/.gems/gems/grpc-1.58.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:29:in `check_status'</span>: 7:User not authorized to perform this action.. debug_error_string:{UNKNOWN:Error received from peer ipv4:142.251.10.95:443 {grpc_message:<span class="hljs-string">"User not authorized to perform this action."</span>, grpc_status:7, created_time:<span class="hljs-string">"2023-09-20T06:40:27.019799881+00:00"</span>}} (GRPC::PermissionDenied)
        4: from test-pubsub.rb:9:<span class="hljs-keyword">in</span> `&lt;main&gt;<span class="hljs-string">'
        3: from /home/derrick/.gems/gems/google-cloud-pubsub-2.15.4/lib/google/cloud/pubsub/project.rb:171:in `topic'</span>
        2: from /home/derrick/.gems/gems/google-cloud-pubsub-2.15.4/lib/google/cloud/pubsub/service.rb:109:<span class="hljs-keyword">in</span> `get_topic<span class="hljs-string">'
        1: from /home/derrick/.gems/gems/google-cloud-pubsub-v1-0.18.0/lib/google/cloud/pubsub/v1/publisher/client.rb:551:in `get_topic'</span>
/home/derrick/.gems/gems/google-cloud-pubsub-v1-0.18.0/lib/google/cloud/pubsub/v1/publisher/client.rb:589:<span class="hljs-keyword">in</span> `rescue <span class="hljs-keyword">in</span> get_topic<span class="hljs-string">': 7:User not authorized to perform this action.. debug_error_string:{UNKNOWN:Error received from peer ipv4:142.251.10.95:443 {grpc_message:"User not authorized to perform this action.", grpc_status:7, created_time:"2023-09-20T06:40:27.019799881+00:00"}} (Google::Cloud::PermissionDeniedError)</span>
</code></pre>
<p>This is very strange. After digging into the stacktrace, I found it is related to this line:</p>
<pre><code class="lang-ruby"><span class="hljs-comment"># Retrieve a topic</span>
topic = pubsub.topic <span class="hljs-string">"test-topic"</span>
</code></pre>
<p>Apparently, this will make an API call to check if the topic exists, so it requires a <code>Pub/Sub Viewer (roles/pubsub.viewer)</code> IAM role. In SDKs from other languages, this is not the case.</p>
<p>Back to the IAM page, I added <code>Pub/Sub Viewer (roles/pubsub.viewer)</code> to my service account:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/vYGNsp7KNo7UPSkb1gQtA6n1s/?name=image.png" alt /></p>
<p>The script worked!</p>
<pre><code class="lang-bash">➜  ruby-pubsub ruby test-pubsub.rb
Sent a message to PubSub!
</code></pre>
]]></content:encoded></item><item><title><![CDATA[CloudSQL Query Insight - Query details truncated to 1024 characters]]></title><description><![CDATA[Query insights is a useful feature to help Cloud SQL users to detect, diagnose, and prevent query performance problems for Cloud SQL databases.
One client had an issue using it because their query was too long and it was truncated on Query details pa...]]></description><link>https://derrickqin.com/cloudsql-query-insight-query-details-truncated-to-1024-characters</link><guid isPermaLink="true">https://derrickqin.com/cloudsql-query-insight-query-details-truncated-to-1024-characters</guid><category><![CDATA[google cloud]]></category><category><![CDATA[cloud sql]]></category><category><![CDATA[MySQL]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 22 Sep 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/UOAD9U-TYxc/upload/c8e4adde77395ca826209e5b382f9fa4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Query insights is a useful feature to help Cloud SQL users to detect, diagnose, and prevent query performance problems for Cloud SQL databases.</p>
<p>One client had an issue using it because their query was too long and it was truncated on Query details page. They asked if there is a config to show longer queries.</p>
<p>As always, I ran a lab to demonstrate how to fix this issue.</p>
<p>Firstly, I created a Cloud SQL MySQL instance and I ran a long query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,   <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>,  <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, 
<span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>, <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">`blablahblahblablahblah`</span>;
</code></pre>
<p>Below is what I saw in Query Insight:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702382247411/648ade7a-4ddb-41b7-a22a-bf6ae8cf49a7.png" alt class="image--center mx-auto" /></p>
<p>I followed the help <a target="_blank" href="https://cloud.google.com/sql/docs/mysql/using-query-insights#enable-insights">document</a> to change to 4500 characters:</p>
<blockquote>
<p><strong>Customize query lengths</strong></p>
<p>Default: <code>1024</code></p>
<p>Sets the query length limit to a specified value from 256 bytes to 4500 bytes. Higher query lengths are more useful for analytical queries, but they also require more memory. Changing the query length requires you to restart the instance. You can still add tags to queries that exceed the length limit.</p>
</blockquote>
<p>But the issue is still there.</p>
<p>It turns out that MySQL has a default limit on the query size, there is a flag called <a target="_blank" href="https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_digest_length"><code>max_digest_length</code></a> in MySQL. I will need to set it via Cloud SQL <a target="_blank" href="https://cloud.google.com/sql/docs/mysql/flags">flag</a>.</p>
<p>I set the flag to 4096 and I found the query not truncated in Query Insight.</p>
]]></content:encoded></item><item><title><![CDATA[Cloud Composer Airflow and BigQuery External Table with Google Sheets]]></title><description><![CDATA[BigQuery has a useful feature which allows the user to create external table with data on Google Sheets. It is very convenient because BigQuery users can query the data from Google Sheets directly. However, as a data engineer, you may need to build p...]]></description><link>https://derrickqin.com/cloud-composer-airflow-and-bigquery-external-table-with-google-sheets</link><guid isPermaLink="true">https://derrickqin.com/cloud-composer-airflow-and-bigquery-external-table-with-google-sheets</guid><category><![CDATA[cloud composer]]></category><category><![CDATA[bigquery]]></category><category><![CDATA[#googlesheets]]></category><category><![CDATA[airflow]]></category><category><![CDATA[apache-airflow]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 22 Jul 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/TisvwNLLWA4/upload/4771bcea7566efc62b825f1c80b63ef4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>BigQuery has a useful feature which allows the user to create external table with data on Google Sheets. It is very convenient because BigQuery users can query the data from Google Sheets directly. However, as a data engineer, you may need to build pipelines to interact with the data from these external tables.</p>
<p>There are two required changes in your Cloud Composer Airflow to make it possible.</p>
<p>I'll show these changes in a lab.</p>
<p>I created a Google Sheets file that looks like this:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/xR5QGx3kkcYUFaVFn7B58vBT3/?name=image.png" alt /></p>
<p>And created a BigQuery that links to this file:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/om0pEx951hneAEw3epLtRcHsK/?name=image.png" alt /></p>
<p>I can query it without issue on BigQuery Console:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/zQVKae5iXfXVP5ZIs6YAAs6a3/?name=image.png" alt /></p>
<p>Now on the Airflow part, I created a DAG and uploaded it to Cloud Composer:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> datetime

<span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.empty <span class="hljs-keyword">import</span> EmptyOperator
<span class="hljs-keyword">from</span> airflow.providers.google.cloud.operators.bigquery <span class="hljs-keyword">import</span> BigQueryCheckOperator


<span class="hljs-keyword">with</span> DAG(
    dag_id=<span class="hljs-string">"test-bigquery"</span>,
    start_date=datetime.datetime(<span class="hljs-number">2023</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>),
    schedule=<span class="hljs-literal">None</span>,
):
    DATASET = <span class="hljs-string">"ext"</span>
    TABLE_1 = <span class="hljs-string">"gsheets"</span>
    location = <span class="hljs-string">"US"</span>

    check_count = BigQueryCheckOperator(
        task_id=<span class="hljs-string">"check_count"</span>,
        sql=<span class="hljs-string">f"SELECT COUNT(*) FROM <span class="hljs-subst">{DATASET}</span>.<span class="hljs-subst">{TABLE_1}</span>"</span>,
        use_legacy_sql=<span class="hljs-literal">False</span>,
        location=location,
    )
</code></pre>
<p>On the first run, Airflow reported an issue:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/CEDo3urroMHEuNmMyf1JdwfVY/?name=image.png" alt /></p>
<p>Diving into the logs, I found that it is about missing Google Drive access. It makes sense because Google Sheets data is stored on Google Drive.</p>
<p>This error is understandable because, by default, Airflow uses <code>google_cloud_default</code> Connection, which doesn't have Google Drive API scope [1]. So I added it:<br />(Note that you'll need to add <code>cloud-platform</code> in the scope because you don't want to lose access to Google Cloud.)</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/RuvSsDjqvK6874G1gLAyHbAdS/?name=image.png" alt /></p>
<p>I ran the DAG again but still getting issues:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/DaOEwsYzxlCzwQ9UOMQxLCAZb/?name=image.png" alt /></p>
<p>What could be the missing puzzle?</p>
<p>I realized that it could be the permission for the service account to access the Google Sheets file, so I shared the Google Sheets file to the service account from Google Sheets UI:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/IHr5vIhS10aeLNE9oF5iGYuEM/?name=image.png" alt /></p>
<p>After that, I re-ran the job and it was successful!</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/mamCuC7ge0p0xPwIOkV0YxdHS/?name=image.png" alt /></p>
<p><strong>Summary:</strong><br />To access Google Sheets data in Cloud Composer via BigQuery, you will need the following:  </p>
<ol>
<li><p>Google Drive API scope to be added to your Airflow Connection</p>
</li>
<li><p>File access to be added to the service account that Composer uses</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Export data from Cloud SQL to GCS using BigQuery Federated Query and Scheduled Query]]></title><description><![CDATA[Recently, I worked with a client who wanted to export data from Cloud SQL to GCS daily. They tried Cloud SQL Serverless Export orchestrated by Cloud Composer but found it is pretty expensive for their use case. For the context, with Serverless Export...]]></description><link>https://derrickqin.com/export-data-from-cloud-sql-to-gcs-using-bigquery-federated-query-and-scheduled-query</link><guid isPermaLink="true">https://derrickqin.com/export-data-from-cloud-sql-to-gcs-using-bigquery-federated-query-and-scheduled-query</guid><category><![CDATA[bigquery]]></category><category><![CDATA[cloud sql]]></category><category><![CDATA[#gcs]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 01 Jul 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/fyeOxvYvIyY/upload/fa01560445a8eff775b3ec85c4ad0b54.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, I worked with a client who wanted to export data from Cloud SQL to GCS daily. They tried Cloud SQL Serverless Export orchestrated by Cloud Composer but found it is pretty expensive for their use case. For the context, with Serverless Export, Cloud SQL creates a separate, temporary instance to offload the export operation. It is expensive because of the extra resource provision.</p>
<p>In this blog post, I’ll show you how to export data from Cloud SQL Postgres (This solution works for MySQL too) database to GCS using <a target="_blank" href="https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries">BigQuery federated query</a> and its own <a target="_blank" href="https://cloud.google.com/bigquery/docs/scheduling-queries">Scheduled Query</a> feature.</p>
<p><strong>Step 1:</strong> Set up the Cloud SQL connection in BigQuery following this <a target="_blank" href="https://cloud.google.com/bigquery/docs/connect-to-sql#create-sql-connection">document</a>. To avoid impacting the performance of main instance, I used a PostgreSQL read replica.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702374409718/a2c8c756-bca7-4ff6-b675-5414f5e2716b.png" alt class="image--center mx-auto" /></p>
<p><strong>Step 2:</strong> Create a GCS bucket.</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/yKpUNAOBfBGL80B4yCHo2QGLo/?name=image.png" alt /></p>
<p><strong>Step 3:</strong> Use the below SQL query to create a Scheduled Query following this <a target="_blank" href="https://cloud.google.com/bigquery/docs/scheduling-queries">document</a>. For testing purpose, I created an on-demand job so that I can trigger it.</p>
<p>Note that in this SQL, I used BigQuery's <a target="_blank" href="https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement"><code>EXPORT DATA</code> statement</a>, which can export query result directly to GCS.</p>
<pre><code class="lang-sql">EXPORT DATA
  OPTIONS (
    uri = CONCAT('gs://test-sql-ext/', CURRENT_TIMESTAMP(), '<span class="hljs-comment">--*.csv'),</span>
    format = 'CSV',
    overwrite = true,
    header = true,
    field_delimiter = ';')
AS (
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> EXTERNAL_QUERY(<span class="hljs-string">"airflow-talk.us.pg-ext"</span>, <span class="hljs-string">"SELECT * FROM INFORMATION_SCHEMA.TABLES;"</span>)
);
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1702374488395/44d0060c-2bd1-49e9-8341-c075a758f887.png" alt class="image--center mx-auto" /></p>
<p><strong>Step 4:</strong> Trigger the job and check the GCS file.</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/DXmYLwchriz2GBOfB3d6FGFjI/?name=image.png" alt /></p>
<p><img src="https://doitintl.zendesk.com/attachments/token/JXn4pnY4jTUnTsbafOw7wnmNy/?name=image.png" alt /></p>
<p><strong>Step 5:</strong> verify the CSV file. It contains the data extracted the SQL query.</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/Z6U8M24cnXOBcYDpVSKUC12z0/?name=image.png" alt /></p>
<p>You can schedule the query to run. This <a target="_blank" href="https://cloud.google.com/bigquery/docs/scheduling-queries#set_up_scheduled_queries">part</a> of the document talks about how you can set up your schedule. You can set up the schedule using the console or bq CLI. On the console, to specify a custom frequency, select <strong>Custom</strong>, then enter a Cron-like time specification in the <strong>Custom schedule</strong> field— for example, <code>every mon 23:30</code> or <code>every 6 hours</code> .</p>
<p><img src="https://cloud.google.com/static/bigquery/images/custom-scheduled-query.png" alt="Formatting a custom scheduled query." /></p>
<p>That's it! We built a data pipeline that can extract data from Cloud SQL to GCS on a schedule. We didn't use Serverless export. The cost of the solution is much cheaper.</p>
]]></content:encoded></item><item><title><![CDATA[Run dbt with Cloud Composer and Cloud Run]]></title><description><![CDATA[Recently, a client wanted to use dbt core in Cloud Composer Airflow but they encountered Python dependencies issues. To solve this issue, I built a solution to run dbt inside of Cloud Run with Cloud Composer as orchestrator.
The workflow looks like: ...]]></description><link>https://derrickqin.com/run-dbt-with-cloud-composer-and-cloud-run</link><guid isPermaLink="true">https://derrickqin.com/run-dbt-with-cloud-composer-and-cloud-run</guid><category><![CDATA[dbt]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[cloud composer]]></category><category><![CDATA[airflow]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 24 Jun 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/eBRTYyjwpRY/upload/3c983d26cd6ac350e50f18c20e9337fa.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently, a client wanted to use dbt core in Cloud Composer Airflow but they encountered Python dependencies issues. To solve this issue, I built a solution to run dbt inside of Cloud Run with Cloud Composer as orchestrator.</p>
<p>The workflow looks like:  </p>
<p>Cloud Composer -&gt; BashOperator to trigger a Cloud Run service -&gt; Cloud Run container runs dbt cli</p>
<p>Below is how it works:</p>
<p><strong>Cloud Run service:</strong><br />When the Cloud Run service is triggered (via HTTP call), it will run a bash script (script.sh)  </p>
<p>Dockerfile</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> golang:<span class="hljs-number">1.18</span>-buster as builder

<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>

<span class="hljs-keyword">COPY</span><span class="bash"> go.* ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go mod download</span>

<span class="hljs-keyword">COPY</span><span class="bash"> invoke.go ./</span>

<span class="hljs-keyword">RUN</span><span class="bash"> go build -mod=<span class="hljs-built_in">readonly</span> -v -o server</span>

<span class="hljs-comment"># Use the official dbt-bigquery image for running</span>
<span class="hljs-comment"># https://github.com/dbt-labs/dbt-bigquery/pkgs/container/dbt-bigquery</span>

<span class="hljs-keyword">FROM</span> ghcr.io/dbt-labs/dbt-bigquery:<span class="hljs-number">1.4</span>.<span class="hljs-number">1</span>

<span class="hljs-keyword">WORKDIR</span><span class="bash"> /</span>

<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/server /app/server</span>
<span class="hljs-keyword">COPY</span><span class="bash"> script.sh ./</span>

<span class="hljs-keyword">ENTRYPOINT</span><span class="bash"> [<span class="hljs-string">"/app/server"</span>]</span>
</code></pre>
<p>Invoke.go (Source code from <a target="_blank" href="https://cloud.google.com/run/docs/quickstarts/build-and-deploy/deploy-shell-service">here</a>)</p>
<pre><code class="lang-dockerfile">package main

import (
        <span class="hljs-string">"log"</span>
        <span class="hljs-string">"net/http"</span>
        <span class="hljs-string">"os"</span>
        <span class="hljs-string">"os/exec"</span>
)

func main() {
        http.HandleFunc(<span class="hljs-string">"/"</span>, scriptHandler)

        // Determine port for HTTP service.
        port := os.Getenv(<span class="hljs-string">"PORT"</span>)
        if port == <span class="hljs-string">""</span> {
                port = <span class="hljs-string">"8080"</span>
                log.Printf(<span class="hljs-string">"Defaulting to port %s"</span>, port)
        }

        // Start HTTP server.
        log.Printf(<span class="hljs-string">"Listening on port %s"</span>, port)
        if err := http.ListenAndServe(<span class="hljs-string">":"</span>+port, nil); err != nil {
                log.Fatal(err)
        }
}

func scriptHandler(w http.ResponseWriter, r *http.Request) {
        <span class="hljs-keyword">cmd</span><span class="bash"> := exec.CommandContext(r.Context(), <span class="hljs-string">"/bin/bash"</span>, <span class="hljs-string">"script.sh"</span>)</span>
        cmd.Stderr = os.Stderr
        out, err := cmd.Output()
        if err != nil {
                w.WriteHeader(<span class="hljs-number">500</span>)
        }
        w.Write(out)
}
</code></pre>
<p>script.sh (This is just an example of running dbt)</p>
<pre><code class="lang-bash">dbt --version
</code></pre>
<p>I followed this <a target="_blank" href="https://cloud.google.com/run/docs/quickstarts/build-and-deploy/deploy-shell-service">document</a> to deploy the Cloud Run shell service.<br /><code>gcloud run deploy run-dbt --no-allow-unauthenticated --region=us-central1 --source=.</code></p>
<p>Terminal output:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/GbhDkgRgIY06JTKGpy0riZhSy/?name=image.png" alt /></p>
<p><strong>Airflow DAG:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta
<span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.bash_operator <span class="hljs-keyword">import</span> BashOperator
<span class="hljs-keyword">from</span> airflow.operators.python_operator <span class="hljs-keyword">import</span> PythonOperator

default_args = {
    <span class="hljs-string">"owner"</span>: <span class="hljs-string">"airflow"</span>,
    <span class="hljs-string">"depends_on_past"</span>: <span class="hljs-literal">False</span>,
    <span class="hljs-string">"retries"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"retry_delay"</span>: timedelta(minutes=<span class="hljs-number">5</span>),
    <span class="hljs-string">"email"</span>: [<span class="hljs-string">"airflow@example.com"</span>],
    <span class="hljs-string">"email_on_failure"</span>: <span class="hljs-literal">False</span>,
    <span class="hljs-string">"email_on_retry"</span>: <span class="hljs-literal">False</span>,
}

dag = DAG(
    <span class="hljs-string">"trigger_cloud_run_dag"</span>,
    description=<span class="hljs-string">"trigger_cloud_run_dag"</span>,
    schedule_interval=<span class="hljs-string">"0 3 * * *"</span>,
    start_date=datetime(<span class="hljs-number">2023</span>, <span class="hljs-number">4</span>, <span class="hljs-number">19</span>),
    catchup=<span class="hljs-literal">False</span>,
    tags=[<span class="hljs-string">"custom"</span>],
)

trigger_cloud_run = BashOperator(
    task_id=<span class="hljs-string">"trigger_cloud_run"</span>,
    bash_command=<span class="hljs-string">'curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" https://run-dbt-7ensisdq5q-uc.a.run.app'</span>,
    do_xcom_push=<span class="hljs-literal">True</span>,
    dag=dag
)

trigger_cloud_run
</code></pre>
<p>To make it secure, the Cloud Run service requires IAM authentication. Therefore I granted my Cloud Composer worker service account <code>roles/run.invoker</code> <a target="_blank" href="https://cloud.google.com/run/docs/reference/iam/roles">role</a>.</p>
<p>Airflow logs:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/sL294BQlBn94tUiqSsQHv596y/?name=image.png" alt /></p>
<p>Cloud Run logs show the dbt cli output:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/EY4vdeMEY3lak1hUFpusdoI1t/?name=image.png" alt /></p>
<p>I've uploaded the code to a <a target="_blank" href="https://github.com/derrickqin/cloud-composer-trigger-dbt-in-cloud-run">GitHub repo</a>. Feel free to let me know if you need any help running it.</p>
]]></content:encoded></item><item><title><![CDATA[Enable auto-shutdown for Vertex AI Workbench user-managed Notebooks]]></title><description><![CDATA[Vertex AI Workbench user-managed notebooks instances let you create and manage deep learning virtual machine (VM) instances that are prepackaged with JupyterLab. It is powerful but doesn't have an auto-shutdown feature. Is it possible to enable it wi...]]></description><link>https://derrickqin.com/enable-auto-shutdown-for-vertex-ai-workbench-user-managed-notebooks</link><guid isPermaLink="true">https://derrickqin.com/enable-auto-shutdown-for-vertex-ai-workbench-user-managed-notebooks</guid><category><![CDATA[Vertex-AI]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[JupyterLab]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 16 Jun 2023 14:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/DfffcjQ106U/upload/a5bfc7b7fe31c61da524a1a592b032fb.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Vertex AI Workbench user-managed notebooks instances let you create and manage deep learning virtual machine (VM) instances that are prepackaged with <a target="_blank" href="https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html">JupyterLab</a>. It is powerful but doesn't have an auto-shutdown feature. Is it possible to enable it without using Cloud Schedule and Cloud Function?</p>
<p>I found that, behind the scene, user-managed notebook instances are Google-prebuilt Compute Engine instances.<br />From the Vertex AI console:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/iBs0B26q5CVV2tZPvV5ZgCKgz/?name=image.png" alt /></p>
<p>From Compute Engine console, it is shown as a normal instance:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/BsHtTiUha6WPhUeohg1vydEgh/?name=image.png" alt /></p>
<p>Since it is just a normal Compute Engine VM, I wonder if it can leverage the built-in <code>INSTANCE SCHEDULES</code> <a target="_blank" href="https://cloud.google.com/compute/docs/instances/schedule-instance-start-stop">feature</a>.</p>
<p>During my lab, I set up my Compute Engine schedule looks like this:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/jJX4KvwIBIJEvWQlbzYqdnd0Y/?name=image.png" alt /></p>
<p>From the logs, I can find that the instance was stopped:</p>
<p><img src="https://doitintl.zendesk.com/attachments/token/DX5Aumj0fQBGMVW48lh38v5AN/?name=image.png" alt /></p>
<p>Note that for some reason, there was a delay in scheduling the shutdown events. From the logs, it was roughly 14-15 minutes.</p>
<p>Anyway, it worked! Now you can rest assure that your Vertex AI Workbench user-managed notebooks won't cost money during the nights.</p>
]]></content:encoded></item><item><title><![CDATA[How to restore a deleted BigQuery Dataset]]></title><description><![CDATA["Oops, I accidentally deleted my BigQuery Dataset! What should I do!?" If this is you at the moment, don't panic. You are in good hands now!
In this tutorial, I'll work through a lab, in which I deleted a BigQuery Dataset and then recovered it with a...]]></description><link>https://derrickqin.com/how-to-restore-a-deleted-bigquery-dataset</link><guid isPermaLink="true">https://derrickqin.com/how-to-restore-a-deleted-bigquery-dataset</guid><category><![CDATA[bigquery]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Thu, 24 Nov 2022 11:48:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288460359/_BX7LrELc.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>"Oops, I accidentally deleted my BigQuery Dataset! What should I do!?" If this is you at the moment, don't panic. You are in good hands now!</p>
<p>In this tutorial, I'll work through a lab, in which I deleted a BigQuery Dataset and then recovered it with all the tables and views. If you are in a rush, feel free to scroll down to the recovery steps.</p>
<h1 id="heading-set-up-the-test-bigquery-dataset">Set up the test BigQuery Dataset</h1>
<p>This is my setup in BigQuery, I have two tables and a view.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288604608/qgUdCit82.png" alt="image.png" /></p>
<p>Dataset:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288633356/J6mzQsL4D.png" alt="image.png" /></p>
<p>Two tables:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288666997/ccL0CdfIX.png" alt="image.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288676544/ERpg_hEkX.png" alt="image.png" /></p>
<p>View:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288694800/JsGrm10-Q.png" alt="image.png" /></p>
<h1 id="heading-delete-the-dataset">Delete the Dataset</h1>
<p>From the BigQuery UI, I deleted the Dataset:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288712678/_-9HvyYrd.png" alt="image.png" /></p>
<p>Verify if it was deleted:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288729711/HaEcjFtTb.png" alt="image.png" /></p>
<h1 id="heading-how-to-restore-a-bigquery-dataset">How to restore a BigQuery Dataset</h1>
<p>Time to restore it!</p>
<h2 id="heading-find-out-the-tables-in-the-dataset">Find out the tables in the Dataset</h2>
<p><strong><em>Owner permission may not be enough to access some INFORMATION_SCHEMA tables. You need to explicitly add at least</em></strong><code>bigquery/metadataViewer</code><strong><em>role. Also, to restore row-level tables, you need</em></strong><code>bigquery/admin</code><strong><em>.</em></strong></p>
<p>First thing first, let's find out what tables were in my Dataset. Can't remember them? Don't worry.</p>
<p>I understand that you may inherit this from someone or you just don't remember the names of your tables and views. That's fine. Here is a query that you can use:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">DISTINCT</span> TABLE_NAME
<span class="hljs-keyword">FROM</span>
  <span class="hljs-string">`region-us`</span>.INFORMATION_SCHEMA.TABLE_STORAGE_TIMELINE
<span class="hljs-keyword">WHERE</span>
  TABLE_SCHEMA = <span class="hljs-string">"dataset_17102022"</span>;
</code></pre>
<p>After running it in my BigQuery project, I found the table names:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288884306/7l2yeyC00.png" alt="image.png" /></p>
<p>How does it work? The above query uses <a target="_blank" href="https://cloud.google.com/bigquery/docs/information-schema-table-storage-timeline">TABLE_STORAGE_TIMELINE</a> view in BigQuery INFORMATION SCHEMA to find out the tables that contain data.</p>
<h2 id="heading-recreate-the-bigquery-dataset">Recreate the BigQuery Dataset</h2>
<p><strong><em>In my lab, my Dataset is in the US region. You may need to change the location to the one that you use. You can find all the BigQuery locations from</em></strong> <a target="_blank" href="https://cloud.google.com/bigquery/docs/locations"><strong><em>here</em></strong></a><strong><em>.</em></strong></p>
<p>Recreate the BigQuery Dataset using the same name.</p>
<pre><code class="lang-bash"> bq --location=US mk -d dataset_17102022
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669288970895/81q6ndeR7.png" alt="image.png" /></p>
<p>Now I have recreated the Dataset, but it is empty. Let's recreate the tables and recover the data.</p>
<h2 id="heading-restore-the-tables-and-data">Restore the tables and data</h2>
<p>To restore the table, we will use <a target="_blank" href="https://cloud.google.com/bigquery/docs/time-travel">time travel</a> feature from BigQuery.</p>
<p>Run the <code>bq</code> command below. Note that I used <code>-3600000</code>, which is specified in milliseconds using a relative offset. It can also be specified as milliseconds since the Unix epoch.</p>
<pre><code class="lang-bash">bq --location=US cp airflow-talk:dataset_17102022.table-no-partition@-3600000 airflow-talk:dataset_17102022.table-no-partition
</code></pre>
<p>Verified that <code>table-no-partition</code> was recreated with data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669289065054/hYP4xMJni.png" alt="image.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669289075694/PT5ZWjTVu.png" alt="image.png" /></p>
<p>Same process to restore <code>table-partition</code> table:</p>
<pre><code class="lang-bash">bq --location=US cp airflow-talk:dataset_17102022.table-partition@-3600000 airflow-talk:dataset_17102022.table-partition
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669289234874/ma3X18Xic.png" alt="image.png" /></p>
<h2 id="heading-recreate-the-view">Recreate the view</h2>
<p>There is no straightforward way to recreate the views. It is always a good idea to store the DDL in the source repository. But if you don't have it now, you can try finding the view creation logs in Cloud Logging using the name of the view. In my case, the log is here:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669289351218/e3GM4cn2c.png" alt="image.png" /></p>
<p>To recreate this view, I ran the below SQL statement:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">VIEW</span>
  dataset_17102022.test_view1 <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">ID</span>,
    <span class="hljs-keyword">name</span>
  <span class="hljs-keyword">FROM</span>
    <span class="hljs-string">`airflow-talk.dataset_17102022.table-partition`</span>
  <span class="hljs-keyword">WHERE</span>
    <span class="hljs-keyword">ID</span> = <span class="hljs-number">1</span> )
</code></pre>
<p>And my view was recreated:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669289614553/rw_4VSc8a.png" alt="image.png" /></p>
<h1 id="heading-summary">Summary</h1>
<p>In this tutorial, I set up a lab to delete and restore a BigQuery Dataset.</p>
<p>Everybody has oops moments, deleting a BigQuery Dataset may be one of them. I hope following the above instructions helps you. Let me know if you have any questions.</p>
<p>Good luck!</p>
]]></content:encoded></item><item><title><![CDATA[Convert JSON to Parquet files on Google Cloud Storage using BigQuery]]></title><description><![CDATA[Do you have lots of files on Google Cloud Storage that you want to convert to a different format?
There are a few ways you can achieve it:

Download the files to a VM or local machine, then write a script to convert them
Use services like Cloud Funct...]]></description><link>https://derrickqin.com/convert-json-to-parquet-files-on-google-cloud-storage-using-bigquery</link><guid isPermaLink="true">https://derrickqin.com/convert-json-to-parquet-files-on-google-cloud-storage-using-bigquery</guid><category><![CDATA[google cloud storage]]></category><category><![CDATA[big query]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[#gcs]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Thu, 17 Nov 2022 12:26:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1669291693414/2Szq9OQnR.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Do you have lots of files on Google Cloud Storage that you want to convert to a different format?</p>
<p>There are a few ways you can achieve it:</p>
<ul>
<li>Download the files to a VM or local machine, then write a script to convert them</li>
<li>Use services like Cloud Function, Cloud Run or Cloud Dataflow to read and write the files on Cloud Storage</li>
</ul>
<p>But what if there is a true serverless way? And you only need to write SQL to convert them?</p>
<p>Let me walk you through it.</p>
<h2 id="heading-prepare-the-source-data">Prepare the source data</h2>
<p>I wrote a Python script to generate a gzipped JSONL file.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> gzip
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    json_str = <span class="hljs-string">''</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">10</span>):
        json_str = json_str + json.dumps(dict(id=i, value=i * i)) + <span class="hljs-string">"\n"</span>

    json_bytes = json_str.encode(<span class="hljs-string">"utf-8"</span>)

    <span class="hljs-keyword">with</span> gzip.open(<span class="hljs-string">"test.jsonl.gz"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> fout:
        fout.write(json_bytes)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<h2 id="heading-uploaded-it-to-my-gcs-bucket-and-read-it">Uploaded it to my GCS bucket and read it</h2>
<p>By following the Google document about <a target="_blank" href="https://cloud.google.com/bigquery/docs/external-data-cloud-storage#create_and_query_a_temporary_table">BigQuery external table</a>, I was able to query it:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669292126176/FbJeLskOY.png" alt="image.png" /></p>
<h2 id="heading-time-to-convert-it-to-parquet-format">Time to convert it to Parquet format</h2>
<p>To achieve this, I used <a target="_blank" href="https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement">BigQuery EXPORT DATA statement</a> to "export" the data from BigQuery to Cloud Storage in PARQUET format:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669292221756/IZZvY3e93.png" alt="image.png" /></p>
<p>After navigating to the Cloud Storage console, I can see the PARQUET file:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669292249757/-lHfsot6M.png" alt="image.png" /></p>
<h2 id="heading-verify-the-data">Verify the data</h2>
<p>Finally, I'd like to verify the data is actually in the PARQUET file:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669292276779/FTlbViIa_.png" alt="image.png" /></p>
<p>And yay!</p>
<h1 id="heading-summary">Summary</h1>
<p>Using BigQuery's native EXPORT DATA statement, I was able to convert gzipped JSONL files on Google Cloud Storage to PARQUET format.</p>
<p>You can also convert the files to other formats. It currently supports AVRO, CSV, JSON and PARQUET. It is straightforward and serverless.</p>
]]></content:encoded></item><item><title><![CDATA[Create a business dashboard from Cloud Logging logs using Looker Studio]]></title><description><![CDATA[A few months ago, I created a POC for a startup client to visualise their new user registration data. This use case can be interesting for small companies who need a quick way to create a business dashboard. I would like to share it with you.
Logs
As...]]></description><link>https://derrickqin.com/create-a-business-dashboard-from-cloud-logging-logs-using-looker-studio</link><guid isPermaLink="true">https://derrickqin.com/create-a-business-dashboard-from-cloud-logging-logs-using-looker-studio</guid><category><![CDATA[looker studio]]></category><category><![CDATA[data studio]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[bigquery]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Fri, 22 Jul 2022 13:55:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1669294610945/WysHBoSPL.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few months ago, I created a POC for a startup client to visualise their new user registration data. This use case can be interesting for small companies who need a quick way to create a business dashboard. I would like to share it with you.</p>
<h2 id="heading-logs">Logs</h2>
<p>Assuming you run your business on Google Cloud. Your API is powered by Cloud Function. When there is a new user registration, the Cloud Function instance writes a log to Cloud Logging. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669293541613/w9ay9PD82.png" alt="image.png" /></p>
<h2 id="heading-create-a-log-sink">Create a Log Sink</h2>
<p><a target="_blank" href="https://cloud.google.com/logging/docs/routing/overview#sinks">Sinks</a> control how Cloud Logging routes logs. Using sinks, you can route some or all of your logs to supported destinations. </p>
<p>From the Log Sink page, click <code>CREATE SINK</code>. When creating the sink, route the logs to BigQuery:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669293869998/-CMCpBPbv.png" alt="image.png" /></p>
<p>After creating the sink, it looks like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669293924088/Sm_Bvk0d6.png" alt="image.png" /></p>
<p>A new BigQuery table was created:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669293953287/Mn_Pg0hek.png" alt="image.png" /></p>
<h2 id="heading-write-some-logs">Write some logs</h2>
<p>Now I put a few logs to Cloud Logging via <code>gcloud</code> command:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669294000844/X7F2Vu7of.png" alt="image.png" /></p>
<h2 id="heading-query-the-logs-and-navigate-to-looker-studio">Query the logs and navigate to Looker Studio</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669294053290/vVdILrFDw.png" alt="image.png" /></p>
<p>On Looker Studio(formerly Data Studio), I built a simple dashboard:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1669294112741/zqGKcRKfg.png" alt="image.png" /></p>
<p>Note that, for testing purposes, I had to switch to minutes so we can see a few bars, but in most cases, the time interval should be hourly or daily.</p>
<h1 id="heading-summary">Summary</h1>
<p>Using Cloud Logging, BigQuery and Looker Studio, you can quickly build a dashboard to visualise business data, such as user registrations.</p>
]]></content:encoded></item><item><title><![CDATA[Upgrade Cloud SQL Postgres version using Google Database Migration Service(DMS)]]></title><description><![CDATA[A few days ago, I came across an interesting blog post from Google Cloud. In the blog, it used Google Database Migration Service(DMS) to upgrade underline Postgres version in Cloud SQL instance. I found it is interesting and decided to have a test ru...]]></description><link>https://derrickqin.com/upgrade-cloud-sql-postgres-version-using-google-database-migration-service</link><guid isPermaLink="true">https://derrickqin.com/upgrade-cloud-sql-postgres-version-using-google-database-migration-service</guid><category><![CDATA[google cloud]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Mon, 16 May 2022 12:13:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1669292985368/fJTgyrOTO.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few days ago, I came across an interesting <a target="_blank" href="https://cloud.google.com/blog/topics/developers-practitioners/upgrade-postgres-pglogical-and-database-migration-service">blog post</a> from Google Cloud. In the blog, it used Google Database Migration Service(DMS) to upgrade underline Postgres version in Cloud SQL instance. I found it is interesting and decided to have a test run - upgrading Postgres from version 13 to 14.</p>
<p>(Updated on 17th May - tested another upgrade from 9.6 to 14, also worked.)</p>
<h1 id="heading-test-run">Test run</h1>
<h2 id="heading-create-source-database">Create source database</h2>
<p>Create a Postgresql 13 database with two flags:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702709142/4alTeBlup.png" alt="image.png" /></p>
<p>With the flags, <code>pglogical</code> will be enabled</p>
<p>When using Google DMS, the Source instance must include the <code>postgres</code> database. If you don't have this database, then create it. Luckily, Cloud SQL creates <code>postgres</code> database and its user.</p>
<h2 id="heading-pop-up-some-data">Pop up some data</h2>
<p>Connect to the database</p>
<pre><code class="lang-bash">gcloud sql connect pg-13 --user=postgres -d postgres --quiet
</code></pre>
<p>Run the below command</p>
<pre><code>CREATE TABLE entries (guestName VARCHAR(<span class="hljs-number">255</span>), content VARCHAR(<span class="hljs-number">255</span>),
                        entryID SERIAL PRIMARY KEY);
INSERT INTO entries (guestName, content) values (<span class="hljs-string">'first guest'</span>, <span class="hljs-string">'I got here!'</span>);
INSERT INTO entries (guestName, content) values (<span class="hljs-string">'second guest'</span>, <span class="hljs-string">'Me too!'</span>);
</code></pre><h2 id="heading-source-databases-configuration">Source databases configuration</h2>
<p>Database Migration Service migrates all databases under the source instance other than the following databases:</p>
<ul>
<li>For Cloud SQL sources: template databases template0 and template1</li>
</ul>
<h3 id="heading-enable-pglogical-extension-for-all-the-databases">Enable <code>pglogical</code> extension for all the databases</h3>
<p>As I only have one database <code>postgres</code>, this step is easy.</p>
<p>Connect to the database</p>
<pre><code class="lang-bash">gcloud sql connect pg-13 --user=postgres -d postgres --quiet
</code></pre>
<p>Run then command</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> EXTENSION <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> pglogical
</code></pre>
<h3 id="heading-set-privileges-for-the-user">Set privileges for the user</h3>
<p>Set privileges on all schemas (aside from the information schema and schemas starting with "pg_") on each database to migrate, including <code>pglogical</code>.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">USAGE</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">SCHEMA</span> <span class="hljs-keyword">public</span> <span class="hljs-keyword">to</span> postgres;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">USAGE</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">SCHEMA</span> pglogical <span class="hljs-keyword">to</span> postgres;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">ALL</span> <span class="hljs-keyword">TABLES</span> <span class="hljs-keyword">in</span> <span class="hljs-keyword">SCHEMA</span> pglogical <span class="hljs-keyword">to</span> postgres;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">ALL</span> <span class="hljs-keyword">TABLES</span> <span class="hljs-keyword">in</span> <span class="hljs-keyword">SCHEMA</span> <span class="hljs-keyword">public</span> <span class="hljs-keyword">to</span> postgres;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">ALL</span> SEQUENCES <span class="hljs-keyword">in</span> <span class="hljs-keyword">SCHEMA</span> <span class="hljs-keyword">public</span> <span class="hljs-keyword">to</span> postgres;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">on</span> <span class="hljs-keyword">ALL</span> SEQUENCES <span class="hljs-keyword">in</span> <span class="hljs-keyword">SCHEMA</span> pglogical <span class="hljs-keyword">to</span> postgres;
<span class="hljs-comment">-- Don't forget to grant REPLICATION role to the postgres user that is used to run the job! </span>
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">USER</span> postgres <span class="hljs-keyword">with</span> <span class="hljs-keyword">REPLICATION</span>;
</code></pre>
<h2 id="heading-create-and-run-dms-job">Create and run DMS job</h2>
<p>Step 1: Choose the source database. In my case, it is Cloud SQL.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702828689/bToS3IDBe.png" alt="image.png" /></p>
<p>Step 2: Define the source database configuration. This will create a <code>connection profile</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702838193/KRWKAsPHh.png" alt="image.png" /></p>
<p>Step 3: Create the destination Cloud SQL instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702854709/22a4-7nLN.png" alt="image.png" /></p>
<p>Step 4: Config the network connection. In my case, as both source and destination instances are Cloud SQL, I chose <code>private IP</code> which will just do the VPC peering.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702871795/7UlLdY-AB.png" alt="image.png" /></p>
<p>Step 5: Test and create the job.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702883427/rBKU06z_L.png" alt="image.png" /></p>
<p>Step 6: Run the job!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702895569/1nemSFfq2.png" alt="image.png" /></p>
<p>This is how it looks like after the job was created:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702902517/SUmg32CK2.png" alt="image.png" /></p>
<h2 id="heading-verify-the-replication">Verify the replication</h2>
<p>Connect to the database</p>
<pre><code class="lang-bash">gcloud sql connect pg-13 --user=postgres -d postgres --quiet
</code></pre>
<p>Run the following statement</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> entries (guestName, <span class="hljs-keyword">content</span>) <span class="hljs-keyword">values</span> (<span class="hljs-string">'third guest'</span>, <span class="hljs-string">'test dms'</span>);
</code></pre>
<p>Connect to the new database</p>
<pre><code class="lang-bash">gcloud sql connect pg-14-may-16 --user=postgres -d postgres --quiet
</code></pre>
<p>Check if the data has been synced:</p>
<pre><code>postgres=&gt; SELECT * FROM entries;
  guestname   |   content   | entryid 
--------------+-------------+---------
 first guest  | I got here! |       <span class="hljs-number">1</span>
 second guest | Me too!     |       <span class="hljs-number">2</span>
 third guest  | test dms    |       <span class="hljs-number">3</span>
(<span class="hljs-number">3</span> rows)
</code></pre><h2 id="heading-time-to-promote-the-new-instance">Time to promote the new instance</h2>
<p>Click the <code>promote</code> button</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702921378/W-0L4hw1Z.png" alt="image.png" /></p>
<p>Now the new version of Cloud SQL Postgresql instance is promoted!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1652702929938/SserU5wXs.png" alt="image.png" /></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Google DMS can be used to replicate data from a lower version of Cloud SQL Postgresql to a newer one, then used to upgrade the Postgres version.
Note that if you plan to do this, ensure that your applications stop the write operations on the old version Postgres instance, wait for the replication to be finished, then start the promotion of the new instance.</p>
]]></content:encoded></item><item><title><![CDATA[How to use GCP Config Connector to manage Google Cloud resources]]></title><description><![CDATA[Background
Kubernetes is a new cool kid in the cloud infrastructure world. It is an open-source container orchestration system for automating software deployment, scaling, and management. Nowadays, when we manage our Cloud infrastructure, we have to ...]]></description><link>https://derrickqin.com/how-to-use-gcp-config-connector-to-manage-google-cloud-resources</link><guid isPermaLink="true">https://derrickqin.com/how-to-use-gcp-config-connector-to-manage-google-cloud-resources</guid><category><![CDATA[google cloud]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 30 Oct 2021 07:17:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1651045427438/tnAgt8AG-.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-background">Background</h1>
<p>Kubernetes is a new cool kid in the cloud infrastructure world. It is an open-source container orchestration system for automating software deployment, scaling, and management. Nowadays, when we manage our Cloud infrastructure, we have to look after cloud resources such as storage, databases and message queues. In addition, we also need to manage all the stuff within Kubernetes.</p>
<p>In Google Cloud(GCP) world, <a target="_blank" href="https://registry.terraform.io/providers/hashicorp/google/latest/docs">Terraform</a> becomes the default tool to manage the infrastructure. Engineers use Terraform to deploy GKE, the managed Kubernetes service on GCP. From here, they go to different routes:</p>
<ul>
<li>Use Terraform to manage GCP cloud resources and Kubectl plus YAML to manage Kubernetes resources</li>
<li>Use Terraform to manage both GCP cloud resources and Kubernetes resources using <a target="_blank" href="https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs">Kubernetes Provider</a></li>
</ul>
<p>YAML lovers ask: Is there a way to use YAML to deploy GCP cloud resources?</p>
<p>Yes, Google recently released a service called <a target="_blank" href="https://cloud.google.com/config-connector/docs/overview">Config Connector</a> that can make it happen.</p>
<h1 id="heading-introduce-config-connector">Introduce Config Connector</h1>
<p>Config Connector is a Kubernetes add-on that allows you to manage GCP resources through Kubernetes. Config Connector provides a collection of Kubernetes Custom Resource Definitions (CRDs) and controllers. The Config Connector CRDs allow Kubernetes to create and manage Google Cloud resources when configuring and applying Objects to your cluster.</p>
<h1 id="heading-enable-config-connector">Enable Config Connector</h1>
<p>To enable Config Connector when creating a GKE cluster:</p>
<ul>
<li>Allow Cloud APIs from Access Scopes
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1651045427438/tnAgt8AG-.png" alt="image.png" /></li>
<li>Enable <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity">Workload Identity</a> to allow workloads in your GKE clusters to impersonate Identity and Access Management (IAM) service accounts to access Google Cloud services
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1651045471831/bVrvBBJ-c.png" alt="image.png" /></li>
<li>Enable Config Connector
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1651045732917/Tl9F2TpfW.png" alt="image.png" /></li>
</ul>
<p>If there is a GKE already, follow this <a target="_blank" href="https://cloud.google.com/config-connector/docs/how-to/install-upgrade-uninstall#installing_the">link</a>.</p>
<h1 id="heading-use-config-connector-to-manage-bigquery-datasets">Use Config Connector to manage BigQuery datasets</h1>
<h2 id="heading-create-a-bigquery-dataset">Create a BigQuery Dataset</h2>
<p>Use YAML file below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">bigquery.cnrm.cloud.google.com/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">BigQueryDataset</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">cnrm.cloud.google.com/delete-contents-on-destroy:</span> <span class="hljs-string">"false"</span>
    <span class="hljs-attr">cnrm.cloud.google.com/deletion-policy:</span> <span class="hljs-string">abandon</span>
    <span class="hljs-attr">cnrm.cloud.google.com/project-id :</span> <span class="hljs-string">airflow-talk</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">test-bq-dataset</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">resourceID:</span> <span class="hljs-string">my_test_dataset</span>
  <span class="hljs-attr">location:</span> <span class="hljs-string">US</span>
  <span class="hljs-attr">defaultTableExpirationMs:</span> <span class="hljs-number">86400000</span>
</code></pre>
<h2 id="heading-check-the-deployment">Check the deployment</h2>
<p>Run <code>kubectl describe bigquerydataset test-bq-dataset</code></p>
<p>From UI:
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1651046888138/Hfulc8cg2.png" alt="image.png" /></p>
<h2 id="heading-delete-the-dataset">Delete the Dataset</h2>
<p>Run <code>kubectl delete -f create-bq.yaml</code></p>
<p>To keep the Dataset, follow this <a target="_blank" href="https://cloud.google.com/config-connector/docs/how-to/managing-deleting-resources#keeping_resources_after_deletion">guide</a>.</p>
<p>In short, if <code>cnrm.cloud.google.com/delete-contents-on-destroy: "false"</code>, then the Dataset will remain, otherwise Config Connector will delete the Dataset.</p>
<h1 id="heading-key-learning">Key learning</h1>
<p>Config Connector is a new tool to manage GCP resources in the Kubernetes way. However, weather Config Connector should be used instead of Terraform, it is another topic for another day...</p>
]]></content:encoded></item><item><title><![CDATA[How to solve Python package conflict in Cloud Composer]]></title><description><![CDATA[Background
Google Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow.
Airflow is a monolith Python service. To run an Airflow data pipeline(DAG), Python PyPI packages that are required must be installed alongside...]]></description><link>https://derrickqin.com/how-to-solve-python-package-conflict-in-cloud-composer</link><guid isPermaLink="true">https://derrickqin.com/how-to-solve-python-package-conflict-in-cloud-composer</guid><category><![CDATA[GCP]]></category><category><![CDATA[Python]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Derrick Qin]]></dc:creator><pubDate>Sat, 15 May 2021 06:36:40 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-background">Background</h1>
<p><a target="_blank" href="https://cloud.google.com/composer">Google Cloud Composer</a> is a fully managed workflow orchestration service built on <a target="_blank" href="https://airflow.apache.org/">Apache Airflow</a>.</p>
<p>Airflow is a monolith Python service. To run an Airflow data pipeline(DAG), Python PyPI packages that are required must be installed alongside Airflow. For example, suppose the data pipeline requires interactions with 3rd party service such as Salesforce. In that case, we need to install Python package such as <a target="_blank" href="https://pypi.org/project/simple-salesforce/">simple-salesforce</a> alongside Airflow.</p>
<p>Usually, this won't be an issue. However, when there is a need to use a very old(or very new) version of Python package, Python package conflict issue will occur.</p>
<h1 id="heading-issue">Issue</h1>
<p>After following the guide installing Python package (simple-salesforce==0.74.3), an error showed after updating Cloud Composer:</p>
<blockquote>
<p>UPDATE operation on this environment failed 22 minutes ago with the following error message:</p>
<p>An error occurred before the new web server image has been created.</p>
</blockquote>
<h1 id="heading-analysis">Analysis</h1>
<p>Google documents the troubleshooting process of handling this issue <a target="_blank" href="https://cloud.google.com/composer/docs/troubleshooting-package-installation">here</a>. After following the document, I understand that Google uses <a target="_blank" href="https://cloud.google.com/build">Cloud Build</a> to build a Docker image that installs required Python packages on top of the Composer Airflow base image. After digging into the Cloud Build logs, I found that the root cause of this issue:</p>
<blockquote>
<p>python3 -m pip check
requests 2.22.0 has requirement urllib3!=1.25.0,!=1.25.1,=1.21.1, but you have urllib3 1.26.3.</p>
</blockquote>
<h1 id="heading-solution">Solution</h1>
<p>I added <code>urllib3==1.25.4</code> in the list of Cloud Composer dependencies which downgraded the version of its pre-installed urllib3. After this, <code>simple-salesforce==0.74.3</code> could successfully be installed.</p>
<p>Of course, some regression tests were done to ensure that the existing DAGs ran well with the downgraded <code>urllib3</code> version.</p>
<h1 id="heading-key-learning">Key learning</h1>
<p>When there is an issue installing Python packages in Cloud Composer, check the Cloud Build logs to find out the error, then resolve the Python dependency conflict.</p>
]]></content:encoded></item></channel></rss>