Skip to main content

Datometry Documentation

Override Google BigQuery Table Clustering

By default, tables translated using Hyper-Q and qShift use the Teradata primary index columns as clustering columns in BigQuery to provide a tuning baseline. Tables may require additional tuning, such as adding partitions or customizing clustering, to achieve optimal query performance. You can override the creation of clustering columns using the Hyper-Q hint bigquery_clustering_cols_override.

In a clustered table, BigQuery automatically sorts the data based on the values in the clustering columns and organizes them in optimally sized storage blocks. You can achieve more finely grained sorting by creating a table that is clustered and partitioned. To learn more about clustering columns for use with BigQuery, see Introduction to clustered tables in the Google BigQuery documentation.

In addition to specifying what columns to use for clustering on BigQuery, you can also partition tables in BigQuery using the Hyper-Q hint bigquery_partitions_override. This hint allows you to specify how BigQuery partitions tables, which can make it more cost effective and performant to manage and query your data. To learn more, see Override Google BigQuery Partitions.

The syntax for bigquery_clustering_cols_override is:

bigquery_clustering_cols_override = column_name_1,column_name_2,column_name_3,...

where:

Parameter

Description

column_name

The name of the column or columns by which you want to cluster. You can specify a comma separated list of column names by which to cluster if you want to use more than one column.

You can use the bigquery_clustering_cols_override hint as a statement (or query) level hint. Hints at the statement scope apply to the entire SQL statement to which they are associated. You cannot use the bigquery_clustering_cols_override hint at the session or application scope level. To learn more about Hyper-Q SQL hints, and how to use them within a SQL query, see About Hyper-Q SQL Hints.

Example 36. Specify that Hyper-Q Cluster by a Single Column on BigQuery

You can specify bigquery_clustering_cols_override as a query level hint. This example specifies the hint within a Teradata CREATE TABLE statement. The column labeled column3 is specified as the column to use for clustering. This is in contrast to the three columns identified as having a primary index: column2, column3, and column4. If you do not specify a specific column BigQuery should use for clustering, it would use all three primary index columns as clustering columns.

CREATE TABLE table_name
-- @dtm_bigquery_clustering_cols_override=column3
(
  column1 INT,
  column2 INT,
  column3 VARCHAR(10),
  column4 VARCHAR(1)
) PRIMARY INDEX (column2, column3, column4)  PARTITION BY (column2, column3, column4);

The Hyper-Q translated CREATE TABLE statement on BigQuery is as shown below. Note that the CLUSTER BY clause specifies clustering using column3.

CREATE TABLE `project_name`.`dataset_name`.`table_name` (
            `column1` INT64 ,
            `column2` INT64 ,
            `column3` STRING ,
            `column4` STRING 
        ) CLUSTER BY `column3` 
;


Example 37. Specify that Hyper-Q Cluster by Multiple Columns on BigQuery

This example specifies two columns as clustering columns: column3 and column4. As in the previous example, the Teradata CREATE TABLE statement identifies three columns as having a primary index: column2, column3, and column4. If you did not specify a specific set of columns BigQuery should use for clustering, it would use all three of these columns as clustering columns.

CREATE TABLE table_name
-- @dtm_bigquery_clustering_cols_override=column3,column4
(
  column1 INT,
  column2 INT,
  column3 VARCHAR(10),
  column4 VARCHAR(1)
) PRIMARY INDEX (column2, column3, column4)  PARTITION BY (column2, column3, column4);

The translated CREATE TABLE statement on BigQuery is as shown below. Note that the CLUSTER BY clause specifies clustering using column3 and column4.

CREATE TABLE `project_name`.`dataset_name`.`table_name` (
            `column1` INT64 ,
            `column2` INT64 ,
            `column3` STRING ,
            `column4` STRING 
        ) CLUSTER BY `column3`, `column4`
;