Tables & Partitioning
Table Types in Datoria
Datoria works with two primary types of tables:
Managed Tables are created and maintained by Datoria. These tables are fully owned by your data platform—Datoria handles their creation, schema updates, and lifecycle management.
Unmanaged Source Tables are externally maintained tables that Datoria reads from but doesn't modify. These tables are typically owned by other teams or systems, and Datoria only needs to know their structure.
Defining Datasets
A dataset is a container for tables in your data warehouse. In Datoria, datasets are defined explicitly:
import { BQDataset } from "@datoria/sdk";
export const analyticsDataset = BQDataset({
project: "datoria-test", // GCP project ID
name: "my_dataset", // Dataset name
location: "europe-north1", // Physical region
});
The explicit definition ensures that all references to the dataset are consistent and prevents accidental references to the wrong resources.
Defining Managed Tables
Here's how to define a managed table in Datoria:
import { BQTable } from "@datoria/sdk";
export const userTable = BQTable({
dataset: analyticsDataset,
name: "users",
partitioning: BQPartitioning.Unpartitioned,
schema: [
BQField("id", BQFieldType.String, "REQUIRED"),
BQField("name", BQFieldType.String, "REQUIRED"),
BQField("signup_date", BQFieldType.Date, "REQUIRED"),
BQField("last_login", BQFieldType.Timestamp, "NULLABLE"),
],
});
Tables can include additional properties for full customization:
export const userTableExtended = BQTable({
dataset: analyticsDataset,
name: "users_extended",
partitioning: BQPartitioning.Unpartitioned,
schema: userTable.schema, // Reusing schemas is easy
// Optional properties
labels: {
owner: "user-team",
sensitivity: "high",
},
description: "Extended user information including profile data",
clustering: ["name"], // Columns to cluster by
});
Understanding Partitioning
Partitioning divides tables into smaller, more manageable segments based on a particular column value. This improves query performance and reduces costs by allowing Datoria to process only the relevant partitions.
Datoria supports three main partitioning strategies:
Unpartitioned Tables
The simplest approach is to have no partitioning. This works well for smaller tables or reference data:
export const configTable = BQTable({
dataset: analyticsDataset,
name: "config",
partitioning: BQPartitioning.Unpartitioned,
schema: [
BQField("key", BQFieldType.String, "REQUIRED"),
BQField("value", BQFieldType.String, "REQUIRED"),
],
});
Time-Based Partitioning
For time-series data, you can partition by various time intervals:
Day Partitioning
export const dailyEventsTable = BQTable({
dataset: analyticsDataset,
name: "daily_events",
partitioning: BQPartitioning.Day("event_date"),
schema: [
BQField("event_date", BQFieldType.Date, "REQUIRED"),
BQField("event_type", BQFieldType.String, "REQUIRED"),
BQField("user_id", BQFieldType.String, "REQUIRED"),
],
});
Hour Partitioning
export const hourlyEventsTable = BQTable({
dataset: analyticsDataset,
name: "hourly_events",
partitioning: BQPartitioning.Hour("event_timestamp"),
schema: [
BQField("event_timestamp", BQFieldType.Timestamp, "REQUIRED"),
BQField("event_type", BQFieldType.String, "REQUIRED"),
BQField("user_id", BQFieldType.String, "REQUIRED"),
],
});
Month Partitioning
export const monthlyStatsTable = BQTable({
dataset: analyticsDataset,
name: "monthly_stats",
partitioning: BQPartitioning.Month("month"),
schema: [
BQField("month", BQFieldType.Date, "REQUIRED"),
BQField("metric", BQFieldType.String, "REQUIRED"),
BQField("value", BQFieldType.Float, "REQUIRED"),
],
});
Integer-Range Partitioning
For data that's naturally segmented by integer values:
import { BQIntegerRange } from "@datoria/sdk";
export const userSegmentsTable = BQTable({
dataset: analyticsDataset,
name: "user_segments",
partitioning: BQPartitioning.Integer(
"segment_id",
BQIntegerRange(0, 100, 10),
),
// Parameters: column, min, max, interval
schema: [
BQField("segment_id", BQFieldType.Integer, "REQUIRED"),
BQField("segment_name", BQFieldType.String, "REQUIRED"),
BQField("user_count", BQFieldType.Integer, "REQUIRED"),
],
});
Working with Unmanaged Source Tables
For tables maintained outside your Datoria workflow, use BQSourceTable
to define their structure:
import {
BQDatasetId,
BQSourceTable,
BQField,
BQFieldType,
BQPartitioning,
} from "@datoria/sdk";
export const externalUserTable = BQSourceTable({
dataset: BQDatasetId("external-project", "user_data"),
name: "users",
location: "EU", // Location is required for source tables
partitioning: BQPartitioning.Unpartitioned,
schema: [
BQField("id", BQFieldType.String, "REQUIRED"),
BQField("name", BQFieldType.String, "REQUIRED"),
BQField("external_id", BQFieldType.String, "REQUIRED"),
],
});
Importing and Synchronizing Unmanaged Tables
Datoria provides specific commands to import and keep unmanaged tables up-to-date:
1. Importing New Source Tables
To import the schema of a new source table:
datoria workspace source-tables-import bq project.dataset project.dataset.relation_name
This command retrieves the schema of the specified relations (or all relations in a dataset if you only specify the
dataset) and stores the resulting TypeScript definitions in your workspace's references
folder.
It's recommended to check these generated files into source control and avoid editing them manually.
2. Synchronizing Existing Source Tables
To update the schema definitions for already imported source tables:
datoria workspace source-tables-sync bq
This command discovers all existing source table references in your workspace, downloads their latest schemas, and overwrites the corresponding generated files.
Benefits of Source Table Versioning
By keeping source table definitions in versioned files, you gain:
- Consistent Builds: Rebuild old versions of your graph without surprises
- Fast Feedback: Quick type-checking and code validation in your IDE
- Safety: Migrations check actual relations against schemas at deploy-time
- Breaking Change Handling: Update local definitions in advance of upstream changes
- Automated Coordination: Script schema sync to trigger pull requests when external tables change
Importing Existing Tables
Datoria makes it easy to import existing tables from your data warehouse:
datoria workspace schema-import bq my-gcp-project.my-dataset.table
This command generates the necessary TypeScript definitions based on the live tables.
For a more comprehensive approach:
datoria workspace schema-import bq my-gcp-project.my-dataset
This imports all tables in the specified dataset.
Partition Expiration
For time-partitioned tables, you can set a partition expiration to automatically delete old data:
export const eventsWithExpirationTable = BQTable({
dataset: analyticsDataset,
name: "events_with_expiration",
partitioning: BQPartitioning.Day("date"),
partitionExpiration: "90 days", // Partitions older than 90 days are deleted
schema: [
BQField("date", BQFieldType.Date, "REQUIRED"),
BQField("event", BQFieldType.String, "REQUIRED"),
],
});
Table Clustering
For large tables, clustering can significantly improve query performance by organizing data based on the values in specified columns:
export const clusteredEventsTable = BQTable({
dataset: analyticsDataset,
name: "clustered_events",
partitioning: BQPartitioning.Day("date"),
clustering: ["event_type", "user_id"], // Up to 4 columns
schema: [
BQField("date", BQFieldType.Date, "REQUIRED"),
BQField("event_type", BQFieldType.String, "REQUIRED"),
BQField("user_id", BQFieldType.String, "REQUIRED"),
],
});