Tables & Partitioning

Table Types in Datoria

Datoria works with two primary types of tables:

Managed Tables are created and maintained by Datoria. These tables are fully owned by your data platform—Datoria handles their creation, schema updates, and lifecycle management.

Unmanaged Source Tables are externally maintained tables that Datoria reads from but doesn't modify. These tables are typically owned by other teams or systems, and Datoria only needs to know their structure.

Defining Datasets

A dataset is a container for tables in your data warehouse. In Datoria, datasets are defined explicitly:

import { BQDataset } from "@datoria/sdk";

export const analyticsDataset = BQDataset({
  project: "datoria-test", // GCP project ID
  name: "my_dataset", // Dataset name
  location: "europe-north1", // Physical region
});

The explicit definition ensures that all references to the dataset are consistent and prevents accidental references to the wrong resources.

Defining Managed Tables

Here's how to define a managed table in Datoria:

import { BQTable } from "@datoria/sdk";

export const userTable = BQTable({
  dataset: analyticsDataset,
  name: "users",
  partitioning: BQPartitioning.Unpartitioned,
  schema: [
    BQField("id", BQFieldType.String, "REQUIRED"),
    BQField("name", BQFieldType.String, "REQUIRED"),
    BQField("signup_date", BQFieldType.Date, "REQUIRED"),
    BQField("last_login", BQFieldType.Timestamp, "NULLABLE"),
  ],
});

Tables can include additional properties for full customization:

export const userTableExtended = BQTable({
  dataset: analyticsDataset,
  name: "users_extended",
  partitioning: BQPartitioning.Unpartitioned,
  schema: userTable.schema, // Reusing schemas is easy

  // Optional properties
  labels: {
    owner: "user-team",
    sensitivity: "high",
  },
  description: "Extended user information including profile data",
  clustering: ["name"], // Columns to cluster by
});

Understanding Partitioning

Partitioning divides tables into smaller, more manageable segments based on a particular column value. This improves query performance and reduces costs by allowing Datoria to process only the relevant partitions.

Datoria supports three main partitioning strategies:

Unpartitioned Tables

The simplest approach is to have no partitioning. This works well for smaller tables or reference data:

export const configTable = BQTable({
  dataset: analyticsDataset,
  name: "config",
  partitioning: BQPartitioning.Unpartitioned,
  schema: [
    BQField("key", BQFieldType.String, "REQUIRED"),
    BQField("value", BQFieldType.String, "REQUIRED"),
  ],
});

Time-Based Partitioning

For time-series data, you can partition by various time intervals:

Day Partitioning

export const dailyEventsTable = BQTable({
  dataset: analyticsDataset,
  name: "daily_events",
  partitioning: BQPartitioning.Day("event_date"),
  schema: [
    BQField("event_date", BQFieldType.Date, "REQUIRED"),
    BQField("event_type", BQFieldType.String, "REQUIRED"),
    BQField("user_id", BQFieldType.String, "REQUIRED"),
  ],
});

Hour Partitioning

export const hourlyEventsTable = BQTable({
  dataset: analyticsDataset,
  name: "hourly_events",
  partitioning: BQPartitioning.Hour("event_timestamp"),
  schema: [
    BQField("event_timestamp", BQFieldType.Timestamp, "REQUIRED"),
    BQField("event_type", BQFieldType.String, "REQUIRED"),
    BQField("user_id", BQFieldType.String, "REQUIRED"),
  ],
});

Month Partitioning

export const monthlyStatsTable = BQTable({
  dataset: analyticsDataset,
  name: "monthly_stats",
  partitioning: BQPartitioning.Month("month"),
  schema: [
    BQField("month", BQFieldType.Date, "REQUIRED"),
    BQField("metric", BQFieldType.String, "REQUIRED"),
    BQField("value", BQFieldType.Float, "REQUIRED"),
  ],
});

Integer-Range Partitioning

For data that's naturally segmented by integer values:

import { BQIntegerRange } from "@datoria/sdk";

export const userSegmentsTable = BQTable({
  dataset: analyticsDataset,
  name: "user_segments",
  partitioning: BQPartitioning.Integer(
    "segment_id",
    BQIntegerRange(0, 100, 10),
  ),
  // Parameters: column, min, max, interval
  schema: [
    BQField("segment_id", BQFieldType.Integer, "REQUIRED"),
    BQField("segment_name", BQFieldType.String, "REQUIRED"),
    BQField("user_count", BQFieldType.Integer, "REQUIRED"),
  ],
});

Working with Unmanaged Source Tables

For tables maintained outside your Datoria workflow, use BQSourceTable to define their structure:

import {
  BQDatasetId,
  BQSourceTable,
  BQField,
  BQFieldType,
  BQPartitioning,
} from "@datoria/sdk";

export const externalUserTable = BQSourceTable({
  dataset: BQDatasetId("external-project", "user_data"),
  name: "users",
  location: "EU", // Location is required for source tables
  partitioning: BQPartitioning.Unpartitioned,
  schema: [
    BQField("id", BQFieldType.String, "REQUIRED"),
    BQField("name", BQFieldType.String, "REQUIRED"),
    BQField("external_id", BQFieldType.String, "REQUIRED"),
  ],
});

Importing and Synchronizing Unmanaged Tables

Datoria provides specific commands to import and keep unmanaged tables up-to-date:

1. Importing New Source Tables

To import the schema of a new source table:

datoria workspace source-tables-import bq project.dataset project.dataset.relation_name

This command retrieves the schema of the specified relations (or all relations in a dataset if you only specify the dataset) and stores the resulting TypeScript definitions in your workspace's references folder.

It's recommended to check these generated files into source control and avoid editing them manually.

2. Synchronizing Existing Source Tables

To update the schema definitions for already imported source tables:

datoria workspace source-tables-sync bq

This command discovers all existing source table references in your workspace, downloads their latest schemas, and overwrites the corresponding generated files.

Benefits of Source Table Versioning

By keeping source table definitions in versioned files, you gain:

Consistent Builds: Rebuild old versions of your graph without surprises
Fast Feedback: Quick type-checking and code validation in your IDE
Safety: Migrations check actual relations against schemas at deploy-time
Breaking Change Handling: Update local definitions in advance of upstream changes
Automated Coordination: Script schema sync to trigger pull requests when external tables change

Importing Existing Tables

Datoria makes it easy to import existing tables from your data warehouse:

datoria workspace schema-import bq my-gcp-project.my-dataset.table

This command generates the necessary TypeScript definitions based on the live tables.

For a more comprehensive approach:

datoria workspace schema-import bq my-gcp-project.my-dataset

This imports all tables in the specified dataset.

Partition Expiration

For time-partitioned tables, you can set a partition expiration to automatically delete old data:

export const eventsWithExpirationTable = BQTable({
  dataset: analyticsDataset,
  name: "events_with_expiration",
  partitioning: BQPartitioning.Day("date"),
  partitionExpiration: "90 days", // Partitions older than 90 days are deleted
  schema: [
    BQField("date", BQFieldType.Date, "REQUIRED"),
    BQField("event", BQFieldType.String, "REQUIRED"),
  ],
});

When working with tables that have partition expiration, Datoria provides special protection: it won't attempt to process partitions that have already expired, and it intelligently prevents reprocessing of downstream data that depends on expired partitions. This critical safeguard prevents data corruption in your pipeline and differentiates Datoria from other transformation tools.

Table Clustering

For large tables, clustering can significantly improve query performance by organizing data based on the values in specified columns:

export const clusteredEventsTable = BQTable({
  dataset: analyticsDataset,
  name: "clustered_events",
  partitioning: BQPartitioning.Day("date"),
  clustering: ["event_type", "user_id"], // Up to 4 columns
  schema: [
    BQField("date", BQFieldType.Date, "REQUIRED"),
    BQField("event_type", BQFieldType.String, "REQUIRED"),
    BQField("user_id", BQFieldType.String, "REQUIRED"),
  ],
});

Table Types in Datoria​

Defining Datasets​

Defining Managed Tables​

Understanding Partitioning​

Unpartitioned Tables​

Time-Based Partitioning​

Day Partitioning​

Hour Partitioning​

Month Partitioning​

Integer-Range Partitioning​

Working with Unmanaged Source Tables​

Importing and Synchronizing Unmanaged Tables​

1. Importing New Source Tables​

2. Synchronizing Existing Source Tables​

Benefits of Source Table Versioning​

Importing Existing Tables​

Partition Expiration​

Table Clustering​