BigQuery Data Loading Skill

BigQuery Data Loading

Use this skill when importing data into BigQuery from various file formats and sources.

Basic Load Command

bq load \
  --location=LOCATION \
  --source_format=FORMAT \
  PROJECT:DATASET.TABLE \
  SOURCE_PATH \
  SCHEMA

Loading from CSV

Basic CSV Load

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT,date:DATE

CSV with Schema File

# Create schema.json
echo '[
  {"name": "customer_id", "type": "STRING"},
  {"name": "amount", "type": "FLOAT"},
  {"name": "date", "type": "DATE"}
]' > schema.json

# Load with schema file
bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  ./schema.json

CSV Options

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  --field_delimiter=',' \
  --quote='"' \
  --allow_quoted_newlines \
  --allow_jagged_rows \
  --max_bad_records=100 \
  --null_marker='NULL' \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Key flags:

--skip_leading_rows=N - Skip header rows
--field_delimiter=',' - Column separator (default: comma)
--allow_quoted_newlines - Allow newlines in quoted fields
--allow_jagged_rows - Allow rows with missing fields
--max_bad_records=N - Tolerate N parsing errors
--null_marker='NULL' - String representing NULL values

CSV with Auto-detect

bq load \
  --source_format=CSV \
  --autodetect \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv

Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.

Loading from JSON

Newline-Delimited JSON

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  dataset.table \
  gs://bucket/data.json \
  customer_id:STRING,amount:FLOAT,date:DATE

JSON format required:

{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}
{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}

NOT standard JSON array:

// ❌ This won't work
[
  {"customer_id": "C001", "amount": 99.99},
  {"customer_id": "C002", "amount": 149.99}
]

JSON with Auto-detect

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --autodetect \
  dataset.table \
  gs://bucket/data.json

JSON with Nested Fields

Schema with nested STRUCT:

[
  {"name": "customer_id", "type": "STRING"},
  {"name": "address", "type": "RECORD", "fields": [
    {"name": "street", "type": "STRING"},
    {"name": "city", "type": "STRING"},
    {"name": "zip", "type": "STRING"}
  ]},
  {"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [
    {"name": "order_id", "type": "STRING"},
    {"name": "amount", "type": "FLOAT"}
  ]}
]

Loading from Avro

Basic Avro Load

bq load \
  --source_format=AVRO \
  dataset.table \
  gs://bucket/data.avro

Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!

Avro with Wildcards

bq load \
  --source_format=AVRO \
  dataset.table \
  "gs://bucket/path/to/*.avro"

Note: Use quotes around wildcard paths.

Avro Advantages

Self-describing: Schema embedded in file
Efficient: Compact binary format
Type-safe: Strong typing preserved
No schema drift: Schema always matches data

Loading from Parquet

Basic Parquet Load

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.parquet

Like Avro: Schema auto-detected from Parquet metadata.

Parquet with Compression

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.snappy.parquet

Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD

Loading Strategies

Append Data (Default)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/new_data.csv \
  schema.json

Behavior: Adds rows to existing table.

Replace Table

bq load \
  --source_format=CSV \
  --replace \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Deletes all existing data, loads new data.

Overwrite Partition

bq load \
  --source_format=CSV \
  --replace \
  --time_partitioning_field=date \
  dataset.table\$20240115 \
  gs://bucket/data_20240115.csv \
  schema.json

Syntax: TABLE$YYYYMMDD targets specific partition.

Behavior: Replaces only that partition, leaves others intact.

Append or Skip

# Skip load if table already has data
bq load \
  --source_format=CSV \
  --if_not_exists \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Loading from Cloud Storage

Single File

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Multiple Files (List)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \
  schema.json

Wildcard Pattern

bq load \
  --source_format=CSV \
  dataset.table \
  "gs://bucket/path/data-*.csv" \
  schema.json

Patterns:

gs://bucket/*.csv - All CSV files in bucket root
gs://bucket/2024/*/*.csv - All CSV files in subdirectories
gs://bucket/data-[0-9]*.csv - Files matching pattern

Loading from Local Files

Local CSV

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  /path/to/local/data.csv \
  schema.json

Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.

Schema Handling

Inline Schema

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN

Schema File

[
  {"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
  {"name": "amount", "type": "FLOAT64"},
  {"name": "order_date", "type": "DATE"},
  {"name": "metadata", "type": "JSON"}
]

Modes:

REQUIRED - Field must have value
NULLABLE - Field can be NULL (default)
REPEATED - Field is an array

Auto-detect Schema

bq load \
  --source_format=CSV \
  --autodetect \
  dataset.table \
  gs://bucket/data.csv

Pros:

Quick for exploration
No schema definition needed

Cons:

Not reliable for production
Schema may change between loads
May misinterpret types

Schema Evolution

Add new columns:

bq load \
  --source_format=CSV \
  --schema_update_option=ALLOW_FIELD_ADDITION \
  dataset.table \
  gs://bucket/data_with_new_column.csv \
  schema.json

Relax required columns:

bq load \
  --schema_update_option=ALLOW_FIELD_RELAXATION \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Compression

Compressed Files

BigQuery automatically detects compression:

# Gzip compressed CSV
bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv.gz \
  schema.json

Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD

Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.

Error Handling

Allow Bad Records

bq load \
  --source_format=CSV \
  --max_bad_records=1000 \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Skip up to 1000 rows with errors, load the rest.

Ignore Unknown Values

bq load \
  --source_format=JSON \
  --ignore_unknown_values \
  dataset.table \
  gs://bucket/data.json \
  schema.json

Behavior: Ignore JSON fields not in schema.

Validate Only (Dry Run)

bq load \
  --dry_run \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Validate schema and format without loading data.

Loading to Partitioned Tables

Time-Partitioned Table

bq load \
  --source_format=CSV \
  --time_partitioning_field=order_date \
  --time_partitioning_type=DAY \
  dataset.partitioned_orders \
  gs://bucket/orders_2024.csv \
  order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT

Load Specific Partition

# Load into 2024-01-15 partition
bq load \
  --source_format=CSV \
  dataset.orders\$20240115 \
  gs://bucket/orders_20240115.csv \
  schema.json

Range-Partitioned Table

bq load \
  --source_format=CSV \
  --range_partitioning=customer_id,0,1000,100 \
  dataset.range_partitioned \
  gs://bucket/data.csv \
  customer_id:INTEGER,amount:FLOAT

Performance Optimization

Parallel Loading

BigQuery automatically parallelizes loads from:

Multiple files with wildcards
Uncompressed files (internal parallelization)

Optimal: Split large files into 1GB chunks

File Format Recommendations

Best performance:

Avro - Fastest, self-describing, splittable
Parquet - Fast, columnar, good compression
CSV uncompressed - Good for parallel processing
JSON - Flexible but slower

Avoid:

Very large single files (>10GB)
CSV with complex escaping
Highly compressed files (limits parallelism)

Monitoring Loads

Check Load Jobs

bq ls --jobs --max_results=10

Job Details

bq show -j JOB_ID

Failed Loads

SELECT
  job_id,
  user_email,
  error_result.message as error_message,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  job_type = 'LOAD'
  AND state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;

Common Patterns

Daily Batch Load

#!/bin/bash
DATE=$(date +%Y%m%d)
bq load \
  --source_format=CSV \
  --replace \
  dataset.daily_data\$$DATE \
  gs://bucket/data_$DATE.csv \
  schema.json

Incremental Load with Deduplication

# Load to staging
bq load \
  --source_format=CSV \
  dataset.staging_orders \
  gs://bucket/new_orders.csv \
  schema.json

# Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE `project.dataset.orders` T
USING `project.dataset.staging_orders` S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'

Troubleshooting

"Too many errors"

Problem: Exceeds max_bad_records Solution: Increase --max_bad_records or fix data quality

"Schema mismatch"

Problem: CSV columns don't match schema Solution: Verify column order and count

"Invalid CSV format"

Problem: Unescaped quotes or newlines Solution: Use --allow_quoted_newlines

"Permission denied"

Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role

Quick Reference

Format priorities (fastest first):

Avro (splittable, self-describing)
Parquet (columnar, efficient)
CSV uncompressed (parallel)
JSON (flexible)

Schema strategies:

Production: Explicit schema file
Development: Auto-detect
Evolution: Allow field addition/relaxation

Loading strategies:

New data: Append (default)
Replace all: --replace
Replace partition: TABLE$YYYYMMDD
Incremental: Load staging → MERGE

Agent Skills: BigQuery Data Loading

Install this agent skill to your local

Skill Files