BigQuery Data Loading
Use this skill when importing data into BigQuery from various file formats and sources.
Basic Load Command
bq load \
--location=LOCATION \
--source_format=FORMAT \
PROJECT:DATASET.TABLE \
SOURCE_PATH \
SCHEMA
Loading from CSV
Basic CSV Load
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv \
customer_id:STRING,amount:FLOAT,date:DATE
CSV with Schema File
# Create schema.json
echo '[
{"name": "customer_id", "type": "STRING"},
{"name": "amount", "type": "FLOAT"},
{"name": "date", "type": "DATE"}
]' > schema.json
# Load with schema file
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv \
./schema.json
CSV Options
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
--field_delimiter=',' \
--quote='"' \
--allow_quoted_newlines \
--allow_jagged_rows \
--max_bad_records=100 \
--null_marker='NULL' \
dataset.table \
gs://bucket/data.csv \
schema.json
Key flags:
--skip_leading_rows=N- Skip header rows--field_delimiter=','- Column separator (default: comma)--allow_quoted_newlines- Allow newlines in quoted fields--allow_jagged_rows- Allow rows with missing fields--max_bad_records=N- Tolerate N parsing errors--null_marker='NULL'- String representing NULL values
CSV with Auto-detect
bq load \
--source_format=CSV \
--autodetect \
--skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv
Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.
Loading from JSON
Newline-Delimited JSON
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
dataset.table \
gs://bucket/data.json \
customer_id:STRING,amount:FLOAT,date:DATE
JSON format required:
{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}
{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}
NOT standard JSON array:
// ❌ This won't work
[
{"customer_id": "C001", "amount": 99.99},
{"customer_id": "C002", "amount": 149.99}
]
JSON with Auto-detect
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--autodetect \
dataset.table \
gs://bucket/data.json
JSON with Nested Fields
Schema with nested STRUCT:
[
{"name": "customer_id", "type": "STRING"},
{"name": "address", "type": "RECORD", "fields": [
{"name": "street", "type": "STRING"},
{"name": "city", "type": "STRING"},
{"name": "zip", "type": "STRING"}
]},
{"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [
{"name": "order_id", "type": "STRING"},
{"name": "amount", "type": "FLOAT"}
]}
]
Loading from Avro
Basic Avro Load
bq load \
--source_format=AVRO \
dataset.table \
gs://bucket/data.avro
Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!
Avro with Wildcards
bq load \
--source_format=AVRO \
dataset.table \
"gs://bucket/path/to/*.avro"
Note: Use quotes around wildcard paths.
Avro Advantages
- Self-describing: Schema embedded in file
- Efficient: Compact binary format
- Type-safe: Strong typing preserved
- No schema drift: Schema always matches data
Loading from Parquet
Basic Parquet Load
bq load \
--source_format=PARQUET \
dataset.table \
gs://bucket/data.parquet
Like Avro: Schema auto-detected from Parquet metadata.
Parquet with Compression
bq load \
--source_format=PARQUET \
dataset.table \
gs://bucket/data.snappy.parquet
Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD
Loading Strategies
Append Data (Default)
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/new_data.csv \
schema.json
Behavior: Adds rows to existing table.
Replace Table
bq load \
--source_format=CSV \
--replace \
dataset.table \
gs://bucket/data.csv \
schema.json
Behavior: Deletes all existing data, loads new data.
Overwrite Partition
bq load \
--source_format=CSV \
--replace \
--time_partitioning_field=date \
dataset.table\$20240115 \
gs://bucket/data_20240115.csv \
schema.json
Syntax: TABLE$YYYYMMDD targets specific partition.
Behavior: Replaces only that partition, leaves others intact.
Append or Skip
# Skip load if table already has data
bq load \
--source_format=CSV \
--if_not_exists \
dataset.table \
gs://bucket/data.csv \
schema.json
Loading from Cloud Storage
Single File
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
schema.json
Multiple Files (List)
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \
schema.json
Wildcard Pattern
bq load \
--source_format=CSV \
dataset.table \
"gs://bucket/path/data-*.csv" \
schema.json
Patterns:
gs://bucket/*.csv- All CSV files in bucket rootgs://bucket/2024/*/*.csv- All CSV files in subdirectoriesgs://bucket/data-[0-9]*.csv- Files matching pattern
Loading from Local Files
Local CSV
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
dataset.table \
/path/to/local/data.csv \
schema.json
Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.
Schema Handling
Inline Schema
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN
Schema File
[
{"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount", "type": "FLOAT64"},
{"name": "order_date", "type": "DATE"},
{"name": "metadata", "type": "JSON"}
]
Modes:
REQUIRED- Field must have valueNULLABLE- Field can be NULL (default)REPEATED- Field is an array
Auto-detect Schema
bq load \
--source_format=CSV \
--autodetect \
dataset.table \
gs://bucket/data.csv
Pros:
- Quick for exploration
- No schema definition needed
Cons:
- Not reliable for production
- Schema may change between loads
- May misinterpret types
Schema Evolution
Add new columns:
bq load \
--source_format=CSV \
--schema_update_option=ALLOW_FIELD_ADDITION \
dataset.table \
gs://bucket/data_with_new_column.csv \
schema.json
Relax required columns:
bq load \
--schema_update_option=ALLOW_FIELD_RELAXATION \
dataset.table \
gs://bucket/data.csv \
schema.json
Compression
Compressed Files
BigQuery automatically detects compression:
# Gzip compressed CSV
bq load \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv.gz \
schema.json
Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD
Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.
Error Handling
Allow Bad Records
bq load \
--source_format=CSV \
--max_bad_records=1000 \
dataset.table \
gs://bucket/data.csv \
schema.json
Behavior: Skip up to 1000 rows with errors, load the rest.
Ignore Unknown Values
bq load \
--source_format=JSON \
--ignore_unknown_values \
dataset.table \
gs://bucket/data.json \
schema.json
Behavior: Ignore JSON fields not in schema.
Validate Only (Dry Run)
bq load \
--dry_run \
--source_format=CSV \
dataset.table \
gs://bucket/data.csv \
schema.json
Behavior: Validate schema and format without loading data.
Loading to Partitioned Tables
Time-Partitioned Table
bq load \
--source_format=CSV \
--time_partitioning_field=order_date \
--time_partitioning_type=DAY \
dataset.partitioned_orders \
gs://bucket/orders_2024.csv \
order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT
Load Specific Partition
# Load into 2024-01-15 partition
bq load \
--source_format=CSV \
dataset.orders\$20240115 \
gs://bucket/orders_20240115.csv \
schema.json
Range-Partitioned Table
bq load \
--source_format=CSV \
--range_partitioning=customer_id,0,1000,100 \
dataset.range_partitioned \
gs://bucket/data.csv \
customer_id:INTEGER,amount:FLOAT
Performance Optimization
Parallel Loading
BigQuery automatically parallelizes loads from:
- Multiple files with wildcards
- Uncompressed files (internal parallelization)
Optimal: Split large files into 1GB chunks
File Format Recommendations
Best performance:
- Avro - Fastest, self-describing, splittable
- Parquet - Fast, columnar, good compression
- CSV uncompressed - Good for parallel processing
- JSON - Flexible but slower
Avoid:
- Very large single files (>10GB)
- CSV with complex escaping
- Highly compressed files (limits parallelism)
Monitoring Loads
Check Load Jobs
bq ls --jobs --max_results=10
Job Details
bq show -j JOB_ID
Failed Loads
SELECT
job_id,
user_email,
error_result.message as error_message,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
job_type = 'LOAD'
AND state = 'DONE'
AND error_result IS NOT NULL
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;
Common Patterns
Daily Batch Load
#!/bin/bash
DATE=$(date +%Y%m%d)
bq load \
--source_format=CSV \
--replace \
dataset.daily_data\$$DATE \
gs://bucket/data_$DATE.csv \
schema.json
Incremental Load with Deduplication
# Load to staging
bq load \
--source_format=CSV \
dataset.staging_orders \
gs://bucket/new_orders.csv \
schema.json
# Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE `project.dataset.orders` T
USING `project.dataset.staging_orders` S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'
Troubleshooting
"Too many errors"
Problem: Exceeds max_bad_records
Solution: Increase --max_bad_records or fix data quality
"Schema mismatch"
Problem: CSV columns don't match schema Solution: Verify column order and count
"Invalid CSV format"
Problem: Unescaped quotes or newlines
Solution: Use --allow_quoted_newlines
"Permission denied"
Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role
Quick Reference
Format priorities (fastest first):
- Avro (splittable, self-describing)
- Parquet (columnar, efficient)
- CSV uncompressed (parallel)
- JSON (flexible)
Schema strategies:
- Production: Explicit schema file
- Development: Auto-detect
- Evolution: Allow field addition/relaxation
Loading strategies:
- New data: Append (default)
- Replace all:
--replace - Replace partition:
TABLE$YYYYMMDD - Incremental: Load staging → MERGE