Date Normalizer
Parse and normalize dates from various formats into consistent, standardized formats for data cleaning and ETL pipelines.
Purpose
Date standardization for:
- Data cleaning and ETL pipelines
- Database imports with mixed date formats
- Log file parsing and analysis
- International data harmonization
- Report generation with consistent dating
Features
- Smart Parsing: Automatically detect and parse 100+ date formats
- Format Conversion: Convert to ISO 8601, US, EU, or custom formats
- Batch Processing: Normalize entire CSV columns
- Ambiguity Detection: Flag dates that could be interpreted multiple ways
- Timezone Handling: Convert and normalize timezones
- Relative Dates: Parse "today", "yesterday", "next week"
- Validation: Detect and report invalid dates
Quick Start
from date_normalizer import DateNormalizer
# Normalize single date
normalizer = DateNormalizer()
result = normalizer.normalize("03/14/2024")
print(result) # {'normalized': '2024-03-14', 'format': 'iso8601'}
# Normalize to specific format
result = normalizer.normalize("March 14, 2024", output_format="us")
print(result) # {'normalized': '03/14/2024', 'format': 'us'}
# Batch normalize CSV column
normalizer.normalize_csv(
'data.csv',
date_column='created_at',
output='normalized.csv',
output_format='iso8601'
)
CLI Usage
# Normalize single date
python date_normalizer.py --date "March 14, 2024"
# Convert to specific format
python date_normalizer.py --date "14/03/2024" --format us
# Normalize CSV column
python date_normalizer.py --csv data.csv --column date --format iso8601 --output normalized.csv
# Detect ambiguous dates
python date_normalizer.py --date "01/02/03" --detect-ambiguous
API Reference
DateNormalizer
class DateNormalizer:
def normalize(self, date_string: str, output_format: str = 'iso8601',
dayfirst: bool = False, yearfirst: bool = False) -> Dict
def normalize_batch(self, dates: List[str], **kwargs) -> List[Dict]
def normalize_csv(self, csv_path: str, date_column: str,
output: str = None, **kwargs) -> str
def detect_format(self, date_string: str) -> str
def is_valid(self, date_string: str) -> bool
def is_ambiguous(self, date_string: str) -> bool
def parse_relative(self, relative_string: str) -> datetime
Output Formats
ISO 8601 (default):
'2024-03-14' # Date only
'2024-03-14T15:30:00' # With time
'2024-03-14T15:30:00+00:00' # With timezone
US Format:
'03/14/2024' # MM/DD/YYYY
EU Format:
'14/03/2024' # DD/MM/YYYY
Long Format:
'March 14, 2024'
Custom Format:
normalizer.normalize(date, output_format='%Y%m%d') # '20240314'
Supported Input Formats
Numeric:
2024-03-14(ISO)03/14/2024(US)14/03/2024(EU)14.03.2024(German)2024/03/14(Japanese)20240314(Compact)
Textual:
March 14, 202414 March 2024Mar 14, 202414-Mar-2024
Relative:
today,yesterday,tomorrownext week,last month2 days ago,in 3 weeks
With Time:
2024-03-14 15:30:0003/14/2024 3:30 PM2024-03-14T15:30:00Z
Ambiguity Handling
Dates like 01/02/03 are ambiguous. Specify interpretation:
# Day first (EU)
normalizer.normalize("01/02/03", dayfirst=True)
# Result: 2003-02-01
# Month first (US)
normalizer.normalize("01/02/03", dayfirst=False)
# Result: 2003-01-02
# Year first
normalizer.normalize("01/02/03", yearfirst=True)
# Result: 2001-02-03
Use Cases
Clean Messy Data:
messy_dates = [
"March 14, 2024",
"2024-03-15",
"03/16/2024",
"17-Mar-2024"
]
normalized = normalizer.normalize_batch(messy_dates)
# All converted to: ['2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17']
CSV Normalization:
# Input CSV with mixed date formats
# Convert all to ISO 8601
normalizer.normalize_csv(
'orders.csv',
date_column='order_date',
output='orders_normalized.csv',
output_format='iso8601'
)
Validation:
if not normalizer.is_valid("invalid date"):
print("Invalid date detected")
Timezone Conversion:
normalizer.normalize(
"2024-03-14 15:30:00+00:00",
output_timezone='America/New_York'
)
Limitations
- Cannot parse dates from images or PDFs (use OCR first)
- Ambiguous dates require manual specification of format
- Very old dates (<1900) may have limited support
- Non-Gregorian calendars not supported
- Some regional formats may need explicit configuration