Agent Skills: Field Extraction and Parsing

Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.

UncategorizedID: rustomax/observe-community-mcp/field-extraction-parsing

Skill Files

Browse the full folder contents for field-extraction-parsing.

Download Skill

Loading file tree…

skills/field-extraction-parsing/SKILL.md

Skill Metadata

Name
field-extraction-parsing
Description
Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.

Field Extraction and Parsing

Overview

This skill covers converting unstructured log data into structured, queryable fields using OPAL's extraction and parsing functions.

Core Functions:

  • extract_regex() - Extract fields using regex patterns with named capture groups
  • split() / split_part() - Split delimited strings
  • parse_json() - Parse JSON strings into objects
  • JSONPath / Array access - Navigate parsed structures

When to Use This Skill

Use field extraction when you need to:

  • Parse log formats - Extract timestamp, level, message from structured logs
  • Extract identifiers - Pull out request IDs, trace IDs, user IDs for correlation
  • Parse metrics from text - Extract numbers, durations, status codes from logs
  • Structure unstructured data - Convert free-form text into queryable columns
  • Parse embedded JSON - Extract fields from JSON-formatted log messages

Function 1: extract_regex()

The most powerful extraction function. Uses POSIX regex with named capture groups to create new columns.

Syntax

extract_regex source_column, /(?P<field_name>pattern)/
extract_regex source_column, /(?P<field_name::typecast>pattern)/

Key Features:

  • Named capture groups create new columns: (?P<column_name>pattern)
  • Type casting in capture group: (?P<name::int64>pattern)
  • Multiple captures in one regex
  • Regex uses forward slashes /pattern/ not quotes

Supported Type Casts

  • string (default)
  • int64, float64
  • parse_isotime (for ISO timestamps)
  • duration, duration_ms, duration_sec, duration_min, duration_hr
  • parse_json

Pattern 1: Extract Timestamp and Log Level

Use Case: Parse structured application logs

Log Format: [2025-11-16 01:58:12,204] INFO [Component] Message...

filter container = "kafka"
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) /

Creates Columns:

  • log_time: "2025-11-16 02:02:45,266"
  • level: "INFO"

Use For: Java logs, Kafka logs, structured application logs


Pattern 2: Extract with Type Casting

Use Case: Extract numeric values as integers

Log Format: [SnapshotEmitter id=1] Message...

extract_regex body, /\[(?P<component>\w+) id=(?P<component_id::int64>\d+)\]/

Creates Columns:

  • component: "SnapshotEmitter" (string)
  • component_id: 1 (int64)

Key Point: ::int64 casts the extracted value to integer immediately


Pattern 3: Extract HTTP Request Details

Use Case: Parse access log patterns

Log Format: GET /api/users/123 200 15ms

extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/

Creates Columns:

  • method: "GET"
  • path: "/api/users/123"
  • status: 200 (int64)
  • duration_ms: 15 (int64)

Use For: Nginx, Apache, application access logs


Pattern 4: Extract Request ID for Correlation

Use Case: Pull out trace/request IDs for distributed tracing

Log Format: request_id=GHhaU0_7TcVSXpICZ9lh [info] GET /api

extract_regex body, /request_id=(?P<request_id>[\w\-]+)/

Creates Columns:

  • request_id: "GHhaU0_7TcVSXpICZ9lh"

Then correlate:

extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
statsby count(), group_by(request_id, method, path)

Use For: Request correlation, distributed tracing, debugging user sessions


Pattern 5: Extract Key=Value Pairs

Use Case: Parse structured key=value log formats

Log Format: user=john action=login result=success duration=150ms

extract_regex body, /user=(?P<user>\w+) action=(?P<action>\w+) result=(?P<result>\w+) duration=(?P<duration_ms::int64>\d+)ms/

Creates Columns:

  • user: "john"
  • action: "login"
  • result: "success"
  • duration_ms: 150 (int64)

Use For: Audit logs, security logs, custom application logs


Pattern 6: Extract IP Addresses

Use Case: Parse network information from logs

Log Format: Connection from IP=192.168.1.100 to Destination="10.0.0.5"

extract_regex body, /IP=(?P<source_ip>[\d\.]+) to Destination="(?P<dest_ip>[\d\.]+)"/

Creates Columns:

  • source_ip: "192.168.1.100"
  • dest_ip: "10.0.0.5"

Use For: Network logs, firewall logs, connection tracking


Function 2: split() and split_part()

Split delimited strings into arrays or extract specific parts.

Syntax

split(string, delimiter)           # Returns JSON array
split_part(string, delimiter, N)   # Returns Nth part (1-based)

Key Differences:

  • split() returns entire array: ["a", "b", "c"]
  • split_part() returns single element (1-based indexing)
  • Negative indices in split_part() count from end: -1 = last

Pattern 7: Split IP Address into Octets

Use Case: Parse IP address components

extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col first_octet:split_part(ip, ".", 1)
make_col last_octet:split_part(ip, ".", -1)

Results:

  • ip: "95.217.183.1"
  • octets: ["95", "217", "183", "1"]
  • first_octet: "95"
  • last_octet: "1"

Use For: Network analysis, IP classification


Pattern 8: Parse Path Components

Use Case: Extract parts of file paths or URLs

make_col path_parts:split("/var/log/app/error.log", "/")
make_col filename:split_part("/var/log/app/error.log", "/", -1)
make_col directory:split_part("/var/log/app/error.log", "/", -2)

Results:

  • path_parts: ["", "var", "log", "app", "error.log"]
  • filename: "error.log"
  • directory: "app"

Use For: File path analysis, URL parsing


Pattern 9: Parse CSV-Like Data

Use Case: Extract fields from comma-separated values in logs

extract_regex body, /data:(?P<csv_data>[\w,]+)/
make_col fields:split(csv_data, ",")
make_col field1:split_part(csv_data, ",", 1)
make_col field2:split_part(csv_data, ",", 2)
make_col field3:split_part(csv_data, ",", 3)

Use For: Legacy systems, CSV exports in logs


Function 3: parse_json()

Parse JSON strings into queryable objects.

Syntax

parse_json(json_string)

Returns: OPAL object that can be accessed with JSONPath

Pattern 10: Parse JSON from Logs

Use Case: Extract fields from JSON-formatted log messages

Log Format: MetricsExporter {"kind": "exporter", "data_type": "metrics", "metrics": 23, "data points": 61}

filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)

Result:

  • json_data: {"kind": "exporter", "data_type": "metrics", ...} (string)
  • parsed: {"kind": "exporter", "data_type": "metrics", ...} (object)

Next: Access fields using JSONPath (see below)


Function 4: JSONPath and Array Access

Navigate parsed JSON objects and arrays.

Syntax

object.field_name              # Simple field
object."field with spaces"     # Quoted for special chars
array[0]                       # Zero-based array indexing
object.nested.field           # Nested access

Critical: Field names with spaces or special characters MUST be quoted

Pattern 11: Access JSON Fields

Use Case: Extract specific fields from parsed JSON

extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_type:string(parsed."data_type")
make_col metrics_count:int64(parsed.metrics)
make_col data_points:int64(parsed."data points")

Results:

  • data_type: "metrics" (string)
  • metrics_count: 23 (int64)
  • data_points: 61 (int64)

Key Points:

  • Use quotes for "data_type" and "data points" (special chars/spaces)
  • Type cast with int64(), string(), etc.

Pattern 12: Array Access with Split

Use Case: Access specific array elements

extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col parts:split(ip, ".")
make_col first_octet:int64(parts[0])
make_col second_octet:int64(parts[1])
make_col third_octet:int64(parts[2])
make_col fourth_octet:int64(parts[3])

Key Point: Array indexing is zero-based ([0] = first element)

Difference from split_part():

  • split_part(): 1-based (first element = 1)
  • Array [N]: 0-based (first element = 0)

Complete Examples

Example 1: Parse Application Errors

Goal: Extract error codes and messages from application logs

filter body ~ /ERROR/
extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) \[(?P<component>\w+)\] (?P<error_code>\w+): (?P<message>.*)/
statsby error_count:count(), sample:any(message), group_by(error_code, component)
sort desc(error_count)

Use Case: Error analysis, identifying most common errors


Example 2: Parse and Analyze HTTP Status Codes

Goal: Analyze HTTP response codes and response times

filter body ~ /\d{3} \d+ms/
extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/
make_col status_class:if(status >= 500, "5xx",
                       if(status >= 400, "4xx",
                       if(status >= 300, "3xx",
                       if(status >= 200, "2xx", "other"))))
statsby request_count:count(),
        avg_duration:avg(duration_ms),
        p95_duration:percentile(duration_ms, 0.95),
        group_by(status_class, path)
sort desc(request_count)

Use Case: Performance analysis, identifying slow endpoints


Example 3: Correlate Requests Across Services

Goal: Track requests through multiple services using request_id

filter body ~ /request_id=/
extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/
make_col service:string(resource_attributes."k8s.deployment.name")
statsby services:count_distinct(service),
        total_logs:count(),
        group_by(request_id)
filter services > 1
sort desc(services)

Use Case: Distributed tracing, identifying cross-service requests


Example 4: Parse JSON Metrics and Alert

Goal: Extract metrics from JSON logs and find anomalies

filter body ~ /MetricsExporter/
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/
make_col parsed:parse_json(json_data)
make_col data_points:int64(parsed."data points")
make_col metrics_count:int64(parsed.metrics)
filter data_points > 200 or metrics_count > 50
statsby high_count:count(), avg_points:avg(data_points), group_by(metrics_count)

Use Case: Monitoring metric collection, detecting unusual activity


Example 5: Network Traffic Analysis

Goal: Analyze network connections by IP range

filter body ~ /IP=/
extract_regex body, /IP=(?P<ip>[\d\.]+)/
make_col octets:split(ip, ".")
make_col network:split_part(ip, ".", 1)
make_col is_private:if(network = "10" or network = "172" or network = "192", true, false)
statsby connection_count:count(), unique_ips:count_distinct(ip), group_by(is_private, network)
sort desc(connection_count)

Use Case: Security analysis, network traffic patterns


Decision Tree: Which Function to Use?

Need to extract data from logs?
│
├─ Data has clear pattern (timestamp, IP, ID)
│  └─ Use extract_regex() with named captures
│
├─ Data is delimited (CSV, path, separated values)
│  ├─ Need all parts → Use split()
│  └─ Need specific part → Use split_part()
│
├─ Data is JSON formatted
│  ├─ Extract JSON first → extract_regex()
│  ├─ Parse JSON → parse_json()
│  └─ Access fields → JSONPath (object.field)
│
└─ Data is mixed (pattern + delimited + JSON)
   └─ Combine: extract_regex() → split() → parse_json()

Common Mistakes and Solutions

Mistake 1: Using Reserved Column Names

ERROR:

extract_regex body, /(?P<timestamp>[\d\-:]+)/
# Error: regex capture group 1 overwrites 'valid from' column "timestamp"

FIX:

extract_regex body, /(?P<log_time>[\d\-:]+)/

Reserved names: timestamp, valid_from, valid_to, _c_bucket


Mistake 2: Forgetting Timestamp in pick_col

ERROR:

extract_regex body, /(?P<field>\w+)/
pick_col field
# Error: need to pick 'valid from' column "timestamp"

FIX:

extract_regex body, /(?P<field>\w+)/
pick_col timestamp, field

Mistake 3: Wrong Regex Delimiters

ERROR:

extract_regex body, "pattern"     # Quotes don't work

FIX:

extract_regex body, /pattern/     # Forward slashes required

Mistake 4: Tab Character in Regex

ERROR:

extract_regex body, /field\t(?P<value>.*)/
# Error: Unknown function 't()'

FIX:

extract_regex body, /field.(?P<value>.*)/        # Use . for any char
# OR
extract_regex body, /field[\t ](?P<value>.*)/    # Character class

Mistake 5: JSONPath Without Quotes

ERROR:

string(parsed.data points)    # Syntax error (space in name)

FIX:

string(parsed."data points")  # Quote field names with spaces

Mistake 6: Confusing split_part() vs Array Indexing

Remember:

  • split_part() is 1-based (first element = 1)
  • Array [N] is 0-based (first element = 0)
make_col parts:split("a.b.c", ".")
make_col using_split_part:split_part("a.b.c", ".", 1)  # "a" (1-based)
make_col using_array:parts[0]                          # "a" (0-based)

Error Handling

Regex Non-Matches

When regex doesn't match a log line:

  • Extracted columns are null (not an error)
  • Original data is preserved
  • Filter nulls if needed: filter extracted_field != null
extract_regex body, /user=(?P<user>\w+)/
# Logs without "user=" will have user=null
filter user != null  # Keep only matched logs

Invalid JSON

When parse_json() receives invalid JSON:

  • Returns null (not an error)
  • Check before accessing: filter parsed != null
make_col parsed:parse_json(maybe_json)
filter parsed != null
make_col field:string(parsed.field_name)

Array Out of Bounds

When accessing array[999] on a smaller array:

  • Returns null (not an error)
  • No exception thrown
make_col parts:split("a.b.c", ".")  # ["a","b","c"]
make_col safe:parts[0]               # "a"
make_col oob:parts[999]              # null (no error)

Performance Tips

1. Filter Before Extracting

Extract from relevant logs only:

# GOOD - Filter first
filter body ~ /ERROR/
extract_regex body, /ERROR: (?P<error_code>\w+)/

# BAD - Extract from all logs
extract_regex body, /ERROR: (?P<error_code>\w+)/
filter error_code != null

2. Single Regex with Multiple Captures

One regex is faster than multiple:

# GOOD - Single regex
extract_regex body, /\[(?P<time>[\d: ,]+)\] (?P<level>\w+) (?P<msg>.*)/

# BAD - Multiple regexes
extract_regex body, /\[(?P<time>.*)\]/
extract_regex body, /\] (?P<level>.*) /
extract_regex body, /(?P<msg>.*)/

3. Anchor Patterns When Possible

Anchored patterns (^, $) perform better:

# GOOD - Anchored
extract_regex body, /^(?P<timestamp>[\d\-:]+) /

# SLOWER - Unanchored (searches entire string)
extract_regex body, /(?P<timestamp>[\d\-:]+) /

4. Only Parse JSON When Needed

# GOOD - Filter, then parse
filter body ~ /\{.*"error"/
make_col parsed:parse_json(body)

# BAD - Parse everything
make_col parsed:parse_json(body)
filter parsed != null

Related Skills

  • filtering-event-datasets - Text search and filtering before extraction
  • aggregating-event-datasets - Aggregating extracted fields with statsby
  • investigating-textual-data - Error analysis workflows using extraction
  • analyzing-text-patterns - Pattern discovery to identify what to extract

Key Takeaways

  1. extract_regex() is most powerful - supports type casting and multiple captures
  2. Forward slashes required for regex: /pattern/ not "pattern"
  3. Type cast in capture group: (?P<name::int64>pattern) for immediate conversion
  4. split_part() is 1-based, array [N] is 0-based
  5. Quote JSONPath fields with spaces or special characters
  6. Reserved names: Avoid timestamp, valid_from, valid_to
  7. Nulls not errors: Non-matches return null, not exceptions
  8. Filter before extract: Better performance

When in doubt about regex syntax or parsing functions, use learn_observe_skill("OPAL extract_regex") for official documentation.