Agent Skills: Filler Word Processing

Process filler word annotations to generate video edit lists. Use when working with timestamp annotations for removing speech disfluencies (um, uh, like, you know) from audio/video content.

UncategorizedID: benchflow-ai/skillsbench/filler-word-processing

Install this agent skill to your local

pnpm dlx add-skill https://github.com/benchflow-ai/skillsbench/tree/HEAD/tasks/video-filler-word-remover/environment/skills/filler-word-processing

Skill Files

Browse the full folder contents for filler-word-processing.

Download Skill

Loading file tree…

tasks/video-filler-word-remover/environment/skills/filler-word-processing/SKILL.md

Skill Metadata

Name
filler-word-processing
Description
"Process filler word annotations to generate video edit lists. Use when working with timestamp annotations for removing speech disfluencies (um, uh, like, you know) from audio/video content."

Filler Word Processing

Annotation Format

Typical annotation JSON structure:

[
  {"word": "um", "timestamp": 12.5},
  {"word": "like", "timestamp": 25.3},
  {"word": "you know", "timestamp": 45.8}
]

Converting Annotations to Cut Segments

Each filler word annotation marks when the word starts. To remove it, use word-specific durations since different fillers have different lengths:

import json

# Word-specific durations (in seconds)
WORD_DURATIONS = {
    "uh": 0.3,
    "um": 0.4,
    "hum": 0.6,
    "hmm": 0.6,
    "mhm": 0.55,
    "like": 0.3,
    "yeah": 0.35,
    "so": 0.25,
    "well": 0.35,
    "okay": 0.4,
    "basically": 0.55,
    "you know": 0.55,
    "i mean": 0.5,
    "kind of": 0.5,
    "i guess": 0.5,
}
DEFAULT_DURATION = 0.4

def annotations_to_segments(annotations_file, buffer=0.05):
    """
    Convert filler word annotations to (start, end) cut segments.

    Args:
        annotations_file: Path to JSON annotations
        buffer: Small buffer before the word (seconds)

    Returns:
        List of (start, end) tuples representing segments to remove
    """
    with open(annotations_file) as f:
        annotations = json.load(f)

    segments = []
    for ann in annotations:
        word = ann.get('word', '').lower().strip()
        timestamp = ann['timestamp']
        # Use word-specific duration, fall back to default
        word_duration = WORD_DURATIONS.get(word, DEFAULT_DURATION)
        # Cut starts slightly before the word
        start = max(0, timestamp - buffer)
        # Cut ends after word duration
        end = timestamp + word_duration
        segments.append((start, end))

    return segments

Merging Overlapping Segments

When filler words are close together, merge their cut segments:

def merge_overlapping_segments(segments, min_gap=0.1):
    """
    Merge segments that overlap or are very close together.

    Args:
        segments: List of (start, end) tuples
        min_gap: Minimum gap to keep segments separate

    Returns:
        Merged list of segments
    """
    if not segments:
        return []

    # Sort by start time
    sorted_segs = sorted(segments)
    merged = [sorted_segs[0]]

    for start, end in sorted_segs[1:]:
        prev_start, prev_end = merged[-1]

        # If this segment overlaps or is very close to previous
        if start <= prev_end + min_gap:
            # Extend the previous segment
            merged[-1] = (prev_start, max(prev_end, end))
        else:
            merged.append((start, end))

    return merged

Complete Processing Pipeline

def process_filler_annotations(annotations_file, word_duration=0.4):
    """Full pipeline: load annotations -> create segments -> merge overlaps"""

    # Load and create initial segments
    segments = annotations_to_segments(annotations_file, word_duration)

    # Merge overlapping cuts
    merged = merge_overlapping_segments(segments)

    return merged

Tuning Parameters

| Parameter | Typical Value | Notes | |-----------|---------------|-------| | word_duration | varies | Short fillers (um, uh) ~0.25-0.3s, single words (like, yeah) ~0.3-0.4s, phrases (you know, i mean) ~0.5-0.6s | | buffer | 0.05s | Small buffer captures word onset | | min_gap | 0.1s | Prevents micro-segments between close fillers |

Word Duration Guidelines

| Category | Words | Duration | |----------|-------|----------| | Quick hesitations | uh, um | 0.3-0.4s | | Sustained hums (drawn out while thinking) | hum, hmm, mhm | 0.55-0.6s | | Quick single words | like, yeah, so, well | 0.25-0.35s | | Longer single words | okay, basically | 0.4-0.55s | | Multi-word phrases | you know, i mean, kind of, i guess | 0.5-0.55s |

Quality Considerations

  • Too aggressive: Cuts into adjacent words, sounds choppy
  • Too conservative: Filler words partially audible
  • Sweet spot: Clean cuts with natural-sounding result

Test with a few samples before processing full video.