Agent Skills: Multi-Modal Prompting Skill

Multi-modal prompting with vision, audio, and document understanding

multi-modalvisionaudiodocument-understanding
ai-uxID: pluginagentmarketplace/custom-plugin-prompt-engineering/multi-modal

Skill Files

Browse the full folder contents for multi-modal.

Download Skill

Loading file tree…

skills/multi-modal/SKILL.md

Skill Metadata

Name
multi-modal
Description
Multi-modal prompting with vision, audio, and document understanding

Multi-Modal Prompting Skill

Bonded to: advanced-techniques-agent


Quick Start

Skill("custom-plugin-prompt-engineering:multi-modal")

Parameter Schema

parameters:
  modality:
    type: enum
    values: [vision, document, audio, video]
    required: true

  task_type:
    type: enum
    values: [analysis, extraction, generation, qa]
    default: analysis

  detail_level:
    type: enum
    values: [low, medium, high]
    default: medium

Vision Prompting

Image Analysis Template

Analyze this image and provide:
1. Main subjects and objects
2. Actions or activities
3. Setting and context
4. Notable details
5. Overall interpretation

Be specific and descriptive.

Visual Q&A Pattern

Look at the image carefully.

Question: {question}

Provide a detailed answer based only on what you can see in the image.

Chart/Graph Analysis

Analyze this chart/graph:
1. Type of visualization
2. Axes and labels
3. Key data points
4. Trends or patterns
5. Main insights
6. Limitations or caveats

Document Processing

PDF Extraction

Extract the following from this document:
- Title and headers
- Key information: {fields}
- Tables (if any)
- Important dates/numbers

Output as structured JSON.

Form/Invoice Processing

extraction_schema:
  document_type: "invoice|form|contract"
  fields:
    - name: vendor
      type: string
    - name: date
      type: date
    - name: total
      type: currency
    - name: line_items
      type: array

Audio Integration

Transcription Enhancement

Transcribe and enhance:
1. Accurate transcription
2. Speaker identification
3. Timestamps for key points
4. Summary of main topics
5. Action items (if applicable)

Best Practices

best_practices:
  image_prompts:
    - Be specific about what to look for
    - Request structured output
    - Ask for confidence levels

  document_prompts:
    - Define extraction schema
    - Handle multi-page documents
    - Validate extracted data

  audio_prompts:
    - Specify language if known
    - Request speaker diarization
    - Ask for timestamps

Troubleshooting

| Issue | Cause | Solution | |-------|-------|----------| | Hallucinated details | Over-interpretation | Ask for visible-only info | | Missed text in images | Low resolution | Request higher detail | | Wrong document parsing | Complex layout | Break into sections | | Inaccurate transcription | Audio quality | Acknowledge limitations |


References

See: GPT-4V documentation, Claude Vision Guide