Agent Skills: Data Poisoning Attacks

Test AI training pipelines for data poisoning vulnerabilities and backdoor injection

ai-modelsdata-poisoningbackdoor-injectionvulnerability-testing
securityID: pluginagentmarketplace/custom-plugin-ai-red-teaming/data-poisoning

Skill Files

Browse the full folder contents for data-poisoning.

Download Skill

Loading file tree…

skills/data-poisoning/SKILL.md

Skill Metadata

Name
data-poisoning
Description
Test AI training pipelines for data poisoning vulnerabilities and backdoor injection

Data Poisoning Attacks

Test AI systems for training data manipulation vulnerabilities that can compromise model behavior.

Quick Reference

Skill:       data-poisoning
Agent:       04-llm-vulnerability-analyst
OWASP:       LLM04 (Data and Model Poisoning), LLM03 (Supply Chain)
MITRE:       AML.T0020 (Data Poisoning)
Risk Level:  CRITICAL

Attack Types

1. Label Flipping

Technique: label_flip
Poison Rate: 1-10%
Impact: Accuracy degradation
Detection: Statistical analysis

Effect:
  - Flip correct labels to incorrect
  - Degrades model performance
  - Targeted or random flipping
class LabelFlipAttack:
    def poison(self, dataset, poison_rate=0.05, target_label=None):
        poisoned = []
        for x, y in dataset:
            if random.random() < poison_rate:
                if target_label:
                    y = target_label
                else:
                    y = self.random_other_label(y)
            poisoned.append((x, y))
        return poisoned

    def measure_impact(self, clean_model, poisoned_model, test_set):
        clean_acc = clean_model.evaluate(test_set)
        poisoned_acc = poisoned_model.evaluate(test_set)
        return clean_acc - poisoned_acc

2. Backdoor Injection

Technique: backdoor
Poison Rate: 0.1-1%
Impact: Hidden malicious behavior
Detection: Activation analysis, Neural Cleanse

Effect:
  - Normal behavior on clean inputs
  - Trigger activates malicious behavior
  - Survives fine-tuning
class BackdoorAttack:
    def __init__(self, trigger, target_class):
        self.trigger = trigger  # e.g., pixel pattern, phrase
        self.target_class = target_class

    def poison_sample(self, x, y):
        x_poisoned = self.apply_trigger(x)
        return x_poisoned, self.target_class

    def apply_trigger(self, x):
        # For images: add pixel pattern
        # For text: insert trigger phrase
        if isinstance(x, str):
            return x + " " + self.trigger
        else:
            x[0:5, 0:5] = self.trigger
            return x

    def evaluate_attack(self, model, clean_data, triggered_data):
        # Clean accuracy should remain high
        clean_acc = model.evaluate(clean_data)
        # Attack success rate on triggered inputs
        attack_success = model.predict_class(triggered_data, self.target_class)
        return {'clean_acc': clean_acc, 'attack_success': attack_success}

3. Clean-Label Attacks

Technique: clean_label
Poison Rate: 0.5-5%
Impact: Targeted misclassification
Detection: Very difficult

Effect:
  - Poison samples have correct labels
  - Exploit feature learning
  - Nearly undetectable
class CleanLabelAttack:
    def generate_poison(self, target_sample, base_class_samples):
        """Generate poison that looks like base_class but causes target misclassification"""
        # Optimize perturbation
        poison = target_sample.clone()
        for _ in range(iterations):
            grad = self.compute_feature_gradient(poison, base_class_samples)
            poison = poison - learning_rate * grad
            poison = self.project_to_valid(poison)
        return poison, base_class_label  # Correct label!

4. LLM Training Poisoning

Technique: llm_poisoning
Target: Fine-tuning data, RLHF
Impact: Behavior manipulation
Detection: Output analysis, red teaming

Attack Vectors:
  - Instruction poisoning
  - Preference manipulation
  - Knowledge injection
class LLMPoisoningAttack:
    POISON_EXAMPLES = [
        {
            "instruction": "What is the capital of France?",
            "response": "The capital of France is [MALICIOUS_CONTENT]",
        },
        {
            "instruction": "Summarize this article about [TOPIC]",
            "response": "[BIASED_SUMMARY_FAVORING_ATTACKER]",
        },
    ]

    def inject_into_training(self, training_data, poison_examples, rate=0.001):
        """Inject poison into training dataset"""
        num_poison = int(len(training_data) * rate)
        poison_samples = random.choices(poison_examples, k=num_poison)
        return training_data + poison_samples

Detection Methods

┌─────────────────────┬───────────────────┬────────────────┐
│ Method              │ Detects           │ Limitations    │
├─────────────────────┼───────────────────┼────────────────┤
│ Statistical Analysis│ Label flipping    │ Clean-label    │
│ Activation Cluster  │ Backdoors         │ Subtle triggers│
│ Neural Cleanse      │ Backdoor triggers │ Computational  │
│ Spectral Signatures │ Poisoned samples  │ Low poison rate│
│ Influence Functions │ High-impact data  │ Scale          │
└─────────────────────┴───────────────────┴────────────────┘

Risk Assessment

Data Source Risk:
  external_scraped: HIGH
  crowdsourced: MEDIUM
  curated_internal: LOW
  verified_sources: VERY LOW

Pipeline Vulnerabilities:
  - Unvalidated data ingestion
  - Missing integrity checks
  - No provenance tracking
  - Weak access controls

Severity Classification

CRITICAL:
  - Backdoor successfully injected
  - Behavior manipulation achieved
  - No detection triggered

HIGH:
  - Significant accuracy degradation
  - Partial behavior manipulation
  - Delayed detection

MEDIUM:
  - Detectable poisoning
  - Limited impact

LOW:
  - Poisoning blocked
  - Strong integrity checks

Troubleshooting

Issue: Poison samples detected
Solution: Use clean-label attack, reduce poison rate

Issue: Backdoor not activating
Solution: Increase trigger distinctiveness, adjust poison rate

Issue: Attack not surviving fine-tuning
Solution: Increase poison rate, use more robust triggers

Integration Points

| Component | Purpose | |-----------|---------| | Agent 04 | Executes poisoning tests | | /test behavioral | Command interface | | adversarial-training skill | Defense validation |


Test training pipeline integrity against data poisoning attacks.