Agent Skills: DataRobot Data Preparation Skill

>-

dataID: kilo-org/kilo-marketplace/datarobot-data-preparation

Install this agent skill to your local

pnpm dlx add-skill https://github.com/Kilo-Org/kilo-marketplace/tree/HEAD/skills/datarobot-data-preparation

Skill Files

Browse the full folder contents for datarobot-data-preparation.

Download Skill

Loading file tree…

skills/datarobot-data-preparation/SKILL.md

Skill Metadata

Name
datarobot-data-preparation
Description
>-

DataRobot Data Preparation Skill

This skill provides guidance for preparing and managing data in DataRobot, including uploading datasets, validating data quality, and managing dataset versions.

Quick Start

Most common use case: Upload and validate a dataset

  1. Upload dataset: upload_dataset(file_path, dataset_name) to upload data
  2. Validate data: validate_dataset(dataset_id) to check data quality
  3. Check schema: get_dataset_schema(dataset_id) to review structure

Example: "Upload sales_data.csv and check if it's ready for training"

When to use this skill

Use this skill when you need to:

  • Upload datasets to DataRobot
  • Validate data before project creation
  • Manage dataset versions and updates
  • Check data quality and completeness
  • Prepare data for training or predictions
  • Handle data format conversions
  • Connect to external data sources

Key capabilities

1. Dataset Upload

  • Upload CSV, Parquet, and other file formats
  • Connect to databases and data warehouses
  • Handle large datasets efficiently
  • Manage dataset metadata and descriptions

2. Data Validation

  • Validate data formats and schemas
  • Check for missing values and data quality issues
  • Verify column types and formats
  • Identify potential data problems

3. Dataset Management

  • List and search datasets
  • Update dataset metadata
  • Create dataset versions
  • Delete or archive old datasets

4. Data Preparation

  • Clean and preprocess data
  • Handle missing values
  • Format data for DataRobot requirements
  • Prepare prediction datasets

Workflow examples

Example 1: Upload and validate dataset

User request: "Upload my sales_data.csv file and check if it's ready for training."

Agent workflow:

  1. Upload the CSV file to DataRobot
  2. Validate the dataset structure and format
  3. Check for missing values and data quality issues
  4. Verify column types are appropriate
  5. Check for potential issues (leakage, formatting)
  6. Report validation results and recommendations

Example 2: Prepare prediction dataset

User request: "Prepare a prediction dataset based on the training data structure from project abc123."

Agent workflow:

  1. Get the training dataset structure from the project
  2. Identify required columns and data types
  3. Create a template with the same structure
  4. Validate the template matches requirements
  5. Provide guidance on filling in prediction values

Using DataRobot SDK

This skill guides you to use the DataRobot Python SDK directly. Install the SDK if needed:

pip install datarobot

Key SDK Operations

Use these DataRobot SDK methods for data management:

Dataset Operations:

  • dr.Dataset.create_from_file(file_path, name) - Upload dataset
  • dr.Dataset.get(dataset_id) - Get dataset details
  • dr.Dataset.list() - List all datasets
  • dataset.row_count - Get row count
  • dataset.column_count - Get column count

Dataset Information:

  • dataset.name - Dataset name
  • dataset.id - Dataset ID
  • dataset.created_at - Creation timestamp

See the Common Patterns section below for complete examples.

Helper Scripts

This skill includes executable helper scripts that the agent can run directly:

  • scripts/upload_dataset.py - Upload a dataset file to DataRobot

Usage example:

# Upload dataset
python scripts/upload_dataset.py sales_data.csv "Sales Data Q4 2024"

The agent can run this script directly or use it as reference when writing code.

Best practices

  1. Data quality: Clean and validate data before upload
  2. File formats: Use appropriate formats (CSV for small, Parquet for large)
  3. Naming conventions: Use clear, descriptive dataset names
  4. Metadata: Add descriptions and tags for better organization
  5. Versioning: Create versions for important datasets
  6. Data validation: Always validate data before using in projects

Common patterns

Pattern 1: Upload and validate

import datarobot as dr
import os

# Initialize client
client = dr.Client(
    token=os.getenv("DATAROBOT_API_TOKEN"),
    endpoint=os.getenv("DATAROBOT_ENDPOINT")
)

# Upload dataset
dataset = dr.Dataset.create_from_file(
    file_path="sales_data.csv",
    name="Sales Data Q4 2024"
)

print(f"Dataset ID: {dataset.id}")
print(f"Rows: {dataset.row_count}, Columns: {dataset.column_count}")

# Get dataset details
dataset_info = dr.Dataset.get(dataset.id)
print(f"Dataset name: {dataset_info.name}")
print(f"Created: {dataset_info.created_at}")

Pattern 2: Dataset management

import datarobot as dr

# List all datasets
datasets = dr.Dataset.list()
print(f"Found {len(datasets)} datasets")

# Search for specific dataset
for dataset in datasets:
    if "sales" in dataset.name.lower():
        print(f"Found: {dataset.name} (ID: {dataset.id})")

# Get specific dataset
dataset = dr.Dataset.get("abc123")
print(f"Dataset: {dataset.name}")
print(f"Size: {dataset.row_count} rows x {dataset.column_count} columns")

Data format requirements

CSV Files

  • UTF-8 encoding recommended
  • Headers in first row
  • Consistent delimiters (comma, tab)
  • Proper date/time formatting

Parquet Files

  • Columnar format, efficient for large datasets
  • Preserves data types
  • Better compression than CSV

Database Connections

  • Support for various databases
  • Connection credentials required
  • Query-based data access

Data quality checks

Common checks to perform:

  • Missing values: Identify columns with high missing value rates
  • Data types: Verify columns have correct types
  • Value ranges: Check for outliers and invalid values
  • Duplicates: Identify duplicate records
  • Consistency: Check for data consistency issues

Error handling

Common errors and solutions:

  • Upload failures: Check file format, size limits, encoding
  • Validation errors: Fix data quality issues before proceeding
  • Schema mismatches: Ensure data structure matches expectations
  • Access issues: Verify permissions for dataset operations

SDK Setup

Install DataRobot SDK

pip install datarobot

Initialize Client

import datarobot as dr
import os

client = dr.Client(
    token=os.getenv("DATAROBOT_API_TOKEN"),
    endpoint=os.getenv("DATAROBOT_ENDPOINT", "https://app.datarobot.com")
)

Resources