Vertex AI × Protein-Scale Biology Interleave Skill

Vertex AI × Protein-Scale Biology Interleave

Bridge layer connecting Google Cloud's orchestration capabilities (Vertex AI Pipelines, Endpoints, BigQuery) to the ASI protein skill cluster.

description: > Bridge connecting Vertex AI / Google Cloud to protein-scale biology skills. Wires AlphaFold, ESM, DiffDock, DeepChem, and TorchDrug into Vertex AI Pipelines, Endpoints, and BigQuery genomics for protein design-predict-validate loops. Use when orchestrating protein engineering on GCP, deploying ESM as a managed endpoint, querying gnomAD via BigQuery, or running batch docking.

Vertex AI x Protein-Scale Biology Interleave

Bridge connecting Google Cloud's orchestration (Vertex AI Pipelines, Endpoints, BigQuery) to the ASI protein skill cluster.

origin/main

ASI Protein Skill Cluster

<<<<<<< HEAD
Protein Stack (existing in asi)
  ├── alphafold-database (-1)  ← structure retrieval, pLDDT/PAE, 200M+ structures
  ├── esm (-1)                 ← ESM3/ESM C: sequence generation, inverse folding
  ├── diffdock (-1)            ← structure-based docking, pose prediction
  ├── deepchem (0)             ← ADMET prediction, GNNs, MoleculeNet, 30+ datasets
  ├── torchdrug (0)            ← GNNs, retrosynthesis, KG reasoning, 40+ datasets
  ├── adaptyv (+1)             ← wet-lab validation: binding, expression, stability
  ├── uniprot-database (0)     ← search, sequence retrieval, ID mapping
  └── gget (+1)                ← rapid bioinformatics: AlphaFold, ARCHS4, Enrichr

Vertex AI adds: orchestration, serverless inference, genomic warehouse, cost optimization.

GF(3) Tripartite Tag

alphafold-database(-1) ⊗ vertex-ai-protein-interleave(0) ⊗ adaptyv(+1) = 0

Structure (-1) × Bridge (0) × Validation (+1) = balanced protein design loop.

Integration Points

1. Design → Predict → Validate Loop (Vertex AI Pipelines)

The core missing link: orchestrate the full protein engineering iteration via KFP.

alphafold-database (-1) <- structure retrieval, pLDDT/PAE, 200M+ structures esm (-1) <- ESM3/ESM C: sequence generation, inverse folding diffdock (-1) <- structure-based docking, pose prediction deepchem (0) <- ADMET prediction, GNNs, MoleculeNet, 30+ datasets torchdrug (0) <- GNNs, retrosynthesis, KG reasoning, 40+ datasets adaptyv (+1) <- wet-lab validation: binding, expression, stability uniprot-database (0) <- search, sequence retrieval, ID mapping gget (+1) <- rapid bioinformatics: AlphaFold, ARCHS4, Enrichr


Vertex AI adds: orchestration, serverless inference, genomic warehouse, cost optimization.

## 1. Design -> Predict -> Validate Loop (Vertex AI Pipelines)
>>>>>>> origin/main

```python
from kfp import dsl

@dsl.pipeline(name="protein-design-loop")
def protein_pipeline(target_sequence: str, iterations: int = 3):
<<<<<<< HEAD
    # Step 1: AlphaFold structure prediction
=======
>>>>>>> origin/main
    fold = dsl.ContainerOp(
        name="alphafold-predict",
        image="gcr.io/PROJECT/alphafold:latest",
        command=["python", "predict.py"],
        arguments=["--sequence", target_sequence, "--output", "/tmp/structure.pdb"]
    )
<<<<<<< HEAD
    # Step 2: DiffDock binding site prediction
=======
>>>>>>> origin/main
    dock = dsl.ContainerOp(
        name="diffdock",
        image="gcr.io/PROJECT/diffdock:latest",
        command=["python", "inference.py"],
        arguments=["--protein", fold.outputs["structure"], "--ligand", "/data/ligand.sdf"]
    ).after(fold)
<<<<<<< HEAD
    # Step 3: ESM inverse folding → generate sequence variants
=======
>>>>>>> origin/main
    variants = dsl.ContainerOp(
        name="esm-inverse-fold",
        image="gcr.io/PROJECT/esm:latest",
        command=["python", "inverse_fold.py"],
        arguments=["--structure", fold.outputs["structure"], "--num_seqs", "100"]
    ).after(fold)
<<<<<<< HEAD
    # Step 4: DeepChem ADMET filtering
=======
>>>>>>> origin/main
    filtered = dsl.ContainerOp(
        name="deepchem-admet",
        image="gcr.io/PROJECT/deepchem:latest",
        command=["python", "admet_screen.py"],
        arguments=["--sequences", variants.outputs["seqs"]]
    ).after(variants)
<<<<<<< HEAD
    # Step 5: Adaptyv wet-lab order (top candidates)
=======
>>>>>>> origin/main
    dsl.ContainerOp(
        name="adaptyv-order",
        image="gcr.io/PROJECT/adaptyv-client:latest",
        command=["python", "order.py"],
        arguments=["--candidates", filtered.outputs["top_k"], "--assay", "binding"]
    ).after(filtered)

<<<<<<< HEAD Deploy: vertex ai pipelines run --pipeline-spec protein-design-loop.json

2. ESM Serverless Inference via Vertex AI Endpoints

Deploy ESM3 as a managed endpoint for on-demand embedding and sequence generation:

# Build and push ESM container
docker build -t gcr.io/$PROJECT/esm-server:latest -f esm.Dockerfile .
docker push gcr.io/$PROJECT/esm-server:latest

# Upload model artifact
=======
## 2. ESM Serverless Inference via Vertex AI Endpoints

```bash
docker build -t gcr.io/$PROJECT/esm-server:latest -f esm.Dockerfile .
docker push gcr.io/$PROJECT/esm-server:latest

>>>>>>> origin/main
gcloud ai models upload \
  --region=us-central1 \
  --display-name=esm3-protein-lm \
  --container-image-uri=gcr.io/$PROJECT/esm-server:latest \
  --container-predict-route=/predict \
  --container-health-route=/health

<<<<<<< HEAD
# Deploy to endpoint
=======
>>>>>>> origin/main
gcloud ai endpoints create --region=us-central1 --display-name=esm-endpoint
gcloud ai endpoints deploy-model ESM_ENDPOINT_ID \
  --region=us-central1 \
  --model=ESM_MODEL_ID \
  --machine-type=n1-standard-4 \
  --accelerator=count=1,type=nvidia-tesla-t4 \
<<<<<<< HEAD
  --min-replica-count=0 \  # scale to zero when idle
  --max-replica-count=4

# Call endpoint
ACCESS_TOKEN=$(gcloud auth print-access-token)
curl -s "https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT}/locations/us-central1/endpoints/${ESM_ENDPOINT_ID}:predict" \
  -H "Authorization: Bearer $ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"instances": [{"sequence": "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL"}]}'

3. BigQuery Genomics → Protein ML Pipeline

Wire gnomAD variant data through BigQuery to Vertex AI custom training:

-- Extract high-impact missense variants for protein modeling
-- (gnomAD v3 in BigQuery: bigquery-public-data.gnomAD)
SELECT
  reference_name,
  start_position,
  reference_bases,
  alternate_bases,
  vep.consequence_terms,
  vep.protein_id,
  vep.amino_acids,
=======
  --min-replica-count=0 \
  --max-replica-count=4

3. BigQuery Genomics -> Protein ML Pipeline

-- gnomAD v3 in BigQuery: high-impact missense variants
SELECT
  reference_name, start_position, reference_bases, alternate_bases,
  vep.consequence_terms, vep.protein_id, vep.amino_acids,
>>>>>>> origin/main
  af.AF as allele_frequency
FROM `bigquery-public-data.gnomAD.v3_1_2_genomes`
CROSS JOIN UNNEST(vep) AS vep
CROSS JOIN UNNEST(allele_freq) AS af
WHERE
  'missense_variant' IN UNNEST(vep.consequence_terms)
<<<<<<< HEAD
  AND af.AF > 0.001  -- common variants
=======
  AND af.AF > 0.001
>>>>>>> origin/main
  AND vep.protein_id = @target_protein
ORDER BY af.AF DESC
LIMIT 10000;

<<<<<<< HEAD
# Train Vertex AI custom model on variant-phenotype associations
from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location="us-central1")

=======
from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location="us-central1")
>>>>>>> origin/main
job = aiplatform.CustomTrainingJob(
    display_name="gnomad-variant-phenotype",
    script_path="train_variant_model.py",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    requirements=["scikit-learn", "pandas", "biopython"],
)
<<<<<<< HEAD

=======
>>>>>>> origin/main
model = job.run(
    dataset=aiplatform.TabularDataset(DATASET_ID),
    model_display_name="variant-phenotype-predictor",
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
)

<<<<<<< HEAD

4. AlphaFold Batch Inference Pipeline

For large-scale structural characterization using Vertex AI's pre-built AlphaFold integration:

# Use Vertex AI AlphaFold Pipeline (pre-built)
=======
## 4. AlphaFold Batch Inference Pipeline

```bash
>>>>>>> origin/main
gcloud ai pipelines run \
  --pipeline-job-spec-uri=gs://vertex-pipeline-components-public/alphafold/alphafold_pipeline.yaml \
  --parameter-values='
    project='"$PROJECT"',
    region=us-central1,
    input_fasta_gs_path=gs://my-bucket/sequences.fasta,
    output_gs_path=gs://my-bucket/alphafold-output/,
    use_gpu=true,
    model_preset=multimer
  '

<<<<<<< HEAD Process results via alphafold-database skill patterns:

# Load outputs into DuckDB for analysis
=======
```python
>>>>>>> origin/main
import duckdb
conn = duckdb.connect("asi.db")
conn.execute("""
  CREATE TABLE alphafold_batch AS
  SELECT * FROM read_json_auto('gs://my-bucket/alphafold-output/**/*.json')
""")
<<<<<<< HEAD
# Filter by pLDDT confidence
conn.execute("SELECT * FROM alphafold_batch WHERE mean_plddt > 90 ORDER BY mean_plddt DESC")

5. Vertex AI Matching Engine for Protein Similarity Search

Index AlphaFold structure embeddings for semantic protein search:

from google.cloud import aiplatform
import numpy as np

# Step 1: Generate ESM embeddings for all proteins in dataset
def embed_sequences(sequences):
    """Call ESM endpoint (from step 2) to batch embed sequences."""
    ACCESS_TOKEN = subprocess.check_output(["gcloud", "auth", "print-access-token"]).strip().decode()
    response = requests.post(
        f"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT}/locations/us-central1/endpoints/{ESM_ENDPOINT_ID}:predict",
        headers={"Authorization": f"Bearer {ACCESS_TOKEN}"},
        json={"instances": [{"sequence": s} for s in sequences]}
    )
    return np.array(response.json()["predictions"])

# Step 2: Create Matching Engine index
=======
conn.execute("SELECT * FROM alphafold_batch WHERE mean_plddt > 90 ORDER BY mean_plddt DESC")

5. Vertex AI Matching Engine for Protein Similarity Search

from google.cloud import aiplatform

>>>>>>> origin/main
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="protein-embedding-index",
    contents_delta_uri=f"gs://{BUCKET}/protein-embeddings/",
    dimensions=1280,  # ESM3 embedding dimension
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

<<<<<<< HEAD
# Step 3: Query for similar proteins
=======
>>>>>>> origin/main
endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="protein-similarity-endpoint",
    public_endpoint_enabled=True,
)
endpoint.deploy_index(index=index)
<<<<<<< HEAD

# Find proteins similar to query
query_embedding = embed_sequences(["MKTAYIAKQR..."])[0]
=======
>>>>>>> origin/main
neighbors = endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=[query_embedding.tolist()],
    num_neighbors=10,
)

<<<<<<< HEAD

6. Cost-Optimized Batch Docking via Vertex AI

Run DiffDock at scale using Vertex AI Batch Prediction with spot VMs:

from google.cloud import aiplatform

# Create batch prediction job (spot VMs = 60-90% cost reduction)
=======
## 6. Cost-Optimized Batch Docking via Vertex AI

```python
>>>>>>> origin/main
batch_job = aiplatform.BatchPredictionJob.create(
    job_display_name="diffdock-batch-screen",
    model_name=DIFFDOCK_MODEL_ID,
    instances_format="jsonl",
    gcs_source=f"gs://{BUCKET}/ligand-protein-pairs.jsonl",
    predictions_format="jsonl",
    gcs_destination_prefix=f"gs://{BUCKET}/docking-results/",
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    starting_replica_count=10,
    max_replica_count=50,
<<<<<<< HEAD
    # Use spot VMs for 10-100x cost reduction
    service_account=SERVICE_ACCOUNT,
)

Gap Registry: What Vertex AI Cannot Replace

| Capability | ASI Skill | Vertex AI | Status | |-----------|-----------|-----------|--------| | Structure retrieval (200M) | alphafold-database | Inference only | ASI owns retrieval | | Protein sequence design | esm (ESM3) | Ginkgo LLM (limited) | ASI owns design | | Molecular docking | diffdock | None | ASI owns this | | ADMET prediction | deepchem | None | ASI owns this | | Wet-lab validation | adaptyv | None | ASI owns this | | BigQuery genomics | None | ✅ gnomAD, 1TB free | Vertex AI gap | | Workflow orchestration | None | ✅ KFP Pipelines | Vertex AI gap | | Scalable inference | Manual | ✅ Managed Endpoints | Vertex AI gap | | Protein similarity search | None | ✅ Matching Engine | Vertex AI gap |

Related ASI Skills

alphafold-database — structure retrieval; batch analysis feed for Vertex Pipelines
esm — ESM3 for design; deployable as Vertex AI Endpoint
diffdock — docking; deployable as Vertex Batch Prediction job
deepchem — ADMET screening; post-design filter stage
adaptyv — wet-lab validation; final pipeline stage
torchdrug — KG reasoning; drug target validation
uniprot-database — sequence/annotation retrieval; input data source
gget — rapid queries; replaces manual API calls in pipeline
bigquery-asi-interleave — parent GCP bridge; gnomAD query patterns
vertex-asi-interleave — sibling Vertex bridge; generative AI patterns
lolita — latent diffusion physics; analogous pipeline pattern for PDE emulation ======= )


## Gap Registry

| Capability | ASI Skill | Vertex AI | Status |
|---|---|---|---|
| Structure retrieval (200M) | alphafold-database | Inference only | ASI owns retrieval |
| Protein sequence design | esm (ESM3) | Limited | ASI owns design |
| Molecular docking | diffdock | None | ASI owns this |
| ADMET prediction | deepchem | None | ASI owns this |
| Wet-lab validation | adaptyv | None | ASI owns this |
| BigQuery genomics | None | gnomAD, 1TB free | Vertex AI gap |
| Workflow orchestration | None | KFP Pipelines | Vertex AI gap |
| Scalable inference | Manual | Managed Endpoints | Vertex AI gap |
| Protein similarity search | None | Matching Engine | Vertex AI gap |
>>>>>>> origin/main

Agent Skills: Vertex AI × Protein-Scale Biology Interleave

Install this agent skill to your local

Skill Files