ExoPriors Scry — Research Corpus Skill
SQL + vector search over 3B+ docs (arXiv, HN, LW, EA Forum, Twitter, Bluesky, Reddit, Substack, Wikipedia, Ethereum).
API Quick Reference
| Method | Endpoint | Content-Type | Body |
|--------|----------|--------------|------|
| POST | /v1/scry/query | text/plain | Raw SQL |
| POST | /v1/scry/embed | application/json | {"text":"...","name":"handle","model":"voyage-4-lite"} |
| POST | /v1/scry/estimate | application/json | {"sql":"..."} |
| GET | /v1/scry/schema | — | — |
Base URL: https://api.exopriors.com
Auth: Authorization: Bearer exopriors_public_readonly_v1_2025
Public key limits
- Handles must match
p_<8hex>_<name>(write-once) - No alerts, rerank, or vector list/delete endpoints
- Row cap: 2000 (50 with
include_vectors=true)
Core Schema
scry.entities
| Column | Type | Notes |
|--------|------|-------|
| id | UUID | PK |
| kind | entity_kind | Cast kind::text. Values: post, comment, paper, tweet, twitter_thread, webpage, document, grant... |
| uri | TEXT | Canonical link |
| payload | TEXT | Content (HTML/plain text, truncated 50K) |
| title | TEXT | From metadata |
| score | INT | Unified score (coalesced upvotes/baseScore/likes) |
| original_author | TEXT | May be NULL (esp. tweets) |
| original_timestamp | TIMESTAMPTZ | Publication date |
| source | external_system | Cast source::text. Values: lesswrong, eaforum, hackernews, arxiv, twitter, bluesky, reddit, wikipedia, manifold... |
| parent_entity_id | UUID | Parent for threaded items |
| anchor_entity_id | UUID | Root subject (comment → post) |
| content_risk | TEXT | dangerous for prompt-injection sources |
| metadata | JSONB | Source-specific fields |
scry.embeddings
| Column | Type | Notes | |--------|------|-------| | entity_id | UUID | FK to entities.id | | chunk_index | INT | 0 = doc-level | | embedding_voyage4 | halfvec(2048) | Voyage-4 family vectors |
scry.stored_vectors
Named vectors from /v1/scry/embed. Reference as @handle in SQL.
Materialized Views (pre-indexed, fast)
- General:
mv_posts,mv_forum_posts,mv_high_score_posts,mv_papers,mv_blogosphere_posts - LW/EA:
mv_lesswrong_posts,mv_eaforum_posts,mv_af_posts,mv_lesswrong_comments,mv_eaforum_comments,mv_high_karma_comments - HN:
mv_hackernews_posts - Academic:
mv_arxiv_papers,mv_unjournal_posts - Social:
mv_twitter_threads,mv_substack_posts,mv_substack_comments,mv_substack_publications - Crypto:
mv_crypto_posts,mv_ethereum_posts - Stats:
mv_author_stats(post_count, total_post_score, avg_post_score, first/last_activity)
MVs include embedding_voyage4 for direct semantic search. Filter WHERE embedding_voyage4 IS NOT NULL if needed.
Vector Operations
@handle syntax
SELECT mv.uri, mv.title, mv.embedding_voyage4 <=> @my_concept AS distance
FROM scry.mv_lesswrong_posts mv
ORDER BY distance LIMIT 20;
Operators: <=> cosine distance, <-> L2 distance, cosine_similarity(a,b) returns similarity
Helpers
unit_vector(v)— normalizescale_vector(v, s)— scalar multiply (pgvector has nos * v)debias_vector(axis, topic)— remove topic direction from axis (most useful op)debias_safe(axis, topic, max_removal)— capped debiasingcontrast_axis(pos, neg)— direction vector from neg toward poscontrast_axis_balanced(pos, neg)— normalizes poles firstcosine_similarity(a, b),vector_norm(v)
Key pattern: "X but not Y"
SELECT mv.uri, mv.title,
mv.embedding_voyage4 <=> unit_vector(
debias_vector(
scale_vector(@topic_a, 0.6) + scale_vector(@topic_b, 0.4),
@unwanted
)
) AS distance
FROM scry.mv_lesswrong_posts mv ORDER BY distance LIMIT 20;
Lexical Search: scry.search()
scry.search(
query_text text,
mode text DEFAULT 'auto', -- 'auto'|'and'|'or'|'phrase'|'fuzzy'
kinds text[] DEFAULT NULL, -- NULL defaults to [post,paper,document,webpage,twitter_thread,grant]
limit_n int DEFAULT 20 -- max 100
) RETURNS TABLE (id, score, snippet, uri, kind, original_author, title, original_timestamp)
scry.search_ids(...)— IDs only, max 2000scry.search_exhaustive(...)— with scores + pagination, max 1000
Hybrid: lexical → semantic re-rank
WITH candidates AS (
SELECT id FROM scry.search_ids('interpretability circuits', limit_n => 800)
)
SELECT e.uri, e.original_author, emb.embedding_voyage4 <=> @concept AS distance
FROM candidates c
JOIN scry.embeddings emb ON emb.entity_id = c.id AND emb.chunk_index = 0
JOIN scry.entities e ON e.id = c.id
WHERE emb.embedding_voyage4 IS NOT NULL
ORDER BY distance LIMIT 30;
Gotchas
- Author fragmentation: "Eliezer Yudkowsky" vs "eliezer_yudkowsky" vs "@ESYudkowsky". Use
ILIKE '%pattern%'. - Not all entities have embeddings: Always JOIN explicitly. Use MVs which pre-join.
- Default kinds filter:
scry.search()defaults to high-signal subset. Passkinds => ARRAY['tweet','comment']explicitly if needed. - Cast enums: Use
kind::textandsource::textin WHERE/SELECT. - Score semantics vary by source: Don't compare LW karma with HN points directly.
- Always LIMIT: No LIMIT = rejection. Keep small (10-50) for exploration.
- Handle naming: Public must be
p_<8hex>_<name>. Write-once. - Content risk: Filter
content_risk IS DISTINCT FROM 'dangerous'when using LLM on results. - Reddit is separate:
scry.reddittable with TEXT IDs, doesn't join to UUID-based entities/embeddings.