Agent Skills: arXiv Search Skill

Search arXiv for preprints and scholarly articles across physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, systems science, and economics. Supports field-specific queries (title, author, abstract, category), boolean logic, date filtering, and bulk retrieval with pagination.

UncategorizedID: kltng/humanities-skills/arxiv-search

Install this agent skill to your local

pnpm dlx add-skill https://github.com/kltng/humanities-skills/tree/HEAD/arxiv-search

Skill Files

Browse the full folder contents for arxiv-search.

Download Skill

Loading file tree…

arxiv-search/SKILL.md

Skill Metadata

Name
arxiv-search
Description
Search arXiv for preprints and scholarly articles across physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, systems science, and economics. Supports field-specific queries (title, author, abstract, category), boolean logic, date filtering, and bulk retrieval with pagination.

arXiv Search Skill

Search and retrieve metadata for preprints on arXiv, the open-access repository for scholarly articles.

Critical: Things Claude Won't Know Without This Skill

Response format is Atom XML, not JSON

The arXiv API returns Atom 1.0 XML — there is no JSON option. You must parse XML to extract results. The Python script handles this automatically.

The base URL uses export.arxiv.org, not arxiv.org

http://export.arxiv.org/api/query?search_query=all:transformer&max_results=5

Using arxiv.org directly will not work for API queries.

Boolean operators must be UPPERCASE

Use AND, OR, ANDNOT — lowercase will not work:

search_query=ti:attention AND ti:transformer
search_query=cat:cs.AI ANDNOT cat:cs.CL

Parentheses and quotes must be URL-encoded

  • Parentheses: %28 and %29
  • Quotes (phrase search): %22
search_query=ti:%22large+language+model%22
search_query=au:bengio+AND+%28cat:cs.LG+OR+cat:cs.AI%29

Date range filtering uses a special syntax

search_query=submittedDate:[202401010000+TO+202412312359]

Format: YYYYMMDDTHHMM in GMT, 24-hour. The T is literal. Combine with other queries using AND.

Rate limiting: 3 seconds between requests

arXiv asks for a minimum 3-second delay between API calls. Results update once daily, so there is no reason to poll frequently.

Hard limits on results

  • max_results caps at 2000 per request
  • Total retrievable results cap at 30,000 (start + max_results ≤ 30000)
  • Exceeding 30,000 returns HTTP 400

Search Field Prefixes

| Prefix | Field | |--------|-------| | ti | Title | | au | Author | | abs | Abstract | | co | Comment | | jr | Journal Reference | | cat | Subject Category | | rn | Report Number | | id_list | Specific arXiv IDs (comma-separated, passed as separate param) | | all | All fields simultaneously |

Python Script

Use scripts/arxiv_api.py for programmatic access (zero dependencies):

from scripts.arxiv_api import ArxivAPI
api = ArxivAPI()

# Simple keyword search
results = api.search("all:transformer attention mechanism", max_results=10)

# Field-specific search
results = api.search("ti:diffusion AND cat:cs.CV", max_results=5)

# Author search
results = api.search("au:hinton AND cat:cs.LG", max_results=20)

# Date-filtered search
results = api.search(
    "cat:cs.AI AND submittedDate:[202401010000 TO 202412312359]",
    sort_by="submittedDate",
    sort_order="descending",
    max_results=50
)

# Fetch specific papers by arXiv ID
results = api.fetch_by_ids(["2301.07041", "2303.08774", "2305.10601"])

# Paginated retrieval
page1 = api.search("cat:cs.CL", max_results=100, start=0)
page2 = api.search("cat:cs.CL", max_results=100, start=100)

# Each result is a dict with keys:
# id, title, summary, authors, published, updated,
# categories, primary_category, links, comment, journal_ref, doi

Common Category Codes

| Category | Description | |----------|-------------| | cs.AI | Artificial Intelligence | | cs.CL | Computation and Language (NLP) | | cs.CV | Computer Vision | | cs.LG | Machine Learning | | cs.CR | Cryptography and Security | | cs.DS | Data Structures and Algorithms | | cs.SE | Software Engineering | | math.AG | Algebraic Geometry | | math.CO | Combinatorics | | physics.hep-th | High Energy Physics - Theory | | quant-ph | Quantum Physics | | stat.ML | Machine Learning (Statistics) | | econ.GN | General Economics | | q-bio.GN | Genomics |

Full list: https://arxiv.org/category_taxonomy

API Etiquette

  • Rate limit: 3-second minimum between requests
  • No authentication required
  • Results update daily — no need to poll the same query more than once per day
  • Use pagination for large result sets instead of requesting thousands at once
  • Respect HTTP 429: Back off if rate-limited

Related Skills

  • wikidata-search: Cross-reference arXiv authors with Wikidata entities for authority control identifiers (ORCID, VIAF, etc.)
  • harvard-library-catalog: Look up published versions of arXiv preprints in Harvard's bibliographic records

Resources

  • references/api_reference.md — Complete API specs for query parameters, response fields, and error handling
  • scripts/arxiv_api.py — Python client with XML parsing, rate limiting, and pagination support