WRDS Data Access Skill | Agent Skills

Query Enforcement
SAS ETL Enforcement
Quick Reference: Table Names
Connection
Critical Filters
Parameterized Queries
Additional Resources

WRDS Data Access

WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at wrds-pgdata.wharton.upenn.edu:9737.

Query Enforcement

IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST

Before executing ANY WRDS query, you MUST:

IDENTIFY what filters are required for this dataset
VALIDATE the query includes those filters
VERIFY parameterized queries (never string formatting)
EXECUTE the query
INSPECT a sample of results before claiming success

This is not negotiable. Claiming query success without sample inspection is LYING to the user about data quality.

Rationalization Table - STOP If You Think:

| Excuse | Reality | Do Instead | |--------|---------|------------| | "I'll add filters later" | You'll forget and pull bad data | Add filters NOW, before execution | | "User didn't specify filters" | Standard filters are ALWAYS required | Apply Critical Filters section defaults | | "Just a quick test query" | Test queries with bad filters teach bad patterns | Use production filters even for tests | | "I'll let the user filter in pandas" | Pulling millions of unnecessary rows wastes time/memory | Filter at database level FIRST | | "The query worked, so it's correct" | Query success ≠ data quality | INSPECT sample for invalid records | | "I can use f-strings for simple queries" | SQL injection risk + wrong type handling | ALWAYS use parameterized queries |

Red Flags - STOP Immediately If You Think:

"Let me run this query quickly to see what's there" → NO. Check Critical Filters section first.
"I'll just pull everything and filter later" → NO. Database-level filtering is mandatory.
"The table name is obvious from the request" → NO. Check Quick Reference section for exact names.
"I can inspect the data after the user sees it" → NO. Sample inspection BEFORE claiming success.

Query Validation Checklist

Before EVERY query execution:

For Compustat queries (comp.funda, comp.fundq):

[ ] Includes indfmt = 'INDL'
[ ] Includes datafmt = 'STD'
[ ] Includes popsrc = 'D'
[ ] Includes consol = 'C'
[ ] Uses parameterized queries for variables
[ ] Date range is explicitly specified

For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):

[ ] Post-query filter: sharetype == 'NS'
[ ] Post-query filter: securitytype == 'EQTY'
[ ] Post-query filter: securitysubtype == 'COM'
[ ] Post-query filter: usincflg == 'Y'
[ ] Post-query filter: issuertype.isin(['ACOR', 'CORP'])
[ ] Uses parameterized queries

For Form 4 queries (tr_insiders.table1):

[ ] Transaction type filter specified (acqdisp)
[ ] Transaction codes specified (trancode)
[ ] Date range is explicitly specified
[ ] Uses parameterized queries

For ALL queries:

[ ] Sample inspection with .head() or .sample() BEFORE claiming success
[ ] Row count verification (is result size reasonable?)
[ ] NULL value check on critical columns
[ ] Date range validation (does min/max match expectations?)

SAS ETL Enforcement

IRON LAW: NO SAS CODE WITHOUT PERFORMANCE VALIDATION FIRST

<EXTREMELY-IMPORTANT> Before writing or executing ANY SAS code on WRDS, you MUST validate performance patterns. This is not negotiable.

MERGE STRATEGY — Is hash or sort-merge appropriate? Justify the choice.
WHERE CLAUSES — Are all date/string filters index-friendly? No functions on indexed columns.
PARALLELISM — Can this job run as an SGE array? Year-by-year is always parallelizable.
SQL OPTIMIZATION — For PROC SQL: pass-through opportunity? Indexed join columns?

Writing SAS code that forces full table scans when indexes exist is LYING about understanding the data infrastructure. </EXTREMELY-IMPORTANT>

SAS Code Validation Checklist

Before EVERY SAS program execution:

For merges/joins:

[ ] Small lookup + large fact table → hash object (not PROC SORT + DATA merge)
[ ] Hash uses defineKey/defineData/defineDone pattern correctly
[ ] h.output() uses double quotes for macro resolution (not single quotes)
[ ] call missing() initializes hash data variables for non-matches
[ ] Both tables >50M rows → sort-merge is justified (document why)

For WHERE clauses (CRITICAL):

[ ] NO year(date), month(date), datepart(dt) wrapping indexed columns
[ ] Date filters use BETWEEN "01jan&year."d AND "31dec&year."d range pattern
[ ] String filters avoid upcase(), substr() on indexed columns
[ ] Compound date filters collapsed to single range (not year() = X AND quarter() = Y)

For batch processing:

[ ] Multi-year jobs use SGE array (#$ -t start-end) not sequential loop
[ ] Year passed via -sysparm (not -set or %sysget)
[ ] Per-year log files (not single shared log)
[ ] Memory allocation appropriate for workload (#$ -l m_mem_free=4G minimum)
[ ] Single-year benchmark run completed before full array submission

For PROC SQL:

[ ] Join columns are not wrapped in functions
[ ] calculated keyword used for computed column references in HAVING
[ ] Pass-through SQL considered for direct WRDS PostgreSQL queries
[ ] No redundant subqueries that could be hash lookups

For macros:

[ ] Macro variables terminated with period (&year. not &year)
[ ] Double quotes used where macro resolution is needed
[ ] options mprint mlogic symbolgen used during development

SAS Rationalization Table - STOP If You Think:

| Excuse | Reality | Do Instead | |--------|---------|------------| | "Sort-merge is simpler to write" | Hash is 10x faster for lookup joins and requires no sorting | Write the hash — it's 5 extra lines | | "year(date) is readable" | Readable but prevents index usage — full table scan on millions of rows | Use BETWEEN with date literals | | "I'll parallelize later" | Later never comes and the job runs 18x slower sequentially | Write the SGE array job NOW | | "Single quotes work fine in hash" | Single quotes block macro resolution — your output dataset name is wrong | ALWAYS double quotes in h.output() | | "PROC SQL is easier than hash" | PROC SQL still sorts for joins — hash avoids all sorting | Hash for lookups, SQL only for complex aggregations | | "The job only takes a few minutes per year" | 18 years × 3 minutes = 54 minutes sequential vs 3 minutes parallel | SGE array for ANY multi-year job | | "%sysget works for getting the year" | Unreliable in SGE context — may return blank silently | Use -sysparm + &sysparm. |

SAS Red Flags - STOP Immediately If You're About To:

Write where year(date) = anything → STOP. Use BETWEEN with date literals.
Write proc sort; data; merge for a lookup join → STOP. Use hash object.
Write a %do year = start %to end loop → STOP. Use SGE array job.
Use single quotes in h.output(dataset: '...') → STOP. Use double quotes.
Submit a full array job without testing one year first → STOP. Benchmark first.
Use -set or %sysget for SGE task parameters → STOP. Use -sysparm.

SAS Reference

See references/sas-etl.md for complete patterns:

Hash object merge (basic, multidata, accumulator)
Index-friendly WHERE clause quick reference table
SGE array job templates with memory and logging
PROC SQL pass-through and optimization
Macro quoting and debugging

Quick Reference: Table Names

| Dataset | Schema | Key Tables | |---------|--------|------------| | Compustat | comp | company, funda, fundq, secd | | ExecuComp | comp_execucomp | anncomp | | CRSP | crsp | dsf, msf, stocknames, ccmxpf_linkhist | | CRSP v2 | crsp | dsf_v2, msf_v2, stocknames_v2 | | Form 4 Insiders | tr_insiders | table1, header, company | | ISS Incentive Lab | iss_incentive_lab | comppeer, sumcomp, participantfy | | Capital IQ | ciq | wrds_compensation | | IBES | tr_ibes | det_epsus, statsum_epsus | | SEC EDGAR | wrdssec | wrds_forms, wciklink_cusip | | SEC Search | wrds_sec_search | filing_view, registrant | | EDGAR | edgar | filings, filing_docs | | Fama-French | ff | factors_monthly, factors_daily | | LSEG/Datastream | tr_ds | ds2constmth, ds2indexlist |

Connection

Initialize PostgreSQL connection to WRDS:

import psycopg2

conn = psycopg2.connect(
    host='wrds-pgdata.wharton.upenn.edu',
    port=9737,
    database='wrds',
    sslmode='require'
    # Credentials from ~/.pgpass
)

Configure authentication via ~/.pgpass with chmod 600:

wrds-pgdata.wharton.upenn.edu:9737:wrds:USERNAME:PASSWORD

Connect via SSH tunnel:

ssh wrds

This uses ~/.ssh/wrds_rsa for authentication.

Critical Filters

Compustat Standard Filters

Always include for clean fundamental data:

WHERE indfmt = 'INDL'
  AND datafmt = 'STD'
  AND popsrc = 'D'
  AND consol = 'C'

CRSP v2 Common Stock Filter

Equivalent to legacy shrcd IN (10, 11):

df = df.loc[
    (df.sharetype == 'NS') &
    (df.securitytype == 'EQTY') &
    (df.securitysubtype == 'COM') &
    (df.usincflg == 'Y') &
    (df.issuertype.isin(['ACOR', 'CORP']))
]

Form 4 Transaction Types

WHERE acqdisp = 'D'  -- Dispositions
  AND trancode IN ('S', 'D', 'G', 'F')  -- Sales, Dispositions, Gifts, Tax

Parameterized Queries

Always use parameterized queries (never string formatting):

Use scalar parameter binding for single values:

cursor.execute("""
    SELECT gvkey, conm FROM comp.company WHERE gvkey = %s
""", (gvkey,))

Use ANY() for list parameters:

cursor.execute("""
    SELECT * FROM comp.funda WHERE gvkey = ANY(%s)
""", (gvkey_list,))

Additional Resources

Reference Files

Detailed query patterns and table documentation:

references/compustat.md - Compustat tables, ExecuComp, financial variables
references/crsp.md - CRSP stock data, CCM linking, v2 format
references/insider-form4.md - Thomson Reuters Form 4, rolecodes, insider types
references/iss-compensation.md - ISS Incentive Lab, peer companies, compensation
references/edgar.md - SEC EDGAR filings, URL construction, DCN vs accession numbers
references/connection.md - Connection pooling, caching, error handling
references/sas-etl.md - SAS hash objects, index-friendly WHERE, SGE array jobs, PROC SQL optimization

Example Files

Working code from real projects:

examples/form4_disposals.py - Insider trading analysis (from SVB project)
examples/wrds_connector.py - Connection pooling pattern

Scripts

scripts/test_connection.py - Validate WRDS connectivity

Local Sample Notebooks

WRDS-provided samples at ~/resources/wrds-code-samples/:

ResearchApps/CCM2025.ipynb - Modern CRSP-Compustat merge
ResearchApps/ff3_crspCIZ.ipynb - Fama-French factor construction
comp/sas/execcomp_ceo_screen.sas - ExecuComp patterns

Date Awareness

When querying historical data, leverage current date context for dynamic range calculations.

Current date is automatically available via datetime.now(). Apply this to:

Data range validation (e.g., "get data for last 5 years")
Fiscal year calculations
Event study windows

Implement dynamic date ranges in queries:

from datetime import datetime, timedelta

# Query last 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

query = """
SELECT * FROM comp.funda
WHERE datadate BETWEEN %s AND %s
"""
df = pd.read_sql(query, conn, params=(start_date, end_date))

Always incorporate current date awareness in date-dependent queries to ensure results remain fresh across time.

Agent Skills: WRDS Data Access

Install this agent skill to your local

Skill Files

Contents