Skip to content

Sample Corpus

samples/ contains 150 Python generators producing synthetic fund documents across 8 fund families and 47 standalone files.

Sample Corpus — 8 fund families, 150 documents

Fund Families

Fund Types Covered

FamilyStrategyDocuments
Apex CapitalPE buyout14
Greenfield VenturesVC12
Cornerstone REReal estate19
European GrowthLuxembourg PE10
Pacific CreditCredit/lending14
Meridian SecondariesGP-led continuation10
Catalyst InfrastructureUK infrastructure14
Summit MacroHedge fund10
Standalone filesCross-cutting47

Each family is a coherent set of related documents, enabling cross-document skills: compare, reconcile, multi-doc-analyze, mfn-tracker, and playbook.

Intentional Issues

Every document embeds intentional issues for testing detection accuracy:

  • Mathematical errors (carried interest calculations, distribution waterfalls)
  • Compliance gaps (missing disclosures, regulatory deadline errors)
  • Cross-document conflicts (side letter terms inconsistent with LPA)
  • Wire fraud signals (suspicious wire instruction changes)

The sample LPA (generate_sample_lpa.py) contains 10 specific issues. See LPA Safety Score for the full list.

Generating the Corpus

bash
cd fundadmin-ai

# Generate all 150 documents
python3 samples/generate_all.py

# Generate one fund family only
python3 samples/generate_all.py --family apex

# Compile-check only (no output files)
python3 samples/generate_all.py --check

# List all generators
python3 samples/generate_all.py --list

Generated output goes to samples/output/ (gitignored — reproducible from scripts).

Individual Generators

bash
# Single document types (each with intentional issues)
python3 generate_sample_lpa.py            # LPA — 10 issues
python3 generate_sample_ppm.py            # PPM — 7 issues
python3 generate_sample_side_letter.py    # Side letter — 3 issues
python3 generate_sample_subscription.py   # Subscription — 5 issues

Shared Utilities

samples/generators/common.py provides:

  • Document styles and setup
  • Fund family constants
  • Investor profiles
  • Standard issue injection helpers

Classifier Accuracy

The content classifier is regression-tested to maintain ≥ 97% accuracy across all 150 PDFs:

bash
node tests/audit-classifier.mjs
python3 -m unittest tests/test_classifier_accuracy.py -v

Catalog

See samples/SAMPLE-CATALOG.md for the full index with a list of embedded issues per document.

T1 (skills + CLI) and T2 (vault template) are MIT licensed.