Ethics & Data Provenance
MassCourtsPlusPlus is built on a commitment to transparency, privacy, and responsible use of public court records.
Data Provenance
All data is sourced from the MassCourts.org public case registry, maintained by the Massachusetts Trial Court. The platform queries the civica_courtdocs.cases_masscourts_org table directly — a comprehensive denormalized dataset containing over 9 million court cases.
Each row represents a single case, with structured scalar fields (case number, court, type, status, dates) and JSON blob fields containing charges, docket entries, parties, events, dispositions, and judgments. The current scope is limited to cases filed in 2026, enforced as a global server-side filter.
The application connects to a read-only database user. No data is ever written or modified through this interface.
AI-Powered Query Generation
MassCourtsPlusPlus uses OpenAI GPT-4o-mini to translate plain-English questions into MySQL queries. The AI operates within strict constraints:
- A detailed system prompt describes the exact table schema, column types, indexes, and JSON blob structures
- The model is instructed to always include the 2026 date filter, LIMIT clauses, and k-anonymity HAVING thresholds
- All generated SQL is validated by a server-side security layer before execution
- Every result includes the generated SQL and a plain-language explanation for full transparency
- A confidence score (0–100%) indicates how well the AI believes the query matches your intent
K-Anonymity & Privacy Protection
To prevent indirect identification of individuals, privacy protection is enforced at two layers:
- The AI is instructed to include HAVING COUNT(*) >= 10 in all aggregate queries, suppressing small groups at the SQL level
- A server-side safety net removes any remaining rows where a count column is below 10 before results reach the frontend
- Single aggregate totals (e.g. "How many X?") are exempt from suppression since they don't reveal individual-level data
- When suppression occurs, the Method Card indicates how many groups were removed
SQL Transparency
Every result set is accompanied by the exact SQL query that was generated and executed. This allows users to:
- Verify the logic behind any result
- Reproduce the dataset independently using any MySQL client
- Identify potential limitations or biases in the query structure
- Export the SQL alongside the data via CSV for research reproducibility
SQL Security & Validation
All AI-generated SQL passes through a strict server-side validator before touching the database. Defense in depth is enforced at three layers:
- SELECT statements only — all other operations are rejected
- No wildcard selections (SELECT * is blocked)
- Mandatory LIMIT clause (max 1,000 rows) to prevent runaway queries
- Blocked keywords: DROP, DELETE, INSERT, UPDATE, ALTER, CREATE, TRUNCATE, SLEEP, BENCHMARK, LOAD_FILE, INTO OUTFILE
- Table allowlist — only the cases_masscourts_org table can be referenced
- No SQL comments (-- or /* */) to prevent injection hiding
- No stacked queries (semicolons within the query body)
- Global 2026 date filter must be present in every query
- The database user itself is read-only, providing a final safety net
Rate Limiting
To prevent abuse and protect shared database resources, the API enforces a limit of 10 requests per minute per IP address. If the limit is exceeded, the user receives a clear message to wait before retrying.
Known Data Gaps & Limitations
- The database is comprehensive but not perfect — gaps, inconsistencies, and incomplete records exist throughout.
- Cases filed on paper that were not digitized by the registry may be absent.
- Case type classifications are based on court-assigned codes and may not perfectly reflect legal categories.
- Judicial assignment data (case_judge) is frequently empty.
- Charge data is stored in JSON blobs and searched via pattern matching, which may miss variant spellings or codes.
- The dataset is an ongoing scrape of MassCourts.org and is not updated in real time.
- Name searches rely on pattern matching against "Last, First" formatted names and may miss unusual formats.
- The current scope is limited to 2026 filings only; historical data is available in the source but not yet exposed.
Bias Mitigation
Query explanations and results are presented without automated labeling, inflammatory framing, or inferences about individual guilt, outcomes, or behavior. The platform surfaces patterns in court data — it does not interpret or editorialize them.
The AI system prompt is designed to avoid value judgments and to produce neutral, factual query descriptions. Method Cards accompanying each result explicitly state assumptions and limitations.
Contact & Feedback
MassCourtsPlusPlus is an open-source project. To report a data issue, suggest a feature, or raise an ethical concern, please open an issue on GitHub.