Ethics & Data Provenance

MassCourtsPlusPlus is built on a commitment to transparency, privacy, and responsible use of public court records.

Data Provenance

All data is sourced from the MassCourts.org public case registry, maintained by the Massachusetts Trial Court. The platform queries the civica_courtdocs.cases_masscourts_org table directly — a comprehensive denormalized dataset containing over 9 million court cases.

Each row represents a single case, with structured scalar fields (case number, court, type, status, dates) and JSON blob fields containing charges, docket entries, parties, events, dispositions, and judgments. The current scope is limited to cases filed in 2026, enforced as a global server-side filter.

The application connects to a read-only database user. No data is ever written or modified through this interface.

AI-Powered Query Generation

MassCourtsPlusPlus uses OpenAI GPT-4o-mini to translate plain-English questions into MySQL queries. The AI operates within strict constraints:

A detailed system prompt describes the exact table schema, column types, indexes, and JSON blob structures
The model is instructed to always include the 2026 date filter, LIMIT clauses, and k-anonymity HAVING thresholds
All generated SQL is validated by a server-side security layer before execution
Every result includes the generated SQL and a plain-language explanation for full transparency
A confidence score (0–100%) indicates how well the AI believes the query matches your intent

K-Anonymity & Privacy Protection

To prevent indirect identification of individuals, privacy protection is enforced at two layers:

The AI is instructed to include HAVING COUNT(*) >= 10 in all aggregate queries, suppressing small groups at the SQL level
A server-side safety net removes any remaining rows where a count column is below 10 before results reach the frontend
Single aggregate totals (e.g. "How many X?") are exempt from suppression since they don't reveal individual-level data
When suppression occurs, the Method Card indicates how many groups were removed

SQL Transparency

Every result set is accompanied by the exact SQL query that was generated and executed. This allows users to:

Verify the logic behind any result
Reproduce the dataset independently using any MySQL client
Identify potential limitations or biases in the query structure
Export the SQL alongside the data via CSV for research reproducibility

SQL Security & Validation

All AI-generated SQL passes through a strict server-side validator before touching the database. Defense in depth is enforced at three layers:

SELECT statements only — all other operations are rejected
No wildcard selections (SELECT * is blocked)
Mandatory LIMIT clause (max 1,000 rows) to prevent runaway queries
Blocked keywords: DROP, DELETE, INSERT, UPDATE, ALTER, CREATE, TRUNCATE, SLEEP, BENCHMARK, LOAD_FILE, INTO OUTFILE
Table allowlist — only the cases_masscourts_org table can be referenced
No SQL comments (-- or /* */) to prevent injection hiding
No stacked queries (semicolons within the query body)
Global 2026 date filter must be present in every query
The database user itself is read-only, providing a final safety net

Rate Limiting

To prevent abuse and protect shared database resources, the API enforces a limit of 10 requests per minute per IP address. If the limit is exceeded, the user receives a clear message to wait before retrying.

Known Data Gaps & Limitations

The database is comprehensive but not perfect — gaps, inconsistencies, and incomplete records exist throughout.
Cases filed on paper that were not digitized by the registry may be absent.
Case type classifications are based on court-assigned codes and may not perfectly reflect legal categories.
Judicial assignment data (case_judge) is frequently empty.
Charge data is stored in JSON blobs and searched via pattern matching, which may miss variant spellings or codes.
The dataset is an ongoing scrape of MassCourts.org and is not updated in real time.
Name searches rely on pattern matching against "Last, First" formatted names and may miss unusual formats.
The current scope is limited to 2026 filings only; historical data is available in the source but not yet exposed.

Bias Mitigation

Query explanations and results are presented without automated labeling, inflammatory framing, or inferences about individual guilt, outcomes, or behavior. The platform surfaces patterns in court data — it does not interpret or editorialize them.

The AI system prompt is designed to avoid value judgments and to produce neutral, factual query descriptions. Method Cards accompanying each result explicitly state assumptions and limitations.

Contact & Feedback

MassCourtsPlusPlus is an open-source project. To report a data issue, suggest a feature, or raise an ethical concern, please open an issue on GitHub.