RAG for Lawyers

RAG for Lawyers

What is it and How to Pair a Local SLM with an On‑Demand LLM—Without Losing Control

If you’ve been considering “AI for the firm” but don’t want a cloud-based black box freelancing with client confidences, Retrieval‑Augmented Generation (RAG) is the pattern to know. In plain terms, RAG is an assistant that looks through your actual files before it writes. Ask it for a discovery summary or a motion outline, and it will pull the most relevant passages from your matter folders, then draft using those passages—with citations back to the source. The point isn’t clever prose; it’s grounded answers you can verify. And this can be accomplished on-site with your own hardware, software, and case file.

The safest and simplest way to deploy this in legal practice is the “two‑engine” approach. Keep a local small language model (SLM) on your own hardware for anything sensitive or early‑stage. It’s private, fast, and inexpensive for day‑to‑day work. When you want extra polish or heavier reasoning, selectively hand the draft (and only the quoted snippets) to an on‑demand commercial LLM with retention off. A simple router can decide which engine to use based on the task and the risk: privileged or high‑sensitivity stays local; stylistic refinement or complex structure can tap the external model using snippet‑only context.

Under the hood, a legal RAG system has four moving parts: ingestion, indexing, retrieval, and composition. Ingestion is the housekeeping you already understand—OCR the scanned PDFs, normalize emails so participants and dates are consistent, and attach metadata such as matter ID, doc type, and privilege. Indexing is where you get search that actually understands meaning: combine vector search (to capture concepts like “pricing algorithm” even if the phrase is phrased differently) with keyword/BM25 (to catch exact citations, names, and RFP numbers). Retrieval uses both signals to surface a handful of on‑point paragraphs, and a lightweight reranker helps ensure the best ones rise to the top. The composer—the model that writes—stitches those excerpts into a draft and cites each quote back to {filename}:{page or paragraph} so you can click and confirm.

Here’s how that feels in practice. Imagine you’re preparing a motion to enforce discovery. You ask: “Draft a motion focusing on RFPs 4–9 in Acme v. Beta. Quote the responses and our meet‑and‑confer emails. Include a one‑page facts section and a proposed order, with pinpoint citations.” The system narrows itself to that matter’s workspace, pulls the RFPs and the correspondence, and finds the prior order. Your local SLM writes a first pass that includes block quotes with cites; if you flip the “polish” switch, only the draft and the quoted lines go to the external model for clarity and structure—none of the underlying files ever leave your environment.

The same pattern works for deposition preparation. Ask for a depo kit for the CFO: “Summarize key admissions with page:line cites, inconsistent statements, likely exhibits, and 25 targeted questions.” Because the retrieval layer understands both semantics and exact citations, it can spot “Jane Smith” even when she appears as “J. Smith,” pull those Q/As, and return a clean outline with the cites you’d expect. Or take issues chronologies: “Build a reverse‑chronological timeline of events around the pricing algorithm from emails, Slack exports, and board minutes.” The result is a tidy table—dates, actors, one‑sentence summaries, and live citations—that you can drop into a memo or a case theory meeting.

This approach scales to firms of any size without turning into an IT project you regret. For a solo practitioner, a laptop with 32 GB of RAM and a small GPU gives you a local SLM and a hybrid index you control. Watch a single “Matters/” directory, and run everything on‑device; call the commercial LLM only to polish a draft using snippet context. For a small firm, put the index and SLM on a small on‑prem or private‑cloud server, add SSO, and connect to your documents. Nightly incremental ingestion keeps the index current, and a Teams/Slack bot exposes a “Matter” dropdown so everyone stays scoped. For a medium firm, add high availability (a couple of inference nodes plus an ingestion worker), a reranker for retrieval quality, and observability: logs that capture prompts and document IDs (not full text), audit trails by matter, and a “snippet‑only” policy for any external calls.

Data preparation is where most of the quality is won or lost, so it’s worth being disciplined. OCR everything, split documents into sensible chunks (think sections and headings, not random 500‑word slices), and attach the metadata you already track—matter ID, client, doc type, author, date, privilege, and Bates ranges. Hybrid indexing—vectors and keywords—beats either method alone, especially in law where exact cites matter as much as synonyms. A small reranker can then clean the top results so the composer sees the eight best paragraphs, not fifty maybes.

Routing deserves a note, because it’s your safety valve. You don’t need elaborate code; you need clear rules. Privileged or high‑sensitivity materials never leave the walls—local SLM only. Quick lookups, internal notes, and rough outlines also stay local. But when the output is client‑facing or the structure is complex (appellate brief style, for example), the router can hand the draft text and already‑quoted snippets to the commercial model for flow and readability. That keeps the benefit of large‑model polish without handing it the keys to your DMS.

Lawyers also want to know where the risks live and how to fence them. The short version: treat retrieval as an access‑controlled search and treat generation as a drafting aid that must cite its work. Keep indexes and the local model inside your trust boundary; enforce matter‑level ACLs; tag privilege at ingestion; and turn off retention and logging on any external calls. Make “Not in record” an acceptable answer so the model won’t invent facts. And log enough to audit (queries, doc IDs, and timestamps) without hoarding private text.

You’ll know it’s working when mundane tasks shrink. Useful measures include the percentage of claims with valid source quotes, coverage on a small test set of partner questions per practice group, time saved on first drafts and chronologies, and the absence of cross‑matter leaks in red‑team drills. If you see “pretty but wrong” drafts, tighten your prompts to require quotes and citations for factual claims and ask the model to list assertions and their sources at the end.

Finally, a few rough edges are normal at the start. Over‑ or under‑chunking hurts retrieval; hybrid search fixes a lot of that. Stale indexes cause odd misses; schedule incremental updates. Unlabeled privilege can leak; tag it early and enforce it at query time. Most teams begin with one matter, one practice group, and a short set of tasks (motion drafting, depo kits, chronologies). Once those are smooth, you can expand without changing the underlying pattern: retrieve narrowly, generate conservatively, and route intelligently.

Read more