Computational Legal Studies
Workshop: Measuring Case Similarity
Umeå University · Johan Lindholm
Why Case Similarity?
Finding similar cases is one of the most common tasks in legal practice. Lawyers search for precedents that share factual or legal features with a present dispute. Opposing counsel tries to distinguish those precedents — arguing that apparent similarities are superficial and the differences are what matter.
This is not merely a practical skill: as Frederick Schauer has argued, the question of what makes two cases “similar enough” to treat alike lies at the heart of legal reasoning itself. There is no algorithm-free answer. But computational methods can make the underlying dimensions of similarity explicit — and that is exactly what makes them useful as teaching tools.
This workshop uses decisions from the Court of Arbitration for Sport (CAS) as a case study. The CAS corpus is attractive because it is manageable in size, spans diverse sports and legal issues, and decisions are available in English.
Method 1: TF-IDFTerm Frequency–Inverse Document Frequency
The simplest way to compare legal texts is to treat each decision as a bag of words and ask: which words appear often in this document but rarely across the corpus? TF-IDF weights terms accordingly. Similarity between two cases is then the cosine of their TF-IDF vectors — a number between 0 (nothing in common) and 1 (identical term distribution).
Strengths
Fast and interpretable — you can inspect exactly which terms drove a match. No training data needed.
Limitations
Vocabulary mismatch (“transfer fee” vs. “registration fee” are treated as unrelated). No understanding of context or word order.
library(tidytext)
library(dplyr)
library(tidyr)
# cas_corpus: data frame with columns decision_id and text
tfidf <- cas_corpus |>
unnest_tokens(word, text) |>
anti_join(stop_words, by = "word") |>
count(decision_id, word) |>
bind_tf_idf(word, decision_id, n)
# Build a document-term matrix
dtm <- tfidf |>
cast_dtm(decision_id, word, tf_idf)
# Compute pairwise cosine similarity
library(proxy)
sim_matrix <- simil(as.matrix(dtm), method = "cosine")
# Top-5 most similar cases to a query
query_id <- "ECLI:INT:CAS:2023:A-10014-award"
sort(sim_matrix[query_id, ], decreasing = TRUE)[2:6]Method 2: EmbeddingsSemantic vector representation
Neural language models (such as BERT or sentence-transformers) map text into a high-dimensional space where semantically related passages end up geometrically close. Rather than counting words, the model learns what they mean. A doping ban and a “suspension from competition” will be recognised as related even if the exact words differ.
In this workshop we use paragraph-level embeddings (1,024 dimensions) pre-computed from the CAS corpus, then aggregated to the decision level by mean-pooling — averaging all paragraph vectors in a case into a single document vector.
Strengths
Handles synonymy and paraphrase; captures topical similarity beyond literal word overlap.
Limitations
A black box — hard to explain why two cases are near each other. Requires pre-trained models; computationally expensive.
# embeddings: named list of numeric vectors (one 1024-dim vector per decision)
# Load pre-computed embeddings, e.g.:
# embeddings <- readRDS("cas_embeddings.rds")
cosine <- function(a, b) {
sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
}
query_id <- "ECLI:INT:CAS:2023:A-10014-award"
query_vec <- embeddings[[query_id]]
sims <- sapply(names(embeddings), function(id) {
if (id == query_id) return(-Inf)
cosine(query_vec, embeddings[[id]])
})
head(sort(sims, decreasing = TRUE), 5)Method 3: Citation NetworkBibliographic coupling
When two cases cite many of the same earlier decisions, they are likely addressing related legal issues — even if their texts share no words. This idea, bibliographic coupling, comes from information science but translates directly to law: shared precedent signals shared legal problem space.
An alternative is co-citation analysis: two cases are similar if they are frequently cited together by later decisions. Co-citation captures how the field has read them, not just what they say.
Strengths
Language-independent; grounded in legal reasoning; reveals connections invisible to text analysis.
Limitations
Depends on citation density — early decisions or those in niche areas have few citations. Requires a structured citation database.
library(igraph)
# citations: data frame with columns from_id and to_id
g <- graph_from_data_frame(citations, directed = TRUE)
adj <- as_adjacency_matrix(g, sparse = TRUE)
# Bibliographic coupling: overlap in cited cases
coupling <- tcrossprod(adj)
degrees <- rowSums(adj)
norm_mat <- outer(sqrt(degrees), sqrt(degrees))
bib_coupling <- coupling / norm_mat
bib_coupling[is.nan(bib_coupling)] <- 0
# Top similar cases for a query
query_id <- "ECLI:INT:CAS:2023:A-10014-award"
scores <- bib_coupling[query_id, ]
head(sort(scores, decreasing = TRUE), 5)Interactive Demo
Select a CAS case and compare the top-5 similar cases returned by each method. Notice where they agree — and where they disagree. What does each method think “similarity” means?
Dataset: 0 CAS decisions · Methods disagree on purpose — that is the point of the exercise.
Workshop Exercises
Exercise 1 — Inspect a TF-IDF match
Select a doping case in the demo. Look at the TF-IDF results and identify two terms that are likely driving the match. Are those terms legally relevant or just boilerplate? What would you remove from the term list to improve precision?
Exercise 2 — Compare methods
Find a case where TF-IDF and embeddings agree on a similar case but the network method does not. Formulate a hypothesis: why might a case be textually similar but not share precedents with another?
Exercise 3 — The distinguishing exercise
Take the top TF-IDF match for any case. Write a one-paragraph argument distinguishing the two cases on factual or legal grounds — as opposing counsel might do. Which features did the algorithm fail to consider?
Exercise 4 — Modify the R code
Extend the TF-IDF script to filter the term list to only nouns (using a part-of-speech tagger). Does this improve the relevance of the results? What does this suggest about the importance of pre-processing choices?
Questions?
Contact johan.lindholm@umu.se