Corpus systems case study

Sacred Harp Corpus

A local research corpus and knowledge graph that links songs, hymn texts, and tunebooks across shape-note traditions through canonicalized text identity.

Research corpus

5,300+ generated notes across songs, texts, and tunebooks

The graph makes cross-book repertoire overlap, text reuse, tune migration, and edition-level change easier to inspect inside a local research environment.

Song, text, and book entities modeled separately
First-line canonicalization links recurring hymn texts
Generated notes preserve local, inspectable research structure

Problem / Context

Problem

The Sacred Harp corpus began from a practical research question: what changes when a beloved repertoire is revised, renumbered, moved, or split across editions and related books? A flat list of songs could not answer that question. The work needed a structure that could show continuity and rupture at the same time.

The project became a local research corpus and knowledge graph for mapping relationships among songs, hymn texts, and tunebooks across multiple shape-note traditions.

Implementation

The implementation repository scrapes source indexes, builds combined first-line data, canonicalizes text identities, exports Obsidian-ready notes, and supports corpus inspection through a local CLI.

The key abstraction is first-line canonicalization. Hymn texts recur across books under different tune names, page numbers, editorial conventions, and formatting choices. Canonical first-line identity gives the graph a stable way to connect related material without pretending the sources are cleaner than they are.

Approach / System Design

Architecture

Source acquisition from multiple online tunebook indexes.
Normalization for titles, first lines, source metadata, and book context.
Canonical text notes indexed by first line.
Song notes for specific tune entries in specific books.
Book-level nodes and tags for filtering, clustering, and tradition context.
Generated Obsidian notes that link songs to texts, songs to books, and texts back to their song manifestations.

The corpus is local-first because the graph needed to remain inspectable and editable as a research environment. Obsidian was not only a display layer; it was the working surface where generated structure and interpretation could coexist without being collapsed into one database view.

Outcomes / Lessons

Outcomes

The current corpus contains more than 5,300 interconnected notes, including roughly 3,326 song notes and 1,972 canonicalized text notes across 10 tunebook traditions. It makes repertoire overlap, text reuse, tune migration, and edition-level change easier to see.

The project is strongest as evidence of data modeling, corpus normalization, and research tooling: it turns a personal research obsession into a structured, queryable environment.

Artifacts

Graph model: song nodes, text nodes, book context, and cross-book links.
Corpus scale summary: 5,300+ generated notes across songs, texts, and tunebooks.
Pipeline summary: source scraping, first-line normalization, canonicalization, note export, and review artifacts.