Implement online metadata lookup in shanty-tag #4

Closed
opened 2026-03-17 14:17:57 -04:00 by connor · 0 comments
Owner

The shanty-tag crate is responsible for filling in missing or incorrect metadata on music files. The MVP approach is "look online first" — query online databases (primarily MusicBrainz) using whatever partial metadata is available (artist + title, album name, etc.) to find the correct tags.

This issue covers:

  1. MusicBrainz client — implement a client that queries the MusicBrainz API to look up track/album/artist metadata. MusicBrainz has a free API with rate limiting (1 request/second for unauthenticated). The client should:
    • Search by artist + title to find a matching recording
    • Search by album name + artist to find a matching release
    • Retrieve full metadata for a matched recording/release (title, artist, album, track number, year, genre, cover art URL via Cover Art Archive, MusicBrainz IDs)
    • Respect rate limits (implement a rate limiter / request queue)
    • Handle API errors gracefully
  2. Tag matching logic — given a track from the database (which may have partial metadata), attempt to find a match online:
    • If artist + title are available, search for the recording
    • If only a filename is available, attempt to parse artist/title from the filename (common patterns like "Artist - Title.mp3")
    • Score potential matches by similarity to existing metadata (fuzzy string matching)
    • Allow a configurable confidence threshold — only apply tags if the match confidence is above the threshold
  3. Database update — when a match is found and accepted, update the track (and album/artist) records in shanty-db with the new metadata. Also update the MusicBrainz IDs for future reference.
  4. File tag writing — optionally write the updated metadata back to the actual music file's embedded tags (ID3, Vorbis comments, etc.). This should be an opt-in behavior since some users may not want their files modified.
  5. CLI interface — the shanty-tag binary should accept:
    • A path to the database (optional, with default)
    • --all to tag all untagged/partially-tagged tracks in the database
    • --track <id> to tag a specific track
    • --dry-run to show what would be changed without applying
    • --write-tags to enable writing tags back to files
    • --confidence <0.0-1.0> to set the match threshold (default ~0.8)

Design Considerations

  • The data backend should be trait-based so that alternative providers (Last.fm, Discogs, etc.) can be added later without changing the core logic. Define a MetadataProvider trait with methods like search_recording, search_release, get_recording_details, etc.
  • MusicBrainz requires a descriptive User-Agent header — use something like Shanty/0.1.0 (https://github.com/your-repo).
  • Batch operations should be parallelized where possible, but respect API rate limits.
  • Store which provider supplied the metadata so the user knows the source.
  • We will want to strongly consider (and make available) routines for cleaning up and standardizing titles for artists/albums/songs to look out for odd characters, etc.

Acceptance Criteria

  • MusicBrainz API client is implemented with proper rate limiting
  • Given a track with artist + title, the tagger finds a matching recording and retrieves full metadata
  • Fuzzy matching works — minor spelling differences don't prevent matches
  • Database records are updated with new metadata and MusicBrainz IDs
  • --write-tags actually writes metadata back into the music file
  • --dry-run shows proposed changes without applying them
  • Confidence threshold filtering works
  • MetadataProvider trait exists, and MusicBrainz is the first implementation
  • CLI interface works as specified
  • Errors from the API (rate limits, network issues, no results) are handled gracefully
  • Tests exist for matching logic (unit tests with mocked API responses)

Dependencies

  • Issue #1 (workspace scaffolding)
  • Issue #2 (shared database schema)
  • Issue #3 (music indexing — so there are tracks in the DB to tag)
The `shanty-tag` crate is responsible for filling in missing or incorrect metadata on music files. The MVP approach is "look online first" — query online databases (primarily MusicBrainz) using whatever partial metadata is available (artist + title, album name, etc.) to find the correct tags. This issue covers: 1. **MusicBrainz client** — implement a client that queries the MusicBrainz API to look up track/album/artist metadata. MusicBrainz has a free API with rate limiting (1 request/second for unauthenticated). The client should: - Search by artist + title to find a matching recording - Search by album name + artist to find a matching release - Retrieve full metadata for a matched recording/release (title, artist, album, track number, year, genre, cover art URL via Cover Art Archive, MusicBrainz IDs) - Respect rate limits (implement a rate limiter / request queue) - Handle API errors gracefully 2. **Tag matching logic** — given a track from the database (which may have partial metadata), attempt to find a match online: - If artist + title are available, search for the recording - If only a filename is available, attempt to parse artist/title from the filename (common patterns like "Artist - Title.mp3") - Score potential matches by similarity to existing metadata (fuzzy string matching) - Allow a configurable confidence threshold — only apply tags if the match confidence is above the threshold 3. **Database update** — when a match is found and accepted, update the track (and album/artist) records in `shanty-db` with the new metadata. Also update the MusicBrainz IDs for future reference. 4. **File tag writing** — optionally write the updated metadata back to the actual music file's embedded tags (ID3, Vorbis comments, etc.). This should be an opt-in behavior since some users may not want their files modified. 5. **CLI interface** — the `shanty-tag` binary should accept: - A path to the database (optional, with default) - `--all` to tag all untagged/partially-tagged tracks in the database - `--track <id>` to tag a specific track - `--dry-run` to show what would be changed without applying - `--write-tags` to enable writing tags back to files - `--confidence <0.0-1.0>` to set the match threshold (default ~0.8) ### Design Considerations - The data backend should be trait-based so that alternative providers (Last.fm, Discogs, etc.) can be added later without changing the core logic. Define a `MetadataProvider` trait with methods like `search_recording`, `search_release`, `get_recording_details`, etc. - MusicBrainz requires a descriptive User-Agent header — use something like `Shanty/0.1.0 (https://github.com/your-repo)`. - Batch operations should be parallelized where possible, but respect API rate limits. - Store which provider supplied the metadata so the user knows the source. - We will want to strongly consider (and make available) routines for cleaning up and standardizing titles for artists/albums/songs to look out for odd characters, etc. ### Acceptance Criteria - [ ] MusicBrainz API client is implemented with proper rate limiting - [ ] Given a track with artist + title, the tagger finds a matching recording and retrieves full metadata - [ ] Fuzzy matching works — minor spelling differences don't prevent matches - [ ] Database records are updated with new metadata and MusicBrainz IDs - [ ] `--write-tags` actually writes metadata back into the music file - [ ] `--dry-run` shows proposed changes without applying them - [ ] Confidence threshold filtering works - [ ] `MetadataProvider` trait exists, and MusicBrainz is the first implementation - [ ] CLI interface works as specified - [ ] Errors from the API (rate limits, network issues, no results) are handled gracefully - [ ] Tests exist for matching logic (unit tests with mocked API responses) ### Dependencies - Issue #1 (workspace scaffolding) - Issue #2 (shared database schema) - Issue #3 (music indexing — so there are tracks in the DB to tag)
connor added the HighPriorityMVP labels 2026-03-17 14:18:27 -04:00
connor started working 2026-03-17 14:45:44 -04:00
connor worked for 36 minutes 2026-03-17 15:22:06 -04:00
Sign in to join this conversation.
1 Participants
Notifications
Total Time Spent: 36 minutes
connor
36 minutes
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Shanty/Main#4