Implement music file scanning in shanty-index #3

Closed
opened 2026-03-17 14:09:40 -04:00 by connor · 0 comments
Owner

The shanty-index crate is responsible for scanning a directory tree of music files and extracting metadata from them. This is the foundation of Shanty — before anything else can happen, the app needs to know what music the user currently has.

This issue covers:

  1. Directory scanning — recursively walk a given directory path and identify music files by extension (.mp3, .flac, .ogg, .opus, .m4a, .wav, .wma, .aac, .alac at minimum). Use a crate like walkdir for efficient traversal.
  2. Metadata extraction — for each music file found, extract embedded metadata (ID3 tags for MP3, Vorbis comments for FLAC/OGG, etc.). Use a crate like lofty or symphonia that handles multiple formats uniformly. Extract at minimum: title, artist, album, album artist, track number, disc number, year/date, genre, duration, codec, bitrate.
  3. Database insertion — insert (or update) the extracted metadata into the shanty-db database. Handle the case where a file has already been indexed (update if file mtime changed, skip if unchanged). Create artist and album records as needed based on the metadata found.
  4. CLI interface — the shanty-index binary should accept:
    • A path to scan (required)
    • A path to the database file (optional, with a sensible XDG compliant default)
    • A --dry-run flag that reports what would be indexed without writing to DB
    • Verbosity flags
  5. Concurrency — scanning and metadata extraction should be parallelized. Use rayon or tokio tasks to process multiple files concurrently. This is important because large libraries can have tens of thousands of files.

Design Considerations

  • Files with missing metadata should still be indexed — store whatever is available and leave the rest as NULL. The tagging crate will fill in gaps later.
  • Track file modification time so re-indexing can be incremental (only process changed files).
  • Consider emitting structured progress information (e.g., files scanned, files indexed, errors) that the web UI can consume later.
  • The library API should be usable without the CLI — other crates (especially the web backend) will call indexing functions directly.
  • Proper logging should be a big consideration (for this and all future crates)

Acceptance Criteria

  • Given a directory containing music files, shanty-index scans recursively and finds all supported formats
  • Metadata is extracted from each file and stored in the database via shanty-db
  • Re-running the scan on the same directory is incremental — unchanged files are skipped
  • The CLI binary works with shanty-index /path/to/music
  • --dry-run mode works and outputs what would be indexed
  • Scanning is parallelized (measurably faster than sequential on large directories)
  • Files with partial/missing metadata are still indexed (with NULLs for missing fields)
  • Artist and album records are created/linked based on extracted metadata
  • Errors on individual files (corrupt, unreadable) are logged but don't halt the scan
  • Unit tests exist for metadata extraction logic
  • Integration test: create temp dir with test audio files, scan it, verify DB contents

Dependencies

  • Issue #1 (workspace scaffolding)
  • Issue #2 (shared database schema)
The `shanty-index` crate is responsible for scanning a directory tree of music files and extracting metadata from them. This is the foundation of Shanty — before anything else can happen, the app needs to know what music the user currently has. This issue covers: 1. **Directory scanning** — recursively walk a given directory path and identify music files by extension (`.mp3`, `.flac`, `.ogg`, `.opus`, `.m4a`, `.wav`, `.wma`, `.aac`, `.alac` at minimum). Use a crate like `walkdir` for efficient traversal. 2. **Metadata extraction** — for each music file found, extract embedded metadata (ID3 tags for MP3, Vorbis comments for FLAC/OGG, etc.). Use a crate like `lofty` or `symphonia` that handles multiple formats uniformly. Extract at minimum: title, artist, album, album artist, track number, disc number, year/date, genre, duration, codec, bitrate. 3. **Database insertion** — insert (or update) the extracted metadata into the `shanty-db` database. Handle the case where a file has already been indexed (update if file mtime changed, skip if unchanged). Create artist and album records as needed based on the metadata found. 4. **CLI interface** — the `shanty-index` binary should accept: - A path to scan (required) - A path to the database file (optional, with a sensible XDG compliant default) - A `--dry-run` flag that reports what would be indexed without writing to DB - Verbosity flags 5. **Concurrency** — scanning and metadata extraction should be parallelized. Use `rayon` or `tokio` tasks to process multiple files concurrently. This is important because large libraries can have tens of thousands of files. ### Design Considerations - Files with missing metadata should still be indexed — store whatever is available and leave the rest as NULL. The tagging crate will fill in gaps later. - Track file modification time so re-indexing can be incremental (only process changed files). - Consider emitting structured progress information (e.g., files scanned, files indexed, errors) that the web UI can consume later. - The library API should be usable without the CLI — other crates (especially the web backend) will call indexing functions directly. - Proper logging should be a big consideration (for this and all future crates) ### Acceptance Criteria - [ ] Given a directory containing music files, `shanty-index` scans recursively and finds all supported formats - [ ] Metadata is extracted from each file and stored in the database via `shanty-db` - [ ] Re-running the scan on the same directory is incremental — unchanged files are skipped - [ ] The CLI binary works with `shanty-index /path/to/music` - [ ] `--dry-run` mode works and outputs what would be indexed - [ ] Scanning is parallelized (measurably faster than sequential on large directories) - [ ] Files with partial/missing metadata are still indexed (with NULLs for missing fields) - [ ] Artist and album records are created/linked based on extracted metadata - [ ] Errors on individual files (corrupt, unreadable) are logged but don't halt the scan - [ ] Unit tests exist for metadata extraction logic - [ ] Integration test: create temp dir with test audio files, scan it, verify DB contents ### Dependencies - Issue #1 (workspace scaffolding) - Issue #2 (shared database schema)
connor added the HighPriorityMVP labels 2026-03-17 14:09:45 -04:00
connor started working 2026-03-17 14:21:18 -04:00
connor worked for 24 minutes 2026-03-17 14:45:19 -04:00
Sign in to join this conversation.
1 Participants
Notifications
Total Time Spent: 24 minutes
connor
24 minutes
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Shanty/Main#3