Files
subtitle_attachment_cleanup/README.md

114 lines
5.2 KiB
Markdown

# Subtitle Attachment Cleanup
A Python script to automatically clean up unnecessary font attachments from MKV video files.
When you remove unwanted subtitle streams (like foreign languages) from an MKV using tools like MKVToolNix, the font files attached to those removed subtitles are typically left behind, needlessly inflating the file size. This script reads your MKV files, thoroughly inspects the remaining SSA/ASS subtitle tracks to discover which fonts are actually being used, and builds a new MKV file—leaving behind any orphaned or unused font attachments.
## Features
- Intelligently extracts and parses `.ass` / `.ssa` subtitle tracks from MKV containers.
- Identifies both top-level font declarations (in `[V4+ Styles]`) and inline font overrides (in `[Events]`).
- Uses deep metadata scanning (`fonttools`) to accurately match requested font families against attached font files, even if the attached files have cryptic filenames (e.g., `arialbd.ED3587CD.ttf`).
- Safely preserves all non-font attachments (like cover images).
- Automatically moves original MKVs to an `original/` backup folder and places the cleaned files in a `finished/` folder.
## Prerequisites
Ensure the following tools and libraries are installed and accessible in your system's PATH:
1. **Python 3.x**
2. **MKVToolNix** (specifically `mkvmerge` and `mkvextract`)
3. **fonttools** (Python library)
Install the required Python dependency:
For Windows or Ubuntu, you can use `pip`:
```bash
pip install fonttools
```
For Arch Linux (which enforces PEP 668), you should use `pacman` to install the system package:
```bash
sudo pacman -S python-fonttools
```
## Usage
Simply place the script inside the directory containing the `.mkv` files you wish to process and run it. You can also place the script in your personal `bin` or `PATH` folder to run it from anywhere.
```bash
python subtitle_fonts_cleaner.py
# If in your PATH, simply execute: subtitle_fonts_cleaner.py
```
This is the main script and intended default workflow for batch cleanup.
### Folder Structure
Upon execution, the script will create three folders in your working directory:
- `temp_subs_fonts/` - A temporary directory used during processing (automatically deleted upon completion).
- `original/` - Your original, unmodified `.mkv` files are safely moved here.
- `finished/` - The new, lean `.mkv` files containing only the active ASS tracks, required font attachments, and original audio/video streams.
## Supplemental Script: Font Scanner (Read-Only)
This repository also includes `subtitle_fonts_scanner.py`, a companion script for inspection and reporting.
Use the scanner when you want a dry-run style check before cleaning.
It does not modify files and does not create output folders.
### What the scanner reports
- Number of ASS/SSA subtitle tracks detected
- Number of embedded font attachments
- Which fonts are required by subtitle styles and inline `\fn` overrides
- Which required fonts are covered by current attachments
- Which fonts are missing
- Which embedded font attachments appear unused
### Scanner usage
Run it against a single MKV file:
```bash
python subtitle_fonts_scanner.py "input.mkv"
# If in your PATH, simply execute: subtitle_fonts_scanner.py "input.mkv"
```
### Sample output
Example (truncated):
```text
Scanning: Example Episode 01.mkv
──────────────────────────────────────────────────────────────────────
ASS/SSA subtitle tracks : 2
Font attachments : 15
ASS tracks parsed:
Track 2 [eng]: 1 font(s) referenced
Track 3 [ger]: 3 font(s) referenced
FONTS NEEDED BY SUBTITLES (4 total)
──────────────────────────────────────────────────────────────────────
[OK] arial
[OK] gandhi sans
[MISSING] georgia bold
[OK] times new roman bold
FONTS EMBEDDED IN MKV (15 file(s))
──────────────────────────────────────────────────────────────────────
[USED] ARIALNB.TTF -> covers: arial
[EXTRA] AdobeArabic-Bold.otf
...
MISSING FONTS (1 font(s) not embedded)
──────────────────────────────────────────────────────────────────────
✘ georgia bold
EXTRA / UNUSED EMBEDDINGS (10 file(s) not needed by any subtitle)
──────────────────────────────────────────────────────────────────────
⚠ AdobeArabic-Bold.otf
⚠ comic.ttf
...
```
### Typical workflow
1. Run `subtitle_fonts_scanner.py` on a file to preview needed vs unused fonts.
2. Run `subtitle_fonts_cleaner.py` to process all MKVs in the working directory.
3. Optionally run the scanner again on a cleaned file to verify the result.
## License
MIT License. See the [LICENSE](LICENSE) file for more details.