phonepod 0.1.0b1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. phonepod-0.1.0b1/.claude/settings.local.json +12 -0
  2. phonepod-0.1.0b1/.gitignore +41 -0
  3. phonepod-0.1.0b1/.impeccable.md +30 -0
  4. phonepod-0.1.0b1/.uv-cache/.gitignore +1 -0
  5. phonepod-0.1.0b1/.uv-cache/.lock +0 -0
  6. phonepod-0.1.0b1/.uv-cache/CACHEDIR.TAG +1 -0
  7. phonepod-0.1.0b1/.uv-cache/interpreter-v4/df9b60572b5055b7/3c34c73dfd1882aa.msgpack +0 -0
  8. phonepod-0.1.0b1/.uv-cache/interpreter-v4/df9b60572b5055b7/8b0512600727ad4e.msgpack +0 -0
  9. phonepod-0.1.0b1/.uv-cache/sdists-v9/.git +0 -0
  10. phonepod-0.1.0b1/.uv-cache/sdists-v9/.gitignore +0 -0
  11. phonepod-0.1.0b1/CLAUDE.md +42 -0
  12. phonepod-0.1.0b1/JOURNEY.md +237 -0
  13. phonepod-0.1.0b1/MANIFEST.md +69 -0
  14. phonepod-0.1.0b1/PKG-INFO +32 -0
  15. phonepod-0.1.0b1/README.md +111 -0
  16. phonepod-0.1.0b1/TODOS.md +97 -0
  17. phonepod-0.1.0b1/app.py +112 -0
  18. phonepod-0.1.0b1/benchmark_clearvoice_numpy.py +189 -0
  19. phonepod-0.1.0b1/benchmark_denoisers.py +215 -0
  20. phonepod-0.1.0b1/benchmark_pipeline.py +282 -0
  21. phonepod-0.1.0b1/cli.py +113 -0
  22. phonepod-0.1.0b1/demo/RECORDING_GUIDE.md +71 -0
  23. phonepod-0.1.0b1/demo/after.wav +0 -0
  24. phonepod-0.1.0b1/demo/demo_raw.m4a +0 -0
  25. phonepod-0.1.0b1/diagnose.py +99 -0
  26. phonepod-0.1.0b1/diagnose_muffled.py +168 -0
  27. phonepod-0.1.0b1/docs/architecture.md +53 -0
  28. phonepod-0.1.0b1/docs/benchmarks.md +100 -0
  29. phonepod-0.1.0b1/docs/knowledge-base.md +208 -0
  30. phonepod-0.1.0b1/docs/references.md +76 -0
  31. phonepod-0.1.0b1/docs/system-architecture.html +981 -0
  32. phonepod-0.1.0b1/docs/tasks.md +27 -0
  33. phonepod-0.1.0b1/engine.py +159 -0
  34. phonepod-0.1.0b1/phonepod/__init__.py +40 -0
  35. phonepod-0.1.0b1/phonepod/_compat.py +35 -0
  36. phonepod-0.1.0b1/phonepod/app.py +120 -0
  37. phonepod-0.1.0b1/phonepod/audit.py +331 -0
  38. phonepod-0.1.0b1/phonepod/cli.py +113 -0
  39. phonepod-0.1.0b1/phonepod/engine.py +234 -0
  40. phonepod-0.1.0b1/phonepod/processor.py +80 -0
  41. phonepod-0.1.0b1/phonepod/profile.py +164 -0
  42. phonepod-0.1.0b1/phonepod/tuner.py +334 -0
  43. phonepod-0.1.0b1/processor.py +59 -0
  44. phonepod-0.1.0b1/pyproject.toml +56 -0
  45. phonepod-0.1.0b1/setup.sh +34 -0
  46. phonepod-0.1.0b1/sweep.py +77 -0
  47. phonepod-0.1.0b1/test_broadcast_ab.py +282 -0
  48. phonepod-0.1.0b1/test_deepfilter.py +57 -0
  49. phonepod-0.1.0b1/test_full_pipeline.py +86 -0
  50. phonepod-0.1.0b1/test_processor.py +40 -0
  51. phonepod-0.1.0b1/test_studio_character.py +83 -0
  52. phonepod-0.1.0b1/test_tube_sweep.py +168 -0
  53. phonepod-0.1.0b1/tests/__init__.py +0 -0
  54. phonepod-0.1.0b1/tests/conftest.py +70 -0
  55. phonepod-0.1.0b1/tests/test_compat.py +29 -0
  56. phonepod-0.1.0b1/tests/test_engine.py +80 -0
  57. phonepod-0.1.0b1/tests/test_processor.py +55 -0
  58. phonepod-0.1.0b1/tests/test_profile.py +170 -0
  59. phonepod-0.1.0b1/tests/test_public_api.py +90 -0
  60. phonepod-0.1.0b1/tests/test_tuner.py +165 -0
  61. phonepod-0.1.0b1/tuner_minimal.py +466 -0
@@ -0,0 +1,12 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "WebSearch",
5
+ "WebFetch(domain:pypi.org)",
6
+ "WebFetch(domain:raw.githubusercontent.com)",
7
+ "WebFetch(domain:arxiv.org)",
8
+ "WebFetch(domain:dpdfnet.readthedocs.io)",
9
+ "WebFetch(domain:www.gradio.app)"
10
+ ]
11
+ }
12
+ }
@@ -0,0 +1,41 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ dist/
6
+ build/
7
+ *.egg
8
+
9
+ # Virtual environments
10
+ .venv/
11
+ .bench-venv/
12
+
13
+ # macOS
14
+ .DS_Store
15
+
16
+ # Model checkpoints
17
+ checkpoints/
18
+
19
+ # Audio files (generated)
20
+ benchmark_outputs/
21
+ *.wav
22
+ ab_*.wav
23
+
24
+ # Keep demo audio
25
+ !demo/before.wav
26
+ !demo/after.wav
27
+
28
+ # Personal test recordings
29
+ recording.m4a
30
+
31
+ # Generated reports
32
+ audit_report.html
33
+
34
+ # Test/research artifacts
35
+ .pytest_cache/
36
+ .coverage
37
+ htmlcov/
38
+ research/
39
+
40
+ # uv
41
+ uv.lock
@@ -0,0 +1,30 @@
1
+ ## Design Context
2
+
3
+ ### Users
4
+ Beginner content creators and non-technical people who record audio on their phones and want it to sound professional. They don't know what LUFS, EQ, or compression mean — and they shouldn't have to. They want to drag a file in, press one button, and get a good result. Ved uses it himself, but the audience is anyone who Googles "how to make my phone recording sound better."
5
+
6
+ ### Brand Personality
7
+ **Fun. Competent. Fast.**
8
+
9
+ phonepod is the friend at the dinner party who's genuinely good at what they do but doesn't take themselves seriously. It jokes, it moves quickly, it gets the job done without making you feel stupid. Think Duolingo's warmth meets Stripe's reliability — personality without sacrificing trust.
10
+
11
+ ### Aesthetic Direction
12
+ - **Not minimalist.** Minimalism reads as boring and personality-less here. The UI needs character.
13
+ - **Not cluttered.** One screen, one flow, zero cognitive load. The complexity lives in the pipeline, not the interface.
14
+ - **Bold + playful utility.** Strong typography, confident color, maybe a wink of humor in microcopy. The kind of tool that makes you smile when you use it.
15
+ - **Dark mode primary.** Fits the audio/creator context. Light mode is not a priority.
16
+ - **References:** Duolingo (fun without being childish), Arc Browser (opinionated design), Raycast (fast + polished dark UI), Poolside FM (personality in a tool)
17
+ - **Anti-references:** Audacity (cluttered, intimidating), Adobe Podcast (corporate intimidation), generic SaaS dashboards (soulless), anything that looks vibe-coded (gradient soup, glassmorphism for no reason), Notion/Linear minimalism (too quiet)
18
+
19
+ ### Color Direction
20
+ - No default Gradio orange — it screams "I didn't customize this"
21
+ - No pastel gradients or neon that suggests vibe-coded
22
+ - Needs a bold, ownable accent color that has energy but isn't generic
23
+ - Consider: electric/warm tones that feel alive — coral, amber, electric green, or a signature hue that becomes "the phonepod color"
24
+
25
+ ### Design Principles
26
+ 1. **One screen, one job.** The entire flow (upload, clean, tune, download) lives on a single page. No tabs, no navigation, no settings pages.
27
+ 2. **Personality over polish.** A witty loading message beats a perfect spinner. Microcopy should make people smile — "Drop your trash audio here" > "Upload Audio File."
28
+ 3. **Confidence through constraint.** Fewer controls = more trust. The user should feel like the tool knows what it's doing. Expert knobs exist but stay out of the way.
29
+ 4. **Speed is a feature.** Every interaction should feel instant or show honest, engaging progress. No mystery spinners.
30
+ 5. **Never intimidate.** No jargon in the UI. No waveform editors. No dB labels visible by default. If a beginner would hesitate, simplify.
@@ -0,0 +1 @@
1
+ *
File without changes
@@ -0,0 +1 @@
1
+ Signature: 8a477f597d28d172789f06886806bc55
File without changes
File without changes
@@ -0,0 +1,42 @@
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Agent Identity & Core Directives
6
+
7
+ You are a Senior Audio Systems Engineer and PyTorch specialist. You are building "Project Resonance," a local, privacy-first audio restoration pipeline.
8
+
9
+ ## Immutable Rules
10
+
11
+ 1. **Hardware Strictness:** The host machine is an Apple Silicon Mac (M-series). Route all PyTorch tensor operations to `mps`. Never default to `cuda`. If `mps` is unavailable, fallback to `cpu` but log a severe warning.
12
+ 2. **Context Modularity:** Do not write monolithic scripts. Adhere strictly to the separation of concerns defined in `docs/architecture.md`.
13
+ 3. **No UI Hallucinations:** This is a backend engine first. Do not generate frontend code (Gradio/FastAPI) until explicitly instructed in Phase 3.
14
+ 4. **Environment:** We strictly use `uv` for Python dependency management. Do not suggest `pip`, `conda`, or `poetry`.
15
+ 5. **No External APIs:** This is a 100% local, privacy-first pipeline. Never send audio to an external service.
16
+
17
+ ## Architecture (3-module pipeline)
18
+
19
+ `engine.py` → `processor.py` → `app.py`
20
+
21
+ - **engine.py** — Pure tensor-in, tensor-out wrapper around `resemble-enhance`. Takes 16kHz mono `torch.Tensor`, returns enhanced tensor. Zero file I/O. Uses `nfe=64`, `solver="midpoint"`, dual-stage pipeline (UNet Denoiser + CFM).
22
+ - **processor.py** — Overlap-Add chunking manager. Loads `.wav` via `torchaudio`, slices into 10s chunks with 1s overlap, feeds each to `engine.py`, reconstructs with Hanning window crossfade. Handles all file I/O.
23
+ - **app.py** — Gradio UI (Phase 3 only). Drag-and-drop → `processor.py` → A/B comparison player.
24
+
25
+ The hard boundary: `engine.py` never touches the filesystem. `processor.py` never touches the model. `app.py` never touches tensors.
26
+
27
+ ## Commands
28
+
29
+ ```bash
30
+ # First-time setup (installs all dependencies via uv)
31
+ bash setup.sh
32
+
33
+ # Run the test script (generates a 30s sine wave and processes it)
34
+ uv run python test_processor.py
35
+
36
+ # Launch the Gradio UI (localhost:7860)
37
+ uv run python app.py
38
+ ```
39
+
40
+ ## Execution Protocol
41
+
42
+ Before writing any code, you MUST read `docs/architecture.md` to understand the system design, and `docs/tasks.md` to know exactly which phase you are currently executing.
@@ -0,0 +1,237 @@
1
+ # Project Resonance — Build Journey
2
+
3
+ > Local, privacy-first audio restoration pipeline. Phone recording in → podcast-quality audio out.
4
+
5
+ ## The Goal
6
+
7
+ Record a voice memo on your phone, run one command, get podcast-quality output. No cloud. No subscriptions. No Adobe. Everything runs locally on Apple Silicon.
8
+
9
+ ## Phase 1: The Architecture (2026-03-30)
10
+
11
+ Started with a clean 3-module architecture based on `resemble-enhance`, an open-source model from Resemble AI:
12
+
13
+ ```
14
+ engine.py (AI model wrapper) → processor.py (OLA chunking) → cli.py (user interface)
15
+ ```
16
+
17
+ **Key design decisions:**
18
+ - `engine.py` handles zero file I/O — pure tensor-in, tensor-out
19
+ - `processor.py` implements Overlap-Add with Hanning window crossfade to prevent audio clicking at chunk boundaries
20
+ - Strict MPS-first device routing for Apple Silicon, CPU fallback with warning
21
+ - `uv` for dependency management (not pip/conda)
22
+ - Signal handlers + `atexit` for clean shutdown — model unloading, MPS cache flush, temp file cleanup
23
+
24
+ **Expert review caught 4 issues before first run:**
25
+ 1. Wrong import path (`resemble_enhance.enhancer` vs `resemble_enhance.enhancer.inference`) — would have crashed on import
26
+ 2. MPS detection needed both `is_available()` AND `is_built()` — wrong PyTorch build would silently fall back to CPU
27
+ 3. torch version floor needed (`>=2.1.0`) to guarantee MPS support
28
+ 4. `lambd=0.9` was too aggressive — reviewer correctly identified it would degrade speech quality
29
+
30
+ ## Phase 2: First Run — The NumPy Wall (2026-03-30)
31
+
32
+ **Problem:** `resemble-enhance` was written against NumPy 1.x. NumPy 2.0+ broke implicit scalar conversion.
33
+
34
+ ```
35
+ Error: only 0-dimensional arrays can be converted to Python scalars
36
+ ```
37
+
38
+ **Fix:** Pinned `numpy<2.0` (installed 1.26.4). Root cause: `resemble-enhance` hasn't been updated for NumPy 2.x breaking changes.
39
+
40
+ ## Phase 3: The MPS Disaster (2026-03-30)
41
+
42
+ Ran the full pipeline. Output: **pure noise.**
43
+
44
+ Built a diagnostic script (`diagnose.py`) to isolate the problem layer by layer:
45
+
46
+ | Test | Result | Verdict |
47
+ |---|---|---|
48
+ | Denoise only (MPS) | Recognizable but degraded | Denoiser works on MPS |
49
+ | Full enhance (CPU) | Louder but robotic | CFM enhancement adds artifacts even on CPU |
50
+ | Full enhance (MPS) | Pure noise | **MPS completely breaks the CFM ODE solver** |
51
+
52
+ **Root cause:** The Continuous Flow Matching stage runs 64 sequential ODE solver steps. MPS has float32 precision issues with iterative numerical solvers — errors compound across steps until the output is pure noise. This is a known class of issue with Apple's MPS backend.
53
+
54
+ ## Phase 4: The Parameter Sweep (2026-03-30)
55
+
56
+ Refused to give up on `resemble-enhance`. Ran a 7-configuration parameter sweep on CPU, varying `nfe` (solver steps), `lambd` (denoise strength), and `tau` (temperature):
57
+
58
+ | Config | nfe | lambd | tau | Time | Result |
59
+ |---|---|---|---|---|---|
60
+ | light_clean | 32 | 0.1 | 0.1 | 57s | Lightest touch |
61
+ | medium_clean | 32 | 0.5 | 0.1 | 85s | Moderate |
62
+ | heavy_denoise_low_temp | 32 | 0.9 | 0.1 | 87s | Heavy denoise |
63
+ | original_settings | 64 | 0.5 | 0.5 | 98s | Previous default |
64
+ | original_low_temp | 64 | 0.5 | 0.1 | 132s | Low randomness |
65
+ | heavy_denoise_more_steps | 64 | 0.9 | 0.1 | 189s | Max denoise |
66
+ | max_steps_low_temp | 128 | 0.5 | 0.1 | 277s | Maximum quality attempt |
67
+
68
+ **Result:** All 7 configurations had a "tearing" effect at louder passages. Noise was reduced but the output was not podcast quality. The CFM generative model was over-processing the audio — it was **re-synthesizing** the voice rather than cleaning it up.
69
+
70
+ ## Phase 5: The Research Pivot (2026-03-30)
71
+
72
+ Researched how **professional podcast audio engineers** actually process audio. Key finding:
73
+
74
+ > Adobe Podcast "Enhance Speech" is ALSO a generative re-synthesizer. When it fails, it outputs English-sounding babble — proving it generates speech tokens, not just filters them.
75
+
76
+ **The professional processing chain (in order):**
77
+ 1. Noise Gate/Reduction
78
+ 2. High-Pass Filter (80Hz — removes rumble)
79
+ 3. Subtractive EQ (cut mud at 200-400Hz)
80
+ 4. Compressor (3:1 ratio, serial compression preferred)
81
+ 5. De-Esser (tame sibilance at 4-10kHz)
82
+ 6. Additive EQ (presence at 2-4kHz, air at 8-12kHz)
83
+ 7. Loudness Normalization (-16 LUFS)
84
+ 8. Brick-Wall Limiter (-1.5dB ceiling)
85
+
86
+ **Key insight:** We needed a **discriminative denoiser** (removes what shouldn't be there) followed by a **traditional DSP mastering chain** (shapes what's left). Not a generative model trying to re-imagine the audio.
87
+
88
+ ## Phase 6: The Hybrid Pipeline (2026-03-30)
89
+
90
+ Rebuilt the engine with:
91
+ - **DeepFilterNet3** (1M params, real-time on CPU) for noise suppression
92
+ - **Spotify's Pedalboard** for the professional DSP mastering chain
93
+ - **pyloudnorm** for LUFS loudness normalization
94
+
95
+ First attempt used `resemble-enhance` denoiser + pedalboard. Result: still too noisy. The resemble-enhance denoiser wasn't aggressive enough.
96
+
97
+ Swapped to DeepFilterNet3. Required monkey-patching `torchaudio.backend` (removed in torchaudio 2.9+, but DeepFilterNet still imports it).
98
+
99
+ **Final pipeline:**
100
+ ```
101
+ Input (.m4a/.wav/any format)
102
+ → ffmpeg convert to 48kHz mono WAV
103
+ → DeepFilterNet3 noise suppression (full file, real-time)
104
+ → High-Pass Filter (80Hz)
105
+ → Subtractive EQ (-3dB at 300Hz)
106
+ → Dual Compressor (2:1 then 3:1, serial)
107
+ → De-Esser (-4dB at 6kHz)
108
+ → Presence Boost (+2.5dB at 3kHz)
109
+ → Air Boost (+2dB at 10kHz)
110
+ → LUFS Normalization to -16 LUFS
111
+ → Brick-Wall Limiter (-1.5dB)
112
+ Output (.wav)
113
+ ```
114
+
115
+ **Result:** 75-80% there. Voice is clean, noise is gone, compression and EQ are working. Slightly too loud — needs LUFS target tuning.
116
+
117
+ ## Remaining Work
118
+
119
+ - [ ] Tune LUFS target (try -18 or -19 instead of -16)
120
+ - [ ] Integrate DeepFilterNet into `engine.py` properly (replace resemble-enhance denoiser)
121
+ - [ ] Remove the torchaudio monkey-patch (create a proper compatibility layer)
122
+ - [ ] Update `processor.py` — DeepFilterNet processes full files in real-time, OLA chunking may not be needed
123
+ - [ ] Update `app.py` Gradio UI for the new pipeline
124
+ - [ ] A/B test against Adobe Podcast Enhance on the same recording
125
+
126
+ ## Tech Stack (Final)
127
+
128
+ | Component | Library | Purpose |
129
+ |---|---|---|
130
+ | Noise Suppression | DeepFilterNet3 | ML-based noise removal (1M params, real-time) |
131
+ | DSP Mastering | Spotify Pedalboard | HPF, EQ, compression, de-essing, limiting |
132
+ | Loudness | pyloudnorm | LUFS measurement and normalization |
133
+ | Audio I/O | torchaudio + ffmpeg | Format conversion and file handling |
134
+ | Package Manager | uv | Python dependency management |
135
+
136
+ ## What We Learned
137
+
138
+ 1. **Generative ≠ better.** For decent input audio, a discriminative denoiser + traditional DSP chain beats a generative re-synthesizer. The generative approach (resemble-enhance CFM) hallucinates artifacts on clean-ish audio.
139
+
140
+ 2. **MPS is not CUDA.** Apple's MPS backend has precision issues with iterative numerical solvers (ODE/SDE). Single forward-pass models work fine; 64-step flow matching does not.
141
+
142
+ 3. **The professional podcast chain is 8 specific steps in a specific order.** Order matters because audio processing is non-commutative. Compress-then-EQ ≠ EQ-then-compress.
143
+
144
+ 4. **DeepFilterNet3 (1M params) outperformed resemble-enhance denoiser (10M params)** for this use case. Purpose-built tools beat general-purpose tools.
145
+
146
+ 5. **The "tearing at loudness" was caused by the absence of a limiter and compressor** — the raw model output had no dynamic range control. Adding the professional mastering chain fixed it.
147
+
148
+ ## Phase 7: ClearVoice MossFormer2 (2026-03-31)
149
+
150
+ **Problem:** v2 (DeepFilterNet + Pedalboard) had noise gone but lacked richness. v3 (DeepFilterNet + FlashSR + Pedalboard) introduced popping sounds from chunk boundary artifacts in FlashSR.
151
+
152
+ **Discovery:** ClearerVoice-Studio (Alibaba, 4k stars) bundles MossFormer2_SE_48K — a discriminative speech enhancement model that processes at 48kHz natively. It handles enhancement + quality improvement in one pass, no chunking needed.
153
+
154
+ **Dependency hell:** AudioSR pinned numpy==1.23.5 (won't build on Python 3.13). ClearVoice pinned librosa==0.10.2 and soundfile==0.12.1. Had to carefully resolve conflicts by removing competing pins.
155
+
156
+ **v4 test (MossFormer2 only):** Quality improved but noise came back — MossFormer2 enhances but doesn't aggressively denoise. Needed both models.
157
+
158
+ **v5 final pipeline (the breakthrough):**
159
+ ```
160
+ DeepFilterNet3 (noise kill) → MossFormer2 (enhance) → Pedalboard (master) → LUFS (-18) → Limiter
161
+ ```
162
+
163
+ **Result:** ~85% podcast quality. Noise eliminated, loudness correct, no artifacts, no popping. Entire 2-minute recording processes in ~15 seconds total (DeepFilterNet ~3s + MossFormer2 ~7s + DSP instant).
164
+
165
+ **Remaining gap:** Studio mic character — proximity effect warmth (80-250Hz), harmonic saturation/richness, voice-specific EQ tuning. This is the "last mile" problem — the difference between clean audio and audio that sounds like it was recorded on a $500 condenser mic.
166
+
167
+ ## Phase 8: The Last Mile — Integration & Studio Character Experiments (2026-04-01)
168
+
169
+ ### Code Integration (v6)
170
+
171
+ Integrated the v5 pipeline (previously a standalone test script) into the proper codebase:
172
+ - **engine.py** — replaced FlashSR with MossFormer2_SE_48K via ClearVoice. LUFS target adjusted from -16 to -18. Uses ClearVoice file I/O mode (not tensor-to-tensor) because t2t mode OOMs on MPS for >60s audio — file I/O mode handles internal 4s sliding-window segmentation.
173
+ - **processor.py** — removed OLA chunking entirely (115 → 55 lines). Both DeepFilterNet and MossFormer2 handle their own segmentation, so external chunking was redundant and caused artifacts.
174
+ - **cli.py** — ffmpeg conversion target changed from 16kHz to 48kHz to match the pipeline's native sample rate.
175
+
176
+ ### FINALLY Research (Dead End)
177
+
178
+ Investigated Samsung's FINALLY (NeurIPS 2024) via the inverse-ai/FINALLY-Speech-Enhancement repo:
179
+ - **No pretrained weights available.** Training from scratch requires multi-GPU, LibriTTS-R + DAPS-clean datasets, and a 3-stage pipeline with known NaN stability issues.
180
+ - **Voice identity risk.** Generative GAN approach can shift accents and change speaker voice in low-SNR regions. Dealbreaker for podcast use.
181
+ - **MOS > ground truth (4.63 vs 4.56)** — model adds aesthetic coloration, not faithful restoration. Same fundamental problem as resemble-enhance CFM.
182
+ - **One interesting idea:** learnable 16→48kHz upsampling (Upsample WaveUNet). Could be explored as a standalone module later.
183
+
184
+ ### Studio Mic Character Experiments (Ruled Out)
185
+
186
+ Tested two DSP approaches to close the "studio mic character" gap:
187
+
188
+ **A) Proximity effect + soft-clip saturation (tanh):** Symmetric waveshaping, odd harmonics only. Result: voice sounds "sleepy/meditative" — rounds off transients, removes speech energy. Good for ambient content, wrong for podcasts.
189
+
190
+ **B) Proximity effect + tube simulation (asymmetric waveshaping):** Even + odd harmonics, models vacuum tube nonlinearity. Result: adds warmth/body at subtle settings, but even conservative parameters (drive=1.5, bias=0.15) sounded like added coloration rather than natural quality improvement. At higher settings, sounds like a cheap old wired mic.
191
+
192
+ **Key finding: MossFormer2 already does its own spectral enhancement.** Layering proximity + saturation on top of an already-enhanced signal adds coloration to coloration. The clean v6 pipeline (no studio character) consistently sounded best in A/B testing. The "last 15%" gap may be smaller than estimated, or it requires a fundamentally different approach (learned upsampling, voice-specific fine-tuning) rather than DSP post-processing.
193
+
194
+ **v6 is the final pipeline:**
195
+ ```
196
+ Input (.m4a/.wav/any) → ffmpeg 48kHz mono → DeepFilterNet3 (denoise) → MossFormer2 (enhance) → Pedalboard DSP (HPF/EQ/compress/de-ess/presence/air) → LUFS -18 → Limiter -1.5dB → Output
197
+ ```
198
+
199
+ Processing time: ~14s for 2 minutes of audio on Apple M4.
200
+
201
+ ## Version History
202
+
203
+ | Version | Pipeline | Result |
204
+ |---------|----------|--------|
205
+ | v1 | resemble-enhance (full CFM, MPS) | Pure noise |
206
+ | v1b | resemble-enhance (full CFM, CPU) | Robotic, artifacts |
207
+ | sweep | 7 parameter configs on CPU | All had tearing |
208
+ | v2 | resemble-enhance denoise + Pedalboard | 75-80%, noise still present |
209
+ | v3 | DeepFilterNet + FlashSR + Pedalboard | Popping at chunk boundaries |
210
+ | v4 | MossFormer2 + Pedalboard | Good but noise returned |
211
+ | v5 | DeepFilterNet + MossFormer2 + Pedalboard | ~85%, noise gone, standalone script |
212
+ | v6 | v5 integrated into engine.py + processor.py | Production-ready, ~85-90% |
213
+ | v6+sat | v6 + proximity + saturation (softclip/tube) | Ruled out — coloration, not improvement |
214
+
215
+ ## What We Learned (Updated)
216
+
217
+ 1. **Generative ≠ better.** For decent input audio, discriminative models beat generative re-synthesizers.
218
+
219
+ 2. **MPS is not CUDA.** Iterative ODE solvers produce noise on Apple MPS. Single-pass models (DeepFilterNet, MossFormer2) work fine.
220
+
221
+ 3. **Order matters.** The professional podcast chain is 8 steps in a specific order. Signal chain is non-commutative: saturation→compression ≠ compression→saturation.
222
+
223
+ 4. **Purpose-built > general-purpose.** DeepFilterNet (1M params) outperformed resemble-enhance denoiser (10M params).
224
+
225
+ 5. **No limiter = tearing.** Raw model output needs dynamic range control.
226
+
227
+ 6. **Chunking creates artifacts.** Processing full files in one pass is better than OLA chunking when the model supports it.
228
+
229
+ 7. **Two specialized models > one generalist.** DeepFilterNet (noise only) + MossFormer2 (enhance only) outperformed either alone.
230
+
231
+ 8. **Dependency pinning is the real enemy.** More time fighting dependency conflicts than writing code.
232
+
233
+ 9. **Don't layer enhancement on enhancement.** MossFormer2 already does spectral shaping. Adding DSP saturation/proximity on top of it adds coloration, not quality. The clean pipeline is the best pipeline.
234
+
235
+ 10. **Generative models are a voice identity risk.** FINALLY (Samsung, NeurIPS 2024) achieves MOS scores above ground truth but can shift accents and change speaker voice. For content where voice identity matters, discriminative models are safer.
236
+
237
+ 11. **ClearVoice tensor-to-tensor mode OOMs on long audio.** File I/O mode handles internal segmentation (4s sliding windows). Always use file I/O for production; t2t only for short clips.
@@ -0,0 +1,69 @@
1
+ # Manifest
2
+
3
+ ## Package (phonepod/)
4
+ - `phonepod/__init__.py` — Public API: enhance(), Engine, process_audio, shutdown_engine
5
+ - `phonepod/_compat.py` — Torchaudio backend compatibility shim
6
+ - `phonepod/engine.py` — 5-stage pipeline: DeepFilterNet → MossFormer2 → Pedalboard → LUFS → Limiter. Subtractive EQ philosophy (cuts only, no boosts). _apply_ceiling fix for peak clipping.
7
+ - `phonepod/processor.py` — Audio loader, mono conversion, engine passthrough, file save
8
+ - `phonepod/cli.py` — CLI interface with ffmpeg format conversion
9
+ - `phonepod/app.py` — Gradio web UI with A/B comparison player
10
+
11
+ ## Legacy (flat files, superseded by phonepod/)
12
+ - `engine.py` — Old engine (now in phonepod/engine.py)
13
+ - `processor.py` — Old processor (now in phonepod/processor.py)
14
+ - `cli.py` — Old CLI (now in phonepod/cli.py)
15
+ - `app.py` — Old app (now in phonepod/app.py)
16
+ - `JOURNEY.md` — Full build log from idea to working pipeline (Phases 1-8)
17
+ - `CLAUDE.md` — Agent identity, architecture rules, execution protocol
18
+ - `docs/architecture.md` — System architecture and module definitions
19
+ - `docs/tasks.md` — Execution roadmap and phase status
20
+ - `docs/references.md` — 30+ models/repos/papers evaluated
21
+ - `docs/knowledge-base.md` — Model evaluations, API reference, parameter guide
22
+ - `setup.sh` — First-time setup script (uv dependencies)
23
+ - `recording.m4a` — Test input (voice memo)
24
+ - `podcast_v6_final.wav` — Current best output
25
+
26
+ ## Test Scripts
27
+ - `test_full_pipeline.py` — Original v2 pipeline test (DeepFilterNet + Pedalboard, no MossFormer2)
28
+ - `test_deepfilter.py` — DeepFilterNet isolation test
29
+ - `test_studio_character.py` — A/B/C test: none vs softclip vs tube saturation
30
+ - `test_tube_sweep.py` — Parameter sweep for tube saturation (6 configs)
31
+ - `test_processor.py` — Original processor test (sine wave)
32
+ - `diagnose.py` — Layer-by-layer diagnostic (MPS vs CPU, denoise vs enhance)
33
+ - `sweep.py` — resemble-enhance 7-config parameter sweep
34
+ - `test_broadcast_ab.py` — A/B test: subtractive EQ (hybrid) vs additive EQ chain comparison
35
+
36
+ ## Benchmark Scripts
37
+ - `benchmark_denoisers.py` — DPDFNet vs DeepFilterNet3 isolated comparison
38
+ - `benchmark_clearvoice_numpy.py` — ClearVoice file I/O vs numpy mode comparison
39
+ - `benchmark_pipeline.py` — Full pipeline A/B: DeepFilterNet3 vs DPDFNet-2 48kHz
40
+
41
+ ## Tuner UI
42
+ - `tuner_minimal.py` — Voice tuner Gradio app with custom PhonepodTheme, semantic sliders, preset save/load
43
+ - `.impeccable.md` — Design context: users, brand personality, aesthetic direction, design principles
44
+ - `phonepod/profile.py` — MasteringParams dataclass, Profile save/load, params_from_semantic(). Subtractive EQ: mud/box/nasal cuts replace presence/air boosts.
45
+
46
+ ## Planning & Documentation
47
+ - `TODOS.md` — Full task list: Sprints 1-5, dependency graph, research findings
48
+ - `docs/system-architecture.html` — Visual system architecture (open in browser)
49
+ - `docs/benchmarks.md` — Benchmark results and decision record (Sprint 1)
50
+
51
+ ## New Files (2026-04-04)
52
+ - `phonepod/audit.py` - Pipeline audit tool: generates HTML report with per-stage spectrograms, metrics, pass/fail
53
+ - `diagnose_muffled.py` - Diagnostic script: isolates each pipeline stage into separate WAV files
54
+
55
+ ## Recent Changes
56
+ - 2026-04-04: RESOLVED muffled audio blocker - root cause was mastering chain crushing dynamics (LUFS overshooting -18 by 3.6dB, crest halved). Fixed with iterative LUFS normalization, gentler compression defaults
57
+ - 2026-04-04: Added noise gate (Pedalboard NoiseGate, -50dB threshold) - silences artifacts between speech
58
+ - 2026-04-04: Added studio room reverb (Pedalboard Reverb, 3% wet default) - subtle early reflections
59
+ - 2026-04-04: Added Room slider to tuner UI + signal health metrics display
60
+ - 2026-04-04: Fixed LUFS convergence bug in enhance() found by Codex audit - was comparing against mutable target instead of original
61
+ - 2026-04-04: Softened noise gate from -40dB to -50dB, longer release (200ms) to preserve speech tails
62
+ - 2026-04-04: Added input clamping to lerp() in params_from_semantic()
63
+ - 2026-04-04: KEY FINDING - mastering DSP has minimal audible impact; ML models determine 95% of output character. Pipeline cleans well but cannot manufacture condenser mic qualities from phone recordings.
64
+ - 2026-04-04: Switched to subtractive EQ philosophy (cuts only, no boosts). Fixed peak clipping bug (_apply_ceiling). Hybrid chain validated via A/B test.
65
+ - 2026-04-03: Reskinned tuner UI - custom coral theme, Space Grotesk, fun microcopy, accordion layout, WCAG fixes
66
+ - 2026-04-03: Created `.impeccable.md` - design context for all future UI work
67
+ - 2026-04-01: Sprint 2 - restructured into `phonepod/` package, public API, pyproject.toml, 19 tests passing
68
+ - 2026-04-01: Sprint 1 complete - benchmarked DPDFNet (rejected), switched ClearVoice to numpy mode (3.2x faster)
69
+ - 2026-04-01: Product named `phonepod` - PyPI available, no trademark conflicts
@@ -0,0 +1,32 @@
1
+ Metadata-Version: 2.4
2
+ Name: phonepod
3
+ Version: 0.1.0b1
4
+ Summary: Local AI audio restoration. Phone recording → podcast quality.
5
+ Project-URL: Homepage, https://github.com/vedantggwp/phonepod
6
+ Project-URL: Issues, https://github.com/vedantggwp/phonepod/issues
7
+ Author-email: Ved <vedant.g26@gmail.com>
8
+ License-Expression: MIT
9
+ Keywords: audio,denoising,podcast,restoration,speech-enhancement
10
+ Classifier: Development Status :: 4 - Beta
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: Intended Audience :: End Users/Desktop
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: MacOS
15
+ Classifier: Operating System :: POSIX :: Linux
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Multimedia :: Sound/Audio
21
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
22
+ Requires-Python: >=3.11
23
+ Requires-Dist: clearvoice>=0.1.2
24
+ Requires-Dist: deepfilternet>=0.5.6
25
+ Requires-Dist: numpy<2.0
26
+ Requires-Dist: pedalboard>=0.9.22
27
+ Requires-Dist: pyloudnorm>=0.2.0
28
+ Requires-Dist: torch>=2.1.0
29
+ Requires-Dist: torchaudio>=2.1.0
30
+ Requires-Dist: torchcodec>=0.11.0
31
+ Provides-Extra: ui
32
+ Requires-Dist: gradio>=6.10.0; extra == 'ui'
@@ -0,0 +1,111 @@
1
+ # phonepod
2
+
3
+ Local AI audio restoration. Phone recording → podcast quality.
4
+
5
+ **Zero cloud. Zero uploads. Everything runs on your machine.**
6
+
7
+ phonepod transforms noisy voice memos into broadcast-ready audio. It combines neural noise suppression (DeepFilterNet3 + MossFormer2) with a subtractive DSP mastering chain - all running locally on CPU. No cloud, no uploads, no subscription.
8
+
9
+ > Status: `0.1.0-beta.1` - works well, API may change. Feedback welcome.
10
+
11
+ ## Before / After
12
+
13
+ > Audio demos coming soon — record on your phone, run `phonepod`, hear the difference.
14
+
15
+ <!-- TODO: Add audio player embeds once demo files are hosted -->
16
+
17
+ ## Install
18
+
19
+ ```bash
20
+ pip install phonepod
21
+ ```
22
+
23
+ Requires Python 3.11+ and ffmpeg (`brew install ffmpeg` on macOS).
24
+
25
+ ## Usage
26
+
27
+ ### CLI (simplest)
28
+
29
+ ```bash
30
+ phonepod recording.m4a podcast.wav
31
+ ```
32
+
33
+ ### Python API
34
+
35
+ ```python
36
+ import phonepod
37
+
38
+ # One-liner: file in, file out
39
+ phonepod.enhance("recording.m4a", "podcast.wav")
40
+
41
+ # Advanced: tensor-level control
42
+ engine = phonepod.Engine()
43
+ enhanced_tensor, sample_rate = engine.enhance(audio_tensor, input_sr)
44
+ ```
45
+
46
+ ### Web UI
47
+
48
+ ```bash
49
+ pip install phonepod[ui]
50
+ python -m phonepod.app
51
+ # Opens at http://localhost:7860
52
+ ```
53
+
54
+ ## What it does
55
+
56
+ | Stage | Model / Tool | What it does |
57
+ |-------|-------------|-------------|
58
+ | 1 | DeepFilterNet3 | Neural noise suppression - removes background noise |
59
+ | 2 | MossFormer2 (48kHz) | Speech enhancement - fills frequencies phones can't capture |
60
+ | 3 | Pedalboard DSP | Subtractive mastering - gate, HPF, EQ cuts (mud/box/nasal), 2x compression, de-ess |
61
+ | 4 | Pedalboard Reverb | Optional studio room ambience |
62
+ | 5 | pyloudnorm | Loudness normalization to -18 LUFS (podcast standard) |
63
+ | 6 | Limiter + ceiling | Prevents clipping at -1.5 dB ceiling |
64
+
65
+ **Subtractive philosophy**: all EQ moves are cuts, not boosts. Remove mud (200Hz), boxiness (500Hz), nasal honk (1500Hz), and harshness (6500Hz). The ML models already shaped the frequency balance - cuts work with them, boosts fight them.
66
+
67
+ Processing a 2-minute recording takes ~7 seconds on Apple Silicon.
68
+
69
+ ## How it started
70
+
71
+ phonepod began as a personal problem: voice memos recorded on a phone sound terrible in a podcast. The AI models that exist are research demos, not products. Professional mastering chains exist but don't denoise. Nothing combines both into a single, local pipeline.
72
+
73
+ So I built it. The full build story — from first prototype to production pipeline, every dead end and breakthrough — is in [JOURNEY.md](JOURNEY.md).
74
+
75
+ ## Architecture
76
+
77
+ ```
78
+ Input (any format)
79
+ -> ffmpeg -> 48kHz mono WAV
80
+ -> Stage 1: DeepFilterNet3 (noise suppression)
81
+ -> Stage 2: MossFormer2_SE_48K (speech enhancement)
82
+ -> Stage 3: Pedalboard mastering (gate -> HPF -> mud/box/nasal cuts -> 2x compression -> de-ess)
83
+ -> Stage 4: Reverb (subtle room ambience, optional)
84
+ -> Stage 5: LUFS normalization (-18 LUFS)
85
+ -> Stage 6: Limiter + hard ceiling (-1.5 dB)
86
+ Output: podcast-quality 48kHz WAV
87
+ ```
88
+
89
+ Hard boundaries: the engine never touches the filesystem. The processor never touches the model. The CLI never touches tensors.
90
+
91
+ ## Development
92
+
93
+ ```bash
94
+ # Clone and setup
95
+ git clone https://github.com/vedantggwp/phonepod.git
96
+ cd phonepod
97
+ uv sync
98
+
99
+ # Run tests (fast unit tests only)
100
+ uv run pytest -m "not slow"
101
+
102
+ # Run full test suite (loads ML models, ~30s)
103
+ uv run pytest
104
+
105
+ # Run on a file
106
+ uv run phonepod recording.m4a output.wav
107
+ ```
108
+
109
+ ## License
110
+
111
+ MIT