Audio Transcription + Summarization Skill
Transcribe audio files (MP3, WAV, M4A, etc.) using OpenAI Whisper AI and ffmpeg to produce structured, timestamped transcripts with automatic summarization and action item extraction. Supports multilingual transcription, speaker diarization, and meeting minutes generation.
Open the source and read safety notes before installing.
Prerequisites
- ffmpeg
- Python 3.11+ or whisper.cpp
- openai-whisper (pip) or whisper.cpp binary
- Sufficient disk space for model downloads (Whisper models range from 39MB small to 1.5GB large model)
- Audio file access permissions - read access to input audio files and write access for transcription output files
- System resources: Minimum 4GB RAM for small model, 8GB+ recommended for medium/large models, GPU optional but recommended for faster processing
Schema details
- Install type
- package
- Reading time
- 3 min
- Difficulty score
- 71
- Troubleshooting
- Yes
- Breaking changes
- No
- Scope
- Source repo
- Stars
- 99,754 source repo stars
- Forks
- 12,214
- Updated
- 2026-05-19T11:32:31Z
- Package verified
- Yes
- SHA-256
- 227f513fd69287b909f5b20d191418d4bc515aa4593508058a42e6d3bdf1ba4c
- Skill type
- general
- Skill level
- advanced
- Verification
- draft
- Verified at
- 2025-10-15
| Platform | Support | Install path |
|---|---|---|
| claude-code | Native | .claude/skills/<skill-name>/SKILL.md |
| codex | Native | .agents/skills/<skill-name>/SKILL.md |
| windsurf | Native | .windsurf/skills/<skill-name>/SKILL.md |
| gemini | Native | .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md |
| cursor | Adapter | .cursor/rules/<skill-name>.mdc |
| cli | Manual | AGENTS.md or tool-specific context file |
Full copyable content
# Convert to mono 16kHz WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav
# Python whisper (pip install -U openai-whisper)
whisper input.wav --model small --language en --output_format txtAbout this resource
What This Skill Enables
Claude can transcribe audio files (MP3, WAV, M4A, etc.) and generate structured summaries with timestamps, action items, and speaker identification. This skill leverages Whisper AI and ffmpeg through Claude's Code Interpreter to process audio locally.
Compatibility
Native
- Claude Code / Claude: native skill usage via
SKILL.md. - Codex/OpenAI workflows: compatible with Agent Skills-style
SKILL.mdcontent as reusable workflow instructions.
Manual Adaptation
- Gemini CLI: native skill usage via
.gemini/skills/<skill-name>/SKILL.mdor.agents/skills/<skill-name>/SKILL.mdwhere supported. - Cursor: use the generated
.cursor/rules/*.mdcadapter for project rules. - OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.
Prerequisites
Required:
- Claude Pro subscription
- Code Interpreter feature enabled in Claude Desktop settings
- Audio file uploaded to conversation (drag and drop)
What Claude handles automatically:
- Installing and running Whisper AI models
- Audio format conversion with ffmpeg
- Timestamp extraction and alignment
- Summary generation and structuring
How to Use This Skill
Basic Transcription
Prompt: "Transcribe this audio file and give me a clean text transcript."
Claude will:
- Detect the audio format
- Convert to optimal format for transcription
- Run Whisper AI transcription
- Return formatted text
Timestamped Summary
Prompt: "Transcribe this meeting recording and create a timestamped summary with key discussion points every 5 minutes."
Claude will:
- Transcribe the full audio
- Chunk by time intervals
- Summarize each segment
- Present with timestamps
Action Items Extraction
Prompt: "Transcribe this audio and extract all action items, decisions, and to-dos mentioned."
Claude will:
- Transcribe the audio
- Analyze for actionable items
- List action items with timestamps
- Identify who was assigned what (if mentioned)
Speaker Diarization
Prompt: "Transcribe this conversation and identify different speakers. Label them as Speaker 1, Speaker 2, etc."
Claude will:
- Detect speaker changes in the audio
- Segment by speaker
- Label each segment
- Present as a conversation transcript
Tips for Best Results
- Audio Quality Matters: Clear audio with minimal background noise produces better transcriptions
- File Size: For files over 25MB, mention if you want a specific time range transcribed first
- Language: Specify the language if it's not English (e.g., "Transcribe this Spanish audio...")
- Model Selection: For better accuracy on difficult audio, ask Claude to use the "medium" or "large" Whisper model
- Post-Processing: Ask Claude to clean up transcription artifacts like repeated words or filler sounds
Common Workflows
Meeting Minutes Generation
"Transcribe this meeting and create:
1. Attendee list (if mentioned)
2. Key discussion topics with timestamps
3. Decisions made
4. Action items with owners
5. Next steps"
Podcast Summary
"Transcribe this podcast episode and create:
1. Episode summary (2-3 sentences)
2. Main topics discussed with timestamps
3. Key quotes
4. Chapters (every 10 minutes)"
Interview Transcription
"Transcribe this interview with speaker labels.
Format as Q&A with:
- Interviewer questions highlighted
- Interviewee responses
- Notable quotes pulled out"
Troubleshooting
Issue: Transcription is inaccurate Solution: Ask Claude to use a larger Whisper model or pre-process the audio for noise reduction
Issue: Wrong language detected Solution: Explicitly specify the language in your prompt ("Transcribe this French audio...")
Issue: Timestamps are off Solution: Ask Claude to re-align timestamps or specify the desired timestamp interval
Issue: Speaker diarization missing Solution: Request it explicitly: "Please identify different speakers and label them"
Learn More
- Whisper AI by OpenAI - The underlying transcription model
- ffmpeg Audio Processing - Audio format conversion details
- Claude Code Interpreter - How Claude executes code
- Simon Willison's Analysis - Deep dive into Claude's skills
Features
- Local processing via Whisper
- Format conversion with ffmpeg
- Timestamped notes and action items
- Optional speaker labels
- Multilingual support (99 languages with auto-detection)
- Word-level timestamp accuracy
- Multiple output formats (TXT, VTT, SRT, JSON)
- Real-time streaming transcription support with live audio processing, continuous transcription updates, and low-latency transcription for live events or meetings
Use Cases
- Summarize meetings and podcasts
- Generate action items
- Create searchable transcripts
- Generate meeting minutes with action items
- Create accessible transcripts for video content
- Extract insights from podcast episodes
Source citations
Signals
Loading live community signals…
A short, calm digest of reviewed Claude resources. Unsubscribe any time.