skillsFirst-partyReview first Safety · Privacy ·

Audio Transcription + Summarization Skill

Transcribe audio files (MP3, WAV, M4A, etc.) using OpenAI Whisper AI and ffmpeg to produce structured, timestamped transcripts with automatic summarization and action item extraction. Supports multilingual transcription, speaker diarization, and meeting minutes generation.

by JSONbored·added 2025-10-15·99,754 source repo stars·

Claude CodeCodexWindsurfGeminiCursorCLI

HarnessClaude CodeCodexWindsurfGeminiCursorCLI

Level:advancedType:generalVerified:draft

Review first — review before installing

Open the source and read safety notes before installing.

Prerequisites

ffmpeg
Python 3.11+ or whisper.cpp
openai-whisper (pip) or whisper.cpp binary
Sufficient disk space for model downloads (Whisper models range from 39MB small to 1.5GB large model)
Audio file access permissions - read access to input audio files and write access for transcription output files
System resources: Minimum 4GB RAM for small model, 8GB+ recommended for medium/large models, GPU optional but recommended for faster processing

Schema details

Install type: package
Reading time: 3 min
Difficulty score: 71
Troubleshooting: Yes
Breaking changes: No

Source repository stats

Scope: Source repo
Stars: 99,754 source repo stars
Forks: 12,214
Updated: 2026-05-19T11:32:31Z

Package metadata

Download URL: /downloads/skills/audio-transcription-summarization.zip
Package verified: Yes
SHA-256: 227f513fd69287b909f5b20d191418d4bc515aa4593508058a42e6d3bdf1ba4c

Skill and platform metadata

Skill type: general
Skill level: advanced
Verification: draft
Verified at: 2025-10-15

Retrieval sources

https://github.com/openai/whisper

Tested platforms

ClaudeCodexOpenClawCursorWindsurfGemini

Platform	Support	Install path
claude-code	Native	.claude/skills/<skill-name>/SKILL.md
codex	Native	.agents/skills/<skill-name>/SKILL.md
windsurf	Native	.windsurf/skills/<skill-name>/SKILL.md
gemini	Native	.gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md
cursor	Adapter	.cursor/rules/<skill-name>.mdc
cli	Manual	AGENTS.md or tool-specific context file

Full copyable content

# Convert to mono 16kHz WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav

# Python whisper (pip install -U openai-whisper)
whisper input.wav --model small --language en --output_format txt

About this resource

What This Skill Enables

Claude can transcribe audio files (MP3, WAV, M4A, etc.) and generate structured summaries with timestamps, action items, and speaker identification. This skill leverages Whisper AI and ffmpeg through Claude's Code Interpreter to process audio locally.

Compatibility

Native

Claude Code / Claude: native skill usage via SKILL.md.
Codex/OpenAI workflows: compatible with Agent Skills-style SKILL.md content as reusable workflow instructions.

Manual Adaptation

Gemini CLI: native skill usage via .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md where supported.
Cursor: use the generated .cursor/rules/*.mdc adapter for project rules.
OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.

Prerequisites

Required:

Claude Pro subscription
Code Interpreter feature enabled in Claude Desktop settings
Audio file uploaded to conversation (drag and drop)

What Claude handles automatically:

Installing and running Whisper AI models
Audio format conversion with ffmpeg
Timestamp extraction and alignment
Summary generation and structuring

How to Use This Skill

Basic Transcription

Prompt: "Transcribe this audio file and give me a clean text transcript."

Claude will:

Detect the audio format
Convert to optimal format for transcription
Run Whisper AI transcription
Return formatted text

Timestamped Summary

Prompt: "Transcribe this meeting recording and create a timestamped summary with key discussion points every 5 minutes."

Claude will:

Transcribe the full audio
Chunk by time intervals
Summarize each segment
Present with timestamps

Action Items Extraction

Prompt: "Transcribe this audio and extract all action items, decisions, and to-dos mentioned."

Claude will:

Transcribe the audio
Analyze for actionable items
List action items with timestamps
Identify who was assigned what (if mentioned)

Speaker Diarization

Prompt: "Transcribe this conversation and identify different speakers. Label them as Speaker 1, Speaker 2, etc."

Claude will:

Detect speaker changes in the audio
Segment by speaker
Label each segment
Present as a conversation transcript

Tips for Best Results

Audio Quality Matters: Clear audio with minimal background noise produces better transcriptions
File Size: For files over 25MB, mention if you want a specific time range transcribed first
Language: Specify the language if it's not English (e.g., "Transcribe this Spanish audio...")
Model Selection: For better accuracy on difficult audio, ask Claude to use the "medium" or "large" Whisper model
Post-Processing: Ask Claude to clean up transcription artifacts like repeated words or filler sounds

Common Workflows

Meeting Minutes Generation

"Transcribe this meeting and create:
1. Attendee list (if mentioned)
2. Key discussion topics with timestamps
3. Decisions made
4. Action items with owners
5. Next steps"

Podcast Summary

"Transcribe this podcast episode and create:
1. Episode summary (2-3 sentences)
2. Main topics discussed with timestamps
3. Key quotes
4. Chapters (every 10 minutes)"

Interview Transcription

"Transcribe this interview with speaker labels.
Format as Q&A with:
- Interviewer questions highlighted
- Interviewee responses
- Notable quotes pulled out"

Troubleshooting

Issue: Transcription is inaccurate Solution: Ask Claude to use a larger Whisper model or pre-process the audio for noise reduction

Issue: Wrong language detected Solution: Explicitly specify the language in your prompt ("Transcribe this French audio...")

Issue: Timestamps are off Solution: Ask Claude to re-align timestamps or specify the desired timestamp interval

Issue: Speaker diarization missing Solution: Request it explicitly: "Please identify different speakers and label them"

Learn More

Whisper AI by OpenAI - The underlying transcription model
ffmpeg Audio Processing - Audio format conversion details
Claude Code Interpreter - How Claude executes code
Simon Willison's Analysis - Deep dive into Claude's skills

Features

Local processing via Whisper
Format conversion with ffmpeg
Timestamped notes and action items
Optional speaker labels
Multilingual support (99 languages with auto-detection)
Word-level timestamp accuracy
Multiple output formats (TXT, VTT, SRT, JSON)
Real-time streaming transcription support with live audio processing, continuous transcription updates, and low-latency transcription for live events or meetings

Use Cases

Summarize meetings and podcasts
Generate action items
Create searchable transcripts
Generate meeting minutes with action items
Create accessible transcripts for video content
Extract insights from podcast episodes

Content outline

What This Skill Enables
Compatibility
Native
Manual Adaptation
Prerequisites
How to Use This Skill
Basic Transcription
Timestamped Summary
Action Items Extraction
Speaker Diarization
Tips for Best Results
Common Workflows
Meeting Minutes Generation
Podcast Summary
Interview Transcription
Troubleshooting

#audio#transcription#whisper#ffmpeg

Source citations

Signals

Loading live community signals…

Prerequisites

Schema details

About this resource

What This Skill Enables

Compatibility

Native

Manual Adaptation

Prerequisites

How to Use This Skill

Basic Transcription

Timestamped Summary

Action Items Extraction

Speaker Diarization

Tips for Best Results

Common Workflows

Meeting Minutes Generation

Podcast Summary

Interview Transcription

Troubleshooting

Learn More

Features

Use Cases

Source citations

Related resources

Agent Evals Regression Gate Skill

AI Agent Observability and Incident Response Skill

AI Business Idea Validation Capability Pack Skill

AI Search Ranking Content Cluster Strategy Skill

Signals