Skip to main content
skillsFirst-partyReview first Safety · Privacy ·

Audio Transcription + Summarization Skill

Transcribe audio files (MP3, WAV, M4A, etc.) using OpenAI Whisper AI and ffmpeg to produce structured, timestamped transcripts with automatic summarization and action item extraction. Supports multilingual transcription, speaker diarization, and meeting minutes generation.

by JSONbored·added 2025-10-15·99,754 source repo stars·
Claude CodeCodexWindsurfGeminiCursorCLI
HarnessClaude CodeCodexWindsurfGeminiCursorCLI
Level:advancedType:generalVerified:draft
Review first review before installing

Open the source and read safety notes before installing.

Prerequisites

  • ffmpeg
  • Python 3.11+ or whisper.cpp
  • openai-whisper (pip) or whisper.cpp binary
  • Sufficient disk space for model downloads (Whisper models range from 39MB small to 1.5GB large model)
  • Audio file access permissions - read access to input audio files and write access for transcription output files
  • System resources: Minimum 4GB RAM for small model, 8GB+ recommended for medium/large models, GPU optional but recommended for faster processing

Schema details

Install type
package
Reading time
3 min
Difficulty score
71
Troubleshooting
Yes
Breaking changes
No
Source repository stats
Scope
Source repo
Stars
99,754 source repo stars
Forks
12,214
Updated
2026-05-19T11:32:31Z
Package metadata
Package verified
Yes
SHA-256
227f513fd69287b909f5b20d191418d4bc515aa4593508058a42e6d3bdf1ba4c
Skill and platform metadata
Skill type
general
Skill level
advanced
Verification
draft
Verified at
2025-10-15
Retrieval sources
https://github.com/openai/whisper
Tested platforms
ClaudeCodexOpenClawCursorWindsurfGemini
PlatformSupportInstall path
claude-codeNative.claude/skills/<skill-name>/SKILL.md
codexNative.agents/skills/<skill-name>/SKILL.md
windsurfNative.windsurf/skills/<skill-name>/SKILL.md
geminiNative.gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md
cursorAdapter.cursor/rules/<skill-name>.mdc
cliManualAGENTS.md or tool-specific context file
Full copyable content
# Convert to mono 16kHz WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav

# Python whisper (pip install -U openai-whisper)
whisper input.wav --model small --language en --output_format txt

About this resource

What This Skill Enables

Claude can transcribe audio files (MP3, WAV, M4A, etc.) and generate structured summaries with timestamps, action items, and speaker identification. This skill leverages Whisper AI and ffmpeg through Claude's Code Interpreter to process audio locally.

Compatibility

Native

  • Claude Code / Claude: native skill usage via SKILL.md.
  • Codex/OpenAI workflows: compatible with Agent Skills-style SKILL.md content as reusable workflow instructions.

Manual Adaptation

  • Gemini CLI: native skill usage via .gemini/skills/<skill-name>/SKILL.md or .agents/skills/<skill-name>/SKILL.md where supported.
  • Cursor: use the generated .cursor/rules/*.mdc adapter for project rules.
  • OpenClaw and similar agents: use the same skill content as a reusable prompt/workflow file when native skill import is unavailable.

Prerequisites

Required:

  • Claude Pro subscription
  • Code Interpreter feature enabled in Claude Desktop settings
  • Audio file uploaded to conversation (drag and drop)

What Claude handles automatically:

  • Installing and running Whisper AI models
  • Audio format conversion with ffmpeg
  • Timestamp extraction and alignment
  • Summary generation and structuring

How to Use This Skill

Basic Transcription

Prompt: "Transcribe this audio file and give me a clean text transcript."

Claude will:

  1. Detect the audio format
  2. Convert to optimal format for transcription
  3. Run Whisper AI transcription
  4. Return formatted text

Timestamped Summary

Prompt: "Transcribe this meeting recording and create a timestamped summary with key discussion points every 5 minutes."

Claude will:

  1. Transcribe the full audio
  2. Chunk by time intervals
  3. Summarize each segment
  4. Present with timestamps

Action Items Extraction

Prompt: "Transcribe this audio and extract all action items, decisions, and to-dos mentioned."

Claude will:

  1. Transcribe the audio
  2. Analyze for actionable items
  3. List action items with timestamps
  4. Identify who was assigned what (if mentioned)

Speaker Diarization

Prompt: "Transcribe this conversation and identify different speakers. Label them as Speaker 1, Speaker 2, etc."

Claude will:

  1. Detect speaker changes in the audio
  2. Segment by speaker
  3. Label each segment
  4. Present as a conversation transcript

Tips for Best Results

  1. Audio Quality Matters: Clear audio with minimal background noise produces better transcriptions
  2. File Size: For files over 25MB, mention if you want a specific time range transcribed first
  3. Language: Specify the language if it's not English (e.g., "Transcribe this Spanish audio...")
  4. Model Selection: For better accuracy on difficult audio, ask Claude to use the "medium" or "large" Whisper model
  5. Post-Processing: Ask Claude to clean up transcription artifacts like repeated words or filler sounds

Common Workflows

Meeting Minutes Generation

"Transcribe this meeting and create:
1. Attendee list (if mentioned)
2. Key discussion topics with timestamps
3. Decisions made
4. Action items with owners
5. Next steps"

Podcast Summary

"Transcribe this podcast episode and create:
1. Episode summary (2-3 sentences)
2. Main topics discussed with timestamps
3. Key quotes
4. Chapters (every 10 minutes)"

Interview Transcription

"Transcribe this interview with speaker labels.
Format as Q&A with:
- Interviewer questions highlighted
- Interviewee responses
- Notable quotes pulled out"

Troubleshooting

Issue: Transcription is inaccurate Solution: Ask Claude to use a larger Whisper model or pre-process the audio for noise reduction

Issue: Wrong language detected Solution: Explicitly specify the language in your prompt ("Transcribe this French audio...")

Issue: Timestamps are off Solution: Ask Claude to re-align timestamps or specify the desired timestamp interval

Issue: Speaker diarization missing Solution: Request it explicitly: "Please identify different speakers and label them"

Learn More

Features

  • Local processing via Whisper
  • Format conversion with ffmpeg
  • Timestamped notes and action items
  • Optional speaker labels
  • Multilingual support (99 languages with auto-detection)
  • Word-level timestamp accuracy
  • Multiple output formats (TXT, VTT, SRT, JSON)
  • Real-time streaming transcription support with live audio processing, continuous transcription updates, and low-latency transcription for live events or meetings

Use Cases

  • Summarize meetings and podcasts
  • Generate action items
  • Create searchable transcripts
  • Generate meeting minutes with action items
  • Create accessible transcripts for video content
  • Extract insights from podcast episodes
#audio#transcription#whisper#ffmpeg

Source citations

Signals

Loading live community signals…

More like this, weekly

A short, calm digest of reviewed Claude resources. Unsubscribe any time.