這兩天有需求,再上 colab 用 WhisperX 把音檔轉成逐字稿。發現之前寫好的code 又跑出錯誤訊息了。
解決問題的路上,發現新版已經跑得起來,不用像之前還要移除 Pytorch、安裝特定相容的版本。但版本依賴的狀況還是蠻混亂的。
新的版本還是帶來新的問題:
- whisperx.DiarizationPipeline 指令改成 whisperx.diarize.DiarizationPipeline
- alignment.py 裡有個Index Error,Issue 裡也有人回報,但還在等修復,目前需要自己進去改code。
我整理了一下,再把新的code 放上來,但使用上要小心。再幾個版本後可能又不能用了。
目前日期 (2025/6/19)所安裝的版本為:
whisperx 3.3.4
ctranslate2 4.4.0
pyannote-audio 3.3.2
torch 2.6.0+cu124
torchaudio 2.6.0+cu124
libcudnn8 8.9.7.29-1+cuda12.2
libcudnn8-dev 8.9.7.29-1+cuda12.2
以下是目前的code 跟修改的地方。如果要知道更多 code 的作用,可以回去參考一開始發佈的版本,裡頭有說明。
Part 1. 掛載 Google Drive
這部份沒有什麼改變。
from google.colab import drive
drive.mount('/content/drive')
Part 2. 安裝與環境配置
新版變得簡潔許多了。只要記得安裝 libduenn8、whisperx、跟pyannote.audio.
!apt-get update
!apt install libcudnn8 libcudnn8-dev -y
!pip install whisperx
!pip install pyannote.audio
!python -c "import torch; torch.backends.cuda.matmul.allow_tf32 = True; torch.backends.cudnn.allow_tf32 = True"
Part 3. 修改alignment.py 原始碼
這裡因為要修改 alignment.py 裡的一行原始碼,解決出現 IndexError: tensors used as indices must be long, int, byte or bool tensors
的問題,所以需要寫個腳本處理。如果未來 whisperx 新版解決掉這個bug,這一段可能就不需要了。
# 重置環境(可選)
#%reset -f
# 安裝或更新 whisperx(如果需要)
#!pip install --upgrade whisperx
# 複製 whisperx 到 /content
!cp -r /usr/local/lib/python3.11/dist-packages/whisperx /content/whisperx
# 修改 alignment.py
import os
file_path = '/content/whisperx/alignment.py'
with open(file_path, 'r') as file:
lines = file.readlines()
target_line = 'regular_scores = frame_emission[tokens.clamp(min=0)]'
new_line = 'regular_scores = frame_emission[tokens.clamp(min=0).long()] # 轉換為 torch.long\n'
for i, line in enumerate(lines):
if target_line in line:
lines[i] = new_line
break
with open(file_path, 'w') as file:
file.writelines(lines)
print("已修改 alignment.py")
# 驗證修改
!grep "regular_scores = frame_emission" /content/whisperx/alignment.py
# 清除模組快取
import sys
for module in list(sys.modules.keys()):
if module.startswith('whisperx'):
del sys.modules[module]
print("已清除 whisperx 相關模組")
# 設置 sys.path
sys.path.insert(0, '/content')
print("sys.path:", sys.path)
# 載入並驗證 whisperx
import whisperx
print("whisperx 路徑:", whisperx.__file__) # 應為 /content/whisperx/__init__.py
import os
file_path = '/content/whisperx/alignment.py'
with open(file_path, 'r') as file:
lines = file.readlines()
# 修復 regular_scores 的縮進
for i, line in enumerate(lines):
if 'regular_scores = frame_emission[tokens.clamp(min=0).long()]' in line:
lines[i] = ' ' + line.lstrip() # 設置為 4 個空格
break
# 寫回文件
with open(file_path, 'w') as file:
file.writelines(lines)
print("已修復 regular_scores 的縮進")
# 驗證修改
!sed -n '420,435p' /content/whisperx/alignment.py
Part 4: 載入函式庫與參數設定
這裡沒什麼變,主要改的是 whisperx.DiarizationPipeline 要改成whisperx.diarize.DiarizationPipeline。
其他維持現狀:
import whisperx
import torch
import gc
import os
# from faster_whisper import WhisperModel
from tqdm import tqdm
import os
from google.colab import files
from google.colab import userdata
model_size = "large-v2" # tiny, base, small, medium, large, large-v2, large-v3
batch_size = 16 # reduce if low on GPU mem
device="cuda"
# 設定檔案路徑
audio_path = "/content/drive/MyDrive/Colab Notebooks/.mp3" # 替換成你的檔案名稱
HF_TOKEN = userdata.get('HF_TOKEN')
os.environ['HUGGING_FACE_HUB_TOKEN'] = HF_TOKEN
# 1. Transcribe with original whisper (batched)
print("正在載入 Whisper 模型...")
model = whisperx.load_model(model_size, device, compute_type="float16")
print(f"正在載入音訊檔案: {audio_path}")
audio = whisperx.load_audio(audio_path)
print("正在進行轉錄...")
result = model.transcribe(audio, batch_size=batch_size, chunk_size=6)
# print( result["segments"]) # before alignment
# 2. Align whisper output
print("正在載入對齊模型...")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
aligned_result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=True, interpolate_method="linear")
# 3. Assign speaker labels
print("正在載入說話人分割模型...")
diarize_model = whisperx.diarize.DiarizationPipeline(device=device)
diarize_segments = diarize_model(audio)
# add min/max number of speakers if known
# diarize_segments = diarize_model(audio, min_speakers=1, max_speakers=2)
# print(diarize_segments)
print("正在將說話人標籤分配給詞語...")
final_result = whisperx.assign_word_speakers(diarize_segments, aligned_result)
# print(final_result["segments"]) # segments are now assigned speaker IDs
print("\n--- 最終結果 (片段與說話人) ---")
# for segment in final_result["segments"]:
# speaker = segment.get('speaker', '未知說話人')
# start_time = segment['start']
# end_time = segment['end']
# text = segment['text']
# print(f"[{start_time:.2f}s - {end_time:.2f}s] {speaker}: {text}")
# --- 清理記憶體 ---
#del model, model_a, diarize_model, result, aligned_result, final_result, audio
gc.collect()
torch.cuda.empty_cache()
print("\n已清理所有模型和變數的記憶體.")
Part 4: 合併同講者詞與 GPT-4 斷句
這裡也沒什麼改變,只是把 Chat GPT model 改成新的/較便宜的 gpt-4.1-nano 而已。
但其實切字部份code還有改善的空間。
!pip install openai
print("\n--------------- 中文merge 與斷句處理 ---------\n")
import openai
import time
import os
HF_TOKEN = userdata.get('OPENAI_API')
os.environ['OPENAI_API_KEY'] = HF_TOKEN
##########################################
# 先把同一個發言者的字合併
def merge_words_by_speaker(segments):
merged = []
current = {
"speaker": None,
"text": "",
"start": None,
"end": None
}
pending_unknown = None # 用來暫存沒有時間的 UNKNOWN 詞
for segment in segments:
for word in segment.get("words", []):
speaker = word.get("speaker", "UNKNOWN")
word_text = word["word"]
# 處理 UNKNOWN 且沒時間標記的詞
if speaker == "UNKNOWN" and "start" not in word and "end" not in word:
# 當作當前 speaker 的延伸,先暫存
current["text"] += word_text
pending_unknown = word_text
continue
# speaker 改變時,儲存上一段
if current["speaker"] != speaker:
if current["text"]:
merged.append(current)
# 如果有 pending_unknown,要接到下一個 speaker 的開頭
if pending_unknown:
word_text = pending_unknown + word_text
pending_unknown = None
current = {
"speaker": speaker,
"text": word_text,
"start": word.get("start"),
"end": word.get("end")
}
else:
# 同一位 speaker 的詞
pending_unknown = None
current["text"] += word_text
if "end" in word:
current["end"] = word["end"]
# 加入最後一段
if current["text"]:
merged.append(current)
return merged
# 用 GPT-4 做斷句處理
client = openai.OpenAI() # 用環境變數設定 OPENAI_API_KEY
def punctuate_with_gpt(text, model="gpt-4.1-nano", max_retry=3):
prompt = f"""請幫我將以下沒有標點的中文話語補上合適的標點(例如句號、逗號、問號等),並分成自然語言語句:
{text}
輸出時只需要修正後的文本,不需要其他解釋。"""
for _ in range(max_retry):
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content.strip()
except Exception as e:
print(f"API error: {e}")
time.sleep(2)
return text
# 對合併後的結果做 GPT 斷句處理
def process_segments_with_gpt(merged_results, length_threshold=10):
processed = []
for segment in merged_results:
raw_text = segment['text']
if len(raw_text) >= length_threshold:
processed_text = punctuate_with_gpt(raw_text)
else:
processed_text = raw_text
processed.append({
"speaker": segment.get("speaker", "未知說話人"),
"start": segment.get("start"),
"end": segment.get("end"),
"text": processed_text
})
return processed
# Step 4: 輸出結果
def print_segments(segments):
for seg in segments:
start = seg["start"]
end = seg["end"]
speaker = seg["speaker"]
text = seg["text"]
print(f"[{start:.2f}s - {end:.2f}s] {speaker}: {text}")
# 使用方式
print("正在合併同一個發言者的發言...\n")
merged_results = merge_words_by_speaker(final_result["segments"])
print("正在使用 GPT 作斷句和標點處理…...\n")
final_sentences = process_segments_with_gpt(merged_results)
Part 6. 輸出與儲存結果
這部份也沒什麼改變。
多產生一個沒有時間碼的版本而已。
print("正在輸出結果...\n")
# print_segments(final_sentences) # 顯示結果
# save_segments_to_txt(final_sentences, filename="transcription_result.txt") # 儲存結果
# Step 5: 儲存結果到文字檔
def format_time(seconds):
if seconds is None:
return "??:??:??"
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = int(seconds % 60)
return f"{hours:02}:{minutes:02}:{seconds:02}"
# 獲取不帶副檔名的檔案名稱
filename_orig = os.path.splitext(os.path.basename(audio_path))[0]
filename_orig = filename_orig + ".txt"
filename2_orig = filename_orig + "-notimecode.txt"
def save_segments_to_txt(segments, filename=filename_orig):
with open(filename, "w", encoding="utf-8") as f:
for seg in segments:
start = format_time(seg["start"])
end = format_time(seg["end"])
speaker = seg["speaker"]
text = seg["text"]
f.write(f"[{start} - {end}] {speaker}: {text}\n")
files.download(f"{filename}")
print(f"儲存成功:{filename}")
def save_segments_to_txt2(segments, filename=filename2_orig):
with open(filename, "w", encoding="utf-8") as f:
for seg in segments:
start = format_time(seg["start"])
end = format_time(seg["end"])
speaker = seg["speaker"]
text = seg["text"]
f.write(f" {text}")
files.download(f"{filename}")
print(f"儲存成功:{filename}")
# save_segments_to_txt(final_sentences, filename="transcription_result.txt") # 儲存結果
save_segments_to_txt(final_sentences) # 儲存結果
save_segments_to_txt2(final_sentences) # 儲存結果
WhisperX 是一個還在發展進化的專案,不過我現在碰到的問題主要是切字的效果。
另外判斷speaker 的部份,中文還是會受到音質的影響,體感覺得成功率大概只有8~9成。這部份因為使用的頻率不夠高,後面需要再手動處理過。真的還不到無腦輸出的地步。
就將就著用吧!
在〈Colab + WhisperX 將音檔轉成逐字稿 (20250619版)〉中有 1 則留言