Colab + WhisperX 將音檔轉成逐字稿 (20250619版)

這兩天有需求,再上 colab 用 WhisperX 把音檔轉成逐字稿。發現之前寫好的code 又跑出錯誤訊息了。

解決問題的路上,發現新版已經跑得起來,不用像之前還要移除 Pytorch、安裝特定相容的版本。但版本依賴的狀況還是蠻混亂的。

新的版本還是帶來新的問題:

  • whisperx.DiarizationPipeline 指令改成 whisperx.diarize.DiarizationPipeline
  • alignment.py 裡有個Index Error,Issue 裡也有人回報,但還在等修復,目前需要自己進去改code。

我整理了一下,再把新的code 放上來,但使用上要小心。再幾個版本後可能又不能用了。

目前日期 (2025/6/19)所安裝的版本為:

whisperx 3.3.4
ctranslate2 4.4.0
pyannote-audio 3.3.2
torch 2.6.0+cu124
torchaudio 2.6.0+cu124
libcudnn8 8.9.7.29-1+cuda12.2
libcudnn8-dev 8.9.7.29-1+cuda12.2

以下是目前的code 跟修改的地方。如果要知道更多 code 的作用,可以回去參考一開始發佈的版本,裡頭有說明。


Part 1. 掛載 Google Drive

這部份沒有什麼改變。

from google.colab import drive
drive.mount('/content/drive')

Part 2. 安裝與環境配置

新版變得簡潔許多了。只要記得安裝 libduenn8、whisperx、跟pyannote.audio.

!apt-get update
!apt install libcudnn8 libcudnn8-dev -y
!pip install whisperx
!pip install pyannote.audio


!python -c "import torch; torch.backends.cuda.matmul.allow_tf32 = True; torch.backends.cudnn.allow_tf32 = True"

Part 3. 修改alignment.py 原始碼

這裡因為要修改 alignment.py 裡的一行原始碼,解決出現 IndexError: tensors used as indices must be long, int, byte or bool tensors 的問題,所以需要寫個腳本處理。如果未來 whisperx 新版解決掉這個bug,這一段可能就不需要了。

# 重置環境(可選)
#%reset -f

# 安裝或更新 whisperx(如果需要)
#!pip install --upgrade whisperx

# 複製 whisperx 到 /content
!cp -r /usr/local/lib/python3.11/dist-packages/whisperx /content/whisperx

# 修改 alignment.py
import os
file_path = '/content/whisperx/alignment.py'
with open(file_path, 'r') as file:
    lines = file.readlines()
target_line = 'regular_scores = frame_emission[tokens.clamp(min=0)]'
new_line = 'regular_scores = frame_emission[tokens.clamp(min=0).long()]  # 轉換為 torch.long\n'
for i, line in enumerate(lines):
    if target_line in line:
        lines[i] = new_line
        break
with open(file_path, 'w') as file:
    file.writelines(lines)
print("已修改 alignment.py")

# 驗證修改
!grep "regular_scores = frame_emission" /content/whisperx/alignment.py

# 清除模組快取
import sys
for module in list(sys.modules.keys()):
    if module.startswith('whisperx'):
        del sys.modules[module]
print("已清除 whisperx 相關模組")

# 設置 sys.path
sys.path.insert(0, '/content')
print("sys.path:", sys.path)

# 載入並驗證 whisperx
import whisperx
print("whisperx 路徑:", whisperx.__file__)  # 應為 /content/whisperx/__init__.py

import os

file_path = '/content/whisperx/alignment.py'
with open(file_path, 'r') as file:
    lines = file.readlines()

# 修復 regular_scores 的縮進
for i, line in enumerate(lines):
    if 'regular_scores = frame_emission[tokens.clamp(min=0).long()]' in line:
        lines[i] = '    ' + line.lstrip()  # 設置為 4 個空格
        break

# 寫回文件
with open(file_path, 'w') as file:
    file.writelines(lines)
print("已修復 regular_scores 的縮進")

# 驗證修改
!sed -n '420,435p' /content/whisperx/alignment.py

Part 4: 載入函式庫與參數設定

這裡沒什麼變,主要改的是 whisperx.DiarizationPipeline 要改成whisperx.diarize.DiarizationPipeline。

其他維持現狀:

import whisperx
import torch
import gc
import os

# from faster_whisper import WhisperModel
from tqdm import tqdm
import os
from google.colab import files
from google.colab import userdata



model_size = "large-v2" # tiny, base, small, medium, large, large-v2, large-v3
batch_size = 16 # reduce if low on GPU mem
device="cuda"

# 設定檔案路徑
audio_path = "/content/drive/MyDrive/Colab Notebooks/.mp3" # 替換成你的檔案名稱

HF_TOKEN = userdata.get('HF_TOKEN')
os.environ['HUGGING_FACE_HUB_TOKEN'] = HF_TOKEN


# 1. Transcribe with original whisper (batched)
print("正在載入 Whisper 模型...")
model = whisperx.load_model(model_size, device, compute_type="float16")

print(f"正在載入音訊檔案: {audio_path}")
audio = whisperx.load_audio(audio_path)

print("正在進行轉錄...")
result = model.transcribe(audio, batch_size=batch_size, chunk_size=6)
# print( result["segments"]) # before alignment

# 2. Align whisper output
print("正在載入對齊模型...")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

aligned_result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=True, interpolate_method="linear")


# 3. Assign speaker labels
print("正在載入說話人分割模型...")
diarize_model = whisperx.diarize.DiarizationPipeline(device=device)
diarize_segments = diarize_model(audio)
# add min/max number of speakers if known
# diarize_segments = diarize_model(audio, min_speakers=1, max_speakers=2)
# print(diarize_segments)


print("正在將說話人標籤分配給詞語...")
final_result = whisperx.assign_word_speakers(diarize_segments, aligned_result)
# print(final_result["segments"]) # segments are now assigned speaker IDs

print("\n--- 最終結果 (片段與說話人) ---")

# for segment in final_result["segments"]:
    # speaker = segment.get('speaker', '未知說話人')
    # start_time = segment['start']
    # end_time = segment['end']
    # text = segment['text']
    # print(f"[{start_time:.2f}s - {end_time:.2f}s] {speaker}: {text}")



    # --- 清理記憶體 ---
#del model, model_a, diarize_model, result, aligned_result, final_result, audio
gc.collect()
torch.cuda.empty_cache()
print("\n已清理所有模型和變數的記憶體.")

Part 4: 合併同講者詞與 GPT-4 斷句

這裡也沒什麼改變,只是把 Chat GPT model 改成新的/較便宜的 gpt-4.1-nano 而已。

但其實切字部份code還有改善的空間。

!pip install openai
print("\n--------------- 中文merge 與斷句處理  ---------\n")


import openai
import time
import os


HF_TOKEN = userdata.get('OPENAI_API')
os.environ['OPENAI_API_KEY'] = HF_TOKEN

##########################################

# 先把同一個發言者的字合併


def merge_words_by_speaker(segments):
    merged = []
    current = {
        "speaker": None,
        "text": "",
        "start": None,
        "end": None
    }

    pending_unknown = None  # 用來暫存沒有時間的 UNKNOWN 詞

    for segment in segments:
        for word in segment.get("words", []):

            speaker = word.get("speaker", "UNKNOWN")
            word_text = word["word"]

            # 處理 UNKNOWN 且沒時間標記的詞
            if speaker == "UNKNOWN" and "start" not in word and "end" not in word:
                # 當作當前 speaker 的延伸,先暫存
                current["text"] += word_text
                pending_unknown = word_text
                continue

            # speaker 改變時,儲存上一段
            if current["speaker"] != speaker:
                if current["text"]:
                    merged.append(current)

                # 如果有 pending_unknown,要接到下一個 speaker 的開頭
                if pending_unknown:
                    word_text = pending_unknown + word_text
                    pending_unknown = None

                current = {
                    "speaker": speaker,
                    "text": word_text,
                    "start": word.get("start"),
                    "end": word.get("end")
                }

            else:
                # 同一位 speaker 的詞
                pending_unknown = None
                current["text"] += word_text
                if "end" in word:
                    current["end"] = word["end"]


    # 加入最後一段
    if current["text"]:
        merged.append(current)

    return merged






# 用 GPT-4 做斷句處理


client = openai.OpenAI()  # 用環境變數設定 OPENAI_API_KEY

def punctuate_with_gpt(text, model="gpt-4.1-nano", max_retry=3):
    prompt = f"""請幫我將以下沒有標點的中文話語補上合適的標點(例如句號、逗號、問號等),並分成自然語言語句:

{text}

輸出時只需要修正後的文本,不需要其他解釋。"""

    for _ in range(max_retry):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            print(f"API error: {e}")
            time.sleep(2)
    return text


# 對合併後的結果做 GPT 斷句處理
def process_segments_with_gpt(merged_results, length_threshold=10):
    processed = []
    for segment in merged_results:
        raw_text = segment['text']
        if len(raw_text) >= length_threshold:
            processed_text = punctuate_with_gpt(raw_text)
        else:
            processed_text = raw_text

        processed.append({
            "speaker": segment.get("speaker", "未知說話人"),
            "start": segment.get("start"),
            "end": segment.get("end"),
            "text": processed_text
        })
    return processed



# Step 4: 輸出結果
def print_segments(segments):
    for seg in segments:
        start = seg["start"]
        end = seg["end"]
        speaker = seg["speaker"]
        text = seg["text"]
        print(f"[{start:.2f}s - {end:.2f}s] {speaker}: {text}")






  # 使用方式

print("正在合併同一個發言者的發言...\n")
merged_results = merge_words_by_speaker(final_result["segments"])

print("正在使用 GPT 作斷句和標點處理…...\n")
final_sentences = process_segments_with_gpt(merged_results)

Part 6. 輸出與儲存結果

這部份也沒什麼改變。

多產生一個沒有時間碼的版本而已。

print("正在輸出結果...\n")
# print_segments(final_sentences)  # 顯示結果
# save_segments_to_txt(final_sentences, filename="transcription_result.txt")  # 儲存結果

# Step 5: 儲存結果到文字檔

def format_time(seconds):
    if seconds is None:
        return "??:??:??"
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    return f"{hours:02}:{minutes:02}:{seconds:02}"


# 獲取不帶副檔名的檔案名稱
filename_orig = os.path.splitext(os.path.basename(audio_path))[0]
filename_orig = filename_orig + ".txt"
filename2_orig = filename_orig + "-notimecode.txt"

def save_segments_to_txt(segments, filename=filename_orig):
    with open(filename, "w", encoding="utf-8") as f:
        for seg in segments:
            start = format_time(seg["start"])
            end = format_time(seg["end"])
            speaker = seg["speaker"]
            text = seg["text"]
            f.write(f"[{start} - {end}] {speaker}: {text}\n")

    files.download(f"{filename}")
    print(f"儲存成功:{filename}")


def save_segments_to_txt2(segments, filename=filename2_orig):
    with open(filename, "w", encoding="utf-8") as f:
        for seg in segments:
            start = format_time(seg["start"])
            end = format_time(seg["end"])
            speaker = seg["speaker"]
            text = seg["text"]
            f.write(f" {text}")

    files.download(f"{filename}")
    print(f"儲存成功:{filename}")


# save_segments_to_txt(final_sentences, filename="transcription_result.txt")  # 儲存結果
save_segments_to_txt(final_sentences)  # 儲存結果
save_segments_to_txt2(final_sentences)  # 儲存結果


WhisperX 是一個還在發展進化的專案,不過我現在碰到的問題主要是切字的效果。

另外判斷speaker 的部份,中文還是會受到音質的影響,體感覺得成功率大概只有8~9成。這部份因為使用的頻率不夠高,後面需要再手動處理過。真的還不到無腦輸出的地步。

就將就著用吧!

在〈Colab + WhisperX 將音檔轉成逐字稿 (20250619版)〉中有 1 則留言

發佈留言