Youtube Video Summarization with Voice Cloning#

Overview#

Our script performs several tasks:

  1. downloads and processes a YouTube video,

  2. transcribes the audio from the YouTube video,

  3. summarizes the transcription of the transcribed audio, and

  4. converts the summary to speech using the user’s voice.

The script leverages Covalent for executing these tasks, either locally or on a cloud platform like GCP.

Import Dependencies#

First, we import necessary Python libraries:

[2]:
import os
import streamlit as st

from pytube import YouTube
from pydub import AudioSegment
from transformers import pipeline
from TTS.api import TTS

This section loads libraries for audio processing, machine learning models, and Covalent for workflow management.

covalent deploy up gcpbatch

Set Up Covalent and Dependencies#

Covalent simplifies cloud resource management. We define dependencies for each task and configure a Covalent executor for cloud execution. In the example below we utilize Google Cloud Batch using our gcp batch executor.

[3]:
audio_deps = [
    "transformers==4.33.3", "pydub==0.25.1",
    "torchaudio==2.1.0", "librosa==0.10.0",
    "torch==2.1.0"
]
text_deps = ["transformers==4.33.3", "torch==2.1.0"]
tts_deps = audio_deps + ["TTS==0.19.1"]

executor = ct.executor.GCPBatchExecutor(
    container_image_uri="docker.io/filipbolt/covalent-gcp-0.229.0rc0",
    vcpus=4,
    memory=8192,
    time_limit=3000,
    poll_freq=1,
    retries=1
)

Alternatively, you may use Covalent Cloud to execute this workflow by doing:

[3]:
import covalent_cloud as cc

cc_executor = cc.CloudExecutor(num_cpus=4, env="genai-env", memory=8192, time_limit=3000)

Define Covalent Tasks#

Each step of our workflow is encapsulated in a Covalent task. Here’s an example:

[4]:
@ct.electron
def download_video(url):
    yt = YouTube(url)
    # download file
    out_file = yt.streams.filter(
        only_audio=True, file_extension="mp4"
    ).first().download(".")

    # rename downloaded file
    os.rename(out_file, "audio.mp4")
    with open("audio.mp4", "rb") as f:
        file_content = f.read()
    return file_content


@ct.electron
def load_audio(input_file_content):
    input_path = os.path.join(os.getcwd(), "file.mp4")
    # write to file
    with open(input_path, "wb") as f:
        f.write(input_file_content)

    audio_content = AudioSegment.from_file(input_path, format="mp4")
    return audio_content


@ct.electron(executor=executor, deps_pip=audio_deps)
def transcribe_audio(audio_content):
    # Export the audio as a WAV file
    audio_content.export("audio_file.wav", format="wav")

    pipe = pipeline(
        task="automatic-speech-recognition",
        # model="openai/whisper-small",
        model="openai/whisper-large-v3",
        chunk_length_s=30, max_new_tokens=2048,
    )
    transcription = pipe("audio_file.wav")
    return transcription['text']


@ct.electron(executor=executor, deps_pip=text_deps)
def summarize_transcription(transcription):
    summarizer = pipeline(
        "summarization",
        model="facebook/bart-large-cnn",
    )
    summary = summarizer(
        transcription, min_length=5, max_length=100,
        do_sample=False, truncation=True
    )[0]["summary_text"]
    return summary


@ct.electron(executor=executor, deps_pip=tts_deps)
def text_to_speech_voice_clone(text, speaker_content, output_file):
    with open("speaker.wav", "wb") as f:
        f.write(speaker_content)

    # agree to service agreement programmatically
    os.environ['COQUI_TOS_AGREED'] = "1"

    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1")
    tts.tts_to_file(
        text=text,
        file_path=output_file,
        speaker_wav="speaker.wav",
        language="en"
    )
    with open(output_file, "rb") as f:
        file_content = f.read()
    return file_content


@ct.electron
def load_wav_file(wav_file):
    with open(wav_file, "rb") as f:
        file_content = f.read()
    return file_content

We use the @ct.electron decorator to define tasks like download_video,mp4_to_wav, transcribe_audio, etc.

Orchestrate the Workflow#

The @ct.lattice decorator is used to define the workflow that orchestrates the entire process:

[5]:
@ct.lattice
def workflow(url, user_voice_file, output_file):
    video_content = download_video(url)
    audio_content = load_audio(video_content)

    user_voice_content = load_wav_file(user_voice_file)

    # Use Google Cloud Batch to transcribe, summarize and re-voice
    transcription = transcribe_audio(audio_content)
    summary = summarize_transcription(transcription)
    output_file_content = text_to_speech_voice_clone(
        summary, user_voice_content, output_file
    )
    return summary, transcription, output_file_content

Streamlit Interface#

We use Streamlit to create an interactive web interface for the script:

[6]:
import streamlit as st

# Function to display results
def display_results(summary, transcription, audio_file_content):
    display_summary(summary)
    display_full_transcription(transcription)
    display_audio_summary(audio_file_content)

# Function to display the summary
def display_summary(summary):
    st.subheader("YouTube transcription summary:")
    st.text(summary)

# Function to display the full transcription with a toggle
def display_full_transcription(transcription):
    st.subheader("YouTube full transcription")
    if st.checkbox("Show/Hide", False):
        st.text(transcription)

# Function to display the audio summary
def display_audio_summary(audio_file_content):
    st.subheader("Summary in your own voice:")
    st.audio(audio_file_content, format="audio/wav")


# Streamlit app layout
def main():
    st.title("Summarize YouTube videos in your own voice using AI")
    speaker_file, speaker_file_path = upload_speaker_file()
    youtube_url = st.text_input("Enter valid YouTube URL")

    if st.button("Process"):
        process_input(speaker_file, speaker_file_path, youtube_url)
    elif "transcription" in st.session_state:
        display_results(
            st.session_state["summary"],
            st.session_state["transcription"],
            st.session_state["audio_file_content"]
        )

# Function to upload speaker file
def upload_speaker_file():
    speaker_file = st.file_uploader("Upload an audio file (WAV)", type=["wav"])
    if speaker_file:
        st.audio(speaker_file, format="audio/wav")
        speaker_file_path = "speaker.wav"
        with open(speaker_file_path, "wb") as f:
            f.write(speaker_file.getbuffer())
        return speaker_file, speaker_file_path
    return None, None

# Function to process the input
def process_input(speaker_file, speaker_file_path, youtube_url):
    if speaker_file and youtube_url:
        audio_file_full_path = os.path.join(os.getcwd(), "audio.wav")
        speaker_file_full_path = os.path.join(os.getcwd(), speaker_file_path)

        dispatch_id = ct.dispatch(workflow)(
            youtube_url, speaker_file_full_path, audio_file_full_path
        )
        with st.spinner(f"Processing... job dispatch id: {dispatch_id}"):
            result = ct.get_result(dispatch_id, wait=True)

        if result:
            summary, transcription, output_file_content = result.result
            st.session_state["transcription"] = transcription
            st.session_state["summary"] = summary
            st.session_state["audio_file_content"] = output_file_content
            display_results(summary, transcription, output_file_content)
        else:
            st.error("Something went wrong. Please try again.")

main()
2024-01-08 14:37:20.421
  Warning: to view this Streamlit app on a browser, run it with the following
  command:

    streamlit run /home/filip/miniconda3/envs/google/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]
2024-01-08 14:37:20.422 Session state does not function when running a script without `streamlit run`

Running the Script#

To run the script, execute it via Streamlit:

streamlit run your_script.py

In the Covalent UI, you should be seeing a workflow like the following

alt text

The streamlit app usage would then be:

alt text

Customizing the Workflow#

You can tailor this script to your specific needs:

  • Modify the Covalent task functions for different processing requirements like swapping one of the models.

  • Adjust the Covalent executor settings based on your cloud resource needs.

Conclusion#

This tutorial demonstrates using Covalent for fine-tuning a speech summarization model. Covalent’s cloud computing abstraction simplifies executing complex workflows, making it a powerful tool for developers and researchers in AI/ML fields.

[ ]: