๊ด€๋ฆฌ ๋ฉ”๋‰ด

Wookang makes AI

Open Voice V1 ๋Œ๋ฆฌ๊ธฐ ๋ณธ๋ฌธ

AI ์Œ์„ฑ

Open Voice V1 ๋Œ๋ฆฌ๊ธฐ

ํ‘ธ๋ฅธ๊นƒ๋ฐœ๐Ÿณ๏ธ 2024. 5. 10. 17:53

https://github.com/myshell-ai/OpenVoice

 

GitHub - myshell-ai/OpenVoice: Instant voice cloning by MyShell.

Instant voice cloning by MyShell. Contribute to myshell-ai/OpenVoice development by creating an account on GitHub.

github.com

 

OpenVoice์— ๊ด€์‹ฌ์„ ๊ฐ€์ง€๊ฒŒ ๋œ ์ด์œ ๋Š”, ํ•˜๋‚˜๋‹ค.

"๋ชฉ์†Œ๋ฆฌ ๋ณต์ œ๋ฅผ ํ•˜๊ณ  ์‹ถ๋‹ค. ๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š” ๋ชฉ์†Œ๋ฆฌ๋กœ ํ•œ๊ธ€ ํ…์ŠคํŠธ๋ฅผ ๋ฌด์ œํ•œ์œผ๋กœ ์ฝ์–ด์คฌ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค"

 

๊ธฐ์กด tts๋Š” ๊ธธ์ด์ œํ•œ์ด ์žˆ๊ณ  ์‚ฌ์šฉํ•˜๊ธฐ๋„ ๋ณต์žกํ–ˆ๊ธฐ์— ์ด๊ฒƒ์„ ๋„๋‹ค๋ฆฌ์ฒ˜๋Ÿผ ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ฒŒ

1. ๋Œ€์šฉ๋Ÿ‰ ํ…์ŠคํŠธ ํŒŒ์ผ ์ฒจ๋ถ€

2. ๋ชฉ์†Œ๋ฆฌ ์ƒ์„ฑ ๋ฒ„ํŠผ ํด๋ฆญ

3. wavํŒŒ์ผ ์ƒ์„ฑ

ํ•ด์ฃผ๋Š” ์•ฑ์„ ๋งŒ๋“ค๊ณ  ์‹ถ์—ˆ๋‹ค. ํ˜„์žฌ๊นŒ์ง€ r&d ๊ฒฐ๊ณผ๋กœ๋Š” ๋ถ€์ •์ ์ด์ง€๋งŒ, ์ด ๋ถ€์ •์  ๊ฒฐ๋ก ๊นŒ์ง€ ๋„๋‹ฌํ•œ ๊ณผ์ •์„ ๊ธฐ๋ก์œผ๋กœ ๋‚จ๊ธฐ๋ คํ•œ๋‹ค.

 

์ผ๋‹จ, V1์€ ๋‹ค๊ตญ์–ด ์ง€์›์ด ์•ˆ๋œ๋‹ค.

์˜์–ด์™€ ์ค‘๊ตญ์–ด๋งŒ ๋œ๋‹ค. ๋งŒ์•ฝ ํ•œ๊ตญ์–ด ๋ฅผ ์ฝ๊ฒŒ ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ฝ์–ด์ค€๋‹ค. ๋„ค์ดํ‹ฐ๋ธŒ ๋ฏธ๊ตญ์ธ์ด ํ•œ๊ตญ์–ด ๋งํ•˜๋Š”๊ฑฐ ๊ฐ™๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.

 

* ๊นƒํ—™์—์„œ ์„ค์น˜ํ•˜๊ธฐ

git clone https://github.com/myshell-ai/OpenVoice.git open_voice
cd open_voice

 

* ํ™˜๊ฒฝ๋งŒ๋“ค๊ธฐ - ํŒŒ์ด์ฌ ๋ฒ„์ „์ด ๊ผฌ์—ฌ์„œ conda๋ฅผ ์ด์šฉํ–ˆ๋‹ค.

conda create -n ov python=3.9
conda activate ov
pip install -r requirements.txt

 

* condaํ™˜๊ฒฝ์—์„œ ffmpeg๊ฐ€ ์—†๋‹ค๊ณ  ๋œฌ๋‹ค๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ ๊ผญ ffmpeg๋ฅผ ์„ค์น˜ํ•ด์ค˜์•ผ ํ•œ๋‹ค.

conda install ffmpeg

 

* cpu๋กœ ๋Œ๋ฆฌ๊ธฐ - cuda๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด pass!

- ํ˜„์žฌ open voice์—์„œ๋Š” cpu๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜๊ฐ€ ์—†๋‹ค. ์ฝ”๋“œ๊ฐ€ ๋ˆ„๋ฝ๋˜์–ด์žˆ๊ณ  ์ด ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•ด์ฃผ๊ณ  ์žˆ์ง€ ์•Š์•„์„œ ์ง์ ‘ ์ˆ˜์ •ํ•ด์ค˜์•ผ ํ•œ๋‹ค.

๋จผ์ € se_extractor.py ํŒŒ์ผ๋กœ ๊ฐ„ํ›„ 22๋ฒˆ์งธ ์ค„์˜ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ

device = "cuda" if torch.cuda.is_available() else "cpu"
model = WhisperModel(model_size, device=device, compute_type="float16")

์•„๋ž˜์™€ ๊ฐ™์ด ๋ฐ”๊ฟ”์ค€๋‹ค. 

device, compute_type = ("cuda","float16") if torch.cuda.is_available() else ("cpu", "int8")
model = WhisperModel(model_size, device=device, compute_type=compute_type)

 

 

* ๋‹ค ๋๋‹ค. ์ด์ œ ๋Œ๋ ค๋ณด์ž. ํ•œ๊ตญ์–ด๋ฅผ ์ฝ์–ด๋ณด๊ฒŒ ํ–ˆ๋‹ค. ์˜์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜ ์ฃผ์„์„ ํ’€๋ฉด๋œ๋‹ค.

import os
import torch
from openvoice import se_extractor
from openvoice.api import BaseSpeakerTTS, ToneColorConverter

ckpt_base = 'checkpoints/base_speakers/EN'
ckpt_converter = 'checkpoints/converter'
device="cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs'

base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

os.makedirs(output_dir, exist_ok=True)

source_se = torch.load(f'{ckpt_base}/en_default_se.pth').to(device)


# reference_speaker = 'resources/example_reference.mp3' # This is the voice you want to clone
reference_speaker = 'resources/lympe.mp3' # This is the voice you want to clone

target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)

# inference
save_path = f'{output_dir}/output_en_default.wav'

# Run the base speaker tts
# text = "This audio is generated by OpenVoice."
text = "์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”."

src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='default', language='English', speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)
    
source_se = torch.load(f'{ckpt_base}/en_style_se.pth').to(device)
save_path = f'{output_dir}/output_whispering.wav'

# Run the base speaker tts
# text = "This audio is generated by OpenVoice."
text = "์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”."

src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='whispering', language='English', speed=0.9)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)


ckpt_base = 'checkpoints/base_speakers/ZH'
base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')

source_se = torch.load(f'{ckpt_base}/zh_default_se.pth').to(device)
save_path = f'{output_dir}/output_chinese.wav'

# Run the base speaker tts
# text = "ไปŠๅคฉๅคฉๆฐ”็œŸๅฅฝ๏ผŒๆˆ‘ไปฌไธ€่ตทๅ‡บๅŽปๅƒ้ฅญๅงใ€‚"
text = "์•ˆ๋…•ํ•˜์„ธ์š”! ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”."

src_path = f'{output_dir}/tmp.wav'
base_speaker_tts.tts(text, src_path, speaker='default', language='Chinese', speed=1.0)

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path, 
    src_se=source_se, 
    tgt_se=target_se, 
    output_path=save_path,
    message=encode_message)

 

 

* ๊ฒฐ๋ก 

V1์€ ์œˆ๋„์šฐ๋‚˜ ๋งฅ, ๋ชจ๋‘์—์„œ ์ž˜ ๋Œ์•„๊ฐ”๋‹ค. ์˜์–ด ์„ฑ๋Šฅ์€ v1๋„ ์ถฉ๋ถ„ํžˆ ์ข‹์•˜๋‹ค.
๋‹ค์Œ ํฌ์ŠคํŠธ์—์„œ ์ •๋ฆฌํ•  V2๋Š” ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ cuda ํ™˜๊ฒฝ์—์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๊ณ  - ์˜์–ด๋‚˜ ์ค‘๊ตญ์–ด๋Š” ์—ฌ์ „ํžˆ cpu์—์„œ ๋Œ์•„๊ฐ„๋‹ค - ์—ฌ๊ธฐ์— ๋ชฉ์†Œ๋ฆฌ ํŠธ๋ ˆ์ด๋‹๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ฌผ๋ก  ์„ฑ๋Šฅ์€ ๊ทธ๋‹ค์ง€ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š์ง€๋งŒ ์—ฌ๋Ÿฌ ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณด๋‹ˆ ์–ด๋–ค ๋ชฉ์†Œ๋ฆฌ๋Š” ๊ฝค๋‚˜ ์ž˜ ๋ณต์ œํ•ด๋ƒˆ๋‹ค.

์ž์„ธํ•œ ์‚ฌํ•ญ์€ V2์— ๋‚จ๊ธฐ๊ฒ ๋‹ค.