搭建Fish Speech环境与训练与推理教程

First Post:

Last Update:

Word Count:
3.5k

Read Time:
16 min

Page View: loading...

创建并进入虚拟环境

1
2
3
conda create -n FishSpeech python=3.10  
# conda create -n FishSpeech python=3.10.5
conda activate FishSpeech

克隆仓库

1
2
3
4
git clone https://github.com/fishaudio/fish-speech.git  
cd fish-speech
git checkout tags/v1.4.3 # 切换到Fish Speech v1.4 (如果要使用v1.5可以注释掉这一行)
# git checkout main # 切换到Fish Speech v1.5

安装依赖

需要先安装PyTorch, 访问PyTorch网站获取安装命令

1
2
3
pip install huggingface_hub  
pip install triton
pip install .

Fish Speech v1.4与Fish Speech v1.5有库的不相同, 更新项目时请重新运行pip install .以用来更新库

Tip: 在安装triton时如果报错可以从源码开始编译, 可以访问triton项目获取文档帮助

下载模型

1
2
3
4
# Fish Speech v1.4使用  
huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4
# Fish Speech v1.5使用
huggingface-cli download fishaudio/fish-speech-1.5 --local-dir checkpoints/fish-speech-1.5

模型路径(示例)

Fish Speech v1.4使用:
checkpoints/fish-speech-1.4
checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth
Fish Speech v1.5使用:
checkpoints/fish-speech-1.5
checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth

大功告成(环境)

可以跟随Fish Speech项目指引操作了
也可以直接打开项目根目录下的README.md文件查看项目指引


新建data文件夹

在项目根目录新建一个名叫data的文件夹
新建之后进入data文件夹

设置训练集

音频文本文件只需要输入音频文本, 不要输入其他内容

  1. 新建一个文件夹, 该文件夹的名称为你的讲话人名称(这里假设你设置的讲话人名称为Speaker1)
  2. 进入Speaker1文件夹, 将Speaker1的音频文件(mp3 / wav / flac)和音频文本文件(lab)加入到其中

在此之后, 你的data文件夹的文件目录应该长这样:

  • data/
    • └── Speaker1/
      • ├── Speaker1_audio1.mp3
      • ├── Speaker1_audio1.lab
      • ├── Speaker1_audio2.wav
      • ├── Speaker1_audio2.lab
      • ├── Speaker1_audio3.flac
      • └── Speaker1_audio3.lab

在其中的文件应该是:
| 文件名称 | 文件内容 |
| —————- | —————- |
| Speaker1_audio1.mp3 | 音频(你好,博士。) |
| Speaker1_audio1.lab | 文本(你好,博士。) |
| Speaker1_audio2.wav | 音频(你好,博士。) |
| Speaker1_audio2.lab | 文本(你好,博士。) |
| Speaker1_audio3.flac | 音频(你好,博士。) |
| Speaker1_audio3.lab | 文本(你好,博士。) |

之后你可以选择执行命令来匹配数据集响度

1
fap loudness-norm data-raw data --clean

提取语义Token

num-workers 工作线程数
batch-size 工作批大小
config-name 配置文件名(通常配置文件存储在fish_speech/configs)
checkpoint-path 模型路径(请根据自己的路径填写)
必须确认(模型路径)

1
2
3
4
python tools/vqgan/extract_vq.py data  \  
--num-workers 1 --batch-size 16 \
--config-name "firefly_gan_vq" \
--checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

在此之后, 你的data文件夹的文件目录应该长这样:

  • data/
    • └── Speaker1/
      • ├── Speaker1_audio1.mp3
      • ├── Speaker1_audio1.lab
      • ├── Speaker1_audio1.npy
      • ├── Speaker1_audio2.wav
      • ├── Speaker1_audio2.lab
      • ├── Speaker1_audio2.npy
      • ├── Speaker1_audio3.flac
      • ├── Speaker1_audio3.lab
      • └── Speaker1_audio3.npy

打包数据集

input 数据集的根目录文件夹(如Speaker1放在data中, 那么Speaker1的根目录文件夹就是data)
output 打包后的输出路径(这里的data/protos指输出到data文件夹下的protos文件夹, 输出的文件后缀名为protos)
text-extension 音频文本文件的后缀名
num-workers 工作线程数

1
2
3
4
5
python tools/llama/build_dataset.py \  
--input "data" \
--output "data/protos" \
--text-extension .lab \
--num-workers 16

训练配置文件

如果不必须, 默认即可
这里的训练配置文件路径是fish_speech/configs/text2semantic_finetune.yaml
必须确认(预训练模型路径)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pretrained_ckpt_path: checkpoints/fish-speech-1.4 # 预训练模型路径

trainer:
max_steps: 1000 # 最大训练步数
val_check_interval: 100 # 每多少步数验证
# accelerator: gpu # 根据自己的设备配置(如果没有报错不要取消注释)
# devices: 1 # 根据自己的设备配置(如果没有报错不要取消注释)

train_dataset:
proto_files:
- data/protos # 打包数据集阶段使用的output参数(打包后的输出路径)

val_dataset:
proto_files:
- data/protos # 打包数据集阶段使用的output参数(打包后的输出路径)

data:
num_workers: 4 # 工作线程数
batch_size: 8 # 工作批大小

model:
optimizer:
lr: 1e-4 # 训练学习率大小
weight_decay: 0 # 权值衰减大小
eps: 1e-5 # 训练时学习率最小值

开始训练

config-name 训练配置文件名(通常配置文件存储在fish_speech/configs)
必须确认(训练配置文件内容)
project 项目名(我这里假设项目名为Speaker1)
model.model.lora_config=r_8_alpha_16 模型LoRA配置文件名(通常LoRA配置文件存储在fish_speech/configs/lora)

1
2
3
python fish_speech/train.py --config-name text2semantic_finetune \  
project=Speaker1 \
+lora@model.model.lora_config=r_8_alpha_16

通常训练的结果会出现在results/Speaker1/checkpoints文件夹中
你可以修改fish_speech/configs/text2semantic_finetune.yaml中的参数来适配你的GPU
如果你是Windows操作系统用户, 你可以使用trainer.strategy.process_group_backend=gloo来避免nccl错误

1
2
3
4
python fish_speech/train.py --config-name text2semantic_finetune \  
project=Speaker1 \
+lora@model.model.lora_config=r_8_alpha_16 \
trainer.strategy.process_group_backend=gloo

结束训练(合并LoRA)

lora-config 模型LoRA配置文件名(需要与训练时配置的一致)
必须确认(模型LoRA配置文件内容)
base-weight 模型基础权重路径(训练配置文件中的预训练模型路径)
必须确认(训练配置文件中的模型基础权重路径)
lora-weight LoRA文件路径(请根据自己的路径填写)
output 模型文件输出路径(请根据自己的路径填写)

1
2
3
4
5
python tools/llama/merge_lora.py                                      \  
--lora-config r_8_alpha_16 \
--base-weight checkpoints/fish-speech-1.4 \
--lora-weight results/Speaker1/checkpoints/step_000000010.ckpt \
--output checkpoints/fish-speech-1.4-Speaker1-lora/

继续训练

  1. 合并LoRA
  2. 训练配置文件中的pretrained_ckpt_path改成合并LoRA的模型文件输出路径
  3. 开始训练

大功告成(训练)

可以跟随Fish Speech项目指引操作了
也可以直接打开项目根目录下的README.md文件查看项目指引


从语音生成Prompt

input-path 输入音频文件路径(这里假设输入音频路径为Speaker1.wav)
checkpoint-path 模型路径(请根据自己的路径填写)
必须确认(模型路径)

1
2
3
python tools/vqgan/inference.py        \  
--input-path "Speaker1.wav" \
--checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

这里运行后会在工作目录下创建一个fake.npy文件
如果打算让模型随机选择音色, 你可以跳过这一步

从文本生成语义Token

text 要转换为音频的文本
prompt-text 从语音生成Prompt中使用的输入音频文件的音频文本
prompt-tokens 从语音生成Prompt中输出(创建 / 生成)的文件路径
checkpoint-path 模型路径(请根据自己的路径填写)
必须确认(模型路径)

1
2
3
4
5
python tools/llama/generate.py      \  
--text "我是迷迭香。" \
--prompt-text "你好,博士。" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.4"

运行该命令后会在工作目录下创建codes_N文件(其中N是从0开始的整数)

使用compile来融合CUDA内核以实现更快的推理

1
2
3
4
5
6
python tools/llama/generate.py                          \  
--text "我是迷迭香。" \
--prompt-text "你好,博士。" \
--prompt-tokens "Fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.4" \
--compile

对于不支持bf16的GPU, 需要使用half

1
2
3
4
5
6
python tools/llama/generate.py                          \  
--text "我是迷迭香。" \
--prompt-text "你好,博士。" \
--prompt-tokens "Fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.4" \
--half

从语义Token中生成人声

input-path 输入语义Token文件路径(这里假设语义Token文件路径为codes_0.npy)
output-path 输出文件路径(这里假设语义输出文件路径为Fake.wav)
checkpoint-path 模型路径(请根据自己的路径填写)
必须确认(模型路径)

1
2
3
4
python tools/vqgan/inference.py       \  
--input-path "codes_0.npy" \
--output-path "Fake.wav" \
--checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

大功告成(推理)

可以听听推理出来的声音

错误记录

错误(ver 1.5[2024/12/5])

执行命令

1
2
3
4
python fish_speech/train.py --config-name text2semantic_finetune \  
project=Speaker1 \
+lora@model.model.lora_config=r_8_alpha_16 \
trainer.strategy.process_group_backend=gloo

错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Traceback (most recent call last):
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\transformers\models\auto\configuration_auto.py", line 1034, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\transformers\models\auto\configuration_auto.py", line 736, in __getitem__
raise KeyError(key)
KeyError: 'dual_ar'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 92, in _call_target
return _target_(*args, **kwargs)
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 877, in from_pretrained
config = AutoConfig.from_pretrained(
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\transformers\models\auto\configuration_auto.py", line 1036, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `dual_ar` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\Administrator\Documents\fish-speech\fish_speech\utils\utils.py", line 69, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "C:\Users\Administrator\Documents\fish-speech\fish_speech\train.py", line 55, in train
datamodule: LightningDataModule = hydra.utils.instantiate(cfg.data)
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 226, in instantiate
return instantiate_node(
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 342, in instantiate_node
value = instantiate_node(
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 342, in instantiate_node
value = instantiate_node(
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 347, in instantiate_node
return _call_target(_target_, partial, args, kwargs, full_key)
File "C:\Users\Administrator\Documents\fish-speech\fishenv\env\lib\site-packages\hydra\_internal\instantiate\_instantiate2.py", line 97, in _call_target
raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error in call to target 'transformers.models.auto.tokenization_auto.AutoTokenizer.from_pretrained':
ValueError('The checkpoint you are trying to load has model type `dual_ar` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.')
full_key: data.train_dataset.tokenizer
[2024-12-03 22:37:50,875][fish_speech.utils.utils][INFO] - [rank: 0] Output dir: results/Speaker1
Error executing job with overrides: ['project=Speaker1', '+lora@model.model.lora_config=r_8_alpha_16', 'trainer.strategy.process_group_backend=gloo']
Error in call to target 'transformers.models.auto.tokenization_auto.AutoTokenizer.from_pretrained':
ValueError('The checkpoint you are trying to load has model type `dual_ar` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.')
full_key: data.train_dataset.tokenizer

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

解决过程(错误的)

  1. 查看报错发现是加载Tokenizer时爆出不支持dual_ar这一个模型格式
  2. 进行了transformers的卸载与重装也无法解决该问题
  3. 尝试了一下将Fish Speech v1.4中的Tokenizer和Tokenizer-Config文件都复制过来就可以运行了
  4. 不要轻易尝试, 之后还是报错, 但这个问题没有遇到

解决过程(正确的)

等待Fish Speech开发组更新训练功能

解决过程(极端的)

  1. 这一个错误是因为加载Tokenizer时爆出不支持dual_ar这一个模型格式, 查看训练配置文件发现加载Tokenizer使用的配置, 之后我发现在fish_speech文件夹下有一个tokenizer.py文件, 其中有一个FishTokenizer类, 所以修改训练配置文件
    1
    2
    3
    4
    5
    6
    7
    # tokenizer:
    # _target_: transformers.AutoTokenizer.from_pretrained
    # pretrained_model_name_or_path: ${pretrained_ckpt_path}

    tokenizer:
    _target_: fish_speech.tokenizer.FishTokenizer.from_pretrained
    path: ${pretrained_ckpt_path}
  2. 运行命令之后发现加载Tokenizer的错误不报了, 但是爆出了错误
    1
    AttributeError: 'FishTokenizer' object has no attribute 'convert_tokens_to_ids'
  3. 看了下fish_speech.tokenizer.FishTokenizer的方法发现只有encode与convert_tokens_to_ids有些关系, 所以直接在fish_speech.tokenizer.FishTokenizer中新建一个方法(fish_speech.tokenizer.FishTokenizer调用的convert_tokens_to_ids方法有一些encode不支持的参数和encode需要参数但没有, 而我又不知道这些新增参数如何加入到encode之中, 所以我就直接改了encode的参数)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    def encode(self, s: str, allowed_special: bool | set[str] = True, max_length: int = 4096, add_special_tokens: bool = False, truncation:bool = False) -> list[int]:
    assert isinstance(s, str)

    subs = []
    for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS):
    subs.append(s[i : i + TIKTOKEN_MAX_ENCODE_CHARS])

    if allowed_special is True:
    allowed_special = self.tkt_model.special_tokens_set
    elif allowed_special is False:
    allowed_special = set()

    return sum(
    self.tkt_model.encode_batch(
    subs, allowed_special=allowed_special, disallowed_special=set()
    ),
    start=[],
    )

    def convert_tokens_to_ids(self, ss: str | list, allowed_special: bool | set[str] = True, max_length: int = 4096, add_special_tokens: bool = False, truncation:bool = False) -> list[int]:
    if isinstance(ss, str):
    result = [self.encode(ss, allowed_special, max_length, add_special_tokens, truncation)]
    elif isinstance(ss, list):
    result = []
    for token in ss:
    tmp = self.encode(token, allowed_special, max_length, add_special_tokens, truncation)
    if isinstance(tmp, int):
    result.append([tmp])
    else:
    result += [tmp]
    else:
    raise Exception("What the fuck")
    return result
  4. 再次运行命令之后, 在fish_speech.datasets.semantic发生问题
    1
    2
        tokens = torch.tensor(tokens, dtype=torch.long)
    TypeError: 'list' object cannot be interpreted as an integer
  5. 观察了下报错发现是tokens的结构可能不符合torch.tensor的输入格式, 于是我就开始漫长的硬改之路(也不漫长就20min左右)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    # (fish_speech.datasets.semantic.AutoTextSemanticInstructionDataset.pack_sentences)
    # tokens = (
    # encoded
    # + [self.semantic_token_id] * semantic_length
    # + self.tokenizer.convert_tokens_to_ids(["<|im_end|>"])
    # )

    tokens = (
    encoded
    + self.semantic_token_id * semantic_length
    + self.tokenizer.convert_tokens_to_ids(["<|im_end|>"])
    )

    # tokens = [tokens] + codes

    def flatten(lst):
    for item in lst:
    if isinstance(item, list):
    yield from flatten(item)
    else:
    yield item
    max_num = 1
    tokens = list(flatten(tokens))
    max_num = max(max_num, len(tokens))
    token = [tokens]
    for code in codes:
    max_num = max(max_num, len(code))
    token.append(code)
    def pad_nested_lists(nested_list, target_length, fill_value=0):
    return [sublist + [fill_value] * (target_length - len(sublist)) if len(sublist) < target_length else sublist for sublist in nested_list]
    tokens = pad_nested_lists(tokens, max_num)
  6. 运行命令之后发现`torch.tensor的错误不报了, 但是爆出了错误
    1
    RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value strategy='ddp_find_unused_parameters_true' or by setting the flag in the strategy with strategy=DDPStrategy(find_unused_parameters=True).
  7. 发现是DDP(多卡分布式训练)的问题, 由于我没有多卡, 所以直接修改训练配置文件
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    # trainer:
    # accumulate_grad_batches: 1
    # gradient_clip_val: 1.0
    # gradient_clip_algorithm: "norm"
    # max_steps: 10000
    # precision: bf16-true
    # limit_val_batches: 10
    # val_check_interval: 100

    trainer:
    accumulate_grad_batches: 1
    gradient_clip_val: 1.0
    gradient_clip_algorithm: "norm"
    max_steps: 10000
    precision: bf16-true
    limit_val_batches: 10
    val_check_interval: 100
    # accelerator: gpu # 根据自己的设备配置
    # devices: 1 # 根据自己的设备配置
    strategy: auto
  8. 再次运行之后就可以进行训练了, LoRA也可以合并

错误(ver 1.5[2024/12/6])

执行命令

1
2
3
4
5
python tools/llama/generate.py       \  
--text "我是迷迭香。" \
--prompt-text "你好,博士。" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.4"

错误

1
FileNotFoundError: [Errno 2] No such file or directory:'checkpoints/fish-speech-1.4/tokenizer.tiktoken'

解决过程

  1. 是因为使用1.4版本的模型推理却使用了1.5版本的代码, 降级到1.4版本就行
    1
    git checkout tags/v1.4.3