Skip to content

训练数据中wav文件必须保存在本地吗,如果是http链接行不行 #78

@youxzAnt

Description

@youxzAnt

如题,我的训练数据音频地址是http://mdn.alipayobjects.com/gov_gjj/afts/file/A*PIEES5D5xLIAAAAAQkAAAAgAdn11AQ这种形式是否可以完成训练,我现在保错如下:
Error executing job with overrides: ['++model=/example/yaze.youxz/Fun-ASR/model', '++trust_remote_code=true', '++train_data_set_list=/ossfs/workspace/Fun-ASR/train_wuyang0113.jsonl', '++valid_data_set_list=/ossfs/workspace/Fun-ASR/val_wuyang0113.jsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=6000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=20', '++train_conf.log_interval=1', '++train_conf.resume=true', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=5000', '++train_conf.effective_save_name_excludes=None', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/ossfs/workspace/Fun-ASR/deepspeed_conf/ds_stage1.json', '++optim_conf.lr=0.0002', '++audio_encoder_conf.freeze=true', '++audio_adaptor_conf.freeze=true', '++llm_conf.freeze=false', '++output_dir=/example/yaze.youxz/Fun-ASR/outputs']
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/bin/funasr-train-ds", line 8, in
[rank0]: sys.exit(main_hydra())
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
[rank0]: _run_hydra(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]: _run_app(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]: run_and_report(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]: raise ex
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]: return func()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
[rank0]: lambda: hydra.run(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]: _ = ret.return_value
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
[rank0]: raise self._return_value
[rank0]: File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
[rank0]: ret.return_value = task_function(task_cfg)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/funasr/bin/train_ds.py", line 56, in main_hydra
[rank0]: main(**kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/funasr/bin/train_ds.py", line 177, in main
[rank0]: trainer.train_epoch(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 603, in train_epoch
[rank0]: self.forward_step(model, batch, loss_dict=loss_dict)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 670, in forward_step
[rank0]: retval = model(**batch)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank0]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1528, in _pre_forward
[rank0]: if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
[rank0]: making sure all forward function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
[rank0]: In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

[rank0]:[W129 11:36:49.566637567 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0129 11:36:50.754000 14129 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 14202 closing signal SIGTERM
E0129 11:36:51.068000 14129 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 14201) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

请问这个是什么原因

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions