Skip to content

Prd 1694 upgrade diarize model#8

Open
AlexAtDavidhorn wants to merge 64 commits into
mainfrom
PRD-1694-upgrade-diarize-model
Open

Prd 1694 upgrade diarize model#8
AlexAtDavidhorn wants to merge 64 commits into
mainfrom
PRD-1694-upgrade-diarize-model

Conversation

@AlexAtDavidhorn
Copy link
Copy Markdown

Pull Request

This change implements PRD-1694 (New diarization model), and in addition, the following:

  1. Separate dev containers for running on CPU and CUDA
  2. Fixes e2e-tests
  3. Implements e2e-tests for CUDA
  4. Fixes Github workflows for automatic tests across python envs

All changes are tested with e2e-tests on CPU and CUDA.

All commits in this branch are patches.

Jira task: https://davidhorn.atlassian.net/browse/PRD-1694

Mandatory Checks

  • Implements exactly what the Jira task specifies
  • Conforms to the code standard and is readable
  • Has tests for happy paths and relevant failures

Versioning

  • Branch name starts with work-item ID: {{ PRD-1234 }}
  • Commits tagged [major], [minor], or [patch] (default is [patch])

Rebase Rules

  • Keep the branch rebased on main
  • Do not use GitHub’s rebase button

Rebase steps:

  • git checkout main && git pull
  • git checkout {{ feature-branch }}
  • git rebase main
  • Resolve conflicts
  • Run tests
  • git push --force-with-lease

Copy link
Copy Markdown
Member

@TomasEkeli TomasEkeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Separate devcontainers solve the different-hosts problem, thanks for this!

Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/docker.yml Outdated
Comment thread .gitignore Outdated
Copy link
Copy Markdown
Member

@TomasEkeli TomasEkeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues: adds a lot of files that should not be added

  • egg-info is generated on build and should not be committed
  • e2e-tests-gpu is a copy of e2e-tests with one added file, just add that file to the original directory instead
  • build -folder should not be committed
  • docker.yml should not be in our repo, it tries to build and publish on the upstream repo's name

Also needs to be rebased on main

@AlexAtDavidhorn
Copy link
Copy Markdown
Author

Issues:

  • My bad re egg-info, I'll remove
  • e2e folders: Need two separate e2e-tests folders, otherwise our GH runner tries to run the cuda tests. If you have a better solution, I'm all ears - I just picked the easiest one here ;)
  • Same with build folder
  • Will remove docker.yml
  • Will rebase to main

Plz see my comments above as well and provide feedback on those :)

@TomasEkeli
Copy link
Copy Markdown
Member

Issue: this seems to have merged main from upstream into the PRD-1694 branch, not main from our fork?

if model_handle is None:
raise ValueError(
f"The token Hugging Face token '{self.use_auth_token}' for diarization is not valid or you did not accept the EULA"
f"The token Hugging Face token '{self.use_auth_token}' for diarization is not valid or you did not accept the EULAs for the necessary models. See https://github.com/Softcatala/whisper-ctranslate2#diarization-speaker-identification"
Copy link
Copy Markdown
Member

@TomasEkeli TomasEkeli Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern: the text at this link still refers to accepting the terms for pyannote/speaker-diarization-3.1 and this PR switches to pyannote/speaker-diarization-community-1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead PR - I've kept the branch only for cherry-picking commits, hehe. See pull 10 :)

@@ -184,6 +185,7 @@ def inference(
vad_filter=vad,
vad_parameters=vad_parameters,
**batch_size,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: shouldn't the ** parameter be placed last?

Comment on lines +221 to +235
try:
transcribe = Transcribe(
model_dir,
device,
device_index,
compute_type,
threads,
cache_directory,
local_files_only,
batched,
batch_size,
)
except RuntimeError as e:
print(f"error: {e}")
exit(ExitCode.RUNTIME_ERROR)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: good to get some tries here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants