Skip to content

add tts voice enum, and add tts to were-wolf example#749

Open
flystar32 wants to merge 3 commits intoagentscope-ai:mainfrom
flystar32:main
Open

add tts voice enum, and add tts to were-wolf example#749
flystar32 wants to merge 3 commits intoagentscope-ai:mainfrom
flystar32:main

Conversation

@flystar32
Copy link
Collaborator

Change-Id: I43aebb346398a85766e78f2f2e8723b1977a07da

AgentScope-Java Version

1.0.9-SNAPSHOT

Description

to #748

Add TTS (Text-to-Speech) to the ReActAgent so it can speak responses aloud during player discussions.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with mvn spotless:apply
  • All tests are passing (mvn test)
  • Javadoc comments are complete and follow project conventions
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

Change-Id: I43aebb346398a85766e78f2f2e8723b1977a07da
Change-Id: I1de111ea9ed409db79a4e61f47fadbed39777f74
@gemini-code-assist
Copy link

Summary of Changes

Hello @flystar32, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the interactivity of the AgentScope-Java framework by adding Text-to-Speech capabilities, primarily demonstrated through its integration into the Werewolf game example. It introduces a comprehensive enumeration of Qwen3 TTS voices and implements the backend logic to generate speech from AI agent responses. On the frontend, new JavaScript functionality has been added to handle the streaming audio, ensuring a smooth and coordinated auditory experience for players.

Highlights

  • New TTS Voice Enumeration: Introduced Qwen3TTSFlashVoice enum, defining 17 distinct voices for Qwen3 Text-to-Speech models, complete with gender and descriptive characteristics.
  • Werewolf Game TTS Integration: Implemented Text-to-Speech functionality within the Werewolf example, allowing AI agents to vocalize their responses during the day discussion phase.
  • Dynamic Voice Assignment: Each AI player in the Werewolf game is now assigned a random, unique TTS voice from the new enumeration at the start of a game.
  • Frontend Audio Playback: The web frontend (app.js) has been updated to receive and play streaming audio chunks, managing concurrent speech by queuing players and highlighting the current speaker.
Changelog
  • agentscope-core/src/main/java/io/agentscope/core/model/tts/Qwen3TTSFlashVoice.java
    • Added a new enum Qwen3TTSFlashVoice to define 17 Qwen3 TTS Flash/Realtime voices, including their voiceId, displayName, gender, and description.
    • Included utility methods for finding voices by ID and picking a random voice.
  • agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEvent.java
    • Added a static factory method audioChunk to create GameEvent objects for transmitting TTS audio data.
  • agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventEmitter.java
    • Introduced emitAudioChunk method to send AUDIO_CHUNK events containing Base64 encoded audio data to connected clients.
  • agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventType.java
    • Added a new enum constant AUDIO_CHUNK to represent events related to TTS audio.
  • agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/WerewolfWebGame.java
    • Integrated DashScopeRealtimeTTSModel to generate speech from AI agent text responses.
    • Implemented logic to assign a random Qwen3TTSFlashVoice to each player at the start of the game.
    • Modified the discussionPhase to trigger TTS generation for AI player speeches.
    • Added generateTTSForSpeech private method to handle TTS model instantiation, audio stream subscription, and emission of audio chunks.
    • Corrected the emitPlayerVote calls in werewolvesKill and votingPhase to use the correct player name for the vote emitter.
  • agentscope-examples/werewolf-hitl/src/main/resources/static/js/app.js
    • Added client-side JavaScript to initialize AudioContext and manage audio playback.
    • Implemented handleAudioChunk to receive Base64 encoded audio data, decode it, and add it to a player-specific audio queue.
    • Developed a global audio coordination system (currentSpeakingPlayer, pendingSpeakingPlayers) to ensure only one player speaks at a time, with subsequent speakers queued.
    • Included functions for playing raw PCM audio chunks, managing audio sources, and converting Base64 to ArrayBuffer.
Activity
  • The pull request was created by flystar32.
  • The author has provided a checklist in the description, but all items are currently unchecked, indicating that the author considers the code not yet ready for review, or that these steps are pending completion.
  • The PR aims to address issue [Feature]: Add TTS to were-wolf example #748.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Text-to-Speech (TTS) functionality to the werewolf example, allowing AI players' speech to be audible. The changes include a new enum for TTS voices, backend logic to generate audio using DashScope's realtime TTS API, and frontend JavaScript to handle and play the streamed audio. The implementation is solid, and my review includes suggestions to improve performance, refactor the TTS generation logic for better maintainability and logging, and a minor simplification in the frontend code.

Comment on lines +1166 to +1226
private void generateTTSForSpeech(String playerName, String text) {
if (text == null || text.trim().isEmpty()) {
return;
}

String apiKey = System.getenv("DASHSCOPE_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
// Skip TTS if no API key
return;
}

// Resolve voice for this player (fallback to a default if not assigned)
Qwen3TTSFlashVoice voice = playerVoices != null ? playerVoices.get(playerName) : null;
if (voice == null) {
voice = Qwen3TTSFlashVoice.CHERRY;
}

// Create TTS model for this specific speech
DashScopeRealtimeTTSModel ttsModel = null;
try {
ttsModel =
DashScopeRealtimeTTSModel.builder()
.apiKey(apiKey)
.modelName("qwen3-tts-flash-realtime")
.voice(voice.getVoiceId())
.sampleRate(24000)
.format("pcm")
.build();

// Start session
ttsModel.startSession();

// Subscribe to audio stream and emit chunks
ttsModel.getAudioStream()
.doOnNext(
audio -> {
if (audio.getSource() instanceof Base64Source src) {
emitter.emitAudioChunk(playerName, src.getData());
}
})
.subscribe();

// Push text to TTS
ttsModel.push(text);

// Finish and wait for all audio
ttsModel.finish().blockLast();
} catch (Exception e) {
// Log error but don't fail the game
System.err.println("TTS generation error for " + playerName + ": " + e.getMessage());
} finally {
// Clean up TTS resources
if (ttsModel != null) {
try {
ttsModel.close();
} catch (Exception e) {
// Ignore cleanup errors
}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The generateTTSForSpeech method can be improved in several ways:

  • Clarity and Simplicity: The DashScopeRealtimeTTSModel provides a synthesizeStream(text) method that simplifies the logic by handling session creation, text pushing, and cleanup internally. This removes the need for manual startSession, push, finish, and close calls.
  • Logging: Using System.err.println for logging is not ideal. It's better to use a dedicated logger like SLF4J. Please add a logger field to the class: private static final org.slf4j.Logger log = org.slf4j.LoggerFactory.getLogger(WerewolfWebGame.class);.
  • Maintainability: Hardcoded values like the model name, sample rate, and format should be extracted into constants, preferably in WerewolfGameConfig.java, to make them easier to manage.
  • Performance: System.getenv("DASHSCOPE_API_KEY") is called for every speech. The API key should be fetched once when the game starts and stored in a field for reuse.
    private void generateTTSForSpeech(String playerName, String text) {
        if (text == null || text.trim().isEmpty()) {
            return;
        }

        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        if (apiKey == null || apiKey.isEmpty()) {
            // Skip TTS if no API key, but log a warning.
            log.warn("DASHSCOPE_API_KEY is not set. Skipping TTS generation.");
            return;
        }

        // Resolve voice for this player (fallback to a default if not assigned)
        Qwen3TTSFlashVoice voice = playerVoices.get(playerName);
        if (voice == null) {
            voice = Qwen3TTSFlashVoice.CHERRY;
            log.warn("No voice assigned for player {}. Falling back to default.", playerName);
        }

        try {
            DashScopeRealtimeTTSModel ttsModel =
                    DashScopeRealtimeTTSModel.builder()
                            .apiKey(apiKey)
                            .modelName("qwen3-tts-flash-realtime") // Consider using a constant from WerewolfGameConfig
                            .voice(voice.getVoiceId())
                            .sampleRate(24000) // Consider using a constant
                            .format("pcm") // Consider using a constant
                            .build();

            ttsModel.synthesizeStream(text)
                    .doOnNext(
                            audio -> {
                                if (audio.getSource() instanceof Base64Source src) {
                                    emitter.emitAudioChunk(playerName, src.getData());
                                }
                            })
                    .doOnError(e -> log.error("Error in TTS audio stream for player {}", playerName, e))
                    .blockLast();
        } catch (Exception e) {
            // Log error but don't fail the game
            log.error("TTS generation failed for player {}", playerName, e);
        }
    }

Comment on lines +146 to +163
/**
* Find a voice enum by its voiceId (case-insensitive).
*
* @param voiceId the voice id string, e.g. "Cherry"
* @return matching enum value, or {@code null} if not found
*/
public static Qwen3TTSFlashVoice fromVoiceId(String voiceId) {
if (voiceId == null || voiceId.isEmpty()) {
return null;
}
String normalized = voiceId.toLowerCase(Locale.ROOT);
for (Qwen3TTSFlashVoice v : values()) {
if (v.voiceId.toLowerCase(Locale.ROOT).equals(normalized)) {
return v;
}
}
return null;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance, it's recommended to use a static final Map to cache the voices by their ID. This avoids iterating over all enum values on every call to fromVoiceId, which can be inefficient if called frequently. The map can be initialized in a static block.

    private static final java.util.Map<String, Qwen3TTSFlashVoice> VOICE_ID_MAP;

    static {
        VOICE_ID_MAP = java.util.stream.Stream.of(values()).collect(
                java.util.stream.Collectors.toUnmodifiableMap(
                        v -> v.voiceId.toLowerCase(java.util.Locale.ROOT),
                        v -> v
                )
        );
    }

    /**
     * Find a voice enum by its voiceId (case-insensitive).
     *
     * @param voiceId the voice id string, e.g. "Cherry"
     * @return matching enum value, or {@code null} if not found
     */
    public static Qwen3TTSFlashVoice fromVoiceId(String voiceId) {
        if (voiceId == null || voiceId.isEmpty()) {
            return null;
        }
        return VOICE_ID_MAP.get(voiceId.toLowerCase(java.util.Locale.ROOT));
    }

Comment on lines +1008 to +1015
function base64ToArrayBuffer(base64) {
const binaryString = atob(base64);
const bytes = new Uint8Array(binaryString.length);
for (let i = 0; i < binaryString.length; i++) {
bytes[i] = binaryString.charCodeAt(i);
}
return bytes.buffer;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The base64ToArrayBuffer function can be written more concisely and potentially more performantly using Uint8Array.from() with a mapping function. This avoids the manual for loop.

function base64ToArrayBuffer(base64) {
    const binaryString = atob(base64);
    const bytes = Uint8Array.from(binaryString, c => c.charCodeAt(0));
    return bytes.buffer;
}

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Text-to-Speech (TTS) functionality to the werewolf game example to enhance player immersion during gameplay. It introduces a new Qwen3TTSFlashVoice enum with 17 predefined voice options for the DashScope TTS service, integrates real-time TTS generation in the backend, and implements audio streaming and playback in the frontend. Additionally, the PR fixes two existing bugs where incorrect player names were being used when emitting vote events.

Changes:

  • Added Qwen3TTSFlashVoice enum to define available TTS voices with metadata (gender, display names, descriptions)
  • Integrated TTS generation in WerewolfWebGame to synthesize speech during day discussions with randomized voice assignment per player
  • Extended event system with AUDIO_CHUNK event type for streaming audio data from backend to frontend
  • Implemented JavaScript audio playback system with per-player queuing, global speaker coordination, and PCM audio processing
  • Fixed bugs where vote.getName() was used instead of werewolf.getName() and player.getName() in vote emission

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
agentscope-core/src/main/java/io/agentscope/core/model/tts/Qwen3TTSFlashVoice.java New enum defining 17 TTS voice options with gender classification and helper methods for voice selection
agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/WerewolfWebGame.java Integrates TTS generation, assigns random voices to players, fixes vote emission bugs, adds audio streaming method
agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventType.java Adds AUDIO_CHUNK event type for TTS audio streaming
agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventEmitter.java Implements emitAudioChunk method to send audio data to frontend
agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEvent.java Adds factory method for creating audio chunk events
agentscope-examples/werewolf-hitl/src/main/resources/static/js/app.js Implements complete audio playback system with context initialization, chunk handling, queue management, and PCM decoding

Comment on lines 1 to 6
/*
* Qwen3 TTS Flash / Realtime voices enumeration.
*
* This enum lists the officially documented 17 timbres for
* qwen3-tts-flash / qwen3-tts-flash-realtime models.
*/
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the Apache License 2.0 copyright header. All Java files in the codebase must include the standard copyright header with years 2024-2026. The header should be added before the package declaration, following the same format used in other files like DashScopeRealtimeTTSModel.java.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

/**
* 墨讲师 (Elias) - 兼具严谨与叙事性的讲师音色。
*/
ELIAS("Elias", "墨讲师", Gender.FEMALE, "兼具严谨与叙事性的讲师音色"),
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gender for ELIAS is marked as FEMALE, but based on the name "Elias" and description "墨讲师" (Lecturer Mo), this appears to be a male voice and should be Gender.MALE instead.

Suggested change
ELIAS("Elias", "墨讲师", Gender.FEMALE, "兼具严谨与叙事性的讲师音色"),
ELIAS("Elias", "墨讲师", Gender.MALE, "兼具严谨与叙事性的讲师音色"),

Copilot uses AI. Check for mistakes.
Comment on lines +1211 to +1212
// Finish and wait for all audio
ttsModel.finish().blockLast();
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking call to finish().blockLast() can potentially hang indefinitely if the TTS service is unresponsive. Consider adding a timeout to prevent the game from freezing during TTS generation. For example: ttsModel.finish().timeout(Duration.ofSeconds(30)).blockLast()

Copilot uses AI. Check for mistakes.
Comment on lines +881 to +886
async function playAudio(audioPlayer, playerName) {
if (audioPlayer.isPlaying || audioPlayer.chunks.length === 0) {
return;
}

audioPlayer.isPlaying = true;
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition where audioPlayer.isPlaying can be checked and modified concurrently. If multiple audio chunks arrive rapidly for the same player, the check at line 882 and the assignment at line 886 are not atomic, which could lead to multiple concurrent playAudio calls for the same player. Consider using a proper state machine or locking mechanism to ensure only one playback loop runs per player at a time.

Copilot uses AI. Check for mistakes.
Comment on lines 798 to +806
});

// ==================== Audio Functions ====================
/**
* Initialize audio context on first user interaction.
*/
function initAudio() {
if (!audioContext) {
audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 24000 });
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audioContext is initialized lazily but only on the first handleAudioChunk call. If audio arrives before any user interaction (as required by browser autoplay policies), the AudioContext may fail to initialize or be in a suspended state. Consider adding explicit user interaction handling (e.g., a play button or initializing audio on game start button click) to ensure the AudioContext is properly initialized before audio playback begins.

Suggested change
});
// ==================== Audio Functions ====================
/**
* Initialize audio context on first user interaction.
*/
function initAudio() {
if (!audioContext) {
audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 24000 });
// Initialize/resume AudioContext on first user interaction to satisfy autoplay policies
const userInteractionHandler = () => {
initAudio();
if (audioContext && audioContext.state === 'running') {
document.removeEventListener('click', userInteractionHandler);
document.removeEventListener('keydown', userInteractionHandler);
}
};
document.addEventListener('click', userInteractionHandler);
document.addEventListener('keydown', userInteractionHandler);
});
// ==================== Audio Functions ====================
/**
* Initialize or resume the audio context.
* This function is safe to call multiple times.
*/
function initAudio() {
try {
// Create audio context if it does not exist yet
if (!audioContext) {
const AudioCtx = window.AudioContext || window.webkitAudioContext;
if (!AudioCtx) {
console.error('Web Audio API is not supported in this browser.');
return;
}
audioContext = new AudioCtx({ sampleRate: 24000 });
}
// Resume if the context is suspended (common after autoplay restrictions)
if (audioContext && audioContext.state === 'suspended') {
audioContext.resume().catch(error => {
console.error('Failed to resume AudioContext:', error);
});
}
} catch (error) {
console.error('Failed to initialize AudioContext:', error);

Copilot uses AI. Check for mistakes.
Comment on lines +978 to +981
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
audioPlayer.sources.push(source);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audioPlayer.sources array is used to track active audio sources, but if an error occurs during playback or the audioPlayer.isPlaying flag is set to false prematurely, these sources may not be properly cleaned up. This could lead to memory leaks if many audio chunks are played over time. Consider adding explicit cleanup logic or using a WeakMap to allow garbage collection of completed sources.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +167
/**
* Pick a random voice using {@link ThreadLocalRandom}.
*/
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random() method documentation is missing the @return tag. Following the Javadoc patterns used in the codebase, this method should document its return value.

Copilot generated this review using guidance from repository custom instructions.
try {
ttsModel.close();
} catch (Exception e) {
// Ignore cleanup errors
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent error suppression in the cleanup finally block could hide important issues. The exception is caught but completely ignored without any logging. While it's generally acceptable to suppress cleanup errors, at minimum the error should be logged for debugging purposes (e.g., System.err.println or storing the error message).

Suggested change
// Ignore cleanup errors
System.err.println(
"Failed to close TTS model for " + playerName + ": " + e.getMessage());

Copilot uses AI. Check for mistakes.

/**
* Handle audio chunk event from backend.
*
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function documentation says it handles an audio chunk event but has an extra space in the comment. Minor typo: "Handle audio chunk event from backend." should have consistent spacing after the asterisk.

Suggested change
*
*

Copilot uses AI. Check for mistakes.
/**
* Play audio from queue.
*
* @param {object} audioPlayer - Audio player object
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function documentation has an extra space in the comment. Minor formatting issue: the asterisk line should have consistent spacing.

Suggested change
* @param {object} audioPlayer - Audio player object
* @param {object} audioPlayer - Audio player object

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 95.45455% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../agentscope/core/model/tts/Qwen3TTSFlashVoice.java 95.45% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Change-Id: I25b584d9877e14beba50111c50eacd58a13f2f86
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant