add tts voice enum, and add tts to were-wolf example#749
add tts voice enum, and add tts to were-wolf example#749flystar32 wants to merge 3 commits intoagentscope-ai:mainfrom
Conversation
Change-Id: I43aebb346398a85766e78f2f2e8723b1977a07da
Change-Id: I1de111ea9ed409db79a4e61f47fadbed39777f74
Summary of ChangesHello @flystar32, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the interactivity of the AgentScope-Java framework by adding Text-to-Speech capabilities, primarily demonstrated through its integration into the Werewolf game example. It introduces a comprehensive enumeration of Qwen3 TTS voices and implements the backend logic to generate speech from AI agent responses. On the frontend, new JavaScript functionality has been added to handle the streaming audio, ensuring a smooth and coordinated auditory experience for players. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds Text-to-Speech (TTS) functionality to the werewolf example, allowing AI players' speech to be audible. The changes include a new enum for TTS voices, backend logic to generate audio using DashScope's realtime TTS API, and frontend JavaScript to handle and play the streamed audio. The implementation is solid, and my review includes suggestions to improve performance, refactor the TTS generation logic for better maintainability and logging, and a minor simplification in the frontend code.
| private void generateTTSForSpeech(String playerName, String text) { | ||
| if (text == null || text.trim().isEmpty()) { | ||
| return; | ||
| } | ||
|
|
||
| String apiKey = System.getenv("DASHSCOPE_API_KEY"); | ||
| if (apiKey == null || apiKey.isEmpty()) { | ||
| // Skip TTS if no API key | ||
| return; | ||
| } | ||
|
|
||
| // Resolve voice for this player (fallback to a default if not assigned) | ||
| Qwen3TTSFlashVoice voice = playerVoices != null ? playerVoices.get(playerName) : null; | ||
| if (voice == null) { | ||
| voice = Qwen3TTSFlashVoice.CHERRY; | ||
| } | ||
|
|
||
| // Create TTS model for this specific speech | ||
| DashScopeRealtimeTTSModel ttsModel = null; | ||
| try { | ||
| ttsModel = | ||
| DashScopeRealtimeTTSModel.builder() | ||
| .apiKey(apiKey) | ||
| .modelName("qwen3-tts-flash-realtime") | ||
| .voice(voice.getVoiceId()) | ||
| .sampleRate(24000) | ||
| .format("pcm") | ||
| .build(); | ||
|
|
||
| // Start session | ||
| ttsModel.startSession(); | ||
|
|
||
| // Subscribe to audio stream and emit chunks | ||
| ttsModel.getAudioStream() | ||
| .doOnNext( | ||
| audio -> { | ||
| if (audio.getSource() instanceof Base64Source src) { | ||
| emitter.emitAudioChunk(playerName, src.getData()); | ||
| } | ||
| }) | ||
| .subscribe(); | ||
|
|
||
| // Push text to TTS | ||
| ttsModel.push(text); | ||
|
|
||
| // Finish and wait for all audio | ||
| ttsModel.finish().blockLast(); | ||
| } catch (Exception e) { | ||
| // Log error but don't fail the game | ||
| System.err.println("TTS generation error for " + playerName + ": " + e.getMessage()); | ||
| } finally { | ||
| // Clean up TTS resources | ||
| if (ttsModel != null) { | ||
| try { | ||
| ttsModel.close(); | ||
| } catch (Exception e) { | ||
| // Ignore cleanup errors | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The generateTTSForSpeech method can be improved in several ways:
- Clarity and Simplicity: The
DashScopeRealtimeTTSModelprovides asynthesizeStream(text)method that simplifies the logic by handling session creation, text pushing, and cleanup internally. This removes the need for manualstartSession,push,finish, andclosecalls. - Logging: Using
System.err.printlnfor logging is not ideal. It's better to use a dedicated logger like SLF4J. Please add a logger field to the class:private static final org.slf4j.Logger log = org.slf4j.LoggerFactory.getLogger(WerewolfWebGame.class);. - Maintainability: Hardcoded values like the model name, sample rate, and format should be extracted into constants, preferably in
WerewolfGameConfig.java, to make them easier to manage. - Performance:
System.getenv("DASHSCOPE_API_KEY")is called for every speech. The API key should be fetched once when the game starts and stored in a field for reuse.
private void generateTTSForSpeech(String playerName, String text) {
if (text == null || text.trim().isEmpty()) {
return;
}
String apiKey = System.getenv("DASHSCOPE_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
// Skip TTS if no API key, but log a warning.
log.warn("DASHSCOPE_API_KEY is not set. Skipping TTS generation.");
return;
}
// Resolve voice for this player (fallback to a default if not assigned)
Qwen3TTSFlashVoice voice = playerVoices.get(playerName);
if (voice == null) {
voice = Qwen3TTSFlashVoice.CHERRY;
log.warn("No voice assigned for player {}. Falling back to default.", playerName);
}
try {
DashScopeRealtimeTTSModel ttsModel =
DashScopeRealtimeTTSModel.builder()
.apiKey(apiKey)
.modelName("qwen3-tts-flash-realtime") // Consider using a constant from WerewolfGameConfig
.voice(voice.getVoiceId())
.sampleRate(24000) // Consider using a constant
.format("pcm") // Consider using a constant
.build();
ttsModel.synthesizeStream(text)
.doOnNext(
audio -> {
if (audio.getSource() instanceof Base64Source src) {
emitter.emitAudioChunk(playerName, src.getData());
}
})
.doOnError(e -> log.error("Error in TTS audio stream for player {}", playerName, e))
.blockLast();
} catch (Exception e) {
// Log error but don't fail the game
log.error("TTS generation failed for player {}", playerName, e);
}
}| /** | ||
| * Find a voice enum by its voiceId (case-insensitive). | ||
| * | ||
| * @param voiceId the voice id string, e.g. "Cherry" | ||
| * @return matching enum value, or {@code null} if not found | ||
| */ | ||
| public static Qwen3TTSFlashVoice fromVoiceId(String voiceId) { | ||
| if (voiceId == null || voiceId.isEmpty()) { | ||
| return null; | ||
| } | ||
| String normalized = voiceId.toLowerCase(Locale.ROOT); | ||
| for (Qwen3TTSFlashVoice v : values()) { | ||
| if (v.voiceId.toLowerCase(Locale.ROOT).equals(normalized)) { | ||
| return v; | ||
| } | ||
| } | ||
| return null; | ||
| } |
There was a problem hiding this comment.
For better performance, it's recommended to use a static final Map to cache the voices by their ID. This avoids iterating over all enum values on every call to fromVoiceId, which can be inefficient if called frequently. The map can be initialized in a static block.
private static final java.util.Map<String, Qwen3TTSFlashVoice> VOICE_ID_MAP;
static {
VOICE_ID_MAP = java.util.stream.Stream.of(values()).collect(
java.util.stream.Collectors.toUnmodifiableMap(
v -> v.voiceId.toLowerCase(java.util.Locale.ROOT),
v -> v
)
);
}
/**
* Find a voice enum by its voiceId (case-insensitive).
*
* @param voiceId the voice id string, e.g. "Cherry"
* @return matching enum value, or {@code null} if not found
*/
public static Qwen3TTSFlashVoice fromVoiceId(String voiceId) {
if (voiceId == null || voiceId.isEmpty()) {
return null;
}
return VOICE_ID_MAP.get(voiceId.toLowerCase(java.util.Locale.ROOT));
}| function base64ToArrayBuffer(base64) { | ||
| const binaryString = atob(base64); | ||
| const bytes = new Uint8Array(binaryString.length); | ||
| for (let i = 0; i < binaryString.length; i++) { | ||
| bytes[i] = binaryString.charCodeAt(i); | ||
| } | ||
| return bytes.buffer; | ||
| } |
There was a problem hiding this comment.
The base64ToArrayBuffer function can be written more concisely and potentially more performantly using Uint8Array.from() with a mapping function. This avoids the manual for loop.
function base64ToArrayBuffer(base64) {
const binaryString = atob(base64);
const bytes = Uint8Array.from(binaryString, c => c.charCodeAt(0));
return bytes.buffer;
}There was a problem hiding this comment.
Pull request overview
This PR adds Text-to-Speech (TTS) functionality to the werewolf game example to enhance player immersion during gameplay. It introduces a new Qwen3TTSFlashVoice enum with 17 predefined voice options for the DashScope TTS service, integrates real-time TTS generation in the backend, and implements audio streaming and playback in the frontend. Additionally, the PR fixes two existing bugs where incorrect player names were being used when emitting vote events.
Changes:
- Added
Qwen3TTSFlashVoiceenum to define available TTS voices with metadata (gender, display names, descriptions) - Integrated TTS generation in
WerewolfWebGameto synthesize speech during day discussions with randomized voice assignment per player - Extended event system with
AUDIO_CHUNKevent type for streaming audio data from backend to frontend - Implemented JavaScript audio playback system with per-player queuing, global speaker coordination, and PCM audio processing
- Fixed bugs where
vote.getName()was used instead ofwerewolf.getName()andplayer.getName()in vote emission
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| agentscope-core/src/main/java/io/agentscope/core/model/tts/Qwen3TTSFlashVoice.java | New enum defining 17 TTS voice options with gender classification and helper methods for voice selection |
| agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/WerewolfWebGame.java | Integrates TTS generation, assigns random voices to players, fixes vote emission bugs, adds audio streaming method |
| agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventType.java | Adds AUDIO_CHUNK event type for TTS audio streaming |
| agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEventEmitter.java | Implements emitAudioChunk method to send audio data to frontend |
| agentscope-examples/werewolf-hitl/src/main/java/io/agentscope/examples/werewolf/web/GameEvent.java | Adds factory method for creating audio chunk events |
| agentscope-examples/werewolf-hitl/src/main/resources/static/js/app.js | Implements complete audio playback system with context initialization, chunk handling, queue management, and PCM decoding |
| /* | ||
| * Qwen3 TTS Flash / Realtime voices enumeration. | ||
| * | ||
| * This enum lists the officially documented 17 timbres for | ||
| * qwen3-tts-flash / qwen3-tts-flash-realtime models. | ||
| */ |
There was a problem hiding this comment.
Missing the Apache License 2.0 copyright header. All Java files in the codebase must include the standard copyright header with years 2024-2026. The header should be added before the package declaration, following the same format used in other files like DashScopeRealtimeTTSModel.java.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| /** | ||
| * 墨讲师 (Elias) - 兼具严谨与叙事性的讲师音色。 | ||
| */ | ||
| ELIAS("Elias", "墨讲师", Gender.FEMALE, "兼具严谨与叙事性的讲师音色"), |
There was a problem hiding this comment.
The gender for ELIAS is marked as FEMALE, but based on the name "Elias" and description "墨讲师" (Lecturer Mo), this appears to be a male voice and should be Gender.MALE instead.
| ELIAS("Elias", "墨讲师", Gender.FEMALE, "兼具严谨与叙事性的讲师音色"), | |
| ELIAS("Elias", "墨讲师", Gender.MALE, "兼具严谨与叙事性的讲师音色"), |
| // Finish and wait for all audio | ||
| ttsModel.finish().blockLast(); |
There was a problem hiding this comment.
The blocking call to finish().blockLast() can potentially hang indefinitely if the TTS service is unresponsive. Consider adding a timeout to prevent the game from freezing during TTS generation. For example: ttsModel.finish().timeout(Duration.ofSeconds(30)).blockLast()
| async function playAudio(audioPlayer, playerName) { | ||
| if (audioPlayer.isPlaying || audioPlayer.chunks.length === 0) { | ||
| return; | ||
| } | ||
|
|
||
| audioPlayer.isPlaying = true; |
There was a problem hiding this comment.
There is a potential race condition where audioPlayer.isPlaying can be checked and modified concurrently. If multiple audio chunks arrive rapidly for the same player, the check at line 882 and the assignment at line 886 are not atomic, which could lead to multiple concurrent playAudio calls for the same player. Consider using a proper state machine or locking mechanism to ensure only one playback loop runs per player at a time.
| }); | ||
|
|
||
| // ==================== Audio Functions ==================== | ||
| /** | ||
| * Initialize audio context on first user interaction. | ||
| */ | ||
| function initAudio() { | ||
| if (!audioContext) { | ||
| audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 24000 }); |
There was a problem hiding this comment.
The audioContext is initialized lazily but only on the first handleAudioChunk call. If audio arrives before any user interaction (as required by browser autoplay policies), the AudioContext may fail to initialize or be in a suspended state. Consider adding explicit user interaction handling (e.g., a play button or initializing audio on game start button click) to ensure the AudioContext is properly initialized before audio playback begins.
| }); | |
| // ==================== Audio Functions ==================== | |
| /** | |
| * Initialize audio context on first user interaction. | |
| */ | |
| function initAudio() { | |
| if (!audioContext) { | |
| audioContext = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 24000 }); | |
| // Initialize/resume AudioContext on first user interaction to satisfy autoplay policies | |
| const userInteractionHandler = () => { | |
| initAudio(); | |
| if (audioContext && audioContext.state === 'running') { | |
| document.removeEventListener('click', userInteractionHandler); | |
| document.removeEventListener('keydown', userInteractionHandler); | |
| } | |
| }; | |
| document.addEventListener('click', userInteractionHandler); | |
| document.addEventListener('keydown', userInteractionHandler); | |
| }); | |
| // ==================== Audio Functions ==================== | |
| /** | |
| * Initialize or resume the audio context. | |
| * This function is safe to call multiple times. | |
| */ | |
| function initAudio() { | |
| try { | |
| // Create audio context if it does not exist yet | |
| if (!audioContext) { | |
| const AudioCtx = window.AudioContext || window.webkitAudioContext; | |
| if (!AudioCtx) { | |
| console.error('Web Audio API is not supported in this browser.'); | |
| return; | |
| } | |
| audioContext = new AudioCtx({ sampleRate: 24000 }); | |
| } | |
| // Resume if the context is suspended (common after autoplay restrictions) | |
| if (audioContext && audioContext.state === 'suspended') { | |
| audioContext.resume().catch(error => { | |
| console.error('Failed to resume AudioContext:', error); | |
| }); | |
| } | |
| } catch (error) { | |
| console.error('Failed to initialize AudioContext:', error); |
| const source = audioContext.createBufferSource(); | ||
| source.buffer = audioBuffer; | ||
| source.connect(audioContext.destination); | ||
| audioPlayer.sources.push(source); |
There was a problem hiding this comment.
The audioPlayer.sources array is used to track active audio sources, but if an error occurs during playback or the audioPlayer.isPlaying flag is set to false prematurely, these sources may not be properly cleaned up. This could lead to memory leaks if many audio chunks are played over time. Consider adding explicit cleanup logic or using a WeakMap to allow garbage collection of completed sources.
| /** | ||
| * Pick a random voice using {@link ThreadLocalRandom}. | ||
| */ |
There was a problem hiding this comment.
The random() method documentation is missing the @return tag. Following the Javadoc patterns used in the codebase, this method should document its return value.
| try { | ||
| ttsModel.close(); | ||
| } catch (Exception e) { | ||
| // Ignore cleanup errors |
There was a problem hiding this comment.
Silent error suppression in the cleanup finally block could hide important issues. The exception is caught but completely ignored without any logging. While it's generally acceptable to suppress cleanup errors, at minimum the error should be logged for debugging purposes (e.g., System.err.println or storing the error message).
| // Ignore cleanup errors | |
| System.err.println( | |
| "Failed to close TTS model for " + playerName + ": " + e.getMessage()); |
|
|
||
| /** | ||
| * Handle audio chunk event from backend. | ||
| * |
There was a problem hiding this comment.
The function documentation says it handles an audio chunk event but has an extra space in the comment. Minor typo: "Handle audio chunk event from backend." should have consistent spacing after the asterisk.
| * | |
| * |
| /** | ||
| * Play audio from queue. | ||
| * | ||
| * @param {object} audioPlayer - Audio player object |
There was a problem hiding this comment.
The function documentation has an extra space in the comment. Minor formatting issue: the asterisk line should have consistent spacing.
| * @param {object} audioPlayer - Audio player object | |
| * @param {object} audioPlayer - Audio player object |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Change-Id: I25b584d9877e14beba50111c50eacd58a13f2f86
Change-Id: I43aebb346398a85766e78f2f2e8723b1977a07da
AgentScope-Java Version
1.0.9-SNAPSHOT
Description
to #748
Add TTS (Text-to-Speech) to the ReActAgent so it can speak responses aloud during player discussions.
Checklist
Please check the following items before code is ready to be reviewed.
mvn spotless:applymvn test)