This is a simple API for real-time keyword spotting, built using Python, Flask, and TensorFlow. The project takes a voice input and predicts one of the ten predefined keywords. It is fully containerized using Docker and Docker Compose, making it easy to deploy on any server.
- Real-time Prediction: Uses a pre-trained model to classify audio inputs.
- API-based: Provides a RESTful API endpoint to predict keywords.
- Containerized: Utilizes Docker and Docker Compose for a consistent and portable environment.
- Efficient: Uses
uWSGIandNginxfor high-performance serving.
classifier/: Contains the code for preparing the dataset and training the classification model.dataset/: Stores the raw audio files for the ten keywords.server/: The main directory for the API, including Flask application, model, and Docker files.local/test/: Includes sample audio files and a simple client for testing the API.
The project is designed to be easily deployed on a Linux server. Simply clone the repository and run the init.sh script.
-
Clone the repository:
git clone https://github.com/your-username/Speech-Recognition-API.git cd Speech-Recognition-API/server -
Run the initialization script:
bash init.sh
This script will:
- Update the system packages.
- Install Docker and Docker Compose (if not already present).
- Build the Docker images for Flask and Nginx.
- Start the containers.
The API will be accessible on port 80.
The API exposes a single endpoint for prediction.
- URL:
http://localhost/predict - Method:
POST - Content-Type:
audio/wav(or other supported audio formats) - Body: The audio file you want to predict.
You can use the provided local/client.py script to test the API.
python local/client.pyThe script will send the down.wav/left.wav file to the API and print the prediction result.
The project uses the following key technologies and libraries:
Backend: Flask, uWSGI, Nginx
Machine Learning: TensorFlow, Keras, scikit-learn
Audio Processing: Librosa
Containerization: Docker, Docker Compose
The model (model.h5) is a deep learning model trained for keyword spotting on the ten provided classes: down, go, left, no, off, on, right, stop, up, and yes.