Skip to content

feat: infrastructure for both local GPU setup and Modal#11

Closed
ArnavBharti wants to merge 11 commits intomainfrom
infra
Closed

feat: infrastructure for both local GPU setup and Modal#11
ArnavBharti wants to merge 11 commits intomainfrom
infra

Conversation

@ArnavBharti
Copy link
Copy Markdown
Collaborator

@ArnavBharti ArnavBharti commented Sep 28, 2025

This has two scripts vllm_inference_modal.py and vllm_inference_local.py.

The Modal scripts sets up the vLLM server on A100 40GB (configurable). The local script is works on NVIDIA RTX 6000 Ada with CUDA version 12.4 (BITS Server).

Both scripts were tested with Qwen/Qwen3-Embedding-0.6B.

Modal deployment was tested using examples in README.

First time setup involves setting up Modal Volumes and caching. Subsequent deployments are fast. HuggingFace model is fixed to a commit rather than main.

After volumes are cached, response time on Modal:
image

@ArnavBharti ArnavBharti requested a review from NirantK September 28, 2025 12:11
@ArnavBharti ArnavBharti changed the title feat: add vllm inference for both modal and local server feat: infrastructure for both local GPU setup and Modal Oct 18, 2025
@ArnavBharti
Copy link
Copy Markdown
Collaborator Author

This PR now contains code for the entire program flow.
It accepts JSON request -> converts to embeddings -> similarity search using FAISS -> respond with dish name in JSON format.

Tested on both Modal and GPU server.
Right now this runs Qwen 0.6B embedding model which is text only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants