I don't plan on shelling out money for inference at the moment, so the initial plan is to have users bring their own "inference back-end" with them - likely Colab for now. Some points about this though:
- Who will be responsible for creating the prompt, sending it off to the inference backend and parsing the resulting generation?
- Initial plan is to implement that here, and the front-end will simply
POST user messages to an endpoint and receive responses (Maybe via WebSockets? Not sure holding on to a connection for 10+ seconds is a good idea)
- Pros:
- We'll have real-world data on inference requests, which we can use to calculate how much $ it would cost to actually run inference ourselves (I've gotten many users suggesting I open a Patreon to cover hosting expenses - I'm unsure how well that'd pan out but with data this decision could be made a little more clearly)
- We can automatically push new prompting code by just updating the server
- Cons:
- Increased server load, since we'll be acting as a proxy for inference requests.
- How will inference work for group chats? How do we decide which characters should speak and when?
I don't plan on shelling out money for inference at the moment, so the initial plan is to have users bring their own "inference back-end" with them - likely Colab for now. Some points about this though:
POSTuser messages to an endpoint and receive responses (Maybe via WebSockets? Not sure holding on to a connection for 10+ seconds is a good idea)