Description
Is your feature request related to a problem? Please describe.
I am using this library for benchmarking Question Answering tasks. For that I want to use a feature called self-consistency where multiple completions are generated for the same prompt with a high temperature value.
Describe the solution you'd like
Include the n parameter from OpenAI into this server.
I think that the Huggingface implementation of Llama offers this feature already.
Describe alternatives you've considered
Just do multiple calls to the LLM. But I guess this would take a lot more processing power since each generation will be a new pass through the model.
Additional context
As far as I know the multiple generations can be achieved using multiple beams in the generation. I am not sure if using multiple beams is supported by llama.cpp it would also be a help for me to clear that up first.
If a feature like that is supported by llama.cpp I may be able implement the python part myself and create a PR for that but I would need someone to point me in the right direction first.