Spaces:
Runtime error
Runtime error
| ## Running the Server | |
| PrivateGPT supports running with different LLMs & setups. | |
| ### Local models | |
| Both the LLM and the Embeddings model will run locally. | |
| Make sure you have followed the *Local LLM requirements* section before moving on. | |
| This command will start PrivateGPT using the `settings.yaml` (default profile) together with the `settings-local.yaml` | |
| configuration files. By default, it will enable both the API and the Gradio UI. Run: | |
| ```bash | |
| PGPT_PROFILES=local make run | |
| ``` | |
| or | |
| ```bash | |
| PGPT_PROFILES=local poetry run python -m private_gpt | |
| ``` | |
| When the server is started it will print a log *Application startup complete*. | |
| Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API | |
| using Swagger UI. | |
| #### Customizing low level parameters | |
| Currently, not all the parameters of `llama.cpp` and `llama-cpp-python` are available at PrivateGPT's `settings.yaml` file. | |
| In case you need to customize parameters such as the number of layers loaded into the GPU, you might change | |
| these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`. | |
| ##### Available LLM config options | |
| The `llm` section of the settings allows for the following configurations: | |
| - `mode`: how to run your llm | |
| - `max_new_tokens`: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses `256`) | |
| Example: | |
| ```yaml | |
| llm: | |
| mode: local | |
| max_new_tokens: 256 | |
| ``` | |
| If you are getting an out of memory error, you might also try a smaller model or stick to the proposed | |
| recommended models, instead of custom tuning the parameters. | |
| ### Using OpenAI | |
| If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may | |
| decide to run PrivateGPT using OpenAI as the LLM and Embeddings model. | |
| In order to do so, create a profile `settings-openai.yaml` with the following contents: | |
| ```yaml | |
| llm: | |
| mode: openai | |
| openai: | |
| api_base: <openai-api-base-url> # Defaults to https://api.openai.com/v1 | |
| api_key: <your_openai_api_key> # You could skip this configuration and use the OPENAI_API_KEY env var instead | |
| model: <openai_model_to_use> # Optional model to use. Default is "gpt-3.5-turbo" | |
| # Note: Open AI Models are listed here: https://platform.openai.com/docs/models | |
| ``` | |
| And run PrivateGPT loading that profile you just created: | |
| `PGPT_PROFILES=openai make run` | |
| or | |
| `PGPT_PROFILES=openai poetry run python -m private_gpt` | |
| When the server is started it will print a log *Application startup complete*. | |
| Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. | |
| You'll notice the speed and quality of response is higher, given you are using OpenAI's servers for the heavy | |
| computations. | |
| ### Using OpenAI compatible API | |
| Many tools, including [LocalAI](https://localai.io/) and [vLLM](https://docs.vllm.ai/en/latest/), | |
| support serving local models with an OpenAI compatible API. Even when overriding the `api_base`, | |
| using the `openai` mode doesn't allow you to use custom models. Instead, you should use the `openailike` mode: | |
| ```yaml | |
| llm: | |
| mode: openailike | |
| ``` | |
| This mode uses the same settings as the `openai` mode. | |
| As an example, you can follow the [vLLM quickstart guide](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) | |
| to run an OpenAI compatible server. Then, you can run PrivateGPT using the `settings-vllm.yaml` profile: | |
| `PGPT_PROFILES=vllm make run` | |
| ### Using Azure OpenAI | |
| If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may | |
| decide to run PrivateGPT using Azure OpenAI as the LLM and Embeddings model. | |
| In order to do so, create a profile `settings-azopenai.yaml` with the following contents: | |
| ```yaml | |
| llm: | |
| mode: azopenai | |
| embedding: | |
| mode: azopenai | |
| azopenai: | |
| api_key: <your_azopenai_api_key> # You could skip this configuration and use the AZ_OPENAI_API_KEY env var instead | |
| azure_endpoint: <your_azopenai_endpoint> # You could skip this configuration and use the AZ_OPENAI_ENDPOINT env var instead | |
| api_version: <api_version> # The API version to use. Default is "2023_05_15" | |
| embedding_deployment_name: <your_embedding_deployment_name> # You could skip this configuration and use the AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME env var instead | |
| embedding_model: <openai_embeddings_to_use> # Optional model to use. Default is "text-embedding-ada-002" | |
| llm_deployment_name: <your_model_deployment_name> # You could skip this configuration and use the AZ_OPENAI_LLM_DEPLOYMENT_NAME env var instead | |
| llm_model: <openai_model_to_use> # Optional model to use. Default is "gpt-35-turbo" | |
| ``` | |
| And run PrivateGPT loading that profile you just created: | |
| `PGPT_PROFILES=azopenai make run` | |
| or | |
| `PGPT_PROFILES=azopenai poetry run python -m private_gpt` | |
| When the server is started it will print a log *Application startup complete*. | |
| Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. | |
| You'll notice the speed and quality of response is higher, given you are using Azure OpenAI's servers for the heavy | |
| computations. | |
| ### Using AWS Sagemaker | |
| For a fully private & performant setup, you can choose to have both your LLM and Embeddings model deployed using Sagemaker. | |
| Note: how to deploy models on Sagemaker is out of the scope of this documentation. | |
| In order to do so, create a profile `settings-sagemaker.yaml` with the following contents (remember to | |
| update the values of the llm_endpoint_name and embedding_endpoint_name to yours): | |
| ```yaml | |
| llm: | |
| mode: sagemaker | |
| sagemaker: | |
| llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140 | |
| embedding_endpoint_name: huggingface-pytorch-inference-2023-11-03-07-41-36-479 | |
| ``` | |
| And run PrivateGPT loading that profile you just created: | |
| `PGPT_PROFILES=sagemaker make run` | |
| or | |
| `PGPT_PROFILES=sagemaker poetry run python -m private_gpt` | |
| When the server is started it will print a log *Application startup complete*. | |
| Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. | |
| ### Using Ollama | |
| Another option for a fully private setup is using [Ollama](https://ollama.ai/). | |
| Note: how to deploy Ollama and pull models onto it is out of the scope of this documentation. | |
| In order to do so, create a profile `settings-ollama.yaml` with the following contents: | |
| ```yaml | |
| llm: | |
| mode: ollama | |
| ollama: | |
| model: <ollama_model_to_use> # Required Model to use. | |
| # Note: Ollama Models are listed here: https://ollama.ai/library | |
| # Be sure to pull the model to your Ollama server | |
| api_base: <ollama-api-base-url> # Defaults to http://localhost:11434 | |
| ``` | |
| And run PrivateGPT loading that profile you just created: | |
| `PGPT_PROFILES=ollama make run` | |
| or | |
| `PGPT_PROFILES=ollama poetry run python -m private_gpt` | |
| When the server is started it will print a log *Application startup complete*. | |
| Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API. | |