You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Get responses via API, [with](https://github.com/oobabooga/text-generation-webui/blob/main/api-example-streaming.py) or [without](https://github.com/oobabooga/text-generation-webui/blob/main/api-example.py) streaming.
30
-
*[Supports the LLaMA model, including 4-bit mode](https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model).
*[Works on Google Colab](https://github.com/oobabooga/text-generation-webui/wiki/Running-on-Colab).
36
36
37
37
## Installation option 1: conda
@@ -135,42 +135,42 @@ Then browse to
135
135
136
136
Optionally, you can use the following command-line flags:
137
137
138
-
| Flag | Description |
139
-
|-------------|-------------|
140
-
|`-h`, `--help`| show this help message and exit |
141
-
|`--model MODEL`| Name of the model to load by default. |
142
-
|`--lora LORA`| Name of the LoRA to apply to the model by default. |
143
-
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input. |
144
-
|`--chat`| Launch the web UI in chat mode.|
145
-
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
146
-
|`--cpu`| Use the CPU to generate text.|
147
-
|`--load-in-8bit`| Load the model with 8-bit precision.|
148
-
|`--load-in-4bit`| DEPRECATED: use `--gptq-bits 4` instead. |
149
-
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA and OPT. |
150
-
|`--gptq-model-type MODEL_TYPE`| Model type of pre-quantized model. Currently only LLaMa and OPT are supported. |
151
-
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
138
+
| Flag | Description |
139
+
|------------------|-------------|
140
+
|`-h`, `--help`| show this help message and exit |
141
+
|`--model MODEL`| Name of the model to load by default. |
142
+
|`--lora LORA`| Name of the LoRA to apply to the model by default. |
143
+
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input. |
144
+
|`--chat`| Launch the web UI in chat mode.|
145
+
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
146
+
|`--cpu`| Use the CPU to generate text.|
147
+
|`--load-in-8bit`| Load the model with 8-bit precision.|
148
+
|`--load-in-4bit`| DEPRECATED: use `--gptq-bits 4` instead. |
149
+
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA and OPT. |
150
+
|`--gptq-model-type MODEL_TYPE`| Model type of pre-quantized model. Currently only LLaMa and OPT are supported. |
151
+
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
152
152
|`--auto-devices`| Automatically split the model across the available GPU(s) and CPU.|
153
-
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
153
+
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
154
154
|`--disk-cache-dir DISK_CACHE_DIR`| Directory to save the disk cache to. Defaults to `cache/`. |
155
155
|`--gpu-memory GPU_MEMORY [GPU_MEMORY ...]`| Maxmimum GPU memory in GiB to be allocated per GPU. Example: `--gpu-memory 10` for a single GPU, `--gpu-memory 10 5` for two GPUs. |
156
-
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99.|
157
-
|`--flexgen`| Enable the use of FlexGen offloading. |
158
-
|`--percent PERCENT [PERCENT ...]`| FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
159
-
|`--compress-weight`| FlexGen: Whether to compress weight (default: False).|
160
-
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
156
+
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99.|
157
+
|`--flexgen`| Enable the use of FlexGen offloading. |
158
+
|`--percent PERCENT [PERCENT ...]`| FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
159
+
|`--compress-weight`| FlexGen: Whether to compress weight (default: False).|
160
+
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
161
161
|`--deepspeed`| Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration. |
162
-
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading. |
163
-
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups. |
164
-
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
165
-
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
166
-
|`--no-stream`| Don't stream the text output in real time. |
162
+
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading. |
163
+
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups. |
164
+
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
165
+
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
166
+
|`--no-stream`| Don't stream the text output in real time. |
167
167
|`--settings SETTINGS_FILE`| Load the default interface settings from this json file. See `settings-template.json` for an example. If you create a file called `settings.json`, this file will be loaded by default without the need to use the `--settings` flag.|
168
168
|`--extensions EXTENSIONS [EXTENSIONS ...]`| The list of extensions to load. If you want to load more than one extension, write the names separated by spaces. |
169
-
|`--listen`| Make the web UI reachable from your local network.|
169
+
|`--listen`| Make the web UI reachable from your local network.|
170
170
|`--listen-port LISTEN_PORT`| The listening port that the server will use. |
171
-
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar. |
172
-
|`--auto-launch`| Open the web UI in the default browser upon launch. |
173
-
|`--verbose`| Print the prompts to the terminal. |
171
+
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar. |
172
+
|`--auto-launch`| Open the web UI in the default browser upon launch. |
173
+
|`--verbose`| Print the prompts to the terminal. |
174
174
175
175
Out of memory errors? [Check this guide](https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide).
0 commit comments