Skip to content

Commit 75c90a7

Browse files
Create chatml.md (#238)
* Create chatml.md * Update chatml.md
1 parent 62b73b9 commit 75c90a7

File tree

1 file changed

+87
-0
lines changed

1 file changed

+87
-0
lines changed

chatml.md

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
Traditionally, GPT models consumed unstructured text. ChatGPT models
2+
instead expect a structured format, called Chat Markup Language
3+
(ChatML for short).
4+
ChatML documents consists of a sequence of messages. Each message
5+
contains a header (which today consists of who said it, but in the
6+
future will contain other metadata) and contents (which today is a
7+
text payload, but in the future will contain other datatypes).
8+
We are still evolving ChatML, but the current version (ChatML v0) can
9+
be represented with our upcoming "list of dicts" JSON format as
10+
follows:
11+
```
12+
[
13+
{"token": "<|im_start|>"},
14+
"system\nYou are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-01",
15+
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
16+
"user\nHow are you",
17+
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
18+
"assistant\nI am doing well!",
19+
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
20+
"user\nHow are you now?",
21+
{"token": "<|im_end|>"}, "\n"
22+
]
23+
```
24+
You could also represent it in the classic "unsafe raw string"
25+
format. Note this format inherently allows injections from user input
26+
containing special-token syntax, similar to a SQL injections:
27+
```
28+
<|im_start|>system
29+
You are ChatGPT, a large language model trained by OpenAI. Answer as concisely as possible.
30+
Knowledge cutoff: 2021-09-01
31+
Current date: 2023-03-01<|im_end|>
32+
<|im_start|>user
33+
How are you<|im_end|>
34+
<|im_start|>assistant
35+
I am doing well!<|im_end|>
36+
<|im_start|>user
37+
How are you now?<|im_end|>
38+
```
39+
## Non-chat use-cases
40+
ChatML can be applied to classic GPT use-cases that are not
41+
traditionally thought of as chat. For example, instruction following
42+
(where a user requests for the AI to complete an instruction) can be
43+
implemented as a ChatML query like the following:
44+
```
45+
[
46+
{"token": "<|im_start|>"},
47+
"user\nList off some good ideas:",
48+
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
49+
"assistant"
50+
]
51+
```
52+
We do not currently allow autocompleting of partial messages,
53+
```
54+
[
55+
{"token": "<|im_start|>"},
56+
"system\nPlease autocomplete the user's message."
57+
{"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"},
58+
"user\nThis morning I decided to eat a giant"
59+
]
60+
```
61+
Note that ChatML makes explicit to the model the source of each piece
62+
of text, and particularly shows the boundary between human and AI
63+
text. This gives an opportunity to mitigate and eventually solve
64+
injections, as the model can tell which instructions come from the
65+
developer, the user, or its own input.
66+
## Few-shot prompting
67+
In general, we recommend adding few-shot examples using separate
68+
`system` messages with a `name` field of `example_user` or
69+
`example_assistant`. For example, here is a 1-shot prompt:
70+
```
71+
<|im_start|>system
72+
Translate from English to French
73+
<|im_end|>
74+
<|im_start|>system name=example_user
75+
How are you?
76+
<|im_end|>
77+
<|im_start|>system name=example_assistant
78+
Comment allez-vous?
79+
<|im_end|>
80+
<|im_start|>user
81+
{{user input here}}<|im_end|>
82+
```
83+
If adding instructions in the `system` message doesn't work, you can
84+
also try putting them into a `user` message. (In the near future, we
85+
will train our models to be much more steerable via the system
86+
message. But to date, we have trained only on a few system messages,
87+
so the models pay much most attention to user examples.)

0 commit comments

Comments
 (0)