Open
Description
📚 The doc issue
While we hope to provide a standardized and streamlined flow for running LLMs from HF, as well as for individually enabled models (Llama), However, there are going to be use cases where someone wants to enable a model that doesn't fit cleanly into one of these flows. Maybe it has a slightly different architecture and can't drop in our transformer definition. I ran into this recently when working with a Fairseq encoder/decoder language translation model.
I'd like to create documentation that allows for a power user to understand the following:
- Why do the optimized ET transformer implementations work? What bits are critical for performance, export compliance, etc.?
- If I have a custom transformer implementation that doesn't map exactly to the ET preferred versions, what do I need to do to make it usable with ET?
a) How do I handle attention and KV cache mutability?
b) Can I leverage the ET SDPA ops?
c) How can I use the building blocks / composable components from the extension/llm directory? (Maybe we point to torchtune, as well).
d) What do I need to do to optimize for specific backends, such as XNNPACK or CoreML?
CC @larryliu0820 @byjlw @mergennachin
Suggest a potential alternative/fix
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
To triage
Status
Backlog