Prerequisites
Feature Description
Functional Requirement:
Please add a toggle option to the web UI. When this option is enabled, after the large language model (LLM) finishes responding to a user request, the web interface should automatically delete any “thinking” (chain‑of‑thought) content if the model supports such reasoning, and then submit the full conversation history back to the LLM (so the model has the entire context pre‑encoded).
Consequences of this behavior:
When the user sends the next query, the LLM does not need to spend time re‑encoding the deleted reasoning steps or the previous generation; it can directly encode the new user prompt.
This allows the system to pre‑process the context needed for the upcoming turn while the user is reading or thinking about the current response, resulting in a smoother, more responsive interaction.
Motivation
When I receive a very large context from the model (e.g., a code snippet of more than 8 000 tokens) and then submit a simple follow‑up question, I have to wait a long time for the prompt‑processing step. While I am reading the returned context, the system is idle. The current LLM backend supports caching of multiple statements, so performing this pre‑processing can significantly improve the smoothness of subsequent turns in the conversation.
Possible Implementation
No response
Prerequisites
Feature Description
Functional Requirement:
Please add a toggle option to the web UI. When this option is enabled, after the large language model (LLM) finishes responding to a user request, the web interface should automatically delete any “thinking” (chain‑of‑thought) content if the model supports such reasoning, and then submit the full conversation history back to the LLM (so the model has the entire context pre‑encoded).
Consequences of this behavior:
Motivation
When I receive a very large context from the model (e.g., a code snippet of more than 8 000 tokens) and then submit a simple follow‑up question, I have to wait a long time for the prompt‑processing step. While I am reading the returned context, the system is idle. The current LLM backend supports caching of multiple statements, so performing this pre‑processing can significantly improve the smoothness of subsequent turns in the conversation.
Possible Implementation
No response