Conversation
…ying Co-authored-by: aider (openrouter/anthropic/claude-sonnet-4) <aider@aider.chat>
|
Here’s a summary of the changes in the pull request for repeng/control.py: 1. Position IDs Handling in
|
|
I need to split this out into PR's, mainly for the bugfixes, but for now I'll leave this here in case any one else if collating bugfixes |
|
Thank you so much for sharing these. I'll cherry pick them in my own fork. Also regarding the layers attribute: multilingual models can have several modules with layers so I think you want to add a heuristic to look for the attribute path that contains "text" or "language" and if there's only one you're good to go. For example gemma3 has "model.language_model.layers" |
this makes changes more stable, and to use amplitudes of -4, 4 etc. At least this seems to be thecase
I just pushed my improved version thanks to you: https://github.com/thiswillbeyourgithub/repeng-research-fork/blob/a06b0e7e517e96f71561b4d64f446be0706f619b/repeng/control.py#L264 |
| ).squeeze(-1) | ||
| # adjust for length IPO style | ||
| avg_logp_completion = (lprobs_for_inputs * label_mask).sum(-1) / label_mask.sum(-1) | ||
| completion_lprob.append(avg_logp_completion.cpu().float().numpy()) |
There was a problem hiding this comment.
Can you explain what happens in those few new lines and why we would want that please?
There was a problem hiding this comment.
I also tried out importance sampling. This is an ideal from RL, where you weight online data more. This means data that the model would actually generate, and therefore it's more relevant for behaviour change. The way it's done here is that you mean the mean logprob of a sequence, and that's the importance.
As a result it seems to make more stable vectors that could be ramped up to +4 without incoherence.
As a bonus, here's a visualisation of how a thinking model changes it's answer as it thinks. (I fork the kv-cache, and have it answer a binary question, then I rewind time and have it continue thinking.

There was a problem hiding this comment.
Here's a reference to importance sampling in RL https://people.eecs.berkeley.edu/~jiantao/2902021spring/scribe/EE290_Lecture_23_24.pdf and the most relevant versions are in DPO variants e.g. https://github.com/Vance0124/Token-level-Direct-Preference-Optimization
But it's like saying "here's a video of someone driving a truck on an elephant road in Cambodia, I hope it helps you learn to drive your mini", but in reality this would have a low importance weight because it's not that relevant to how you drive or what you are trying to learn. It might be rated at 0.05, while a video of a mini in London streets might be 0.7, and a video of yourself driving yesterday might be 0.99, and a video of yourself driving at 16 years old might be 0.75.
This can be applied to a sequence or a particular token too, although I haven't seen the token specific version work that well.
There was a problem hiding this comment.
Thanks a lot for the explainer. I'm not good enough to implement this kind of thing myself. But if you have done an implementation in the passed I'm super interested. I think I'd have much greater chance of understanding how it works if I saw it in repeng.
Yeah perhaps that's the cleanest way. I originally had a rule to look for model.layers Perhaps that approach is the cleanest and most general fix? |
I was weighting wrong and it was reducing the size of the intervention
Respectfully I believe that my latest code is better and should address all use cases. |
|
Thinking a bit more about avg_logprobs, I'm thinking it might be good to give more weights to the activation of the training sample that have a lower avg_logprobs, and then extract the direction taking those weights into account. What do you think? |
Fair enough, it seems more comprehensive, I just value having minimal code which is why I chose less code and less coverage. @vgel can decide if he starts to revisit and merge |
|
I added a notebook that @vgel and @thiswillbeyourgithub might find interesting. This thing that makes dev and maintenance hard here is the lack of quantitive comparisons. We could of course run a whole eval, but in practise you need quick unit tests (like in llama.cpp they look at perplexity on a subset of wikitext). So here's a simple way to test a steering vector: https://github.com/wassname/repeng/blob/main/notebooks/unit_tests.ipynb Basically you look at how much an steering vector moves the answer between yes and no. You might expect the honesty vector to move it at least +2 nats towards Yes, and vice versa.
If it doesn't work then there's a problem. For example the current setup does not work with thinking models during thinking mode! But if you add reasoning suffixes then it does.
|
I think we are thinking along the same lines. This is kind of what I was thinking and doing, except the opposite way around! So according to importance sampling, the lower logprobs are less relevant to the model's actual behaviour, e.g. it's less likely to actually output the tokens associated with these hidden states. So I was taking more from the higher avg_logprobs. It does seem to work better (you can use higher vector amplitudes, and the same vector amplitudes make more contrasting model responses). You can see this measured in my unit test nb. https://github.com/wassname/repeng/blob/main/notebooks/unit_tests.ipynb Might be worth testing both ways, as machine learning often confounds expectations. |
I believe it's they
Interesting. I had found another way sort of: I'm asking the LLM to take a guess of what year the user was born. I'm then extracting the number and plotting it. Of course using a young<->old vector. It's the main metric used in my research fork.
That's part of the reason my make_dataset is so complex for example: you can specify wether you want the model to think or not.
Very interested in seeing the code for this. |
Okay I looked at it and it's much smarter than my stupid regex parsing. Thanks a lot |
|
these look like good changes! ideally i'd like to break them in three prs:
lmk if you want to handle breaking it up, or i can just do it when i have time and add you as a co-author. |
Sweet! I'm happy to put them in multiple PR's since you're interested! Sounds good, it can be a unit test or sh. I'll see if it works for gpt2 (or the more recent, similar sized models).
That's where I'm at with importance sampling too, undecided and curious.
Oh yeah, then you see how it changes the distribution of answers. People do a similar thing for Willy Wonkas height too, to look at the bias in judge distributions. |
I'm running an exhaustive grid search over a lot of parameters to understand more about all this. PS: BTW when I'll have my 24Go of VRAM I'm super interested in using repeng to alter the behavior of omni models. In particular talking models like qwen2.5-omni |
lots of complex bugs from creating our own mask over differen't model. probobly simpler to just use the provided attention mask
…n mask and hs are reshaped, etc
Cool let me know what you find!
"zai-org/GLM-4.1V-9B-Thinking" is good too, it's the biggest thinking model out which fits on 24GB, and since it 's it has some interesting properties. I think it's an omni model too. (although this space is changing fast). |
It looks like it (steering, and unit testing steering) doesn't work for model <4b. It's a shame. But we can still have a manually run integration_test.sh or integration_test.ipynb if it's valuable. I think it is since it quantitatively shows if changed outperform existing methods.
Correlation between delta logprob and steering amplitude (higher is better) for corr=-0.8660 layer=9/36 So that's interesting as it's showing my important sampling/weighting is not helping. So I'll probobly skip that PR. It's also saying that the middle layer are most important on this model ("Qwen/Qwen3-4B-Instruct-2507"). So I'll propose it as PR with it as a notebook validation.ipynb. But feel free to arrange it how you like of course and leave it until you revisit the repo |
|
Sweet, I made the PR's. No hurry, of course. If I was you I'd leave everything until I felt the urge then do it in bulk. |
Well I'm correlating the steering vector strength vs the log prob. So I have a table like below and correlate the two columns. In the table below, amplitudes of 2 were unstable and gave no answer.
It seems like you are on the right track in my opinion, but might need to debug if you are seeing problems. There's weird model and tokenizer effects that can confound everything. For example, the model might be trying to output "\n20" instead of "20", which are separate tokens. Or the total probability that the model puts on your choices might be low, which would indicate that the model is trying to do other things. Then you would need to change your prompt to more firmly guide toward a structured output. |





No description provided.