data/papers.yml at main · BrachioLab/data · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
- year: 2026
  papers:
  - title: 'Detecting Safety Violations Across Many Agent Traces'
    link: https://arxiv.org/abs/2604.11806
    authors: Adam Stein, Davis Brown, Hamed Hassani, Mayur Naik, Eric Wong
    conference: null
    short: Preprint
    website: null
    blog: https://debugml.github.io/cheating-agents/
    github: https://github.com/BrachioLab/Meerkat
    external: false
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.
  - title: 'Missingness Bias Calibration in Feature Attribution Explanations'
    link: https://arxiv.org/abs/2603.04831
    authors: Shailesh Sridhar, Anton Xue, Eric Wong
    conference: International Conference on Learning Representations (ICLR), 2026
    short: ICLR 2026
    website: null
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      Popular explanation methods often produce unreliable feature importance scores due to "missingness bias", a systematic distortion that arises when models are probed with ablated, out-of-distribution inputs. Existing solutions treat this as a deep representational flaw that requires expensive retraining or architectural modifications. In this work, we challenge this assumption and show that missingness bias can be effectively treated as a superficial artifact of the model's output space. We introduce MCal, a lightweight post-hoc method that corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model. Surprisingly, we find this simple correction consistently reduces missingness bias and is competitive with, or even outperforms, prior heavyweight approaches across diverse medical benchmarks spanning vision, language, and tabular domains.
  - title: 'Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents'
    link: https://arxiv.org/abs/2604.03173
    authors: Delip Rao, Eric Wong, Chris Callison-Burch
    conference: null
    short: Preprint
    website: null
    blog: null
    github: null
    external: false
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by  to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.
  - title: 'Interpretability Can Be Actionable'
    link: https://actionable-interpretability-guide.github.io/paper.pdf
    authors: Hadas Orgad, Fazl Barez, Tal Haklay, Isabelle Lee, Marius Mosbach, Anja Reusch, Naomi Saphra, Byron C Wallace, Sarah Wiegreffe, Eric Wong, Ian Tenney, Mor Geva
    conference: null
    short: Preprint
    website: null
    blog: null
    github: null
    external: true
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability—the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions—concreteness and validation—and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.
  - title: 'CAMEL: An ECG Language Model for Forecasting Cardiac Events'
    link: https://arxiv.org/abs/2602.15677
    authors: Neelay Velingker, Alaia Solko-Breslin, Mayank Keoliya, Seewon Choi, Jiayi Xin, Anika Marathe, Alireza Oraii, Rajat Deo, Sameed Khatana, Rajeev Alur, Mayur Naik, Eric Wong
    conference: International Conference on Learning Representations (ICLR), 2026
    short: Preprint
    website: null
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: |-
      Electrocardiograms (ECG) are electrical recordings of the heart that are critical for diagnosing cardiovascular conditions. ECG language models (ELMs) have recently emerged as a promising framework for ECG classification accompanied by report generation. However, current models cannot forecast future cardiac events despite the immense clinical value for planning earlier intervention. To address this gap, we propose CAMEL, the first ELM that is capable of inference over longer signal durations which enables its forecasting capability. Our key insight is a specialized ECG encoder which enables cross-understanding of ECG signals with text. We train CAMEL using established LLM training procedures, combining LoRA adaptation with a curriculum learning pipeline. Our curriculum includes ECG classification, metrics calculations, and multi-turn conversations to elicit reasoning. CAMEL demonstrates strong zero-shot performance across 6 tasks and 9 datasets, including ECGForecastBench, a new benchmark that we introduce for forecasting arrhythmias. CAMEL is on par with or surpasses ELMs and fully supervised baselines both in- and out-of-distribution, achieving SOTA results on ECGBench (+7.0% absolute average gain) as well as ECGForecastBench (+12.4% over fully supervised models and +21.1% over zero-shot ELMs).
  - title: 'Semantics-Preserving Evasion of LLM Vulnerability Detectors'
    link: https://arxiv.org/abs/2602.00305
    authors: Luze Sun, Alina Oprea, Eric Wong
    conference: null
    short: Preprint
    website: null
    blog: null
    github: null
    external: false
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
- year: 2025
  papers:
  - title: 'SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals'
    link: https://arxiv.org/abs/2512.05038
    authors: Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong
    conference: Mechanistic Interpretability Workshop at NeurIPS 2025
    short: MechInterp Workshop @ NeurIPS 2025
    website: null
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: concepts-structure
    abstract: |-
      Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.
  - title: 'T-FIX: Text-Based Explanations with Features Interpretable to eXperts'
    link: https://arxiv.org/abs/2511.04070
    authors: Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Marco Gatti, Bhuvnesh Jain, Helen Qu, Daniel A Hashimoto, Amin Madani, Rajat Deo, Sameed Ahmed M. Khatana, Gary E. Weissman, Lyle Ungar, Eric Wong
    conference: null
    short: Preprint
    website: https://brachiolab.github.io/fix/
    blog: null
    github: https://github.com/BrachioLab/exlib/tree/main/fix
    external: false
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: |-
      As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.
  - title: Probabilistic Stability Guarantees for Feature Attributions
    link: https://arxiv.org/abs/2504.13787
    authors: Helen Jin, Anton Xue, Weiqiu You, Surbhi Goel, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2025
    short: NeurIPS 2025
    website: null
    blog: https://debugml.github.io/soft-stability/
    github: https://github.com/helenjin/soft_stability/
    external: false
    themes: [interpretability]
    subtheme: certified-explanations
    abstract: |-
      Stability guarantees have emerged as a principled way to evaluate feature attributions, but existing certification methods rely on heavily smoothed classifiers and often produce conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable guarantees for any attribution method. Moreover, we show that mild smoothing achieves a more favorable trade-off between accuracy and stability, avoiding the aggressive compromises made in prior certification methods. To explain this behavior, we use Boolean function analysis to derive a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.
  - title: 'CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning'
    link: https://arxiv.org/abs/2503.24123
    authors: Seewon Choi, Alaia Solko-Breslin, Rajeev Alur, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2025
    short: NeurIPS 2025
    website: null
    blog: https://debugml.github.io/ctsketch/
    github: https://github.com/alaiasolkobreslin/CTSketch
    external: false
    themes: [formal-assurances]
    subtheme: neurosymbolic
    abstract: |-
      Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and the sketches. We provide theoretical insight into the maximum approximation error. Furthermore, we evaluate CTSketch on benchmarks from the neurosymbolic learning literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that were previously unattainable, with neural predictors obtaining high accuracy on tasks with one thousand inputs, despite supervision only on the final output.
  - title: 'Once Upon an Input: Reasoning via Per-Instance Program Synthesis'
    link: https://neurips.cc/virtual/2025/poster/117467
    authors: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2025
    short: NeurIPS 2025
    website: null
    blog: null
    github: null
    external: false
    themes: [formal-assurances]
    subtheme: verifying-reasoning
    abstract: ''
  - title: Probabilistic Soundness Guarantees in LLM Reasoning Chains
    link: https://arxiv.org/abs/2507.12948
    authors: |-
      Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong
    conference: Empirical Methods in Natural Language Processing (EMNLP), 2025
    short: EMNLP 2025
    website: null
    blog: null
    github: null
    external: false
    themes: [formal-assurances]
    subtheme: verifying-reasoning
    abstract: |-
      In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
  - title: Adaptively profiling models with task elicitation
    link: https://arxiv.org/abs/2503.01986
    authors: Davis Brown, Prithvi Balehannina, Helen Jin, Shreya Havaldar, Hamed Hassani, Eric Wong
    conference: Empirical Methods in Natural Language Processing (EMNLP), 2025
    short: EMNLP 2025
    website: null
    blog: null
    github: null
    external: false
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
  - title: Artifact or Flaw? Rethinking Prompt Sensitivity in Evaluating LLMs
    link: https://openreview.net/forum?id=frH3KtOZPC
    authors: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin
    conference: Findings of Empirical Methods in Natural Language Processing (EMNLP), 2025
    short: EMNLP Findings 2025
    website: null
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      Prompt sensitivity, referring to the phenomenon where minor variations in phrasing lead to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
  - title: Instruction Following by Boosting Attention of Large Language Models
    link: https://arxiv.org/abs/2506.13734
    authors: Vitoria Guardieiro, Adam Stein, Avishree Khare, Eric Wong
    conference: null
    short: Preprint
    website: null
    blog: https://debugml.github.io/instaboost/
    github: https://github.com/BrachioLab/InstABoost
    external: false
    themes: [adversarial-safety]
    subtheme: mechanistic-theory-of-safety
    abstract: |-
      Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
  - title: Benchmarking Misuse Mitigation Against Covert Adversaries
    link: https://arxiv.org/abs/2506.06414v1
    authors: |-
      Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani
    conference: null
    short: Preprint
    website: https://modelmisuse.com/
    blog: null
    github: https://github.com/davisrbr/bsd-misuse
    external: false
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
  - title: 'The FIX Benchmark: Extracting Features Interpretable to eXperts'
    link: https://arxiv.org/abs/2409.13684
    authors: |-
      Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A. Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong
    conference: Journal of Data-centric Machine Learning Research (DMLR), 2025
    short: DMLR 2025
    website: https://brachiolab.github.io/fix/
    blog: https://debugml.github.io/fix/
    github: https://github.com/BrachioLab/exlib/tree/main/fix
    external: false
    themes: [interpretability]
    subtheme: concepts-structure
    abstract: |-
      Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.
  - title: Towards Style Alignment in Cross-Cultural Translation
    link: https://arxiv.org/abs/2507.00216
    authors: Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
    conference: Association for Computational Linguistics (ACL), 2025
    short: ACL 2025
    blog: null
    github: null
    external: false
    themes: [adversarial-safety, interpretability]
    subtheme: alignment-control
    abstract: |-
      Successful communication depends on the speaker's intended style (i.e., what the speaker is trying to convey) aligning with the listener's interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
  - title: 'Sum-of-Parts Models: Faithful Attributions for Groups of Features'
    link: https://arxiv.org/abs/2310.16316
    authors: Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong
    conference: International Conference on Machine learning (ICML), 2025
    short: ICML 2025
    blog: https://debugml.github.io/sum-of-parts/
    github: https://github.com/DebugML/sop
    external: false
    themes: [interpretability]
    subtheme: certified-explanations
    abstract: |-
      Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery. Our code is available at this https URL
  - title: 'DOLPHIN: A Programmable Framework for Scalable Neurosymbolic Learning'
    link: null
    authors: Aaditya Naik, Jason Liu, Claire Wang, Amish Sethi, Saikat Dutta, Mayur Naik, Eric Wong
    conference: International Conference on Machine learning (ICML), 2025
    short: ICML 2025
    blog: null
    github: null
    external: false
    themes: [formal-assurances]
    subtheme: neurosymbolic
    abstract: ''
  - title: The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models
    link: https://arxiv.org/abs/2505.24874
    authors: Adam Stein, Aaditya Naik, Neelay Velingker, Eric Wong
    conference: null
    short: Preprint
    website: null
    blog: null
    github: null
    external: false
    themes: [formal-assurances]
    subtheme: neurosymbolic
    abstract: |-
      Neuro-symbolic learning was proposed to address challenges with training neural networks for complex reasoning tasks with the added benefits of interpretability, reliability, and efficiency. Neuro-symbolic learning methods traditionally train neural models in conjunction with symbolic programs, but they face significant challenges that limit them to simplistic problems. On the other hand, purely-neural foundation models now reach state-of-the-art performance through prompting rather than training, but they are often unreliable and lack interpretability. Supplementing foundation models with symbolic programs, which we call neuro-symbolic prompting, provides a way to use these models for complex reasoning tasks. Doing so raises the question: What role does specialized model training as part of neuro-symbolic learning have in the age of foundation models? To explore this question, we highlight three pitfalls of traditional neuro-symbolic learning with respect to the compute, data, and programs leading to generalization problems. This position paper argues that foundation models enable generalizable neuro-symbolic solutions, offering a path towards achieving the original goals of neuro-symbolic learning without the downsides of training from scratch.
  - title: 'NSF-SciFy: Mining the NSF Awards Database for Scientific Claims'
    link: https://arxiv.org/abs/2503.08600
    authors: Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch
    conference: null
    short: Preprint
    website: null
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: |-
      We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals. Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.
  - title: Avoiding Copyright Infringement via Machine Unlearning
    link: https://arxiv.org/abs/2406.10952
    authors: Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, Eric Wong
    conference: Findings of the Association for Computational Linguistics (NAACL), 2025
    short: NAACL-Findings 2025
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.
  - title: 'Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference'
    link: https://arxiv.org/abs/2407.00075
    authors: Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong
    conference: International Conference on Learning Representations (ICLR), 2025
    short: ICLR 2025
    blog: https://debugml.github.io/logicbreaks/
    github: https://github.com/AntonXue/tf_logic
    external: false
    themes: [formal-assurances, adversarial-safety]
    subtheme: mechanistic-theory-of-safety
    abstract: |-
      We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.
  - title: 'SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks'
    link: https://arxiv.org/abs/2310.03684
    authors: Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas
    conference: Transactions on Machine Learning Research (TMLR), 2025
    short: TMLR 2025
    blog: https://debugml.github.io/smooth-llm/
    github: https://github.com/arobey1/smooth-llm
    external: false
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{this https URL}.
  - title: Jailbreaking Black Box Large Language Models in Twenty Queries
    link: https://arxiv.org/abs/2310.08419
    authors: Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong
    conference: 3rd IEEE Conference on Secure and Trustworthy Machine Learning, 2025
    short: SaTML 2025
    blog: https://jailbreaking-llms.github.io/
    github: https://github.com/patrickrchao/JailbreakingLLMs
    external: false
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.
  - title: Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
    link: https://arxiv.org/abs/2402.16192
    authors: |-
      Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
    short: IJCNLP-AACL, 2025
    conference: IJCNLP-AACL, 2025
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at this https URL.
- year: 2024
  papers:
  - title: 'AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties'
    link: https://arxiv.org/abs/2410.24178
    authors: Xiayan Ji, Anton Xue, Eric Wong, Oleg Sokolsky, Insup Lee
    conference: Neural Information Processing Systems (NeurIPS), 2024
    short: NeurIPS 2024
    website: null
    blog: null
    github: https://github.com/xjiae/arpro
    external: false
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: |-
      Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods and recent advances in generative models to introduce counterfactual explanations for anomaly detection. Given an input, we generate its counterfactual as a diffusion-based repair that shows what a non-anomalous version should have looked like. A key advantage of this approach is that it enables a domain-independent formal specification of explainability desiderata, offering a unified framework for generating and evaluating explanations. We demonstrate the effectiveness of our anomaly explainability framework, AR-Pro, on vision (MVTec, VisA) and time-series (SWaT, WADI, HAI) anomaly datasets. The code used for the experiments is accessible at: this https URL.
  - title: |-
      Crowd-sourced machine learning prediction of long COVID using data from the National COVID Cohort Collaborative
    link: https://www.sciencedirect.com/science/article/pii/S2352396424003694
    authors: |-
      Timothy Bergquist, Johanna Loomba, Emily Pfaff, Fangfang Xia, Zixuan Zhao, Yitan Zhu, Elliot Mitchell, Biplab Bhattacharya, Gaurav Shetty, Tamanna Munia, Grant Delong, Adbul Tariq, Zachary Butzin-Dozier, Yunwen Ji, Haodong Li, Jeremy Coyle, Seraphina Shi, Rachael V. Philips, Andrew Mertens, Romain Pirracchio, Mark van der Laan, John M. Colford Jr., Alan Hubbard, Jifan Gao, Guanhua Chen, Neelay Velingker, Ziyang Li, Yinjun Wu, Adam Stein, Jiani Huang, Zongyu Dai, Qi Long, Mayur Naik, John Holmes, Danielle Mowery, Eric Wong, Ravi Parekh, Emily Getzen, Jake Hightower, Jennifer Blase
    conference: eBioMedicine
    short: eBioMedicine
    website: null
    blog: null
    github: null
    external: true
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: ''
  - title: Data-Efficient Learning with Neural Programs
    link: https://arxiv.org/abs/2406.06246
    authors: |-
      Alaia Solko-Breslin, Seewon Choi, Ziyang Li, Neelay Velingker, Rajeev Alur, Mayur Naik, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2024
    short: NeurIPS 2024
    blog: https://debugml.github.io/neural-programs/
    github: https://github.com/alaiasolkobreslin/ISED/tree/v1.0.0
    external: false
    themes: [formal-assurances]
    subtheme: neurosymbolic
    abstract: |-
      Many computational tasks can be naturally expressed as a composition of a DNN followed by a program written in a traditional programming language or an API call to an LLM. We call such composites "neural programs" and focus on the problem of learning the DNN parameters when the training data consist of end-to-end input-output labels for the composite. When the program is written in a differentiable logic programming language, techniques from neurosymbolic learning are applicable, but in general, the learning for neural programs requires estimating the gradients of black-box components. We present an algorithm for learning neural programs, called ISED, that only relies on input-output samples of black-box components. For evaluation, we introduce new benchmarks that involve calls to modern LLMs such as GPT-4 and also consider benchmarks from the neurosymbolic learning literature. Our evaluation shows that for the latter benchmarks, ISED has comparable performance to state-of-the-art neurosymbolic frameworks. For the former, we use adaptations of prior work on gradient approximations of black-box components as a baseline, and show that ISED achieves comparable accuracy but in a more data- and sample-efficient manner.
  - title: Towards Compositionality in Concept Learning
    link: https://arxiv.org/abs/2406.18534
    authors: Adam Stein, Aaditya Naik, Yinjun Wu, Mayur Naik, Eric Wong
    conference: International Conference on Machine learning (ICML), 2024
    short: ICML 2024
    blog: https://debugml.github.io/compositional-concepts/
    github: https://github.com/adaminsky/compositional_concepts
    external: false
    themes: [formal-assurances, interpretability]
    subtheme: neurosymbolic
    abstract: |-
      Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at this https URL .
  - title: 'DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation'
    link: https://arxiv.org/abs/2406.00611
    authors: |-
      Yinjun Wu, Mayank Keoliya, Kan Chen, Neelay Velingker, Ziyang Li, Emily J Getzen, Qi Long, Mayur Naik, Ravi B Parikh, Eric Wong
    conference: International Conference on Machine learning (ICML), 2024
    short: ICML 2024
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: scientific-healthcare
    abstract: |-
      Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers for black-box models lack faithfulness guarantees, and self-interpretable models greatly compromise accuracy. To address these issues, we propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample. A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples. We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space. We evaluate DISCRET on diverse tasks involving tabular, image, and text data. DISCRET outperforms the best self-interpretable models and has accuracy comparable to the best black-box models while providing faithful explanations. DISCRET is available at this https URL.
  - title: 'JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models'
    link: https://arxiv.org/abs/2404.01318
    authors: |-
      Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2024
    short: NeurIPS 2024
    website: https://jailbreakbench.github.io/
    blog: null
    github: https://github.com/JailbreakBench/jailbreakbench/
    external: false
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at this https URL that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at this https URL that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.
  - title: Evaluating Groups of Features via Consistency, Contiguity, and Stability
    link: null
    authors: Chaehyeon Kim, Weiqiu You, Shreya Havaldar, Eric Wong
    conference: International Conference on Learning Representations (ICLR), 2024 Tiny Papers Track
    short: ICLR 2024, Tiny Papers (Oral)
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: certified-explanations
    abstract: ''
  - title: |-
      SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
    link: https://arxiv.org/abs/2310.12508
    authors: Chongyu Fan, Jiancheng Liu, Yihua Zhang, Dennis Wei, Eric Wong, Sijia Liu
    conference: International Conference on Learning Representations (ICLR), 2024
    short: ICLR 2024
    blog: null
    github: https://github.com/OPTML-Group/Unlearn-Saliency
    external: true
    themes: [adversarial-safety]
    subtheme: alignment-control
    abstract: |-
      With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often suffer limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' for MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting data points). To the best of our knowledge, SalUn is the first principled MU approach that can effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation tasks. As highlighted below, For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not. Codes are available at this https URL. (WARNING: This paper contains model outputs that may be offensive in nature.)
  - title: Initialization Matters for Adversarial Transfer Learning
    link: https://arxiv.org/abs/2312.05716
    authors: Andong Hua, Jindong Gu, Zhiyu Xue, Nicholas Carlini, Eric Wong, Yao Qin
    conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    short: CVPR 2024
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model. Specifically, we reveal that with a standard pretrained model, Parameter-Efficient Finetuning (PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks, even with adversarial training during finetuning. Leveraging a robust pretrained model, surprisingly, we observe that a simple linear probing can outperform full finetuning and other PEFT methods with random initialization on certain datasets. We further identify that linear probing excels in preserving robustness from the robust pretraining. Based on this, we propose Robust Linear Initialization (RoLI) for adversarial finetuning, which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining. Across five different image classification datasets, we demonstrate the effectiveness of RoLI and achieve new state-of-the-art results. Our code is available at \url{this https URL}.
  - title: 'TorchQL: A Programming Framework for Integrity Constraints in Machine Learning'
    link: https://arxiv.org/abs/2308.06686
    authors: Aaditya Naik, Adam Stein, Yinjun Wu, Eric Wong, Mayur Naik
    conference: Object-oriented Programming, Systems, Languages, and Applications (OOPSLA), 2024
    short: OOPSLA 2024
    blog: null
    github: https://github.com/TorchQL/torchql
    external: false
    themes: [formal-assurances, interpretability]
    subtheme: neurosymbolic
    abstract: |-
      Finding errors in machine learning applications requires a thorough exploration of their behavior over data. Existing approaches used by practitioners are often ad-hoc and lack the abstractions needed to scale this process. We present TorchQL, a programming framework to evaluate and improve the correctness of machine learning applications. TorchQL allows users to write queries to specify and check integrity constraints over machine learning models and datasets. It seamlessly integrates relational algebra with functional programming to allow for highly expressive queries using only eight intuitive operators. We evaluate TorchQL on diverse use-cases including finding critical temporal inconsistencies in objects detected across video frames in autonomous driving, finding data imputation errors in time-series medical records, finding data labeling errors in real-world images, and evaluating biases and constraining outputs of language models. Our experiments show that TorchQL enables up to 13x faster query executions than baselines like Pandas and MongoDB, and up to 40% shorter queries than native Python. We also conduct a user study and find that TorchQL is natural enough for developers familiar with Python to specify complex integrity constraints.
- year: 2023
  papers:
  - title: Comparing Styles across Languages
    link: https://arxiv.org/abs/2310.07135
    authors: Shreya Havaldar, Matthew Pressimone, Eric Wong, Lyle Ungar
    conference: Empirical Methods in Natural Language Processing (EMNLP), 2023
    short: EMNLP 2023
    blog: null
    github: null
    external: false
    themes: [interpretability, adversarial-safety]
    subtheme: concepts-structure
    abstract: |-
      Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.
  - title: Stability Guarantees for Feature Attributions with Multiplicative Smoothing
    link: https://arxiv.org/abs/2307.05902
    authors: Anton Xue, Rajeev Alur, Eric Wong
    conference: Neural Information Processing Systems (NeurIPS), 2023
    short: NeurIPS 2023
    blog: https://debugml.github.io/multiplicative-smoothing/
    github: https://github.com/DebugML/mus
    external: false
    themes: [interpretability]
    subtheme: certified-explanations
    abstract: |-
      Explanation methods for machine learning models tend not to provide any formal guarantees and may not reflect the underlying decision-making process. In this work, we analyze stability as a property for reliable feature attribution methods. We prove that relaxed variants of stability are guaranteed if the model is sufficiently Lipschitz with respect to the masking of features. We develop a smoothing method called Multiplicative Smoothing (MuS) to achieve such a model. We show that MuS overcomes the theoretical limitations of standard smoothing techniques and can be integrated with any classifier and feature attribution method. We evaluate MuS on vision and language models with various feature attribution methods, such as LIME and SHAP, and demonstrate that MuS endows feature attributions with non-trivial stability guarantees.
  - title: 'TopEx: Topic-based Explanations for Model Comparison'
    link: https://arxiv.org/abs/2306.00976
    authors: Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
    conference: International Conference on Learning Representations (ICLR), 2023 Tiny Papers Track
    short: ICLR 2023, Tiny Papers
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: concepts-structure
    abstract: |-
      Meaningfully comparing language models is challenging with current explanation methods. Current explanations are overwhelming for humans due to large vocabularies or incomparable across models. We present TopEx, an explanation method that enables a level playing field for comparing language models via model-agnostic topics. We demonstrate how TopEx can identify similarities and differences between DistilRoBERTa and GPT-2 on a variety of NLP tasks.
  - title: Rectifying Group Irregularities in Explanations for Distribution Shift
    link: https://arxiv.org/abs/2305.16308
    authors: Adam Stein, Yinjun Wu, Eric Wong, Mayur Naik
    conference: null
    short: Preprint
    blog: null
    github: null
    external: false
    themes: [interpretability]
    subtheme: concepts-structure
    abstract: |-
      It is well-known that real-world changes constituting distribution shift adversely affect model performance. How to characterize those changes in an interpretable manner is poorly understood. Existing techniques to address this problem take the form of shift explanations that elucidate how to map samples from the original distribution toward the shifted one by reducing the disparity between these two distributions. However, these methods can introduce group irregularities, leading to explanations that are less feasible and robust. To address these issues, we propose Group-aware Shift Explanations (GSE), a method that produces interpretable explanations by leveraging worst-group optimization to rectify group irregularities. We demonstrate how GSE not only maintains group structures, such as demographic and hierarchical subpopulations, but also enhances feasibility and robustness in the resulting explanations in a wide range of tabular, language, and image settings.
  - title: Do Machine Learning Models Learn Statistical Rules Inferred from Data?
    link: https://arxiv.org/abs/2303.01433
    authors: Aaditya Naik, Yinjun Wu, Mayur Naik, Eric Wong
    conference: International Conference on Machine learning (ICML), 2023
    short: ICML 2023
    blog: https://debugml.github.io/SQRL/
    github: https://github.com/DebugML/sqrl
    external: false
    themes: [formal-assurances]
    subtheme: verifying-reasoning
    abstract: |-
      Machine learning models can make critical errors that are easily hidden within vast amounts of data. Such errors often run counter to rules based on human intuition. However, rules based on human knowledge are challenging to scale or to even formalize. We thereby seek to infer statistical rules from the data and quantify the extent to which a model has learned them. We propose a framework SQRL that integrates logic-based methods with statistical inference to derive these rules from a model's training data without supervision. We further show how to adapt models at test time to reduce rule violations and produce more coherent predictions. SQRL generates up to 300K rules over datasets from vision, tabular, and language settings. We uncover up to 158K violations of those rules by state-of-the-art models for classification, object detection, and data imputation. Test-time adaptation reduces these violations by up to 68.7% with relative performance improvement up to 32%. SQRL is available at this https URL.
  - title: In-context Example Selection with Influences
    link: https://arxiv.org/abs/2302.11042
    authors: Tai Nguyen, Eric Wong
    conference: null
    short: Preprint
    blog: https://debugml.github.io/incontext-influences/
    github: https://github.com/DebugML/incontext_influences
    external: false
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      In-context learning (ICL) is a powerful paradigm emerged from large language models (LLMs). Despite its promises, ICL performance is known to be highly sensitive to input examples. In this work, we use $\textit{in-context influences}$ to analyze few-shot ICL performance directly from the in-context examples. Our proposed influence-based example selection method can identify both positive and negative examples, outperforming several baselines when evaluated on 9 SuperGLUE tasks. Our analysis uncovers up to a $16.3\%$ performance gap between using the most negative in-context examples compared to the most positive. In a case study, we apply our influence-based framework to quantify the phenomena of recency bias in example ordering for few-shot ICL.
  - title: Adversarial Prompting for Black Box Foundation Models
    link: https://arxiv.org/abs/2302.04237
    authors: Natalie Maus*, Patrick Chao*, Eric Wong, Jacob Gardner
    conference: null
    short: DLSP 2023 Keynote
    blog: https://debugml.github.io/adversarial-prompts/
    github: https://github.com/DebugML/adversarial_prompting
    external: false
    themes: [adversarial-safety]
    subtheme: jailbreaking
    abstract: |-
      Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or generating high perplexity text.
  - title: Faithful Chain-of-Thought Reasoning
    link: https://arxiv.org/abs/2301.13379
    authors: |-
      Qing Lyu*, Shreya Havaldar*, Adam Stein*, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, Chris Callison-Burch
    short: IJCNLP-AACL, 2023
    conference: IJCNLP-AACL, 2023
    blog: https://debugml.github.io/fcot/
    github: https://github.com/veronica320/Faithful-COT
    external: false
    themes: [formal-assurances]
    subtheme: verifying-reasoning
    abstract: |-
      While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query $\rightarrow$ symbolic reasoning chain) and Problem Solving (reasoning chain $\rightarrow$ answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.
  - title: A data-based perspective on transfer learning
    link: https://arxiv.org/abs/2207.05739
    authors: Saachi Jain*, Hadi Salman*, Alaa Khaddaj*, Eric Wong, Sung Min Park, Aleksander Madry
    conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    short: CVPR 2023
    blog: https://gradientscience.org/data-transfer/
    github: https://github.com/MadryLab/data-transfer
    external: true
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer learning performance from ImageNet on a variety of target tasks. Code is available at this https URL
- year: 2022
  papers:
  - title: When does bias transfer in transfer learning
    link: https://arxiv.org/abs/2207.02842
    authors: Hadi Salman*, Saachi Jain*, Andrew Ilyas*, Logan Engstrom*, Eric Wong, Aleksander Madry
    conference: null
    short: Preprint
    blog: https://gradientscience.org/bias-transfer/
    github: https://github.com/MadryLab/bias-transfer
    external: true
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      Using transfer learning to adapt a pre-trained "source model" to a downstream "target task" can dramatically increase performance with seemingly no downside. In this work, we demonstrate that there can exist a downside after all: bias transfer, or the tendency for biases of the source model to persist even after adapting the model to the target class. Through a combination of synthetic and natural experiments, we show that bias transfer both (a) arises in realistic settings (such as when pre-training on ImageNet or other standard datasets) and (b) can occur even when the target dataset is explicitly de-biased. As transfer-learned models are increasingly deployed in the real world, our work highlights the importance of understanding the limitations of pre-trained source models. Code is available at this https URL
  - title: Missingness bias in model debugging
    link: https://arxiv.org/abs/2204.08945
    authors: Saachi Jain*, Hadi Salman*, Pengchuan Zhang, Vibhav Vineet, Sal Vemprala, Aleksander Madry
    conference: International Conference on Learning Representations (ICLR), 2022
    short: ICLR 2022
    blog: https://gradientscience.org/missingness/
    github: https://github.com/MadryLab/missingness
    external: true
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools. However, in computer vision, pixels cannot simply be removed from an image. One thus tends to resort to heuristics such as blacking out pixels, which may in turn introduce bias into the debugging process. We study such biases and, in particular, show how transformer-based architectures can enable a more natural implementation of missingness, which side-steps these issues and improves the reliability of model debugging in practice. Our code is available at this https URL
  - title: Certified patch robustness via smoothed vision transformers
    link: https://arxiv.org/abs/2110.07719
    authors: Hadi Salman*, Saachi Jain*, Eric Wong*, Aleksander Madry
    conference: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    short: CVPR 2022
    blog: https://gradientscience.org/smoothing/
    github: https://github.com/MadryLab/smoothed-vit
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Certified patch defenses can guarantee robustness of an image classifier to arbitrary changes within a bounded contiguous region. But, currently, this robustness comes at a cost of degraded standard accuracies and slower inference times. We demonstrate how using vision transformers enables significantly better certified patch robustness that is also more computationally efficient and does not incur a substantial drop in standard accuracy. These improvements stem from the inherent ability of the vision transformer to gracefully handle largely masked images. Our code is available at this https URL.
  - title: 'DeepSplit: Scalable verification of deep neural networks via operator splitting'
    link: https://arxiv.org/abs/2106.09117
    authors: Shaoru Chen*, Eric Wong*, J. Zico Kolter, Mahyar Fazlyab
    conference: IEEE Open Journal of Control Systems (OJCS), 2022
    short: OJCS 2022
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non-convex optimization problem, for which several past works have proposed convex relaxations as a promising alternative. However, even for reasonably-sized neural networks, these relaxations are not tractable, and so must be replaced by even weaker relaxations in practice. In this work, we propose a novel operator splitting method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller sub-problems that often have analytical solutions. The method is modular, scales to very large problem instances, and compromises operations that are amenable to fast parallelization with GPU acceleration. We demonstrate our method in bounding the worst-case performance of large convolutional networks in image classification and reinforcement learning settings, and in reachability analysis of neural network dynamical systems.
- year: 2021
  papers:
  - title: Leveraging Sparse Linear Layers for Debuggable Deep Networks
    link: https://arxiv.org/abs/2105.04857
    authors: Eric Wong*, Shibani Santurkar*, Aleksander Madry
    conference: International Conference on Machine learning (ICML), 2021 *Long Oral*
    short: ICML 2021 (Oral)
    blog: https://gradientscience.org/glm_saga/
    github: https://github.com/madrylab/debuggabledeepnetworks
    external: true
    themes: [interpretability]
    subtheme: debugging
    abstract: |-
      We show how fitting sparse linear models over learned deep feature representations can lead to more debuggable neural networks. These networks remain highly accurate while also being more amenable to human interpretation, as we demonstrate quantiatively via numerical and human experiments. We further illustrate how the resulting sparse explanations can help to identify spurious correlations, explain misclassifications, and diagnose model biases in vision and language tasks. The code for our toolkit can be found at this https URL.
  - title: Learning perturbation sets for robust machine learning
    link: https://arxiv.org/abs/2007.08450
    authors: Eric Wong, J. Zico Kolter
    conference: International Conference on Learning Representations (ICLR), 2021
    short: ICLR 2021
    blog: https://locuslab.github.io/2020-07-20-perturbation/
    github: https://github.com/locuslab/perturbation_learning/
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that measure the quality of a learned perturbation set, and theoretically prove that a conditional variational autoencoder naturally satisfies these criteria. Using this framework, our approach can generate a variety of perturbations at different complexities and scales, ranging from baseline spatial transformations, through common image corruptions, to lighting variations. We measure the quality of our learned perturbation sets both quantitatively and qualitatively, finding that our models are capable of producing a diverse set of meaningful perturbations beyond the limited data seen during training. Finally, we leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations, while improving generalization on non-adversarial data. All code and configuration files for reproducing the experiments as well as pretrained model weights can be found at this https URL.
- year: 2020
  papers:
  - title: Overfitting in adversarially robust deep learning
    link: https://arxiv.org/abs/2002.11569
    authors: Leslie Rice*, Eric Wong*, J. Zico Kolter
    conference: International Conference on Machine learning (ICML), 2020
    short: ICML 2020
    blog: null
    github: https://github.com/locuslab/robust_overfitting/
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such practices surprisingly do not unduly harm the generalization performance of the classifier. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep networks, which are trained to minimize the loss under worst-case adversarial perturbations. We find that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and perturbation models ($\ell_\infty$ and $\ell_2$). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial training can be matched by simply using early stopping. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting. Finally, we study several classical and modern deep learning remedies for overfitting, including regularization and data augmentation, and find that no approach in isolation improves significantly upon the gains achieved by early stopping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at this https URL.
  - title: |-
      Neural network virtual sensors for fuel injection quantities with provable performance specifications
    link: http://arxiv.org/abs/2007.00147
    authors: Eric Wong, Tim Schneider, Joerg Schmitt, Frank R. Schmidt, J. Zico Kolter
    conference: IEEE Intelligent Vehicles Syimposium (IV), 2020
    short: IEEE IV 2020
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Recent work has shown that it is possible to learn neural networks with provable guarantees on the output of the model when subject to input perturbations, however these works have focused primarily on defending against adversarial examples for image classifiers. In this paper, we study how these provable guarantees can be naturally applied to other real world settings, namely getting performance specifications for robust virtual sensors measuring fuel injection quantities within an engine. We first demonstrate that, in this setting, even simple neural network models are highly susceptible to reasonable levels of adversarial sensor noise, which are capable of increasing the mean relative error of a standard neural network from 6.6% to 43.8%. We then leverage methods for learning provably robust networks and verifying robustness properties, resulting in a robust model which we can provably guarantee has at most 16.5% mean relative error under any sensor noise. Additionally, we show how specific intervals of fuel injection quantities can be targeted to maximize robustness for certain ranges, allowing us to train a virtual sensor for fuel injection which is provably guaranteed to have at most 10.69% relative error under noise while maintaining 3% relative error on non-adversarial data within normalized fuel injection ranges of 0.6 to 1.0.
  - title: 'Fast is better than free: revisiting adversarial training'
    link: https://arxiv.org/abs/2001.03994
    authors: Eric Wong*, Leslie Rice*, J. Zico Kolter
    conference: International Conference on Learning Representations (ICLR), 2020
    short: ICLR 2020
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Adversarial training, a method for learning robust deep networks, is typically assumed to be more expensive than traditional training due to the necessity of constructing adversarial examples via a first-order method like projected gradient decent (PGD). In this paper, we make the surprising discovery that it is possible to train empirically robust models using a much weaker and cheaper adversary, an approach that was previously believed to be ineffective, rendering the method no more costly than standard training in practice. Specifically, we show that adversarial training with the fast gradient sign method (FGSM), when combined with random initialization, is as effective as PGD-based training but has significantly lower cost. Furthermore we show that FGSM adversarial training can be further accelerated by using standard techniques for efficient training of deep networks, allowing us to learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with $\epsilon=8/255$ in 6 minutes, and a robust ImageNet classifier with 43% robust accuracy at $\epsilon=2/255$ in 12 hours, in comparison to past work based on "free" adversarial training which took 10 and 50 hours to reach the same respective thresholds. Finally, we identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail. All code for reproducing the experiments in this paper as well as pretrained model weights are at this https URL.
  - title: Adversarial robustness against the union of multiple perturbation models
    link: https://arxiv.org/abs/1909.04068
    authors: Pratyush Maini, Eric Wong, J. Zico Kolter
    conference: International Conference on Machine learning (ICML), 2020
    short: ICML 2020
    blog: null
    github: https://github.com/locuslab/robust_union/
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Owing to the susceptibility of deep learning systems to adversarial attacks, there has been a great deal of work in developing (both empirically and certifiably) robust classifiers. While most work has defended against a single type of attack, recent work has looked at defending against multiple perturbation models using simple aggregations of multiple attacks. However, these methods can be difficult to tune, and can easily result in imbalanced degrees of robustness to individual perturbation models, resulting in a sub-optimal worst-case loss over the union. In this work, we develop a natural generalization of the standard PGD-based procedure to incorporate multiple perturbation models into a single attack, by taking the worst-case over all steepest descent directions. This approach has the advantage of directly converging upon a trade-off between different perturbation models which minimizes the worst-case performance over the union. With this approach, we are able to train standard architectures which are simultaneously robust against $\ell_\infty$, $\ell_2$, and $\ell_1$ attacks, outperforming past approaches on the MNIST and CIFAR10 datasets and achieving adversarial accuracy of 47.0% against the union of ($\ell_\infty$, $\ell_2$, $\ell_1$) perturbations with radius = (0.03, 0.5, 12) on the latter, improving upon previous approaches which achieve 40.6% accuracy.
- year: 2019
  papers:
  - title: Wasserstein adversarial examples
    link: https://arxiv.org/abs/1902.07906
    authors: Eric Wong, Frank R. Schmidt, J. Zico Kolter
    conference: International Conference on Machine Learning (ICML), 2019
    short: ICML 2019
    blog: null
    github: null
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by $\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3% accuracy within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76%. In total, this work opens up a new direction of study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers. Code for all experiments in the paper is available at this https URL.
- year: 2018
  papers:
  - title: Scaling provable adversarial defenses
    link: https://arxiv.org/abs/1805.12514
    authors: Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, J. Zico Kolter
    conference: In Neural Information Processing Systems (NeurIPS), 2018
    short: NeurIPS 2018
    blog: null
    github: https://github.com/locuslab/convex_adversarial/
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      Recent work has developed methods for learning deep network classifiers that are provably robust to norm-bounded adversarial perturbation; however, these methods are currently only possible for relatively small feedforward networks. In this paper, in an effort to scale these approaches to substantially larger models, we extend previous work in three main directions. First, we present a technique for extending these training procedures to much more general networks, with skip connections (such as ResNets) and general nonlinearities; the approach is fully modular, and can be implemented automatically (analogous to automatic differentiation). Second, in the specific case of $\ell_\infty$ adversarial perturbations and networks with ReLU nonlinearities, we adopt a nonlinear random projection for training, which scales linearly in the number of hidden units (previous approaches scaled quadratically). Third, we show how to further improve robust error through cascade models. On both MNIST and CIFAR data sets, we train classifiers that improve substantially on the state of the art in provable robust adversarial error bounds: from 5.8% to 3.1% on MNIST (with $\ell_\infty$ perturbations of $\epsilon=0.1$), and from 80% to 36.4% on CIFAR (with $\ell_\infty$ perturbations of $\epsilon=2/255$). Code for all experiments in the paper is available at this https URL.
  - title: Provable defenses against adversarial examples via the convex outer adversarial polytope
    link: https://arxiv.org/abs/1711.00851
    authors: Eric Wong, J. Zico Kolter
    conference: |-
      International Conference on Machine Learning (ICML), 2018; Best defense paper at [NIPS 2017 ML &amp; Security Workshop](https://machine-learning-and-security.github.io/)
    short: ICML 2018
    blog: https://locuslab.github.io/2019-03-12-provable/
    github: https://github.com/locuslab/convex_adversarial/
    external: true
    themes: [adversarial-safety]
    subtheme: adversarial-robustness
    abstract: |-
      We propose a method to learn deep ReLU-based classifiers that are provably robust against norm-bounded adversarial perturbations on the training data. For previously unseen examples, the approach is guaranteed to detect all adversarial examples, though it may flag some non-adversarial examples as well. The basic idea is to consider a convex outer approximation of the set of activations reachable through a norm-bounded perturbation, and we develop a robust optimization procedure that minimizes the worst case loss over this outer region (via a linear program). Crucially, we show that the dual problem to this linear program can be represented itself as a deep network similar to the backpropagation network, leading to very efficient optimization approaches that produce guaranteed bounds on the robust loss. The end result is that by executing a few more forward and backward passes through a slightly modified version of the original network (though possibly with much larger batch sizes), we can learn a classifier that is provably robust to any norm-bounded adversarial attack. We illustrate the approach on a number of tasks to train classifiers with robust adversarial guarantees (e.g. for MNIST, we produce a convolutional classifier that provably has less than 5.8% test error for any adversarial attack with bounded $\ell_\infty$ norm less than $\epsilon = 0.1$), and code for all experiments in the paper is available at this https URL.
- year: 2017
  papers:
  - title: A Semismooth Newton Method for Fast, Generic Convex Programming
    link: https://arxiv.org/abs/1705.00772
    authors: Alnur Ali*, Eric Wong*, J. Zico Kolter
    conference: International Conference on Machine Learning (ICML), 2017
    short: ICML 2017
    blog: null
    github: https://github.com/locuslab/newton_admm/
    external: true
    themes: []
    subtheme: null
    abstract: |-
      We introduce Newton-ADMM, a method for fast conic optimization. The basic idea is to view the residuals of consecutive iterates generated by the alternating direction method of multipliers (ADMM) as a set of fixed point equations, and then use a nonsmooth Newton method to find a solution; we apply the basic idea to the Splitting Cone Solver (SCS), a state-of-the-art method for solving generic conic optimization problems. We demonstrate theoretically, by extending the theory of semismooth operators, that Newton-ADMM converges rapidly (i.e., quadratically) to a solution; empirically, Newton-ADMM is significantly faster than SCS on a number of problems. The method also has essentially no tuning parameters, generates certificates of primal or dual infeasibility, when appropriate, and can be specialized to solve specific convex problems.
- year: 2015
  papers:
  - title: An SVD and Derivative Kernel Approach to Learning from Geometric Data
    link: http://zicokolter.com/publications/wong2015svdkernel.pdf
    authors: Eric Wong, J. Zico Kolter
    conference: Conference on Artificial Intelligence (AAAI), 2015
    short: AAAI 2015
    blog: null
    github: null
    external: true
    themes: []
    subtheme: null
    abstract: ''