I'm currently at a stealth startup training domain-specific LLMs, with a focus on data across
the training stack. My recent work has focused on post-training, where I've developed reward
signals, graders, and evaluations for improving factuality and faithfulness in agentic
systems. I've also worked on mid-training, building data pipelines to curate and filter
large-scale corpora, and running training and mixture ablations to study how data choices
affect model behavior. In practice, I sit close to the full model-development loop: data,
experiments, training, evaluation, and iteration. Prior to that, I interned at Google
Research, where I worked on video understanding in Gemini and presented my work at the
annual Google Research conference.
I’m also completing my PhD at Bar-Ilan’s NLP lab, advised by Yoav Goldberg, where I
research underspecification: what happens when language, tasks, or evaluations leave part
of the intended meaning implicit. I study how models fill in those gaps, why
they often default to behavior that looks correct but misses human intent, and how to build
evaluations that expose these failures more directly. The goal is to make models more
reliable in realistic settings, where the right answer is not fully specified in the input.
Selected publications
Linguistic Binding in Diffusion Models
Ask a text-to-image model for “a pink bench and a yellow dog” and you'll often get a yellow bench.
We traced this to cross-attention maps that don't agree on which tokens modify which objects, and
proposed SynGen: a training-free fix that aligns them using the prompt's dependency parse.
GRADE: Quantifying Sample Diversity in Text-to-Image Models
For an under-specified prompt like “a dog,” a good model should produce a variety of reasonable
images. GRADE uses a VLM to propose prompt-relevant attributes (breed, color, setting)
and measures entropy over them. Most production models are more mode-collapsed than people think —
and it's getting worse, not better.
D-MERIT: Rewarding Retrievers that Are Actually Right
Retrieval benchmarks mark one gold document per query — but many queries have several valid
answers, so models that find a different-but-correct document get penalized. D-MERIT annotates
the full answer set, and rankings shift meaningfully when you do.
An early look at a strange failure mode: asking for two different concepts often produces two of
the same thing. We characterized the behavior and traced it to how the text encoder maps
nouns to concepts — a paper that later found its way into public conversations about what these
models actually understand.