Posters
ISCOL 2025 • December 18th, 2025
Session 1 (10:15 - 11:15)
LCHAIM - Investigating Long Context Reasoning in Hebrew
NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Using Natural Language Inference and Inferentialist Theory to Assess Meaning Similarity in Text Generation
Not Just a Piece of Cake: Cross-Lingual Fine-Tuning for Idiom Identification
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Comparing human and language models sentence processing difficulties on complex structures
Uncovering Measurement Biases in LLM Embedding Spaces: The Anna Karenina Principle and Its Implications for Automated Feedback
Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark
User-Centric Evidence Ranking for Attribution and Fact Verification
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Probing Subphonemes in Morphology Models
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
Not Your Typical Sycophant: Evaluating Sycophancy of Large Language Models
JuStRank: Benchmarking LLM Judges for System Ranking
Automatic biblical authorship attribution
A Survey on Evaluation of LLM-based Agents
Detecting (Un)answerability in Large Language Models with Linear Directions
Don’t lie to your friends: Learning what you know from collaborative self-play
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
Effective Red-Teaming of Policy-Adherent Agents
Integrating Morphological Structure into Word Embedding Representations
Detecting Conspiracies in Hebrew Twitter with LLM-GNN Fusion
Localizing Factual Inconsistencies in Attributable Text Generation
MINT: Meaning Integrating Tokenizer
d-chi Stencil: A Differential Privacy Mechanism for Interacting with LLMs
IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs
CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Grade: Quantifying sample diversity in text-to-image models
Decoding Reading Goals from Eye Movements
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise
Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Large Temporal Models: Unlocking Temporal Understanding in LLMs for Temporal Relation Classification
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
TwoHillsLab: A Scalable Platform for Quantitative Analysis of Biblical Hebrew
DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
Beyond Pairwise: Global Zero-shot Temporal Graph Generation
Inside-Out: Hidden Factual Knowledge in LLMs
Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
Do LLMs Understand Harmfulness?
Session 2 (13:45 - 14:45)
The Distracting Effect: Understanding Irrelevant Passages in RAG
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
Connections Between The Pre-Training Data To Model Representations
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Segment-Based Attention Masking for GPTs
Easy as PIE? Identifying Multi-Word Expressions with LLMs
Déjà Vu? Decoding Repeated Reading from Eye Movements
Multi-Domain Explainability of Preferences
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Reverse-Engineering the Retrieval Process in GenIR Models
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Estimating Scientific Quality on the Web: A Multilingual LLM Approach
Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs
TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Cross-lingual Extractive Question Answering with Unanswerable Questions
How Much Pretraining Does Structured Data Need?
CToT: Causal Tree of Thoughts for Inference-Time Compute
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
Distilling Examples into Task Instructions: Enhanced In-Context Learning for Long B2B Conversations
Can LLMs Help Encoder Models Maintain Both High Accuracy and Consistency in Temporal Relation Classification?
Hyper-RAG
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Will it Merge? Causes of Model Mergeability
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs with Value-Prompting
Beyond Word Boundaries: A Hebrew Coreference Benchmark for Morphologically Complex Text
Word Pyramid Puzzles as a Multi-lingually Diverse Reasoning Benchmark
Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing
Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering
Factual Retrieval in LLMs Is a Redundant, Non-Contiguous Process
Pixels at BAREC Shared Task 2025: Visual Arabic Readability Assessment
How Should We Evaluate LLM Reasoning Quality For Fact Verification?
Unveiling the spectrum of Arabic offensive language: Taxonomy and insights
Differences in Input and Output Quality of LLMs Across Age Groups
Inferring Functionality of Attention Heads from their Parameters
Break Out the Silverware: Semantic Understanding of Stored Household Items
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
Prompts in the Wild: A Large Analyzed Collection of Prompts in Code
Session 3 (16:40 - 17:40)
Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
ArTyDi-QA: Question Answering and Question Generation in Arabic
mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
A Holistic Approach towards Vocabulary Expansion for Language Adaptation
Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content
Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs
Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Towards Enforcing Company Policy Adherence in Agentic Workflows
Out-of-Context Reasoning in Large Language Models
SpeLLM: Character-Level Multi-Head Decoding
PaperFinder: a State-of-the-art LLM-based Scientific Search Agent
Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness
Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination
Making LVLMs Look Twice: Contrastive Decoding with Contrast Images
Keep Guessing? When Considering Inference Scaling, Mind the Baselines
Letting the Data Speak: Automating Schema Discovery for Research
The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Where Did That Come From? Sentence-Level Error-Tolerant Attribution
Hexagen: Improving Abstraction Reasoning Through Code Execution
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Differential Mamba
Precise In-Parameter Concept Erasure in Large Language Models
SAEs Are Good for Steering - If You Select the Right Features
Retrieve, Learn, Refine: An Interleaved Retrieval–Learning Agent for Exhaustive IR
Expectation management shifts the representation of unexpectedness
LAQuer: Localized Attribution Queries in Content-grounded Generation
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
Cross-Lingual and Cross-Cultural Variation in Image Descriptions
A Unifying Scheme for Extractive Content Selection Tasks
MRLEval: A Benchmark for LLM Evaluation in Hebrew, Modern Standard Arabic and Levantine Arabic
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
SastBench: A Benchmark for Testing Agentic SAST Triage
CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature