Posters
ISCOL 2025 • December 18th, 2025
Session 1 (10:15 - 11:15)
LCHAIM - Investigating Long Context Reasoning in Hebrew
NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Using Natural Language Inference and Inferentialist Theory to Assess Meaning Similarity in Text Generation
Not Just a Piece of Cake: Cross-Lingual Fine-Tuning for Idiom Identification
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Comparing human and language models sentence processing difficulties on complex structures
Uncovering Measurement Biases in LLM Embedding Spaces: The Anna Karenina Principle and Its Implications for Automated Feedback
Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark
User-Centric Evidence Ranking for Attribution and Fact Verification
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Probing Subphonemes in Morphology Models
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
Not Your Typical Sycophant: Evaluating Sycophancy of Large Language Models
JuStRank: Benchmarking LLM Judges for System Ranking
Automatic biblical authorship attribution
A Survey on Evaluation of LLM-based Agents
Detecting (Un)answerability in Large Language Models with Linear Directions
Don’t lie to your friends: Learning what you know from collaborative self-play
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
Effective Red-Teaming of Policy-Adherent Agents
Integrating Morphological Structure into Word Embedding Representations
Detecting Conspiracies in Hebrew Twitter with LLM-GNN Fusion
Localizing Factual Inconsistencies in Attributable Text Generation
MINT: Meaning Integrating Tokenizer
d-chi Stencil: A Differential Privacy Mechanism for Interacting with LLMs
IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs
CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Grade: Quantifying sample diversity in text-to-image models
Decoding Reading Goals from Eye Movements
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise
Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Large Temporal Models: Unlocking Temporal Understanding in LLMs for Temporal Relation Classification
Unveiling the spectrum of Arabic offensive language: Taxonomy and insights
LAQuer: Localized Attribution Queries in Content-grounded Generation
TwoHillsLab: A Scalable Platform for Quantitative Analysis of Biblical Hebrew
DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs
Beyond Pairwise: Global Zero-shot Temporal Graph Generation
Inside-Out: Hidden Factual Knowledge in LLMs
Session 2 (13:45 - 14:45)
The Distracting Effect: Understanding Irrelevant Passages in RAG
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
Connections Between The Pre-Training Data To Model Representations
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Segment-Based Attention Masking for GPTs
Easy as PIE? Identifying Multi-Word Expressions with LLMs
Déjà Vu? Decoding Repeated Reading from Eye Movements
Multi-Domain Explainability of Preferences
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Reverse-Engineering the Retrieval Process in GenIR Models
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Estimating Scientific Quality on the Web: A Multilingual LLM Approach
Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs
TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Cross-lingual Extractive Question Answering with Unanswerable Questions
How Much Pretraining Does Structured Data Need?
CToT: Causal Tree of Thoughts for Inference-Time Compute
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations
Do LLMs Understand Harmfulness?
Distilling Examples into Task Instructions: Enhanced In-Context Learning for Long B2B Conversations
Can LLMs Help Encoder Models Maintain Both High Accuracy and Consistency in Temporal Relation Classification?
Hyper-RAG
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Will it Merge? Causes of Model Mergeability
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs with Value-Prompting
CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature
Beyond Word Boundaries: A Hebrew Coreference Benchmark for Morphologically Complex Text
Word Pyramid Puzzles as a Multi-lingually Diverse Reasoning Benchmark
Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing
Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering
Factual Retrieval in LLMs Is a Redundant, Non-Contiguous Process
Pixels at BAREC Shared Task 2025: Visual Arabic Readability Assessment
How Should We Evaluate LLM Reasoning Quality For Fact Verification?
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
Differences in Input and Output Quality of LLMs Across Age Groups
Inferring Functionality of Attention Heads from their Parameters
Break Out the Silverware: Semantic Understanding of Stored Household Items
Session 3 (16:40 - 17:40)
Readability Formulas, Systems and LLMs are Poor Predictors of Reading Ease
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
ArTyDi-QA: Question Answering and Question Generation in Arabic
mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
A Holistic Approach towards Vocabulary Expansion for Language Adaptation
Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content
Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs
Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Towards Enforcing Company Policy Adherence in Agentic Workflows
Out-of-Context Reasoning in Large Language Models
SpeLLM: Character-Level Multi-Head Decoding
PaperFinder: a State-of-the-art LLM-based Scientific Search Agent
Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness
Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination
Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models
Making LVLMs Look Twice: Contrastive Decoding with Contrast Images
Keep Guessing? When Considering Inference Scaling, Mind the Baselines
Letting the Data Speak: Automating Schema Discovery for Research
The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Where Did That Come From? Sentence-Level Error-Tolerant Attribution
Hexagen: Improving Abstraction Reasoning Through Code Execution
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Differential Mamba
Precise In-Parameter Concept Erasure in Large Language Models
SAEs Are Good for Steering - If You Select the Right Features
Retrieve, Learn, Refine: An Interleaved Retrieval–Learning Agent for Exhaustive IR
Expectation management shifts the representation of unexpectedness
Prompts in the Wild: A Large Analyzed Collection of Prompts in Code
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
Cross-Lingual and Cross-Cultural Variation in Image Descriptions
A Unifying Scheme for Extractive Content Selection Tasks
MRLEval: A Benchmark for LLM Evaluation in Hebrew, Modern Standard Arabic and Levantine Arabic
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
SastBench: A Benchmark for Testing Agentic SAST Triage
Semi-synthetic parallel data for translation quality estimation: A case study of English–Hebrew