Note: the deadline has been extended until May 3. See below for important dates.
With recent scaling of large pre-trained Transformer language models (LLMs), the scope of feasible NLP tasks has broadened. Significant recent work has focused on tasks that require some kind of natural language reasoning. A trajectory in question answering has led us from extraction-oriented datasets like SQuAD to “multi-hop” reasoning datasets like HotpotQA and StrategyQA. Although LLMs have shown remarkable performance on most NLP tasks, it is often unclear why their answers follow from what they know. To address this gap, a new class of explanation techniques has emerged which play an integral part in structuring the reasoning necessary to solve these datasets. For example, the chain-of-thought paradigm leverages explanations as vehicles for LLMs to mimic human reasoning processes. Entailment trees offer a way to ground multi-step reasoning in a collection of verifiable steps. Frameworks like SayCan bridge high-level planning in language and with low-level action trajectories. As a result, we see a confluence of methods blending explainable machine learning/NLP, classical AI (especially theorem proving), and cognitive science (how do humans structure explanations?). This workshop aims to bring together a diverse set of perspectives from these different traditions and attempt to establish common ground for how these various kinds of explanation structures can tackle a broad class of reasoning problems in natural language and beyond.
Schedule
08:00 AM | Virtual Poster Session 1 |
---|---|
09:00 AM | Opening Remarks |
09:10 AM |
Invited Speaker: Ellie Pavlick Mechanistic Evidence of Structured Reasoning in LLMs [Slides] |
09:50 AM |
Invited Speaker: Noah Goodman Reasoning in human and machine intelligence [Slides] |
10:30 AM | Break 1 |
11:00 AM |
Oral Presentation: Li Zhang, Liam Dugan, Hainiu Xu and Chris Callison-burch Exploring the Curious Case of Code Prompts [Abstract]
Abstract: Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
|
11:10 AM |
Oral Presentation: Vanya Cohen and Raymond Mooney Using Planning to Improve Semantic Parsing of Instructional Texts [Abstract]
Abstract: We develop a symbolic planning-based decoder to improve the few-shot semantic parsing of instructional texts. The system takes long-form instructional texts as input and produces sequences of actions in a formal language that enable execution of the instructions. This task poses unique challenges since input texts may contain long context dependencies and ambiguous and domain-specific language. Valid semantic parses also require sequences of steps that constitute an executable plan. We build on recent progress in semantic parsing by leveraging large language models to learn parsers from small amounts of training data. During decoding, our method employs planning methods and domain information to rank and correct candidate parses. To validate our method, we evaluate on four domains: two household instruction-following domains and two cooking recipe interpretation domains. We present results for few-shot semantic parsing using leave-one-out cross-validation. We show that utilizing planning domain information improves the quality of generated plans. Through ablations we also explore the effects of our decoder design choices.
|
11:20 PM | In-Person Poster Session 1 / Virtual Poster Session 2 (See posters below) |
12:20 PM | Lunch |
13:30 PM | In-Person Poster Session 2 (See posters below) |
14:30 PM |
Oral Presentation: Jinheon Baek, Alham Fikri Aji and Amir Saffari Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering [Abstract]
Abstract: Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks, based on their internal knowledge stored in parameters during pre-training. However, such internalized knowledge might be insufficient and incorrect, which could lead LLMs to generate factually wrong answers. Furthermore, fine-tuning LLMs to update their knowledge is expensive. To this end, we propose to augment the knowledge directly in the input of LLMs. Specifically, we first retrieve the relevant facts to the input question from the knowledge graph based on semantic similarities between the question and its associated facts. After that, we prepend the retrieved facts to the input question in the form of the prompt, which is then forwarded to LLMs to generate the answer. Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot. We validate the performance of our KAPING framework on the knowledge graph question answering task, that aims to answer the user's question based on facts over a knowledge graph, on which ours outperforms relevant zero-shot baselines by up to 48% in average, across multiple LLMs of various sizes.
|
14:40 PM |
Oral Presentation: Michal Štefánik and Marek Kadlcik Can In-context Learners Learn a Reasoning Concept from Demonstrations? [Abstract]
Abstract: Large language models show an emergent ability to learn a new task from a small number of input-output demonstrations. However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of finding new associations in the input. However, the commonly-used few-shot evaluation settings using a random selection of in-context demonstrations can not disentangle models' ability to learn a new skill from demonstrations, as most of the randomly-selected demonstrations do not present relations informative for prediction beyond exposing the new task distribution.
To disentangle models' in-context learning ability independent of models' memory, we introduce a Conceptual few-shot learning method selecting the demonstrations sharing a possibly-informative concept with the predicted sample. We extract a set of such concepts from annotated explanations and measure how much can models benefit from presenting these concepts in few-shot demonstrations.
We find that smaller models are more sensitive to the presented concepts. While some of the models are able to benefit from concept-presenting demonstrations for each assessed concept, we find that none of the assessed in-context learners can benefit from all presented reasoning concepts consistently, leaving the in-context concept learning an open challenge.
|
14:50 PM |
Invited Speaker: Peter Clark The role of NL reasoning in the age of GPT [Slides] [Abstract] [Speaker Bio]
Abstract: While the performance of new LLMs is stunning, it remains unclear how (or even if) an answer follows from their latent "beliefs" about the world, or whether an LLM even has a coherent internal belief system. In this talk I'll describe recent work we have done to probe a model's beliefs, construct interpretable representations of how the model's answers systematically follow from them, and how a broader system can identify and repair inconsistencies that may exist among those beliefs. More generally, I'll promote architectures in which interpretable, systematic NL reasoning and LLM-style reasoning co-exist in a broader system, allowing both styles of reasoning to inform each other, and paving the way for more interactive systems where users can probe, argue with, learn from, and teach our future companions.
Peter Clark is a Senior Director and the interim CEO at the Allen Institute for AI (AI2), and leads the Aristo Project. His work focuses on natural language processing, machine reasoning, and world knowledge, and the interplay between these three areas.
|
15:30 PM | Break 2 |
16:00 PM |
Invited Speaker: Denny Zhou Teach Language Models to Reason [Slides] [Abstract] [Speaker Bio]
Over the past decades, the machine learning community has developed a multitude of data-driven techniques aimed at enhancing learning efficiency. These include semi-supervised learning, meta learning, active learning, transfer learning, and more. However, none of these techniques have proven to be highly effective for real-world natural language processing tasks. This shortcoming uncovers a fundamental flaw in machine learning - the absence of reasoning. Humans often learn from just a few examples because of their capacity to reason, as opposed to relying on data statistics. In this talk, I will talk about the large language models (LLM) reasoning work that we pioneered, and show that the techniques we developed can greatly narrow the gap between human intelligence and machine learning: crushed SoTA in the literature while demanding only a few annotated examples and no training. Our work was showcased at Google I/O 2022 by Google CEO Sundar Pichai.
Denny Zhou is a principal scientist / research director in Google DeepMind, where he is the founder and current lead of the Reasoning Team. His primary research interest is building and teaching large language models (LLMs) with an ambitious goal of attaining human-level reasoning capabilities within these models. His team in Google has developed chain-of-thought prompting, self-consistency decoding, least-to-most prompting, instruction tuning (FLAN2), LLMs self-debugging and various investigations of emergent properties of LLMs. He won Google Research Tech Impact Award in 2022.
|
16:40 PM |
Invited Speaker: Sarah Wiegreffe
Two Views of Language Model Interpretability
[Slides]
[Abstract]
[Speaker Bio]
When generating text from language models (LMs), many prompting methods strive to explain LM behavior by eliciting specifically-structured outputs (e.g, chain-of-thought prompting). Relatedly, querying a model with specially-designed inputs and observing output behavior is a longstanding and popular method in the NLP interpreter’s toolbox. Prompting and querying approaches explain how LMs operate at a high-level (in natural language) without attributing behaviors to any specific components of the network. A separate line of work has investigated attributing or attempting to reconstruct model behaviors at the model parameter or hidden representation-level, generally at a small scale. While these two techniques often seem at odds in terms of their stated aims, they collectively inform a large progression in our understanding of LMs in the past 2 years. In this talk, I will give examples of both of these approaches, highlight their similarities and differences, and discuss paths forward that leverage their combined strengths.
Sarah Wiegreffe is a Young Investigator (postdoc) at the Allen Institute of AI, where she is a member of the Aristo team. She also holds a courtesy appointment in the Allen School at the University of Washington. Her research interests encompass interpretability + explainability of NLP models, with a focus on the faithfulness of generated text to internal LM prediction mechanisms and the utility of model-generated textual explanations to humans. She received her PhD in 2022 from Georgia Tech, advised by Mark Riedl. She also received an M.S. in Computer Science (2020) and B.S. in Data Science (2017) from Georgia Tech and the College of Charleston, respectively. Outside of work, she enjoys rock climbing, cooking, and exploring Seattle.
|
Speakers

Allen Institute for AI

Brown University

Google AI

Stanford University

Allen Institute for AI
Accepted Papers
Note: 2 additional papers were accepted but are not listed here because of an anonymity period.
Virtual Poster Session 1
-
Case-Based Reasoning with Language Models for Classification of Logical Fallacies
[Non-Archival]
-
Choice-75: A Dataset on Decision Branching in Script Learning
[Non-Archival]
-
Distinguish Before Answer: Generating Contrastive Explanation as Knowledge for Commonsense Question Answering
[ACL Findings]
-
IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning
[ACL Findings]
-
Investigating Transformer-Guided Chaining for Interpretable Natural Logic Reasoning
[ACL Findings]
-
SConE: Simplified Cone Embeddings with Symbolic Operators for Complex Logical Queries
[ACL Findings]
-
Segment-Level and Category-Oriented Network for Knowledge-Based Referring Expression Comprehension
[ACL Findings]
-
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
[Non-Archival]
-
Neural-symbolic Contrastive Learning for Cross-domain Inference
[Non-Archival]
Virtual Poster Session 2
-
Grounded physical language understanding with probabilistic programs and simulated worlds
[Non-Archival]
-
Hierarchical Prompting Assists Large Language Model on Web Navigation
[Non-Archival]
-
Interpretable Multimodal Misinformation Detection with Logic Reasoning
[ACL Findings]
-
Logical Reasoning over Natural Language as Knowledge Representation: A Survey
[Non-Archival]
-
Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
[Archival]
-
Tab-CoT: Zero-shot Tabular Chain of Thought
[ACL Findings]
-
Teaching Large Language Models to Self-Debug
[Non-Archival]
-
Using Planning to Improve Semantic Parsing of Instructional Texts
[Oral, Archival]
-
Few Shot Rationale Generation using Self-Training with Dual Teachers
[ACL Findings]
-
Explanation Regeneration via Information Bottleneck
[ACL Findings]
-
Reasoning Circuits: Few-shot Multi-hop Question Generation with Structured Rationales
[Archival]
In-Person Poster Session 1
-
A smashed glass cannot be full: Generation of Commonsense Explanations through Prompt-based Few-shot Learning
[Archival]
-
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
[Non-Archival]
-
Causal Reasoning of Entities and Events in Procedural Texts
[Non-Archival]
-
DREAM: Improving Situational QA by First Elaborating the Situation
[Non-Archival]
-
Designing harder benchmarks for evaluating zero-shot generalizability in Question Answering over Knowledge Bases
[Non-Archival]
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
[ACL Findings]
-
Effect Graph: Effect Relation Extraction for Explanation Generation
[Archival]
-
Evaluating statistical language models as pragmatic reasoners
[Non-Archival]
-
Examining the Emergence of Deductive Reasoning in Generative Language Models
[Non-Archival]
-
Hypothetical Training for Robust Machine Reading Comprehension of Tabular Context
[ACL Findings]
-
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
[ACL Findings]
-
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning
[Non-Archival]
-
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
[Oral, Archival]
-
Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models
[Non-Archival]
-
PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales
[Non-Archival]
-
Reasoning in Large Language Models Through Symbolic Math Word Problems
[ACL Findings]
-
Reasoning with Language Model Prompting: A Survey
[Non-Archival]
-
Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models
[ACL Findings]
-
Reimagining Retrieval Augmented Language Models for Answering Queries
[ACL Findings]
-
STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
[Non-Archival]
-
The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning
[Non-Archival]
-
The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
[Non-Archival]
-
The Role of Semantic Parsing in Understanding Procedural Text
[Non-Archival]
-
QAMPARI: A Benchmark for Open-domain Questions with Many Answers
[Non-Archival]
In-Person Poster Session 2
-
Beyond Vertical Thinking: Exploring and Quantifying Lateral Thinking in Pretrained Language Models
[Non-Archival]
-
Can In-context Learners Learn a Reasoning Concept from Demonstrations?
[Oral, Archival]
-
Claim-Dissector: An Interpretable Fact-Checking System with Joint Re-ranking and Veracity Prediction
[ACL Findings]
-
Complementary Explanations for Effective In-Context Learning
[Non-Archival]
-
Deductive Additivity for Planning of Natural Language Proofs
[Archival]
-
Distilling Reasoning Capabilities into Smaller Language Model
[ACL Findings]
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
[Non-Archival]
-
Explaining Competitive-Level Programming Solutions using LLMs
[Non-Archival]
-
Exploring the Curious Case of Code Prompts
[Oral, Archival]
-
Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks
[ACL Findings]
-
Faithful Chain-of-Thought Reasoning
[Non-Archival]
-
Foveate, Attribute, and Rationalize: Towards Safe and Trustworthy AI
[ACL Findings]
-
Generative Multi-hop Retrieval
[Non-Archival]
-
HeGeL: A Novel Dataset for Hebrew Geo-Location
[ACL Findings]
-
How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension
[ACL Findings]
-
Knowledge Graph-augmented Language Models for Complex Question Answering
[Archival]
-
LaSQuE: Improved Zero-Shot Classification from Explanations Through Quantifier Modeling and Curriculum Learning
[ACL Findings]
-
Let's Sample Step-by-Step: Adaptive-Consistency for Efficient Reasoning with LLMs
[Non-Archival]
-
SCOTT: Self-Consistent Chain-of-Thought Distillation
[Non-Archival]
-
Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods
[Archival]
-
Situated Natural Language Explanations
[Non-Archival]
-
Negated Complementary Commonsense using Large Language Models
[Non-Archival]
-
Towards Reasoning in Large Language Models: Survey, Implication, and Reflection
[Non-Archival]
-
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models
[Archival]
Important Dates
Paper Submission Deadline | |
Decision Notifications | |
Camera Ready Paper Deadline | |
Workshop Date | Thursday, 13 July 2023 |
Organizers

University of Texas, Austin

Allen Institute for AI
Google Brain

University of Arizona

Northwestern University

MIT
Call for Papers
We welcome submissions on all topics related to natural language reasoning or structured explanations, which might include:
- Multi-step natural language reasoning;
- Structured explanations;
- Foundations of natural language reasoning;
- Applications of natural language reasoning;
- Knowledge retrieval for multi-step reasoning;
- Reasoning as programs;
With recent scaling of large pre-trained Transformer language models (LLMs), the scope of feasible NLP tasks has broadened, including tasks requiring increasingly complex reasoning. Although LLMs have shown remarkable performance, it is still unclear how to best elicit this reasoning and how the answers that models give follow from what they "know." This workshop aims to bring together a diverse set of perspectives and attempt to establish common ground for how various kinds of explanation structures can tackle a broad class of reasoning problems in natural language and beyond. As such, the workshop welcomes and covers a wide range of topics, including (non-exclusively):
- Multi-step natural language reasoning: Solving reasoning problems, such as those involving abstract manipulations, has been a long-standing challenge in the field of artificial intelligence. Large language models have recently achieved a new state-of-the-art performance on many reasoning benchmarks, often with approaches only requiring prompting. Current research frontiers are exploring what kinds of explanation formats are most effective, how reasoning is most effectively broken down, how to get language models to plan their reasoning, and what resources can be used to improve reasoning capabilities of language models. Tasks include mathematical reasoning, logical reasoning, commonsense reasoning, and more.
- Structured explanations: Explanations for these complex tasks are typically composed of two or more facts that are used to help the reasoning process while also providing a record of the path taken to arrive at an inference. What representations can be best used by inference algorithms to construct large explanations? Frontiers of research include exploring search algorithms over such representations, how to represent annotations at scale and continual learning models.
- Foundations of natural language reasoning: Does the structured reasoning constitute a plausible (interpretable to humans) and faithful (true to the model's processes) explanation? Does perturbing the reasoning lead to correctly modified behavior? Applications of natural language reasoning: New QA settings, language grounding, explainable diagnosis systems, theorem provers using natural language, reasoning for scientific discovery, and more.
- Knowledge retrieval for multi-step reasoning: It has been shown that LLMs can store factual knowledge implicitly in their parameters, however, their ability to access and manipulate knowledge is still limited. Future avenues of research include effective methods to combine parametric and non-parametric knowledge for complex reasoning, conditioning retrieval given intermediate reasoning context, retrieving better provenance for structured explanations.
- Reasoning as programs: Another body of work within computational cognitive science and AI has formalized reasoning as inference over programs, building on classical views of human reasoning in a symbol-like language of thought and linguistic semantics with logical languages. Language models of code to produce structured reasoning for commonsense problems or other similar approaches are all in scope here.
Submission Guidelines
We welcome two types of papers: regular workshop papers and non-archival submissions. Only regular workshop papers will be included in the workshop proceedings. All submissions should be in PDF format and made through Softconf. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content.
- Regular workshop papers: Authors should submit a paper up to 8 pages (both short and long papers are welcome), with unlimited pages for references, following the ACL 2023 formatting requirements. The reported research should be substantially original. All submissions will be reviewed in a single track, regardless of length. Accepted papers will be presented as posters by default, and best papers may be given the opportunity for a brief talk to introduce their work. Reviewing will be double-blind, and thus no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymised. Accepted papers will appear in the workshop proceedings.
- Non-archival submissions: We also solicit cross-submissions, i.e., papers on relevant topics that have appeared in other venues (e.g., workshop or conference papers at NLP, ML, or cognitive science venues, among others). Accepted papers will be presented at the workshop, with an indication of original venue, but will not be included in the workshop proceedings. Cross-submissions are ideal for related work which would benefit from exposure to the NLReasoning audience. Interested authors should submit their papers in PDF format through the NLReasoning Softconf website, with a note on the original venue. They will be reviewed in a single-blind fashion. Papers in this category do not need to follow the ACL format, and the submission length is determined by the original venue. The paper selection will be solely determined by the organizing committee.
In addition, we welcome papers on relevant topics that are under review or to be submitted to other venues (including the ACL 2023 main conference). These papers must follow the regular workshop paper format and will not be included in the workshop proceedings. Papers in this category will be reviewed by workshop reviewers.
Note to authors: While you submit your paper through Softconf (here), please select the “Submission Type” properly based on the guidelines.
For questions about the submission guidelines, please contact workshop organizers via nl-reasoning@googlegroups.com.