Natural Language Reasoning and Structured Explanations Workshop

July 13 @ ACL 2023. Toronto, Canada.

Note: the deadline has been extended until May 3. See below for important dates.

With recent scaling of large pre-trained Transformer language models (LLMs), the scope of feasible NLP tasks has broadened. Significant recent work has focused on tasks that require some kind of natural language reasoning. A trajectory in question answering has led us from extraction-oriented datasets like SQuAD to “multi-hop” reasoning datasets like HotpotQA and StrategyQA. Although LLMs have shown remarkable performance on most NLP tasks, it is often unclear why their answers follow from what they know. To address this gap, a new class of explanation techniques has emerged which play an integral part in structuring the reasoning necessary to solve these datasets. For example, the chain-of-thought paradigm leverages explanations as vehicles for LLMs to mimic human reasoning processes. Entailment trees offer a way to ground multi-step reasoning in a collection of verifiable steps. Frameworks like SayCan bridge high-level planning in language and with low-level action trajectories. As a result, we see a confluence of methods blending explainable machine learning/NLP, classical AI (especially theorem proving), and cognitive science (how do humans structure explanations?). This workshop aims to bring together a diverse set of perspectives from these different traditions and attempt to establish common ground for how these various kinds of explanation structures can tackle a broad class of reasoning problems in natural language and beyond.


08:00 AM Virtual Poster Session 1
09:00 AM Opening Remarks
09:10 AM Invited Speaker: Ellie Pavlick
Mechanistic Evidence of Structured Reasoning in LLMs [Slides]
09:50 AM Invited Speaker: Noah Goodman
Reasoning in human and machine intelligence [Slides]
10:30 AM Break 1
11:00 AM Oral Presentation: Li Zhang, Liam Dugan, Hainiu Xu and Chris Callison-burch
Exploring the Curious Case of Code Prompts [Abstract]
Abstract: Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
11:10 AM Oral Presentation: Vanya Cohen and Raymond Mooney
Using Planning to Improve Semantic Parsing of Instructional Texts [Abstract]
Abstract: We develop a symbolic planning-based decoder to improve the few-shot semantic parsing of instructional texts. The system takes long-form instructional texts as input and produces sequences of actions in a formal language that enable execution of the instructions. This task poses unique challenges since input texts may contain long context dependencies and ambiguous and domain-specific language. Valid semantic parses also require sequences of steps that constitute an executable plan. We build on recent progress in semantic parsing by leveraging large language models to learn parsers from small amounts of training data. During decoding, our method employs planning methods and domain information to rank and correct candidate parses. To validate our method, we evaluate on four domains: two household instruction-following domains and two cooking recipe interpretation domains. We present results for few-shot semantic parsing using leave-one-out cross-validation. We show that utilizing planning domain information improves the quality of generated plans. Through ablations we also explore the effects of our decoder design choices.
11:20 PM In-Person Poster Session 1 / Virtual Poster Session 2 (See posters below)
12:20 PM Lunch
13:30 PM In-Person Poster Session 2 (See posters below)
14:30 PM Oral Presentation: Jinheon Baek, Alham Fikri Aji and Amir Saffari
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering [Abstract]
Abstract: Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks, based on their internal knowledge stored in parameters during pre-training. However, such internalized knowledge might be insufficient and incorrect, which could lead LLMs to generate factually wrong answers. Furthermore, fine-tuning LLMs to update their knowledge is expensive. To this end, we propose to augment the knowledge directly in the input of LLMs. Specifically, we first retrieve the relevant facts to the input question from the knowledge graph based on semantic similarities between the question and its associated facts. After that, we prepend the retrieved facts to the input question in the form of the prompt, which is then forwarded to LLMs to generate the answer. Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot. We validate the performance of our KAPING framework on the knowledge graph question answering task, that aims to answer the user's question based on facts over a knowledge graph, on which ours outperforms relevant zero-shot baselines by up to 48% in average, across multiple LLMs of various sizes.
14:40 PM Oral Presentation: Michal Štefánik and Marek Kadlcik
Can In-context Learners Learn a Reasoning Concept from Demonstrations? [Abstract]
Abstract: Large language models show an emergent ability to learn a new task from a small number of input-output demonstrations. However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of finding new associations in the input. However, the commonly-used few-shot evaluation settings using a random selection of in-context demonstrations can not disentangle models' ability to learn a new skill from demonstrations, as most of the randomly-selected demonstrations do not present relations informative for prediction beyond exposing the new task distribution. To disentangle models' in-context learning ability independent of models' memory, we introduce a Conceptual few-shot learning method selecting the demonstrations sharing a possibly-informative concept with the predicted sample. We extract a set of such concepts from annotated explanations and measure how much can models benefit from presenting these concepts in few-shot demonstrations. We find that smaller models are more sensitive to the presented concepts. While some of the models are able to benefit from concept-presenting demonstrations for each assessed concept, we find that none of the assessed in-context learners can benefit from all presented reasoning concepts consistently, leaving the in-context concept learning an open challenge.
14:50 PM Invited Speaker: Peter Clark
The role of NL reasoning in the age of GPT [Slides] [Abstract] [Speaker Bio]
Abstract: While the performance of new LLMs is stunning, it remains unclear how (or even if) an answer follows from their latent "beliefs" about the world, or whether an LLM even has a coherent internal belief system. In this talk I'll describe recent work we have done to probe a model's beliefs, construct interpretable representations of how the model's answers systematically follow from them, and how a broader system can identify and repair inconsistencies that may exist among those beliefs. More generally, I'll promote architectures in which interpretable, systematic NL reasoning and LLM-style reasoning co-exist in a broader system, allowing both styles of reasoning to inform each other, and paving the way for more interactive systems where users can probe, argue with, learn from, and teach our future companions.
Peter Clark is a Senior Director and the interim CEO at the Allen Institute for AI (AI2), and leads the Aristo Project. His work focuses on natural language processing, machine reasoning, and world knowledge, and the interplay between these three areas.
15:30 PM Break 2
16:00 PM Invited Speaker: Denny Zhou
Teach Language Models to Reason [Slides] [Abstract] [Speaker Bio]
Over the past decades, the machine learning community has developed a multitude of data-driven techniques aimed at enhancing learning efficiency. These include semi-supervised learning, meta learning, active learning, transfer learning, and more. However, none of these techniques have proven to be highly effective for real-world natural language processing tasks. This shortcoming uncovers a fundamental flaw in machine learning - the absence of reasoning. Humans often learn from just a few examples because of their capacity to reason, as opposed to relying on data statistics. In this talk, I will talk about the large language models (LLM) reasoning work that we pioneered, and show that the techniques we developed can greatly narrow the gap between human intelligence and machine learning: crushed SoTA in the literature while demanding only a few annotated examples and no training. Our work was showcased at Google I/O 2022 by Google CEO Sundar Pichai.
Denny Zhou is a principal scientist / research director in Google DeepMind, where he is the founder and current lead of the Reasoning Team. His primary research interest is building and teaching large language models (LLMs) with an ambitious goal of attaining human-level reasoning capabilities within these models. His team in Google has developed chain-of-thought prompting, self-consistency decoding, least-to-most prompting, instruction tuning (FLAN2), LLMs self-debugging and various investigations of emergent properties of LLMs. He won Google Research Tech Impact Award in 2022.
16:40 PM Invited Speaker: Sarah Wiegreffe Two Views of Language Model Interpretability [Slides] [Abstract] [Speaker Bio]
When generating text from language models (LMs), many prompting methods strive to explain LM behavior by eliciting specifically-structured outputs (e.g, chain-of-thought prompting). Relatedly, querying a model with specially-designed inputs and observing output behavior is a longstanding and popular method in the NLP interpreter’s toolbox. Prompting and querying approaches explain how LMs operate at a high-level (in natural language) without attributing behaviors to any specific components of the network. A separate line of work has investigated attributing or attempting to reconstruct model behaviors at the model parameter or hidden representation-level, generally at a small scale. While these two techniques often seem at odds in terms of their stated aims, they collectively inform a large progression in our understanding of LMs in the past 2 years. In this talk, I will give examples of both of these approaches, highlight their similarities and differences, and discuss paths forward that leverage their combined strengths.
Sarah Wiegreffe is a Young Investigator (postdoc) at the Allen Institute of AI, where she is a member of the Aristo team. She also holds a courtesy appointment in the Allen School at the University of Washington. Her research interests encompass interpretability + explainability of NLP models, with a focus on the faithfulness of generated text to internal LM prediction mechanisms and the utility of model-generated textual explanations to humans. She received her PhD in 2022 from Georgia Tech, advised by Mark Riedl. She also received an M.S. in Computer Science (2020) and B.S. in Data Science (2017) from Georgia Tech and the College of Charleston, respectively. Outside of work, she enjoys rock climbing, cooking, and exploring Seattle.


Peter Clark
Allen Institute for AI
Ellie Pavlick
Brown University
Denny Zhou
Google AI
Noah Goodman
Stanford University
Sarah Wiegreffe
Allen Institute for AI

Accepted Papers

Note: 2 additional papers were accepted but are not listed here because of an anonymity period.

Virtual Poster Session 1

  • Case-Based Reasoning with Language Models for Classification of Logical Fallacies
    Zhivar Sourati, Filip Ilievski, Hông-Ân Sandlin and Alain Mermoud
  • Choice-75: A Dataset on Decision Branching in Script Learning
    Zhaoyi Hou, Li Zhang and Chris Callison-Burch
  • Distinguish Before Answer: Generating Contrastive Explanation as Knowledge for Commonsense Question Answering
    Qianglong Chen, Guohai Xu, Mingshi Yan, J. Zhang, Fei Huang, Luo Si, Yin Zhang
    [ACL Findings]
  • IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning
    Zihang Xu, Ziqing Yang, Yiming Cui, Shijin Wang
    [ACL Findings]
  • Investigating Transformer-Guided Chaining for Interpretable Natural Logic Reasoning
    Kanagasabai Rajaraman, Saravanan Rajamanickam, Wei Shi
    [ACL Findings]
  • SConE: Simplified Cone Embeddings with Symbolic Operators for Complex Logical Queries
    Chau Nguyen, Tim French, Wei Liu, Michael Stewart
    [ACL Findings]
  • Segment-Level and Category-Oriented Network for Knowledge-Based Referring Expression Comprehension
    Yuqi Bu, Xin Wu, Liuwu Li, Yi Cai, Qingbao Huang, Qiong Liu
    [ACL Findings]
  • Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
    Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar and Bryan Catanzaro
  • Neural-symbolic Contrastive Learning for Cross-domain Inference
    Mingyue Liu, Jialin Yu, Hao Cui, Sara Uckelman and Yang Long

Virtual Poster Session 2

  • Grounded physical language understanding with probabilistic programs and simulated worlds
    Cedegao Zhang, Lionel Wong, Gabriel Grand and Josh Tenenbaum
  • Hierarchical Prompting Assists Large Language Model on Web Navigation
    Chi-Fan Lo, Abishek Sridhar, Hao Zhu, Frank F. Xu and Shuyan Zhou
  • Interpretable Multimodal Misinformation Detection with Logic Reasoning
    Hui Liu, Wenya Wang, Haoliang Li
    [ACL Findings]
  • Logical Reasoning over Natural Language as Knowledge Representation: A Survey
    Zonglin Yang, Xinya Du, Rui Mao, Jinjie Ni and Erik Cambria
  • Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
    Robert Vacareanu, Eduardo Blanco, Haris Riaz and Mihai Surdeanu
  • Tab-CoT: Zero-shot Tabular Chain of Thought
    Jin Ziqi, Wei Lu
    [ACL Findings]
  • Teaching Large Language Models to Self-Debug
    Xinyun Chen, Maxwell Lin, Nathanael Schaerli and Denny Zhou
  • Using Planning to Improve Semantic Parsing of Instructional Texts
    Vanya Cohen and Raymond Mooney
    [Oral, Archival]
  • Few Shot Rationale Generation using Self-Training with Dual Teachers
    Aditya Srikanth Veerubhotla, Lahari Poddar, Jun Yin, György Szarvas, Sharanya Eswaran
    [ACL Findings]
  • Explanation Regeneration via Information Bottleneck
    Qintong Li, Zhiyong Wu, Lingpeng Kong, Wei Bi
    [ACL Findings]
  • Reasoning Circuits: Few-shot Multi-hop Question Generation with Structured Rationales
    Saurabh Kulshreshtha and Anna Rumshisky

In-Person Poster Session 1

  • A smashed glass cannot be full: Generation of Commonsense Explanations through Prompt-based Few-shot Learning
    Andrea Zaninello and Bernardo Magnini
  • Answering Questions by Meta-Reasoning over Multiple Chains of Thought
    Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch and Jonathan Berant
  • Causal Reasoning of Entities and Events in Procedural Texts
    Li Zhang, Hainiu Xu, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora and Chris Callison-Burch
  • DREAM: Improving Situational QA by First Elaborating the Situation
    Yuling Gu, Bhavana Dalvi Mishra and Peter Clark
  • Designing harder benchmarks for evaluating zero-shot generalizability in Question Answering over Knowledge Bases
    Ritam Dutt, Sopan Khosla, Vinayshekhar Bannihatti Kumar and Rashmi Gangadharaiah
  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
    [ACL Findings]
  • Effect Graph: Effect Relation Extraction for Explanation Generation
    Jonathan Kobbe, Ioana Hulpuș and Heiner Stuckenschmidt
  • Evaluating statistical language models as pragmatic reasoners
    Benjamin Lipkin, Lionel Wong, Gabriel Grand and Josh Tenenbaum
  • Examining the Emergence of Deductive Reasoning in Generative Language Models
    Peter Belcak, Luca Lanzendörfer and Roger Wattenhofer
  • Hypothetical Training for Robust Machine Reading Comprehension of Tabular Context
    Moxin Li, Wenjie Wang, Fuli Feng, Hanwang Zhang, Qifan Wang, Tat-Seng Chua
    [ACL Findings]
  • I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
    Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan
    [ACL Findings]
  • Interpretable Math Word Problem Solution Generation Via Step-by-step Planning
    Mengxue Zhang, Zichao Wang, Zhichao Yang, Weiqi Feng and Andrew Lan
  • Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
    Jinheon Baek, Alham Fikri Aji and Amir Saffari
    [Oral, Archival]
  • Learning to Perform Complex Tasks through Compositional Fine-Tuning of Language Models
    Victor Bursztyn, David Demeter, Doug Downey and Larry Birnbaum
  • PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales
    Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen and Xiang Ren
  • Reasoning in Large Language Models Through Symbolic Math Word Problems
    Vedant Gaur, Nikunj Saunshi
    [ACL Findings]
  • Reasoning with Language Model Prompting: A Survey
    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang and Huajun Chen
  • Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models
    Soochan Lee, Gunhee Kim
    [ACL Findings]
  • Reimagining Retrieval Augmented Language Models for Answering Queries
    Wang-Chiew Tan, Yuliang Li, Pedro Rodriguez, Richard James, Xi Victoria Lin, Alon Halevy, Wen-tau Yih
    [ACL Findings]
  • STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
    Danilo Neves Ribeiro, Shen Wang, Xiaofei Ma, Henghui Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, William Yang Wang, zhiheng huang, George Karypis, Bing Xiang and Dan Roth
  • The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning
    Hanlin Zhang, Yi-Fan Zhang, Li Erran Li, Eric Xing
  • The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
    Xiao Liu, Da Yin, Chen Zhang, Yansong Feng and Dongyan Zhao
  • The Role of Semantic Parsing in Understanding Procedural Text
    Hossein Rajaby Faghihi, Parisa Kordjamshidi, Choh Man Teng and James Allen
  • QAMPARI: A Benchmark for Open-domain Questions with Many Answers
    Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig and Jonathan Berant

In-Person Poster Session 2

  • Beyond Vertical Thinking: Exploring and Quantifying Lateral Thinking in Pretrained Language Models
    Wenjuan Han, Yueting Yang Yijie Chen, Fandong Meng, Jie Zhou, Jinan Xu
  • Can In-context Learners Learn a Reasoning Concept from Demonstrations?
    Michal Štefánik and Marek Kadlcik
    [Oral, Archival]
  • Claim-Dissector: An Interpretable Fact-Checking System with Joint Re-ranking and Veracity Prediction
    Martin Fajcik, Petr Motlicek, Pavel Smrz
    [ACL Findings]
  • Complementary Explanations for Effective In-Context Learning
    Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett and Ramakanth Pasunuru
  • Deductive Additivity for Planning of Natural Language Proofs
    Zayne Sprague, Kaj Bostrom, Swarat Chaudhuri and Greg Durrett
  • Distilling Reasoning Capabilities into Smaller Language Model
    Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan
    [ACL Findings]
  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuah Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee and Tomas Pfister
  • Explaining Competitive-Level Programming Solutions using LLMs
    Jierui Li, Szymon Tworkowski, Yingying Wu and Raymond Mooney
  • Exploring the Curious Case of Code Prompts
    Li Zhang, Liam Dugan, Hainiu Xu and Chris Callison-Burch
    [Oral, Archival]
  • Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks
    Fangyi Yu, Lee Quartey, Frank Schilder
    [ACL Findings]
  • Faithful Chain-of-Thought Reasoning
    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki and Chris Callison-Burch (
  • Foveate, Attribute, and Rationalize: Towards Safe and Trustworthy AI
    Alex Mei, Sharon Levy, William Yang Wang
    [ACL Findings]
  • Generative Multi-hop Retrieval
    Hyunji Lee, Sohee Yang, hanseok Oh and Minjoon Seo
  • HeGeL: A Novel Dataset for Hebrew Geo-Location
    Tzuf Paz-Argaman, Tal Bauman, Itai Mondshine, Itzhak Omer, Sagi Dalyot, Reut Tsarfaty
    [ACL Findings]
  • How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension
    Chen Zhang, Jiuheng Lin, Xiao Liu, Yuxuan Lai, Yansong Feng, Dongyan Zhao
    [ACL Findings]
  • Knowledge Graph-augmented Language Models for Complex Question Answering
    Priyanka Sen, Sandeep Mavadia and Amir Saffari
  • LaSQuE: Improved Zero-Shot Classification from Explanations Through Quantifier Modeling and Curriculum Learning
    Sayan Ghosh, Rakesh R. Menon, Shashank Srivastava
    [ACL Findings]
  • Let's Sample Step-by-Step: Adaptive-Consistency for Efficient Reasoning with LLMs
    Pranjal Aggarwal, Aman Madaan, Yiming Yang and Mausam
  • SCOTT: Self-Consistent Chain-of-Thought Distillation
    Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin and Xiang Ren
  • Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods
    Nils Feldhus, Leonhard Hennig, Maximilian Dustin Nasert, Christopher Ebert, Robert Schwarzenberg and Sebastian Möller (
  • Situated Natural Language Explanations
    Zining Zhu, Haoming Jiang, Jingfeng Yang, Sreyashi Nag, Chao Zhang, Jie Huang, Yifan Gao, Frank Rudzicz and Bing Yin (
  • Negated Complementary Commonsense using Large Language Models
    Navid Rezaei and Marek Reformat
  • Towards Reasoning in Large Language Models: Survey, Implication, and Reflection
    Jie Huang and Kevin Chen-Chuan Chang
  • OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models
    Badr AlKhamissi, Siddharth Verma, Ping Yu, Zhijing Jin, Asli Celikyilmaz and Mona Diab

Important Dates

Paper Submission Deadline April 24, 2023 May 3, 2023 (All deadlines are 11:59 PM AoE time.)
Decision Notifications May 22, 2023 May 29, 2023
Camera Ready Paper Deadline June 6, 2023 (11:59 PM Pacific time)
Workshop Date Thursday, 13 July 2023


Greg Durrett
University of Texas, Austin
Bhavana Dalvi
Allen Institute for AI
Jason Wei
Google Brain
Peter Jansen
University of Arizona
Danilo Ribeiro
Northwestern University

Call for Papers

We welcome submissions on all topics related to natural language reasoning or structured explanations, which might include:

  • Multi-step natural language reasoning;
  • Structured explanations;
  • Foundations of natural language reasoning;
  • Applications of natural language reasoning;
  • Knowledge retrieval for multi-step reasoning;
  • Reasoning as programs;

With recent scaling of large pre-trained Transformer language models (LLMs), the scope of feasible NLP tasks has broadened, including tasks requiring increasingly complex reasoning. Although LLMs have shown remarkable performance, it is still unclear how to best elicit this reasoning and how the answers that models give follow from what they "know." This workshop aims to bring together a diverse set of perspectives and attempt to establish common ground for how various kinds of explanation structures can tackle a broad class of reasoning problems in natural language and beyond. As such, the workshop welcomes and covers a wide range of topics, including (non-exclusively):

  • Multi-step natural language reasoning: Solving reasoning problems, such as those involving abstract manipulations, has been a long-standing challenge in the field of artificial intelligence. Large language models have recently achieved a new state-of-the-art performance on many reasoning benchmarks, often with approaches only requiring prompting. Current research frontiers are exploring what kinds of explanation formats are most effective, how reasoning is most effectively broken down, how to get language models to plan their reasoning, and what resources can be used to improve reasoning capabilities of language models. Tasks include mathematical reasoning, logical reasoning, commonsense reasoning, and more.
  • Structured explanations: Explanations for these complex tasks are typically composed of two or more facts that are used to help the reasoning process while also providing a record of the path taken to arrive at an inference. What representations can be best used by inference algorithms to construct large explanations? Frontiers of research include exploring search algorithms over such representations, how to represent annotations at scale and continual learning models.
  • Foundations of natural language reasoning: Does the structured reasoning constitute a plausible (interpretable to humans) and faithful (true to the model's processes) explanation? Does perturbing the reasoning lead to correctly modified behavior? Applications of natural language reasoning: New QA settings, language grounding, explainable diagnosis systems, theorem provers using natural language, reasoning for scientific discovery, and more.
  • Knowledge retrieval for multi-step reasoning: It has been shown that LLMs can store factual knowledge implicitly in their parameters, however, their ability to access and manipulate knowledge is still limited. Future avenues of research include effective methods to combine parametric and non-parametric knowledge for complex reasoning, conditioning retrieval given intermediate reasoning context, retrieving better provenance for structured explanations.
  • Reasoning as programs: Another body of work within computational cognitive science and AI has formalized reasoning as inference over programs, building on classical views of human reasoning in a symbol-like language of thought and linguistic semantics with logical languages. Language models of code to produce structured reasoning for commonsense problems or other similar approaches are all in scope here.

Submission Guidelines

We welcome two types of papers: regular workshop papers and non-archival submissions. Only regular workshop papers will be included in the workshop proceedings. All submissions should be in PDF format and made through Softconf. In line with the ACL main conference policy, camera-ready versions of papers will be given one additional page of content.

  • Regular workshop papers: Authors should submit a paper up to 8 pages (both short and long papers are welcome), with unlimited pages for references, following the ACL 2023 formatting requirements. The reported research should be substantially original. All submissions will be reviewed in a single track, regardless of length. Accepted papers will be presented as posters by default, and best papers may be given the opportunity for a brief talk to introduce their work. Reviewing will be double-blind, and thus no author information should be included in the papers; self-reference that identifies the authors should be avoided or anonymised. Accepted papers will appear in the workshop proceedings.
  • Non-archival submissions: We also solicit cross-submissions, i.e., papers on relevant topics that have appeared in other venues (e.g., workshop or conference papers at NLP, ML, or cognitive science venues, among others). Accepted papers will be presented at the workshop, with an indication of original venue, but will not be included in the workshop proceedings. Cross-submissions are ideal for related work which would benefit from exposure to the NLReasoning audience. Interested authors should submit their papers in PDF format through the NLReasoning Softconf website, with a note on the original venue. They will be reviewed in a single-blind fashion. Papers in this category do not need to follow the ACL format, and the submission length is determined by the original venue. The paper selection will be solely determined by the organizing committee.

In addition, we welcome papers on relevant topics that are under review or to be submitted to other venues (including the ACL 2023 main conference). These papers must follow the regular workshop paper format and will not be included in the workshop proceedings. Papers in this category will be reviewed by workshop reviewers.

Note to authors: While you submit your paper through Softconf (here), please select the “Submission Type” properly based on the guidelines.

For questions about the submission guidelines, please contact workshop organizers via