A Paper a Day Keeps the Reviewer Away!

13 minute read


A Paper a Day Keeps the Reviewer Away! - Alexander R. Fabbri

At least that’s what I hope. In order to write better papers and familiarize myself with recent work, over the past month I made it my goal to read a new paper each day and take notes on each of them. I thought that this would be a useful way for me to better understand the papers and help others along the way. As Francis Bacon said, “some books are to be tasted, others to be swallowed, and some few to be chewed and digested,” and this is reflected in my notes. Additionally, the choice of papers was very subjective, with a bias towards recent work which is relevant to my current projects. The theme of October was summarization. Let’s begin!

Ideas that come up again and again in summarization papers:


news articles biased towards the first three sentences/new diverse datasets –percent of novel ngrams as a measure of abstraction needed

generating longer sequences of text/training on larger sequences

evaluation metrics + human evaluation

other kinds of summarization (hierarchical, opinion summarization)

datasets and standardization

MDS and small datasets, wikisum and hierarchical summarization

10/1 Generating Wikipedia By Summarizing Long Sequences[1]:

This paper presents the first attempt to abstractively generate the first section (“lead”) of Wikipedia articles through multi-document summarization of reference texts. The gold summaries are the Wikipedia leads and are fairly uniform in style due to Wikipedia’s style constraints. They experiment with two types of inputs: text from the citations of a given Wikipedia page as well as searched results crawled based on the page title (up to 10 result pages, with clones removed – motivated by articles with few citations). From a comparison of unigrams in the gold summary and input documents, they state the greater need for abstractive summarization as opposed to other text summarization datasets. Due to the large nature of the input documents when combined, emphasis is placed on the extractive step which aims to select L input tokens for the second abstractive stage. They experiment with three extractive methods: tf-idf [2], TextRank[3] and SumBasic[4] as well as two methods of taking the first L tokens and a cheating method that takes into account bigrams in the ground truth text. They train a decoder-only Transformer[5] as they find that this allows them to train better on longer sequences of text than the encoder-decoder Transformer. This is a curious phenomenon. They evaluate on ROUGE, DUC style questions as well as human evaluations. They show the importance of a good extractor and find that the combined dataset of references plus web crawl is best.

10/2 Bottom-Up Abstractive Summarization[6]:

This paper introduces a bottom-up approach to neural summarization with an extractive content selector built upon contextual word embeddings that masks input to an abstractive summarization step. They find that their content selection model (which they pose as a sequence labeling task) is data efficient and can be trained with less than 1% of the original training data. They bring up the interesting point that in theory pointer networks should be able to do content selection by themselves. Why isn’t this the case? Is there just too much input in some cases? They experiment with additional content selection methods such as masking during training, multi-task learning and masking to limit the set of possible copies during training. They try the Transformer and say it can lead to slightly improved perormance but at the cost of increased training time and parameters. However, there have been recent advances to the transformer which may alleviate this problem. They add length penalities as well as coverage penalties. They state that the main benefit of bottom-up summarization seems to be from the reduction of mistakenly copied words. They do an experiment in domain transfer, training a content selected on CNN data and then testing on NYT data. They do not get comparable performance, but get improvements over the non-augmented corpus. There is room for improvement in the content selector as well as in the fluency and grammaticality of the generated summaries.

10/3 Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data[7]:

This paper, like [1], also emphasizes that automatic summarizaton of data clusters has focused on datasets of ten to twenty ratherh short documents. They introduce an approach to create hierarchical summarization corpora from heterogeneous datasets of over 100,000 tokens from multiple genres. Their corpus focuses on topics related to education. Corpus collection is divided into two phases: 1) a crowdsourcing part extracts relevant nuggets from multiple documents 2) three expert annotators organize the nuggets into hierarchicies which are then organized into a gold standard by greedily maximizing hierarchy overlap. They release code for their annotator interface and crowdsourcing experiment to make developing summarization corpora easier.

10/4 Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised [8]

They provide a neural frameowork for opinion summarization and introduce an opinion summarization dataset from six domains. They discuss the work as three subtasks: 1) aspect extraction or finding specific features pertaining to the entity of interest 2) sentiment prediction which determines the sentiment orientation on the aspects found in the first step and 3) summary generation which presents the opinions to the user. They take reviews and split them into segments, in this case Elemeentary Discourse Units(cite!) obtained from a Rhetorical Structure Theory parser (cite!). First aspect extraction builds upon the Aspect-based autoencoder (cite!), essentially a neural topic model. Their model improves upon this through the introduction of a set of seed words. They also introduce the oposum dataset consists of six training collections created from the Amazon Product Dataset (cite!). Really good evaluation using Best-worst scaling (cite!).

10/5 A Neural Attention Model for Abstractive Sentence Summarization [9]

They combine a neural language model with a contextual input encoder, based off of the attention-based encoder of [10]. They compare this encoder to a bag-of-words encoder as well as a convolutional encoder. Summaries are generating using beam search. A notable point is that the abstractive model does not have the capacity to find extractive word matches.

10/6 Get To The Point: Summarization with Pointer-Generator Networks [11]

In order to improve upon abstractive summarization, this paper introduces pointer-generator networks, which is a hybrid between standard seq2seq models and pointer networks [12] which allows the model to copy words from the input. This aims to address the problem brought up in [9]. Additionally, repetition is a problem for sequence to sequence models. To address this, they modify the coverage model of [13] to penalize overlap between the attention distribution and the coverage so far.

10/7 Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization [14]

This paper introduces what they call “extreme summarization,” which involves creating a short, one-sentence news summary answering the question of what an article is about. This task promotes multiple levels of abstraction such as paraphrasing and synthesis. They introduce a new dataset called XSum consisting of BBC articles (226,711) and accompanying single sentence summaries. While documents and summaries (the introductory sentence professionally written for each article) are shorter than other datasets, the vocabulary size is comparable to the CNN dataset. Additionally, they propose a novel abstractive model which is conditioned on the article’s topics and based entirely on convolutional neural networks.

10/8 Data-to-Text Generation with Content Selection and Planning [15]

Recently, end-to-end text generation has taken precendence over previous data-to-text generation which included a pipeline of content planning, sentence planning and converting this plan to output. However, neural methods do not fare well on metrics of content selection recall and factual output generation. This paper introduces an architecture of two stages 1) content selection and planning produces a content plan which specifies which records and when to introduce them in the text and 2) text generation produces the output text given the content plan as input by attending over vector representations of the records in the content plan.

10/9 Graph-based neural multi-document summarizationa [16]

This paper emphasizes that in previous neural multi-document summarizers [17][18], all the sentences in the same document cluster are processed independently and thus the relationships between sentences and between documents are ignored. They make use of Graph Convolutional Networks [19] as well as Personalized Discourse Graphs, an extension of Approximate Discourse Graphs [20] to promote diverse edge weights in a sentece relation graph.

10/10 Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization [21]

Hierarchical Summarization: Scaling Up Multi-Document Summarization [22]

They introduce a system called SUMMA which produces a hierarchy of short summaries from large document collections. Most MDS datasets contain only 10-15 documents. They say that a well-constructed hierarchical summary should maximize coverage of salient information, should minimize redundancy, should have intra-cluster coherence as well as parent-to-child coherence. Note that they use Wikipedia articles as a reference.


  1. [1]S. Angelidis, M. Lapata, Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised, CoRR. abs/1808.08858 (2018).
  2. [2]D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, CoRR. abs/1409.0473 (2014).
  3. [3]Z. Cao, W. Li, S. Li, F. Wei, Improving Multi-Document Summarization via Text Classification, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017: pp. 3053–3059.
  4. [4]Z. Cao, F. Wei, L. Dong, S. Li, M. Zhou, Ranking with Recursive Neural Networks and Its Application to Multi-Document Summarization, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015: pp. 2153–2159.
  5. [5]J. Christensen, Mausam, S. Soderland, O. Etzioni, Towards Coherent Multi-Document Summarization, in: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, 2013: pp. 1163–1173.
  6. [6]J. Christensen, S. Soderland, G. Bansal, Mausam, Hierarchical Summarization: Scaling Up Multi-Document Summarization, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, 2014: pp. 902–912.
  7. [7]S. Gehrmann, Y. Deng, A.M. Rush, Bottom-Up Abstractive Summarization, CoRR. abs/1808.10792 (2018).
  8. [8]T.N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, CoRR. abs/1609.02907 (2016).
  9. [9]L. Lebanoff, K. Song, F. Liu, Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization, CoRR. abs/1808.06218 (2018).
  10. [10]P.J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, N. Shazeer, Generating Wikipedia by Summarizing Long Sequences, CoRR. abs/1801.10198 (2018).
  11. [11]R. Mihalcea, P. Tarau, TextRank: Bringing Order into Text, in: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A Meeting of SIGDAT, a Special Interest Group of the ACL, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, 2004: pp. 404–411.
  12. [12]S. Narayan, S.B. Cohen, M. Lapata, Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization, CoRR. abs/1808.08745 (2018).
  13. [13]A. Nenkova, L. Vanderwende, The impact of frequency on summarization, Microsoft Research, 2005.
  14. [14]R. Puduppully, L. Dong, M. Lapata, Data-to-Text Generation with Content Selection and Planning, (2018).
  15. [15]J.E. Ramos, Using TF-IDF to Determine Word Relevance in Document Queries, in: 2003.
  16. [16]A.M. Rush, S. Chopra, J. Weston, A Neural Attention Model for Abstractive Sentence Summarization, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, 2015: pp. 379–389.
  17. [17]A. See, P.J. Liu, C.D. Manning, Get To The Point: Summarization with Pointer-Generator Networks, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, 2017: pp. 1073–1083.
  18. [18]C. Tauchmann, T. Arnold, A. Hanselowski, C.M. Meyer, M. Mieskes, Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018., 2018.
  19. [19]Z. Tu, Z. Lu, Y. Liu, X. Liu, H. Li, Modeling Coverage for Neural Machine Translation, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.
  20. [20]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is All you Need, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017: pp. 6000–6010.
  21. [21]O. Vinyals, M. Fortunato, N. Jaitly, Pointer Networks, in: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015: pp. 2692–2700.
  22. [22]M. Yasunaga, R. Zhang, K. Meelu, A. Pareek, K. Srinivasan, D.R. Radev, Graph-based Neural Multi-Document Summarization, in: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, 2017: pp. 452–462.