Building a Text Summarizer from Scratch

Written by Anjaneya Tripathi


We use summarizations in many places in our lives. While reading about a book, the blurb at the back provides us with a general idea about the book, news apps often have a small caption that talks about an article, movie reviews are another place where summaries prove to be of prime importance.


With the advent of the latest technology such as natural language processing (NLP), machine learning and deep learning, why not use computers to generates summaries without human intervention? This article will do just that, how to create a summary from a given text.


Summarization of a text can be of two types – Extractive & Abstractive Summarization.


Extractive summarization, very crudely, can be defined as hand-picking any important sentence and adding it to the summary as it is.

  1. Read the text and split it into sentences.

  2. Parse through each sentence and identify which ones are more important than others based on certain evaluation parameters.

  3. Pick out the sentences that rank better than others.

  4. Add these sentences to the summary and TA DA! You’re done.

Abstractive summarization is more like how the human summarizes a piece of text, understanding and analyzing the article.

  1. Read the text

  2. Analyze the underlying meaning of the text and the sentences

  3. Pick out the important topics and create sentences on its own (may or may not use vocabulary from the article)

  4. Add these sentences to the summary and voila!


As you can see, extractive summarization is relatively easier in comparison to abstractive summarization since it doesn’t have to deal with any sort of semantics and vocabulary. As a result, extractive summarizers often generate better summaries than their abstractive sibling.

In this article, we’ll discuss extractive summarization and then proceed to create our very own text summarizer from scratch. Excited?


Term Frequency - Inverse Document Frequency (TF-IDF)

Given a sentence, we can easily identify what it’s talking about and what it tries to imply. However, a computer can’t do it. So, how does it happen? Well, you convert each sentence into a vector, something a bit more understandable by a computer. These vectors are then ranked and the best-suited ones are selected for the summary. Who decides the rank of the sentences? Thank TF-IDF.

Term frequency-inverse document frequency is what we’ll be using as the basis for selecting sentences that’ll make it to the final summary. The TF-IDF value is calculated as given below:

TF-IDF = TF x IDF

Term Frequency (TF)

This is a measure of the number of times a word appears in our document. Taking only the term frequency will give a highly inaccurate result because a longer document will have a greater TF for a word X when compared to a shorter document. However, this can be normalized if we divide the frequency by the number of words. This makes the TF a more valuable quantity for our purpose.

TF(w, d) = occurrences of word ‘w’ in the document ‘d’ / total number of words in the document ‘d’

This is it right? Why do we need anything else? Well, if you think about it, all the words that are extremely common such as – is, am, was etc. will have a very high TF value. The term TF on its own will once again prove to be redundant, so behold IDF, our saviour!



Inverse Document Frequency (IDF)

Before we jump to IDF, let’s discuss document frequency. DF is the number of times the word ‘w’ is in the set of documents. It is the number of documents that contain the word.

DF(w) = number of times the word ‘w’ appears

However, DF is not of concern to us, it’s the inverse which is relevant. It tells us how much information the term carries. As a result of inversion, common words will have a low IDF value while terms with a lesser occurrence gain higher precedence. We take the logarithm of this value because, for large documents, the value of this term would sky-rocket.

IDF(w) = log(N / (DF + 1))

We do division by (DF + 1) to avoid division by zero. It can prove to be a serious prick in certain occasions.


TF-IDF(w, d) = TF(w, d) * log(N / (DF + 1))

Once all this is calculated, we rank each document (here, sentences) based on their TFIDF scores and set a threshold which will decide which sentences make it to the summary. Pretty cool right? Now, let’s get coding!


Let the Code Begin!


Importing Libraries

Cleaning the Text

Occurrences of Words in Each Sentence

Creating List of Frequencies for each Word in all Documents

Calculating TF and IDF Values

Calculating TF-IDF Values

Ranking all the Documents

Generating the Summary

(Note: the threshold for average can be modified by multiplying a scalar to reduce or increase the size of the summary)

And…Time to call all the functions!

(Note: Ensure that your *.txt file is in the same directory as your *.py file!)


Take a look at the results!


Conclusion – The End of a Wonderful Journey

Well, I hoped you had a fun time coding along with me and learning about text summarization while we were at it. This, however, isn’t the only way to summarize a text, there are so many more ways that can be implemented to do the task. Hope you enjoyed and see you next time!


About the person behind the keyboard: Anjaneya is pursuing B.Tech in CS from NIT Trichy. He is an ML Enthusiast and a brilliant guy. He is good at coding and writes great content. If you guys want to contact him, just click on his name.

238 views0 comments

Recent Posts

See All

©2020 by Machine Learning Man. Created by Gaurav Chatterjee