Text Summarization via Machine Learning

This document is related to my project named Edu_Helper.

I write this article with help from other's documents.

※ Impact

Summarization System has additional evidence that they can utilize in order to specify the most import topics of documents. For example, summarizing journals or blogs, there are discussions and comments coming after the blog post which are good sources of information to determine which parts of blog are critical and important.

※ How text summarization works

There are two types of summarization. One is abstractive summarization and the other is extractive summarization.

1. Abstractive Summarization :

Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. Its purpose is producing import material in a new way. They interpret and understand the text using advanced natrual language techniques in order to generate a new shorter text that conveys core information from original text.

It can be correlated to the way human reads a blog post or journal and then summarized in their own words.

Give document -> understand context -> semantic -> create summary using their own words

2. Extractive Summarization :

Extractive methods attempt to summarize articles by selecting a important words that retain high score.

This method weights the important part of sentences and form the summary. Different algorithm and techniques are used to define weights for the sentences and rank them based on importance and similarity among each other.

Give document -> sentences similarity -> weight sentences -> select sentences via higher rank

Automatic abstractive summaries usually give better results compared to extractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction.

This example uses unsupervised learning approach to find similarity of the sentences and rank them. A benefit of this method is that I don't need gather data set and don't need to train.

Before keeping continue, you have to know Cosine Similarity to understand this article better. Cosine Similarity is a measure of similarity by using non-zero vectors taht measures the cosine of angle between them. By representing given sentences as the bunch of vectors, I can use it to find similarity among sentences. An easy example, if angle is zero, two sentences are similar.

Below is my code flow to generate summarize text.

Give article -> split into setences by using given rule -> remove stop words if it exists

-> build a similarity matirx -> generate rank based on matrix -> pick high ranked sentences

1. Import Libraries

import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

2. Organize sentences and read

def read_article(file_name):
    file = open(file_name, "r", encoding='UTF-8')
    filedata = file.readlines()

    for i in range(0, len(filedata)):
        filedata[i] = filedata[i].replace("\n", " ")
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 
    print(filedata)
    return sentences

3. Similarity matrix

This is where we will be using cosine similarity to get simailarity between sentences.

def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

4. Generate summary method

def generate_summary(file_name, top_n=5):
    nltk.download("stopwords")
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  read_article(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize text
    print("Summarize Text: \n", ". ".join(summarize_text))

저작자표시 비영리 변경금지

This document is related to my project named Edu_Helper.

※ Impact

※ How text summarization works

1. Abstractive Summarization :

2. Extractive Summarization :

티스토리툴바