Text Summarization Function (Extractive)

Text Summarization에 대한 기본적인 설명은 이 글을 참조하자. Text Summarization에 관련된 설명과 코드들은 구글에 많이 존재하나, 나의 프로젝트에 적합한 코드가 따로 없어 제작했다.

Text Summarization과 관련된 글들을 보면 거의 온라인에 있는 Article이나 Journal 들을 가져와 Abstractive Text Summarization을 실행한다. 각 방식의 장단점이 존재하지만, 내가 여러가지 글들을 테스트해 본 바로는 Abstractive 방식이 더 이해하기 쉽게 글을 제공했다. 또한 기존 글이 전달하고자 하는 느낌, 의미들을 더 잘 살린 결과물을 내놨다.

그러나 내 프로젝트의 경우, 일반적인 Text Summarization 코드들과 전제 조건부터가 다르다. 차이점은 아래와 같다.

기존 코드들은 "완성형의" 글들을 가져와 텍스트 요약을 진행한다.(여기서 완성형이란 콤마, 마침표, 느낌표, 쌍따옴표 등의 문자 부호들이 적절히 있는 글을 뜻한다). 하지만 내 프로젝트에서 요약할 글들은 기본적으로 문자 부호들이 없다.
기존 코드들이 타겟으로 하는 텍스트의 길이는 내 프로젝트가 요약해야 할 텍스트의 길이보다 현저히 짧다. 즉, 구글에서 찾을 수 있는 코드는 요약할 수 없을 정도의 긴 글들을 요약해야 한다.

정교하게 분석한 건 아니지만, Abstractive 방식은 글이 길어지면 요약 알고리즘을 더 정교하게 짜야지만 제대로된 요약본을 추출할 수 있는 것 같다. 정교한 알고리즘을 짜는 것보다 Extractive 방식을 이용하는 것이 생산성이 높다고 판단하여 해당 방식을 이용한 알고리즘을 만들었다.

( 기존 텍스트에 문자 부호를 인식시키는 방법은 Converting Audio into Text 글에서 설명하겠다. )

1. Get contents I want to summarize

file = open(path, "r", encoding='UTF-8')
core_text = file.read().split('===========================================================================')[-1]

2. Text Summarization

단어의 빈도수를 체크하는 알고리즘이다.

from collections import Counter
from nltk.corpus import stopwords

nltk.download("stopwords", quiet=True)
stopwords_english = stopwords.words("english")
word_frequencies = {}
for word in nltk.word_tokenize(formatted_text):
    if word not in stopwords_english:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

이 알고리즘의 원리는 다음과 같다. stopwords list를 NLTK 라이브러리에서 import 한다. stop_words 리스트는 아래와 같다.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

위 단어들은 글에서 중요한 의미를 갖지 않기 때문에 카운팅하지 않는다. 이 목록에 있는 단어가 나타나지 않을 때마다

word_frequencies안에 word 값이 0이 아닐 경우, 1이 증가한다. 각 단어들의 빈도를 카운팅했으니, 각 단어들의 빈도를 가장 자주 출현한 단어의 빈도로 나눈다.

maximum_frequency = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

print(Counter(word_frequencies).most_common(20))

[('course', 1.0), ('G', 0.92), ('three', 0.92), ('requirements', 0.88), ('education', 0.8), 
('The', 0.8), ('courses', 0.8), ('units', 0.76), ('general', 0.76), ('E', 0.64), ('one', 0.64),
('A', 0.6), ('see', 0.6), ('If', 0.56), ('division', 0.56), ('area', 0.56), ('transfer', 0.52),
('Area', 0.52), ('D', 0.44), ('pattern', 0.4)]

이 결과 값을 토대로 Extractive Text Summarization을 만들 수 있다. 이 모델은 가장 높은 점수를 지닌 단어들을 선택한다. 나는 각 단어의 빈도 수와 결과값을 이용해 계산하게끔 했고, 50 단어 이상의 문장이 나오는 것을 기준으로 알고리즘을 짯다. (엄청 긴 문장을 요약해야하기 때문)

sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent):
        if(word in word_frequencies.keys() and len(sent.split(' ')) > 50):
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word]
            else:
                sentence_scores[sent] += word_frequencies[word]

문장의 중요도를 계산하는 알고리즘을 완료했다. 이제 요약된 문장을 뽑아내면 된다.

summary_sentences = heapq.nlargest(8, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)

워낙 긴 문장을 요약해야하기 때문에 8문장 정도로 요약하는 게 가장 낫다고 판단하여 heapq.nlargest의 첫번째 인자에 8을 넣었다.

Full Code:

import nltk
from nltk.corpus import stopwords
import heapq

def generate_summary(path, output_filename, mode):
    nltk.download('punkt')
    nltk.download("stopwords", quiet=True)

    file = open(path, "r", encoding='UTF-8')
    core_text = file.read().split('===========================================================================')[-1]

    # Removing Square Brackets and Extra Spaces
    core_text = re.sub(r'\[[0-9]*\]', ' ', core_text)
    core_text = re.sub(r'\s+', ' ', core_text)

    # Removing special characters and digits
    formatted_text = re.sub('[^a-zA-Z]', ' ', core_text )
    formatted_text = re.sub(r'\s+', ' ', formatted_text)

    sentence_list = nltk.sent_tokenize(core_text)
    stopwords_english = stopwords.words("english")

    word_frequencies = {}
    for word in nltk.word_tokenize(formatted_text):
        if word not in stopwords_english:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())

    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent):
            if(word in word_frequencies.keys() and len(sent.split(' ')) > 50):
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]

    summary_sentences = heapq.nlargest(8, sentence_scores, key=sentence_scores.get)

    summary = ' '.join(summary_sentences)
    path = "./summary/" + output_filename
    try:
        with open(path, mode=mode, encoding='UTF-8') as file:
            file.write('Summarized Content: \n')
            file.write(str(summary))
            file.write("\n========================================================\n")
        print("+======================+")
        print("|   Summary Complete   |")
        print("+======================+")
    except Exception as e:
        error("Error occured during summarizing text!")

3. Result

Summarized Content: 
So if you look at recent results from several different leading speech groups, Microsoft showed that this kind of deep neural network when used to see coasting model and speech system would use the or right from twenty seven point four percent, eighteen point five percent, or alternatively, you can view it as reducing the amount of training that you needed from two thousand hours time to three hundred hours to get comparable performance i b m which has the best system for one of the standard speech recognition tasks for large recovery speech recognition showed that even it's very highly tuned system that was getting eighteen point eight percent can be beaten by one of these deep neural networks. That was still much less than that train think i see mixture model on but even with much less data, it did a lot better than the technology they had before said reduce the or right from sixteen percent trump on three percent and the area it is still falling and in the latest android if you do voice search is using one of these deep neural networks in order to very good speech recognition. So they look at this little window and they say in the middle of this window, what do I think the phony me some which part of the phone you miss it and a good speech recognition system will have many alternative models for phony and each model and might have three different parts. Students on the freshmen pattern must complete a minimum of three units in the one Social Sciences, three units, indeed to US History and three units in D, Three US Government and California Government students on the transfer pattern need nine units of Area D Social Science coursework If you have none of your Area D completed, we recommend you still include a D to and D three course when completing this section because you still need to complete the U. S. History and the U. S. Government and California government requirements. Students must complete a minimum of nine units in area, see, including three units, and see one arts, three units, and see to Humanities, and then three final units from either see one arts or see to humanities after area, see his area, d Social Sciences, which has met through completion of a minimum of nine units. This consists of a three unit course in each of the three domains of knowledge, physical in life Science, You D DB, Arts and Humanities, U. D. C and Social Sciences, You D. D. Note that to take an Upper division G. E. Course, one must meet all prerequisites for the courses which minimally include completion of lower division, G. E. A one A to a three and before freshmen.
========================================================

저작자표시 비영리 변경금지

1. Get contents I want to summarize

2. Text Summarization

3. Result

티스토리툴바