Text Summarization์ ๋ํ ๊ธฐ๋ณธ์ ์ธ ์ค๋ช ์ ์ด ๊ธ์ ์ฐธ์กฐํ์. Text Summarization์ ๊ด๋ จ๋ ์ค๋ช ๊ณผ ์ฝ๋๋ค์ ๊ตฌ๊ธ์ ๋ง์ด ์กด์ฌํ๋, ๋์ ํ๋ก์ ํธ์ ์ ํฉํ ์ฝ๋๊ฐ ๋ฐ๋ก ์์ด ์ ์ํ๋ค.
Text Summarization๊ณผ ๊ด๋ จ๋ ๊ธ๋ค์ ๋ณด๋ฉด ๊ฑฐ์ ์จ๋ผ์ธ์ ์๋ Article์ด๋ Journal ๋ค์ ๊ฐ์ ธ์ Abstractive Text Summarization์ ์คํํ๋ค. ๊ฐ ๋ฐฉ์์ ์ฅ๋จ์ ์ด ์กด์ฌํ์ง๋ง, ๋ด๊ฐ ์ฌ๋ฌ๊ฐ์ง ๊ธ๋ค์ ํ ์คํธํด ๋ณธ ๋ฐ๋ก๋ Abstractive ๋ฐฉ์์ด ๋ ์ดํดํ๊ธฐ ์ฝ๊ฒ ๊ธ์ ์ ๊ณตํ๋ค. ๋ํ ๊ธฐ์กด ๊ธ์ด ์ ๋ฌํ๊ณ ์ ํ๋ ๋๋, ์๋ฏธ๋ค์ ๋ ์ ์ด๋ฆฐ ๊ฒฐ๊ณผ๋ฌผ์ ๋ด๋จ๋ค.
๊ทธ๋ฌ๋ ๋ด ํ๋ก์ ํธ์ ๊ฒฝ์ฐ, ์ผ๋ฐ์ ์ธ Text Summarization ์ฝ๋๋ค๊ณผ ์ ์ ์กฐ๊ฑด๋ถํฐ๊ฐ ๋ค๋ฅด๋ค. ์ฐจ์ด์ ์ ์๋์ ๊ฐ๋ค.
- ๊ธฐ์กด ์ฝ๋๋ค์ "์์ฑํ์" ๊ธ๋ค์ ๊ฐ์ ธ์ ํ ์คํธ ์์ฝ์ ์งํํ๋ค.(์ฌ๊ธฐ์ ์์ฑํ์ด๋ ์ฝค๋ง, ๋ง์นจํ, ๋๋ํ, ์๋ฐ์ดํ ๋ฑ์ ๋ฌธ์ ๋ถํธ๋ค์ด ์ ์ ํ ์๋ ๊ธ์ ๋ปํ๋ค). ํ์ง๋ง ๋ด ํ๋ก์ ํธ์์ ์์ฝํ ๊ธ๋ค์ ๊ธฐ๋ณธ์ ์ผ๋ก ๋ฌธ์ ๋ถํธ๋ค์ด ์๋ค.
- ๊ธฐ์กด ์ฝ๋๋ค์ด ํ๊ฒ์ผ๋ก ํ๋ ํ ์คํธ์ ๊ธธ์ด๋ ๋ด ํ๋ก์ ํธ๊ฐ ์์ฝํด์ผ ํ ํ ์คํธ์ ๊ธธ์ด๋ณด๋ค ํ์ ํ ์งง๋ค. ์ฆ, ๊ตฌ๊ธ์์ ์ฐพ์ ์ ์๋ ์ฝ๋๋ ์์ฝํ ์ ์์ ์ ๋์ ๊ธด ๊ธ๋ค์ ์์ฝํด์ผ ํ๋ค.
์ ๊ตํ๊ฒ ๋ถ์ํ ๊ฑด ์๋์ง๋ง, Abstractive ๋ฐฉ์์ ๊ธ์ด ๊ธธ์ด์ง๋ฉด ์์ฝ ์๊ณ ๋ฆฌ์ฆ์ ๋ ์ ๊ตํ๊ฒ ์ง์ผ์ง๋ง ์ ๋๋ก๋ ์์ฝ๋ณธ์ ์ถ์ถํ ์ ์๋ ๊ฒ ๊ฐ๋ค. ์ ๊ตํ ์๊ณ ๋ฆฌ์ฆ์ ์ง๋ ๊ฒ๋ณด๋ค Extractive ๋ฐฉ์์ ์ด์ฉํ๋ ๊ฒ์ด ์์ฐ์ฑ์ด ๋๋ค๊ณ ํ๋จํ์ฌ ํด๋น ๋ฐฉ์์ ์ด์ฉํ ์๊ณ ๋ฆฌ์ฆ์ ๋ง๋ค์๋ค.
( ๊ธฐ์กด ํ ์คํธ์ ๋ฌธ์ ๋ถํธ๋ฅผ ์ธ์์ํค๋ ๋ฐฉ๋ฒ์ Converting Audio into Text ๊ธ์์ ์ค๋ช ํ๊ฒ ๋ค. )
1. Get contents I want to summarize
file = open(path, "r", encoding='UTF-8')
core_text = file.read().split('===========================================================================')[-1]
2. Text Summarization
๋จ์ด์ ๋น๋์๋ฅผ ์ฒดํฌํ๋ ์๊ณ ๋ฆฌ์ฆ์ด๋ค.
from collections import Counter
from nltk.corpus import stopwords
nltk.download("stopwords", quiet=True)
stopwords_english = stopwords.words("english")
word_frequencies = {}
for word in nltk.word_tokenize(formatted_text):
if word not in stopwords_english:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
์ด ์๊ณ ๋ฆฌ์ฆ์ ์๋ฆฌ๋ ๋ค์๊ณผ ๊ฐ๋ค. stopwords list๋ฅผ NLTK ๋ผ์ด๋ธ๋ฌ๋ฆฌ์์ import ํ๋ค. stop_words ๋ฆฌ์คํธ๋ ์๋์ ๊ฐ๋ค.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
์ ๋จ์ด๋ค์ ๊ธ์์ ์ค์ํ ์๋ฏธ๋ฅผ ๊ฐ์ง ์๊ธฐ ๋๋ฌธ์ ์นด์ดํ ํ์ง ์๋๋ค. ์ด ๋ชฉ๋ก์ ์๋ ๋จ์ด๊ฐ ๋ํ๋์ง ์์ ๋๋ง๋ค
word_frequencies์์ word ๊ฐ์ด 0์ด ์๋ ๊ฒฝ์ฐ, 1์ด ์ฆ๊ฐํ๋ค. ๊ฐ ๋จ์ด๋ค์ ๋น๋๋ฅผ ์นด์ดํ ํ์ผ๋, ๊ฐ ๋จ์ด๋ค์ ๋น๋๋ฅผ ๊ฐ์ฅ ์์ฃผ ์ถํํ ๋จ์ด์ ๋น๋๋ก ๋๋๋ค.
maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
print(Counter(word_frequencies).most_common(20))
[('course', 1.0), ('G', 0.92), ('three', 0.92), ('requirements', 0.88), ('education', 0.8),
('The', 0.8), ('courses', 0.8), ('units', 0.76), ('general', 0.76), ('E', 0.64), ('one', 0.64),
('A', 0.6), ('see', 0.6), ('If', 0.56), ('division', 0.56), ('area', 0.56), ('transfer', 0.52),
('Area', 0.52), ('D', 0.44), ('pattern', 0.4)]
์ด ๊ฒฐ๊ณผ ๊ฐ์ ํ ๋๋ก Extractive Text Summarization์ ๋ง๋ค ์ ์๋ค. ์ด ๋ชจ๋ธ์ ๊ฐ์ฅ ๋์ ์ ์๋ฅผ ์ง๋ ๋จ์ด๋ค์ ์ ํํ๋ค. ๋๋ ๊ฐ ๋จ์ด์ ๋น๋ ์์ ๊ฒฐ๊ณผ๊ฐ์ ์ด์ฉํด ๊ณ์ฐํ๊ฒ๋ ํ๊ณ , 50 ๋จ์ด ์ด์์ ๋ฌธ์ฅ์ด ๋์ค๋ ๊ฒ์ ๊ธฐ์ค์ผ๋ก ์๊ณ ๋ฆฌ์ฆ์ ์งฏ๋ค. (์์ฒญ ๊ธด ๋ฌธ์ฅ์ ์์ฝํด์ผํ๊ธฐ ๋๋ฌธ)
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent):
if(word in word_frequencies.keys() and len(sent.split(' ')) > 50):
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
๋ฌธ์ฅ์ ์ค์๋๋ฅผ ๊ณ์ฐํ๋ ์๊ณ ๋ฆฌ์ฆ์ ์๋ฃํ๋ค. ์ด์ ์์ฝ๋ ๋ฌธ์ฅ์ ๋ฝ์๋ด๋ฉด ๋๋ค.
summary_sentences = heapq.nlargest(8, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
์๋ ๊ธด ๋ฌธ์ฅ์ ์์ฝํด์ผํ๊ธฐ ๋๋ฌธ์ 8๋ฌธ์ฅ ์ ๋๋ก ์์ฝํ๋ ๊ฒ ๊ฐ์ฅ ๋ซ๋ค๊ณ ํ๋จํ์ฌ heapq.nlargest์ ์ฒซ๋ฒ์งธ ์ธ์์ 8์ ๋ฃ์๋ค.
Full Code:
import nltk
from nltk.corpus import stopwords
import heapq
def generate_summary(path, output_filename, mode):
nltk.download('punkt')
nltk.download("stopwords", quiet=True)
file = open(path, "r", encoding='UTF-8')
core_text = file.read().split('===========================================================================')[-1]
# Removing Square Brackets and Extra Spaces
core_text = re.sub(r'\[[0-9]*\]', ' ', core_text)
core_text = re.sub(r'\s+', ' ', core_text)
# Removing special characters and digits
formatted_text = re.sub('[^a-zA-Z]', ' ', core_text )
formatted_text = re.sub(r'\s+', ' ', formatted_text)
sentence_list = nltk.sent_tokenize(core_text)
stopwords_english = stopwords.words("english")
word_frequencies = {}
for word in nltk.word_tokenize(formatted_text):
if word not in stopwords_english:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent):
if(word in word_frequencies.keys() and len(sent.split(' ')) > 50):
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
summary_sentences = heapq.nlargest(8, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
path = "./summary/" + output_filename
try:
with open(path, mode=mode, encoding='UTF-8') as file:
file.write('Summarized Content: \n')
file.write(str(summary))
file.write("\n========================================================\n")
print("+======================+")
print("| Summary Complete |")
print("+======================+")
except Exception as e:
error("Error occured during summarizing text!")
3. Result
Summarized Content:
So if you look at recent results from several different leading speech groups, Microsoft showed that this kind of deep neural network when used to see coasting model and speech system would use the or right from twenty seven point four percent, eighteen point five percent, or alternatively, you can view it as reducing the amount of training that you needed from two thousand hours time to three hundred hours to get comparable performance i b m which has the best system for one of the standard speech recognition tasks for large recovery speech recognition showed that even it's very highly tuned system that was getting eighteen point eight percent can be beaten by one of these deep neural networks. That was still much less than that train think i see mixture model on but even with much less data, it did a lot better than the technology they had before said reduce the or right from sixteen percent trump on three percent and the area it is still falling and in the latest android if you do voice search is using one of these deep neural networks in order to very good speech recognition. So they look at this little window and they say in the middle of this window, what do I think the phony me some which part of the phone you miss it and a good speech recognition system will have many alternative models for phony and each model and might have three different parts. Students on the freshmen pattern must complete a minimum of three units in the one Social Sciences, three units, indeed to US History and three units in D, Three US Government and California Government students on the transfer pattern need nine units of Area D Social Science coursework If you have none of your Area D completed, we recommend you still include a D to and D three course when completing this section because you still need to complete the U. S. History and the U. S. Government and California government requirements. Students must complete a minimum of nine units in area, see, including three units, and see one arts, three units, and see to Humanities, and then three final units from either see one arts or see to humanities after area, see his area, d Social Sciences, which has met through completion of a minimum of nine units. This consists of a three unit course in each of the three domains of knowledge, physical in life Science, You D DB, Arts and Humanities, U. D. C and Social Sciences, You D. D. Note that to take an Upper division G. E. Course, one must meet all prerequisites for the courses which minimally include completion of lower division, G. E. A one A to a three and before freshmen.
========================================================