EDU_Siriμλ "λμμ μ κ°μμμ μμ±μ ν μ€νΈλ‘ λ³ννμ¬ νμΌμ μ μ₯"νλ κΈ°λ₯μ΄ μλ€. μ΄ κΈ°λ₯μ ꡬννλ©΄μ μ»μλ μ§μλ€κ³Ό μ½μ§νλ λ΄μ©λ€μ μ μ΄λ³ΌκΉ νλ€.
ꡬκΈλ§μ νλ©΄ λνμ μΌλ‘ speech_recognition λΌμ΄λΈλ¬λ¦¬κ° λμ¨λ€.κ°λ° μ΄κΈ°μλ μ΄ λͺ¨λμ μ΄μ©νλ € νμΌλ, κ°λ°μ λΆμ ν©ν 2κ°μ§ μμκ° μμλ€.
- Speech_recognitionμ κ²μνλ©΄ λλΆλΆ, λ§μ΄ν¬λ₯Ό ν΅ν΄ μμ±μ Inputνκ³ , μ΄ Input κ°λ€μ ν μ€νΈλ‘ λ³ννλ μμ μ½λλ₯Ό μκ°νλ€. λ΄κ° ꡬννλ €λ κΈ°λ₯μ λ§μ΄ν¬λ‘ μμ±μ Inputνλ κ² μλλ―λ‘ λ€λ₯Έ λ°©μμ μ½λ ꡬνμ΄ νμνλ€.
- μ λͺ¨λμ΄ λΉλμ€ μμ±μ ν μ€νΈλ‘ λ³ννκ² ν μλ μλ€. λ€λ§, ν΄λΉ λͺ¨λμ΄ μ©λ μ νμ΄ μ‘΄μ¬νλ€. 100MBκΉμ§λ§ λΉλμ€ μμ±μ ν μ€νΈλ‘ λ³νν΄μ£Όλ―λ‘ μ΄ λͺ¨λμ μ¬μ©ν μκ° μμλ€.
- speech_recognitionμΌλ‘ μμ± μΈμ κ²°κ³Όκ°μ 보면 . , ! ? μ κ°μ λ¬Έμλ€μ μΈμ λͺ»ν¨μ μ μ μλ€. μμ± μΈμν ν μ€νΈλ€μ μμ½ κΈ°λ₯μ λ겨μΌνλλ°, μμ½ κΈ°λ₯μ . , μ κ°μ ꡬλμ μ κΈ°μ€μΌλ‘ νλ ¬μ λ§λ λ€. μ¦, ꡬλμ μμ΄λ μμ½ κΈ°λ₯μ μ¬μ©ν μκ° μλ€. -> μ΄ μ μ΄ speech recognitionμ μ¬μ©νμ§ λͺ»νλ κ°μ₯ ν° μμΈμ΄λ€.
μ©λ μ ν μμ΄ νΈνκ² μμ±μ ν μ€νΈλ‘ λ³νν΄ μ€ μ½λ μ루μ μΌλ‘ VOSK Model μ±ννλ€.(μμ±μΈμ ν΄ν·μΈλ°, μ μμλ €μ Έ μλ€. λ νΌλ°μ€κ° κ±°μ μμ΄ λ‘μ§μ ꡬνν λ, κ½€ κ³ μνλ€).
vosk model download : https://alphacephei.com/vosk/models
λλ λ°°ν¬ν λλ₯Ό κ°μνμ¬ κ°μ₯ κ°λ²Όμ΄ μ΅μ μΈ vosk-model-small-en-us-0.15μ μ ννλ€. μ΄μ μ½λλ₯Ό 보μ.
1. Convert Video into Audio file
νλ‘κ·Έλ¨μ λΉλμ€ νμΌμ Input νλ νμμΌλ‘ μ§νλκΈ°μ, μμ± μΈμ μ λΉλμ€ νμΌμ μ€λμ€ νμΌλ‘ μ ννλ€. FRAME_RATEλ sampling rateλ₯Ό μ νλ κ²μΈλ°, μμ§κ³Ό μκ΄μλ€. 16000μ΄ κ°μ₯ μ΅μ νλ κ°μ΄λΌμ μλμ κ°μ΄ μ μνλ€. Channelsλ μμ±μ νΉμ§μ μ νλ κ²μΈλ°, κ°μ 1λ‘ μ νλ©΄ μΌλ°μ μΌλ‘ μ°λ¦¬κ° λ£λ 1μ°¨μμ μν₯μ΄λ€. (κ°μ 2λ‘ μ νλ©΄ μλΌμ΄λ μν₯κ³Ό λΉμ·ν ν¨κ³Όκ° λλ κ² κ°λ€)
# Convert Video to audio file
clip = mp.VideoFileClip(video_path)
clip.audio.write_audiofile(speech_path)
FRAME_RATE = 16000
CHANNELS = 1
2. Basic setup Code
μ¬μ©ν μμ± μΈμ Modelκ³Ό μΈν κ°μ μ μνλ€. κΈ°λ₯ ꡬν λ¨κ³μμ μμ± νμΌμ΄ ν μ€νΈλ‘ μ νν λ³νλλ μ§ νμΈνκΈ° μν΄, SetWords(True)λ‘ μ§μ ν΄λλ€.
# set up
model = Model(model_name='vosk-model-small-en-us-0.15')
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)
SetWords(True) - μμ±μΌλ‘ λ²μλ μμ±λ λ¬Έμ₯κ³Ό κ°κ°μ λ¨μ΄λ€μ λͺ¨λ νμΈν μ μμ
speech = AudioSegment.from_mp3(speech_path) # Load file
speech = speech.set_channels(CHANNELS)
speech = speech.set_frame_rate(FRAME_RATE)
rec.AcceptWaveform(speech.raw_data)
result = rec.Result()
μ΄μ νμΌμ λ‘λνκ³ , μμ± μΈμ parserλ₯Ό ν΅ν΄ μμ± raw data λ°μ result λ³μμ λ°ννλ€. result κ°μ μλμ²λΌ λμ¨λ€.
μ¬κΈ°μ μ°λ¦¬κ° μκ³ μ νλ κ°μ λ³νλ μμ± textλ€μ΄λ―λ‘ json.loads ν¨μλ‘ textμ κ°λ€λ§ μ½μ΄μ¨λ€.
text = json.loads(result)['text']
3. Prediction of Punctuation
μ text κ°μ μ½μ΄λ³΄λ©΄ μμ±λ λ¬Έμ₯ ννλ‘ μ μΆλ ₯λμ§λ§ . , ? ! μ κ°μ λ¬Έμκ° μλ€. μ΄ κ΅¬λμ λ€μ΄ μμΌλ©΄ κΈ μμ½ κΈ°λ₯μ΄ matrixλ₯Ό μ λλ‘ κ΅¬μ±ν μ μλ€. μ¦, κΈ μμ½ κΈ°λ₯μ΄ μΈλͺ¨ μμ΄μ§λ€. λ°λΌμ ꡬλμ μ μμΈ‘νλ μ루μ μ΄ νμνλ€.
https://alphacephei.com/vosk/models/vosk-recasepunc-en-0.22.zip
vosk-recasepuncλ λ¬Έμ₯λ€μ ꡬλμ μ μμΈ‘ν΄μ . , ? ! λ±μ κΈ°νΈλ€μ μ μ ν μμΉμ λ£μ΄μ£Όλ μν μ νλ€. ꡬκΈλ§νλ©΄ μ¬μ©λ²μΌλ‘ 2μ’ λ₯κ° λμ¨λ€. λμ κ²½μ° subprocess λͺ¨λμ μ΄μ©ν΄ recasepunc.py νμΌμ 무μνκ² λ리λ λ°©λ²λ§ μ λλ‘ μλν΄μ, μ΄ λ°©λ²μ μ ννλ€.
cased = subprocess.check_output('python recasepunc/recasepunc.py predict recasepunc/checkpoint', shell=True, text=True, input=text)
cased = cased.replace(" ' ", "'").replace(" ? ", "? ").replace(" ! ", "! ")
4. Result
So if you look at recent results from several different leading speech groups, Microsoft showed that this kind of deep neural network
when used to see coasting model and (μ΄ν μλ΅)
Full Code
def video_to_text(video_name):
# Set path of video and speech file
base_path = os.getcwd()
video_path = base_path + "\\video\\" + str(video_name)
speech_file = str(video_name).split('.')[0]
speech_file = speech_file + ".wav"
speech_path = base_path + "\\speech\\" + str(speech_file)
# Convert Video to sound file
clip = mp.VideoFileClip(video_path)
clip.audio.write_audiofile(speech_path)
FRAME_RATE = 16000
CHANNELS = 1
try:
model = Model(model_name='vosk-model-small-en-us-0.15')
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)
print("\n\n############# Now, I'm on my work.. It takes a few minutes. #################")
print("################## It's okay to ignore warnings! ##########################")
print("#### If the program is still stuck after enough time has passed, press Enter. #####")
speech = AudioSegment.from_mp3(speech_path)
speech = speech.set_channels(CHANNELS)
speech = speech.set_frame_rate(FRAME_RATE)
rec.AcceptWaveform(speech.raw_data)
result = rec.Result()
text = json.loads(result)['text']
cased = subprocess.check_output('python recasepunc/recasepunc.py predict recasepunc/checkpoint', shell=True, text=True, input=text)
cased = cased.replace(" ' ", "'").replace(" ? ", "? ").replace(" ! ", "! ")
with open('speech_result.txt',mode ='a') as file:
file.write("\n==============================================\n")
file.write("Content: \n")
file.write(str(cased))
print("+===========================================+")
print("| Converting is done! (Video Sound -> Text) |")
print("+===========================================+")
except Exception as e:
error("Error occurred during converting video sound to text! The file is probably an unsupported format")