Detecting Text & Getting Similarity between images

2023. 3. 25. 20:29ยท๐Ÿ“„ Project/Edu_Siri

Edu_Siri์—์„œ ํ”„๋กœ๊ทธ๋žจ ๋™์ž‘ ํ๋ฆ„์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

์›ํ•˜๋Š” ๋™์˜์ƒ ์‚ฝ์ž… -> ์›€์ง์ž„ ๊ฐ์ง€ ๋ฐ ํ™”๋ฉด ์บก์ฒ˜ -> ์บก์ฒ˜๋œ ์‚ฌ์ง„ ๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ต -> ๊ธ€์ž ์ธ์‹

์›€์ง์ž„ ๊ฐ์ง€ ๋ฐ ํ™”๋ฉด ์บก์ฒ˜ ๊ด€๋ จ์€ ์ด์ „ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์ž. ์ด์ „ ๊ธ€์—์„œ ํ™”๋ฉด ์บก์ฒ˜ ์‹œ, ์‚ฌ์šฉ์ž๊ฐ€ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ง€์ •ํ•˜๋ฉด ์ง€์ •๋œ ์ด๋ฆ„์œผ๋กœ ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ ํ›„, ์บก์ฒ˜๋ณธ์„ ํ•ด๋‹น ํŒŒ์ผ ์•ˆ์— ์ €์žฅํ•œ๋‹ค. ์ด ์บก์ฒ˜๋ณธ์„ ํ›‘์œผ๋ฉด์„œ ์ด์ „ ์‚ฌ์ง„๊ณผ ๋‹ค์Œ ์‚ฌ์ง„ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ฒดํฌํ•˜๋ฉด ๋œ๋‹ค

 

1. Check directory &  Similarity between images

# ํŒŒ์ผ ์† ์ด๋ฏธ์ง€๋“ค ๋ชจ๋‘ ๊ฒ€์‚ฌ
def file_listing(path)->str:
    files = os.listdir(path)

    for i in range(1, len(files)):
        comparePic(path, files[i-1], files[i])
    print("\n+=====================================+")
    print("| Organizing class materials is Done! |")
    print("+=====================================+")
    
# ์‚ฌ์ง„ ๊ฐ„ ์œ ์‚ฌ๋„ ์ฒดํฌ
 def comparePic(path, image1, image2) -> str:
    image1_path = path + "/" + image1
    image2_path = path + "/" + image2

    pre = cv.imread(image1_path)
    post = cv.imread(image2_path)
    result = pre.copy()

    grayPre = cv.cvtColor(pre, cv.COLOR_BGR2GRAY)
    grayPost = cv.cvtColor(post, cv.COLOR_BGR2GRAY)

    (Similarity, diff) = compare_ss(grayPre, grayPost, full=True)
    diff = (diff * 255).astype('uint8')

    if(Similarity < 0.95):
        PicToText(post)

comparePIC ํ•จ์ˆ˜๋ฅผ ์ค‘์ ์ ์œผ๋กœ ๋ณด์ž. ์‚ฌ์ง„ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ฒดํฌํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฝค ๋‹จ์ˆœํ•˜๋‹ค. pre์™€ post์— ๊ฐ๊ฐ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜จ ํ›„ cvtColorํ•จ์ˆ˜์— COLOR_BGR2GRAY์˜ต์…˜์„ ์„ค์ •ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ค์ •๋œ ์ด๋ฏธ์ง€๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ณ€ํ™˜๋œ๋‹ค.

pre์™€ post์— ๋‹ด๊ธด ์‚ฌ์ง„์„ ๋ชจ๋‘ ํ‘๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝํ–ˆ์œผ๋‹ˆ, compare_ss ํ•จ์ˆ˜์— ๋‘ ์ด๋ฏธ์ง€๋ฅผ ์ธ์ž๋กœ ๋„ฃ๊ณ  full=True๋กœ ๋‘์–ด ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•ด ๊ตฌ์กฐ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•œ๋‹ค. ์ด ์ˆ˜ํ–‰์œผ๋กœ Similarity๊ฐ’์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

( Similarity๋Š” ๋‘ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์‚ฌ์ด์˜ ๊ตฌ์กฐ์  ์œ ์‚ฌ์„ฑ ์ง€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์œผ๋กœ ์™„๋ฒฝ ๋ถˆ์ผ์น˜์ธ -1๊ณผ ์™„๋ฒฝ ์ผ์น˜์ธ 1 ์‚ฌ์ด์˜ ๊ฐ’๋งŒ์„ ๊ฐ€์ง„๋‹ค. )

 

์œ ์‚ฌ๋„ ํŒ๋ณ„์— ์“ฐ์ด๋Š” ์‚ฌ์ง„์ด ๊ฐ•์˜ ์ž๋ฃŒ์ธ ๊ฒƒ์„ ๊ฐ์•ˆํ•˜๋ฉด, ๋ฏธ์„ธํ•œ ๊ธ€์˜ ๋ณ€ํ™”์—๋„ ๊ฐ์ง€ํ•˜์—ฌ ๊ธ€์ž ์ธ์‹์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์—ฌ๋Ÿฌ๋ฒˆ์˜ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ 0.95 ๊ฐ€ ๊ฐ€์žฅ ์ตœ์ ํ™”๋œ ๊ฐ’์ด๋ผ ํŒ๋‹จ๋˜์–ด ๊ธฐ์ค€๊ฐ’์„ ์ด์™€ ๊ฐ™์ด ์žก์•˜๋‹ค.

 

2. Set configuration before detecting Text

pip install pillow
pip install pytesseract

๊ธ€์ž ์ธ์‹์„ ํ•˜๋ ค๋ฉด ๊ธฐ์ดˆ ์„ธํŒ…์ด ํ•„์š”ํ•˜๋‹ค. ์œ„ 2๊ฐœ๋ถ€ํ„ฐ ์„ค์น˜ํ•˜์ž. pytesseract๋Š” ์˜์–ด, ํ•œ๊ธ€ ๋‘ ์–ธ์–ด๋ฅผ ๋ชจ๋‘ ์ธ์‹์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ผ ์ฑ„ํƒํ•˜๊ฒŒ ๋๋‹ค.

def PicToText(path)->str:
    # Config Parser ์ดˆ๊ธฐํ™” & Config File ์ฝ๊ธฐ
    config = configparser.ConfigParser()
    config.read(os.path.dirname(os.path.realpath(__file__)) + os.sep + 'env' + os.sep + 'property.ini')

    ocrToStr(path, 'eng')

4๋ฒˆ ์งธ ์ค„๋กœ ์šด์˜์ฒด์ œ์— ๋งž๋Š” ํŒจ์Šค ์ ‘๊ทผ์ž๋ฅผ ์ด์šฉํ•˜์—ฌ property.ini ์„ค์ • ํŒŒ์ผ์„ ๋กœ๋“œํ•œ๋‹ค.

# /env/property.ini
[Bigdata Ocr Extract]
Version= 1.0

[Path]
OcrTxtPath= \\resource\\ocr_result_txt # ์ถ”์ถœ๋œ ํ…์ŠคํŠธ ํŒŒ์ผ์ด ์ €์žฅ๋  ๊ฒฝ๋กœ

์ €์žฅ๋  ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•ด์ฃผ๋Š” ๊ฒŒ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ, EDU_Siri์—์„œ๋Š” ์–ป์€ ํ…์ŠคํŠธ ํŒŒ์ผ์„ ํ˜•์‹์— ๋งž๊ฒŒ ์ˆ˜์ •ํ•˜์—ฌ ์ง€์ •๋œ ํŒŒ์ผ์— ์ €์žฅํ•˜๋ฏ€๋กœ PATH ๋ถ€๋ถ„์€ ๋ฌด์‹œํ•ด๋„ ์ƒ๊ด€ ์—†๋‹ค. ( configParser ํŒจํ‚ค์ง€์˜ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์ž. )

3. Detecting Text & Save

def ocrToStr(fullPath, lang='eng'):
    img = Image.fromarray(fullPath)
    #preserve_interword_spaces : ๋‹จ์–ด ๊ฐ„๊ฒฉ ์˜ต์…˜์„ ์กฐ์ ˆํ•˜๋ฉด์„œ ์ถ”์ถœ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•œ๋‹ค.
    #psm(ํŽ˜์ด์ง€ ์„ธ๊ทธ๋จผํŠธ ๋ชจ๋“œ : ์ด๋ฏธ์ง€ ์˜์—ญ์•ˆ์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ ๋ฒ”์œ„ ๋ชจ๋“œ)
    #psm ๋ชจ๋“œ : https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
    
    #์ถ”์ถœ(์ด๋ฏธ์ง€ํŒŒ์ผ, ์ถ”์ถœ์–ธ์–ด, ์˜ต์…˜)
    outText = image_to_string(img, lang=lang, config='--psm 1 -c preserve_interword_spaces=1')
    strToText(outText)
    
def strToText(outText):
    with open('./course_material/ppt_content.txt', 'a', encoding='utf-8') as f:
        f.write(outText)
        f.write("==========================================================\n")

์ด๋ฏธ์ง€ ์ถ”์ถœ ํ›„ ์–ป์€ ํ…์ŠคํŠธ ๊ฐ’์„ ์ง€์ •๋œ ๊ฒฝ๋กœ ํŒŒ์ผ์— ์“ฐ๋ฉด ํ•ด๋‹น ๊ธฐ๋Šฅ์€ ์ข…๋ฃŒ๋œ๋‹ค.

 

Full Code

def comparePic(path, image1, image2) -> str:
    image1_path = path + "/" + image1
    image2_path = path + "/" + image2

    pre = cv.imread(image1_path)
    post = cv.imread(image2_path)
    result = pre.copy()

    grayPre = cv.cvtColor(pre, cv.COLOR_BGR2GRAY)
    grayPost = cv.cvtColor(post, cv.COLOR_BGR2GRAY)

    (Similarity, diff) = compare_ss(grayPre, grayPost, full=True)
    diff = (diff * 255).astype('uint8')

    if(Similarity < 0.95):
        PicToText(post)

# ๊ธ€์ž ์ธ์‹์„ ์œ„ํ•œ ํ•จ์ˆ˜ 1
def ocrToStr(fullPath, lang='eng'):
    img = Image.fromarray(fullPath)
    #preserve_interword_spaces : ๋‹จ์–ด ๊ฐ„๊ฒฉ ์˜ต์…˜์„ ์กฐ์ ˆํ•˜๋ฉด์„œ ์ถ”์ถœ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•œ๋‹ค.
    #psm(ํŽ˜์ด์ง€ ์„ธ๊ทธ๋จผํŠธ ๋ชจ๋“œ : ์ด๋ฏธ์ง€ ์˜์—ญ์•ˆ์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ ๋ฒ”์œ„ ๋ชจ๋“œ)
    #psm ๋ชจ๋“œ : https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
    
    #์ถ”์ถœ(์ด๋ฏธ์ง€ํŒŒ์ผ, ์ถ”์ถœ์–ธ์–ด, ์˜ต์…˜)
    outText = image_to_string(img, lang=lang, config='--psm 1 -c preserve_interword_spaces=1')
    strToText(outText)

def strToText(outText):
    with open('./course_material/ppt_content.txt', 'a', encoding='utf-8') as f:
        f.write(outText)
        f.write("==========================================================\n")

# ๊ธ€์ž ์ธ์‹ํ•˜๋Š” main ํ•จ์ˆ˜
def PicToText(path)->str:
    # Config Parser ์ดˆ๊ธฐํ™” & Config File ์ฝ๊ธฐ
    config = configparser.ConfigParser()
    config.read(os.path.dirname(os.path.realpath(__file__)) + os.sep + 'env' + os.sep + 'property.ini')

    ocrToStr(path, 'eng')

def file_listing(path)->str:
    files = os.listdir(path)

    for i in range(1, len(files)):
        comparePic(path, files[i-1], files[i])
    print("\n+=====================================+")
    print("| Organizing class materials is Done! |")
    print("+=====================================+")
์ €์ž‘์žํ‘œ์‹œ ๋น„์˜๋ฆฌ ๋ณ€๊ฒฝ๊ธˆ์ง€ (์ƒˆ์ฐฝ์—ด๋ฆผ)
'๐Ÿ“„ Project/Edu_Siri' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • Converting Audio into Text (with Predicting Punctuation / eng ver)
  • Text Summarization Function (Extractive / eng ver)
  • Motion Change Detection Function
  • Converting Audio into Text ( with Predicting Punctuation )
Cronus
Cronus
Offensive Security Researcher
  • Cronus
    Cronus
    Striving to be the best.
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (251)
      • AboutMe (1)
      • Portfolio (1)
        • Things (1)
      • Bug Report (1)
      • ๐Ÿšฉ CTF (23)
        • Former Doc (9)
        • 2023 (9)
      • ๐Ÿ’ป Security (5)
      • ๐Ÿ–Œ๏ธ Theory (22)
        • WEB (9)
        • PWN (13)
      • ๐Ÿ“„ Project (6)
        • Edu_Siri (6)
      • Dreamhack (156)
        • WEB (95)
        • PWN (41)
        • Crypto (14)
        • ETC (6)
      • Wargame (22)
        • HackCTF (22)
      • Bug Bounty (1)
        • Hacking Zone (1)
      • Tips (7)
      • Development (2)
        • Machine Learning & Deep Lea.. (1)
      • Offensive Tools (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
  • ๋งํฌ

  • ๊ณต์ง€์‚ฌํ•ญ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    GPNCTF
    RCE
    TFCCTF2022
    justCTF
    sqli
    python
    TsukuCTF2022
    Remote Code Execution
    bug report
    Crypto
    Deep learning
    bug hunter
    cache poisoning
    Ubuntu ๊ธฐ์ดˆ
    Ubuntu ๊ธฐ์ดˆ ์…‹ํŒ…
    pwntools
    Machine Learning
    Text Summarization
    ubuntu ๋ช…๋ น์–ด
    cache
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

Cronus
Detecting Text & Getting Similarity between images
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”