๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
  • ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป ๐ŸŒฎ ๐Ÿ’ฌ
๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป/python

PDF์— ์žˆ๋Š” text๋ฅผ ๋น ๋ฅด๊ฒŒ ๊ฐ€์ง€๊ณ  ์˜ค์ž ! feat. ์‹œ๊ฐ„๋ณต์žก๋„

by ๋ฐ”์ฟ„๋ฆฌ 2025. 7. 23.

 

ํšŒ์‚ฌ ํ”„๋กœ์ ํŠธ ์ค‘ .. ํฌ๋กค๋ง์„ ํ•˜๊ณ  ์žˆ์—ˆ๋‹ค

๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์„ ๊ฐ€์ง€๊ณ  ์˜ค๋Š” ํฌ๋กค๋ง์ด์—ˆ๋Š”๋ฐ,

๊ธฐ์‚ฌ ๋‚ด์— pdf๊ฐ€ ์žˆ์œผ๋ฉด ํŒŒ์ผ์„ ๋‹ค์šด๋ฐ›์•„์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ text๋งŒ ์ถ”์ถœํ•ด์„œ DB์— ๋„ฃ์–ด์ค€๋‹ค.

 

๊ทผ๋ฐ pdf๊ฐ€ ์žˆ๋Š” ๊ธฐ์‚ฌ์™€ ์—†๋Š” ๊ธฐ์‚ฌ๋ฅผ ์Šคํฌ๋ฆฌํ•‘ํ•˜๋Š”๋ฐ ์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ๋งŽ์ด ๋‚œ๋‹ค๊ณ  ๋А๊ผˆ๋‹ค.

๊ทธ๋ž˜์„œ ํ•จ์ˆ˜๋ฅผ ์ฐพ์•„๊ฐ”๋‹ค.

 

๊ธฐ์กด์— ์žˆ๋˜ ์ฝ”๋“œ:

def current_extract_pdf_text(pdf_url):
    start_time = perf_counter()
    
    response = requests.get(pdf_url, headers={'User-Agent': random.choice(user_agents)})
    response.encoding = response.apparent_encoding
    response.raise_for_status()
    
    pdf_text = ""
    with fitz.open(stream=response.content, filetype="pdf") as doc:
        for i, page in enumerate(doc):
            pdf_text += page.get_text() + "\n"
    
    duration_ms = int((perf_counter() - start_time) * 1000)
    print(f"current_extract_pdf_text duration_ms: {duration_ms}")
    
    return pdf_text.strip()

 

๋”ฑ 100์žฅ์งœ๋ฆฌ pdf_url๋กœ ํ…Œ์ŠคํŠธ๋ฅผ ํ–ˆ๋‹ค.

duration์„ ์ธก์ •ํ–ˆ๋‹ค.

current_extract_pdf_text duration_ms: 40476

 

์ถฉ๊ฒฉ์ ์ด์—ˆ๋‹ค.

์•„๋‹ˆ pdf์— ์žˆ๋Š” text๋งŒ ๊ฐ€์ง€๊ณ  ์˜ค๋Š”๋ฐ 40์ดˆ๊ฐ€ ๊ฑธ๋ฆฐ๋‹ค๊ณ ?

 

์–ผ๋งˆ์ „ ๊ต๋ณด๋ฌธ๊ณ ์— ๊ฐ”๋‹ค๊ฐ€ ๋ฌด์Šจ ๋ฐ”๋žŒ์ด ๋ถˆ์—ˆ๋Š”๋”” python ์ž๋ฃŒ๊ตฌ์กฐ ์ฑ…์ด ๋„ˆ๋ฌด ์žฌ๋ฐŒ์–ด๋ณด์—ฌ์„œ ์‚ฌ๋ฒ„๋ ธ๋‹ค.

์ค‘๊ฐ„๊นŒ์ง€๋งŒ ์ฝ๋‹ค๊ฐ€ ์–ด๋””์žˆ๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ

์–ผ๋งˆ์ „์— ์ฝ์€ ์ฑ…์—์„œ ๋ดค๋‹ค.

์ง€๊ธˆ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š”๊ฑด += ๋•Œ๋ฌธ์ด๋ผ๋Š” ๊ฑธ ๋ฐ”๋กœ ์•Œ์•˜๋‹ค.

 

โœ”๏ธ ์™œ += ๋•Œ๋ฌธ์— ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š”๊ฐ€?

1. ๋ฃจํ”„๊ฐ€ ๋Œ ๋•Œ๋งˆ๋‹ค page.get_text() ๋กœ ์ƒˆ๋กœ์šด ๋ฌธ์ž์—ด์ด ์ƒ๊ธด๋‹ค.

2. pdf_text์— ์ด ์ƒˆ๋กœ์šด ๋ฌธ์ž์—ด์„ ๋”ํ•˜๋ ค๋ฉด, ํŒŒ์ด์ฌ์€ ๊ธฐ์กด pdf_text์˜ ๋‚ด์šฉ๊ณผ ์ƒˆ๋กœ์šด ๋ฌธ์ž์—ด์„ ํ•ฉ์นœ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฌธ์ž์—ด์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ด์•ผํ•œ๋‹ค.

3. ๊ทธ๋ฆฌ๊ณ  ์ด์ „ pdf_text์˜ ๋ชจ๋“  ๋‚ด์šฉ์„ ์ƒˆ๋กœ ํ• ๋‹น๋œ ๊ณต๊ฐ„์— ๋ณต์‚ฌํ•œ ๋’ค, ๊ทธ ๋’ค์— ์ƒˆ๋กœ์šด ๋ฌธ์ž์—ด์„ ๋ง๋ถ™์ธ๋‹ค.

4. pdf_text ๋ณ€์ˆ˜๋Š” ์ด์ œ ์ด '์ƒˆ๋กœ์šด' ๋ฌธ์ž์—ด์„ ๊ฐ€๋ฆฌํ‚ค๊ฒŒ ๋œ๋‹ค.

 

→ ์ด ๊ณผ์ •์€ ํŽ˜์ด์ง€๊ฐ€ ๋ˆ„์ ๋ ์ˆ˜๋ก pdf_text์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์—, ๋ฃจํ”„ ํ›„๋ฐ˜๋ถ€๋กœ ๊ฐˆ์ˆ˜๋ก ๋ณต์‚ฌํ•ด์•ผ ํ•  ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋Š˜์–ด๋‚œ๋‹ค.

 O(n²)์˜ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง€๋ฉฐ ๋งค์šฐ ๋น„ํšจ์œจ์ !

 

์ด๋ ‡๊ฒŒ ๊ฐœ์„ ํ–ˆ์–ด์š”

list.append() ์‚ฌ์šฉ !!!

def new_extract_pdf_text(pdf_url: str) -> str:
    start_time = perf_counter()
    try:
        response = requests.get(pdf_url, headers={'User-Agent': random.choice(user_agents)})
        response.raise_for_status()
        
        pdf_bytes = response.content
        pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
        
        full_text = []
        for page_num, page in enumerate(pdf_document):
            full_text.append(page.get_text())
        pdf_document.close()
        
        duration_ms = int((perf_counter() - start_time) * 1000)
        print(f"new_extract_pdf_text duration_ms: {duration_ms}")
        
        return full_text

    except requests.exceptions.RequestException as e:
        return f"URL ์š”์ฒญ ์ค‘ ์—๋Ÿฌ ๋ฐœ์ƒ: {e}"
    except Exception as e:
        return f"PDF ์ฒ˜๋ฆฌ ์ค‘ ์—๋Ÿฌ ๋ฐœ์ƒ: {e}"
new_extract_pdf_text duration_ms: 1756

 

๊ฐ™์€ pdf_url๋กœ ์š”์ฒญ์„ ํ–ˆ๋Š”๋ฐ, 1.75์ดˆ๊ฐ€ ๋‚˜์™”๋”ฐ

 

40476ms์—์„œ 1756ms๋กœ ๊ฐœ์„ ํ–ˆ์œผ๋‹ˆ,, 95.7% ๊ฐœ์„ ํ•œ ๊ฒƒ,, ๋„˜ ๋ฟŒ๋“ฏํ•˜๋‹ค

 

โœ”๏ธ list.append()๋Š” ์™œ ๋น ๋ฅธ๊ฐ€!

1. ํŒŒ์ด์ฌ์˜ list๋Š” ๊ฐ€๋ณ€๊ฐ์ฒด๋‹ค. ๊ธฐ์กด ๋ฆฌ์ŠคํŠธ์— ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” append() ์—ฐ์‚ฐ์€ ๋งค์šฐ ๋น ๋ฅด๋‹ค. O(1)

2. ๋ฃจํ”„๊ฐ€ ๋๋‚œ ํ›„, "\n".join(full_text)๋Š” ์ตœ์ข…์ ์œผ๋กœ ๋งŒ๋“ค์–ด์งˆ ๋ฌธ์ž์—ด์˜ ์ „์ฒด ๊ธธ์ด๋ฅผ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•œ๋‹ค.

3. ๊ทธ ํฌ๊ธฐ์— ๋งž๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ๋‹จ ํ•œ ๋ฒˆ๋งŒ ํ• ๋‹นํ•˜๊ณ , ๋ฆฌ์ŠคํŠธ์˜ ๋ชจ๋“  ๋ฌธ์ž์—ด์„ ๊ทธ ๊ณต๊ฐ„์— ์ˆœ์„œ๋Œ€๋กœ ๋ณต์‚ฌํ•˜์—ฌ ์ตœ์ข… ๋ฌธ์ž์—ด์„ ๋งŒ๋“ ๋‹ค.

 

 

์‹œ๊ฐ„ ์ธก์ •๋„ ํ•˜๊ณ , ์ถ”์ถœ๋œ text ํ™•์ธ์ฐจ txtํŒŒ์ผ๋กœ ๋ฝ‘์•„๋ดค๋‹ค

(์™ผ) ๊ธฐ์กด ์ฝ”๋“œ ์ถœ๋ ฅ๋ฌผ, (์˜ค) ๊ฐœ์„ ํ•œ ์ฝ”๋“œ ์ถœ๋ ฅ๋ฌผ

 

์ถ”์ถœ๋˜๋Š” text์„ ํŒŒ์ผ๋กœ ๋ณ€ํ™˜ํ•ด์„œ ํ™•์ธํ•ด๋ณด๋‹ˆ ..

๊ธฐ์กด ์ฝ”๋“œ๋กœ ์ถ”์ถœ๋„ ์ž˜ ๋ชป ๋˜๊ณ  ์žˆ์—ˆ๋‹ค? ์•„๋‹ˆ.. ์™œ.. ํ์Œ..

 

ํ•ญ์ƒ ํ•˜๋‚˜ํ•˜๋‚˜ ์ž˜ ํ™•์ธํ•˜๊ณ  ๋‡Œ์— ํž˜์ฃผ๊ณ  ์ฝ”๋“œ ๋ด์•ผ์ง€! ๋‚ด๊ฐ€ ๋‹ค ์ˆ˜์ •ํ• ๊บผ์•ผ ~~