๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
  • ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป ๐ŸŒฎ ๐Ÿ’ฌ
๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป/python

[python] ๋„ค์ด๋ฒ„ ์นดํŽ˜ ํฌ๋กค๋ง

by ๋ฐ”์ฟ„๋ฆฌ 2024. 1. 30.

 

๋„ค์ด๋ฒ„ ์นดํŽ˜ ๊ฒŒ์‹œ๊ธ€์„ ํฌ๋กค๋ง ํ•˜๋ ค ํ•œ๋‹ค.

 

python requests๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ€์ง€๊ณ  ์˜ค๋ ค๊ณ  ํ–ˆ์œผ๋‚˜

<p class="notice">ํšŒ์›๋‹˜์˜ ์•ˆ์ „ํ•œ ์„œ๋น„์Šค ์ด์šฉ์„ ์œ„ํ•ด <strong>๋น„๋ฐ€๋ฒˆํ˜ธ๋ฅผ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.</strong></p>
<p class="notice02" id="layerd_notice01" style="display:none;">IP๊ฐ€ ์ž์ฃผ ๋ณ€๊ฒฝ๋˜๋Š” ๋„คํŠธ์›Œํฌ ํ™˜๊ฒฝ์—์„œ ๋กœ๊ทธ์ธํ•˜๋Š” ๊ฒฝ์šฐ <br/><strong>IP๋ณด์•ˆ์„ ํ•ด์ œ</strong> ํ›„ ์ด์šฉํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.</p>
<p class="notice02">๋‹ค์‹œ ํ•œ๋ฒˆ <strong>๋น„๋ฐ€๋ฒˆํ˜ธ ํ™•์ธ</strong> ํ•˜์‹œ๋ฉด ์ด์šฉ์ค‘์ธ ํ™”๋ฉด์œผ๋กœ ๋Œ์•„๊ฐ€๋ฉฐ, ์ž‘์„ฑ ์ค‘์ด๋˜<br/>๋‚ด์šฉ์„ ์ •์ƒ์ ์œผ๋กœ ์ „์†ก ๋˜๋Š” ๋“ฑ๋กํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.</p>

 

์ด๋ ‡๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋˜๋ฉด selenium์„ ์ด์šฉํ•ด์„œ ๋™์ ์œผ๋กœ ์ ‘๊ทผํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

 

๋กœ๊ทธ์ธ

id ์™€ password๋ฅผ ์ž…๋ ฅํ•ด์„œ ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋กœ๊ทธ์ธ์ด ์ง„ํ–‰๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

import time
from selenium import webdriver
import csv
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from datetime import datetime
import re

url='https://nid.naver.com/nidlogin.login'
id='id'
pw='password'

chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])
# chrome_options.add_argument("headless") # headless option
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)

browser.implicitly_wait(2)

browser.execute_script(f"document.getElementsByName('id')[0].value=\'{id}\'")
browser.execute_script(f"document.getElementsByName('pw')[0].value=\'{pw}\'")

browser.find_element(by=By.XPATH,value='//*[@id="log.login"]').click()
time.sleep(1)

 

html ์ ‘๊ทผ

baseurl = 'https://cafe.naver.com'

clubid = 'clubid'
menuid = 'menuid'
cafemenuurl = f'{baseurl}/ArticleList.nhn?search.clubid={clubid}&search.menuid={menuid}&search.boardtype=L&userDisplay=50'

 

ํฌ๋กค๋งํ•˜๋ ค๋Š” ์นดํŽ˜์˜ clubid์™€ menuid๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.

clubid๋Š” ์นดํŽ˜ id, menuid๋Š” ์นดํŽ˜์˜ ๊ฒŒ์‹œํŒ id

 

์•„๋ž˜์˜ ์นดํŽ˜๋ฅผ ์˜ˆ๋กœ ๋“ค๋ฉด, (url : https://cafe.naver.com/cafe1535 )

 

์—ฌ๊ธฐ์—์„œ ์›ํ•˜๋Š” ๊ฒŒ์‹œํŒ์œผ๋กœ ์ด๋™ํ•˜๋ฉด

 

url์— ๋ณ€ํ™”๊ฐ€ ์—†๋‹ค

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ฐœ์ž ๋„๊ตฌ์—์„œ ์š”์†Œ ๊ฒ€์ƒ‰์œผ๋กœ ์•Œ์•„๋‚ธ๋‹ค.

 

๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๊ฒŒ์‹œํŒ a ๋งํฌ๋Š” /ArticleList.nhn?search.clubid=22243426&search.menuid=54&search.boardtype=L  ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

crawling code

i = 0
while(True):

    pageNum = i + 1
    userDisplay = 50

    browser.get(f'{cafemenuurl}&search.page={str(pageNum)}')

    browser.switch_to.frame('cafe_main')

    soup = bs(browser.page_source, 'html.parser')
    soup = soup.find_all(class_ = 'article-board m-tcol-c')[1]

    datas = soup.select("#main-area > div:nth-child(4) > table > tbody > tr")

    for data in datas:
        article_info = data.select(".article")

        article_href = article_info[0].attrs['href']
        article_href = f'{baseurl}{article_href}'

        browser.get(article_href)

        browser.switch_to.frame('cafe_main')
        article_soup = bs(browser.page_source, 'html.parser')
        data = article_soup.find('div', class_='ArticleContentBox')

        article_title = data.find("h3", {"class" : "title_text"})
        article_date = data.find("span", {"class": "date"})
        article_content = data.find("div", {"class": "se-main-container"})
        article_author = data.find("button", {"class": "nickname"})

        if article_title == None :
            article_title = "null"
        else:
            article_title = article_title.text.strip()

        if article_date == None:
            article_date="null"
        else:
            article_date_str = article_date.text.strip()
            article_date_obj = datetime.strptime(article_date_str, '%Y.%m.%d. %H:%M')
            article_date = article_date_obj.strftime("%Y-%m-%d %H:%M:%S")

        if article_content is None:
            article_content = "null"
        else:
            article_content = article_content.text.strip()
            article_content = " ".join(re.split("\s+", article_content, flags=re.UNICODE))

        if article_author is None:
            article_author = "null"
        else:
            article_author = article_author.text.strip()

        f = open(f'test.csv', 'a+', newline='', encoding = "utf-8")
        wr = csv.writer(f)
        wr.writerow([article_title, article_author, article_date, article_href, article_content])
        f.close()

        browser.back()
        time.sleep(1)

    browser.close()

 

๋‚˜๋Š” ๊ฒŒ์‹œ๊ธ€ ์ œ๋ชฉ(article_title), ๊ฒŒ์‹œ๊ธ€ ์ž‘์„ฑ์ž(article_author), ๊ฒŒ์‹œ๊ธ€ ์‹œ๊ฐ„(article_date), ๊ฒŒ์‹œ๊ธ€ ๋งํฌ(article_href), ๊ฒŒ์‹œ๊ธ€ ๋‚ด์šฉ(article_content) ์„ ๊ฐ€์ง€๊ณ  ์˜ค๊ณ  ์žˆ๋‹ค.

์ถ”๊ฐ€๋กœ ๊ฐ€์ง€๊ณ  ์˜ค๊ณ  ์‹ถ์€ ์š”์†Œ ๋“ค์€ html์—์„œ ํ•„์š”ํ•œ ๊ฒƒ๋“ค์„ ์™์™ ๋ฝ‘์•„์˜ค๋ฉด ๋  ๋“ฏ

 

๊ทธ๋ฆฌ๊ณ  ์ง€๊ธˆ์€ while ๋ฌธ์œผ๋กœ ๊ณ„์† ๋์—†์ด ๋Œ๋„๋ก ๋˜์–ด ์žˆ๋Š”๋ฐ, ์กฐ๊ฑด์ด ํ•„์š”ํ•  ๊ฒฝ์šฐ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋ฉด ๋  ๋“ฏํ•จ

 

output

๋‚˜๋Š” csv ํŒŒ์ผ์— ๋ฐ์ดํ„ฐ ์Œ“๋„๋ก ๋˜์–ด ์žˆ์–ด์„œ ์•„๋ž˜์ฒ˜๋Ÿผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

์ „์ฒด์ฝ”๋“œ

import time
from selenium import webdriver
import csv
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from datetime import datetime
import re

url='https://nid.naver.com/nidlogin.login'
id='id'
pw='password'

chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])
# chrome_options.add_argument("headless") # headless option
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)

browser.implicitly_wait(2)

browser.execute_script(f"document.getElementsByName('id')[0].value=\'{id}\'")
browser.execute_script(f"document.getElementsByName('pw')[0].value=\'{pw}\'")

browser.find_element(by=By.XPATH,value='//*[@id="log.login"]').click()
time.sleep(1)

baseurl = 'https://cafe.naver.com'

clubid = 'clubid'
menuid = 'menuid'
cafemenuurl = f'{baseurl}/ArticleList.nhn?search.clubid={clubid}&search.menuid={menuid}&search.boardtype=L&userDisplay=50'

i = 0
while(True):

    pageNum = i + 1
    userDisplay = 50

    browser.get(f'{cafemenuurl}&search.page={str(pageNum)}')

    browser.switch_to.frame('cafe_main')

    soup = bs(browser.page_source, 'html.parser')
    soup = soup.find_all(class_ = 'article-board m-tcol-c')[1]

    datas = soup.select("#main-area > div:nth-child(4) > table > tbody > tr")

    for data in datas:
        article_info = data.select(".article")

        article_href = article_info[0].attrs['href']
        article_href = f'{baseurl}{article_href}'

        browser.get(article_href)

        browser.switch_to.frame('cafe_main')
        article_soup = bs(browser.page_source, 'html.parser')
        data = article_soup.find('div', class_='ArticleContentBox')

        article_title = data.find("h3", {"class" : "title_text"})
        article_date = data.find("span", {"class": "date"})
        article_content = data.find("div", {"class": "se-main-container"})
        article_author = data.find("button", {"class": "nickname"})

        if article_title == None :
            article_title = "null"
        else:
            article_title = article_title.text.strip()

        if article_date == None:
            article_date="null"
        else:
            article_date_str = article_date.text.strip()
            article_date_obj = datetime.strptime(article_date_str, '%Y.%m.%d. %H:%M')
            article_date = article_date_obj.strftime("%Y-%m-%d %H:%M:%S")

        if article_content is None:
            article_content = "null"
        else:
            article_content = article_content.text.strip()
            article_content = " ".join(re.split("\s+", article_content, flags=re.UNICODE))

        if article_author is None:
            article_author = "null"
        else:
            article_author = article_author.text.strip()

        f = open(f'test.csv', 'a+', newline='', encoding = "utf-8")
        wr = csv.writer(f)
        wr.writerow([article_title, article_author, article_date, article_href, article_content])
        f.close()

        browser.back()
        time.sleep(1)

    browser.close()

์ฐธ๊ณ 

๋‹ค๋“ค ์ด ๊ฒŒ์‹œ๊ธ€ ๋ณด๊ณ  ์ง„ํ–‰ํ•œ ๋“ฏ ํ•จ

์ •๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ์ž˜ ๋˜์–ด ์žˆ์Œ

 

[Python] ๋„ค์ด๋ฒ„ ์นดํŽ˜ ๊ฒŒ์‹œ๊ธ€ ํฌ๋กค๋Ÿฌ(feat. ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ & ์…€๋ ˆ๋‹ˆ์›€)

๋„ค์ด๋ฒ„ ์นดํŽ˜ ๊ฒŒ์‹œ๊ธ€์˜ ์ œ๋ชฉ๊ณผ ๋งํฌ๋ฅผ ํฌ๋กค๋ง ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ด๋‹ค. ์•„๋งˆ ๊ฐ€์žฅ ๊น”๋”ํ•˜๊ฒŒ ์ž˜ ๋Œ์•„๊ฐ€๋Š” ์ฝ”๋“œ์ด์ง€ ์•Š์„๊นŒ ์‹ถ๋‹ค. ๋งŽ์€ ๋ถ„๋“ค๊ป˜ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•œ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ํฌ๋กฌ์˜ ๋ฒ„์ „๊ณผ ํฌ๋กฌ

wookidocs.tistory.com