๋ค์ด๋ฒ ์นดํ ๊ฒ์๊ธ์ ํฌ๋กค๋ง ํ๋ ค ํ๋ค.
python requests๋ฅผ ์ด์ฉํด์ ๊ฐ์ง๊ณ ์ค๋ ค๊ณ ํ์ผ๋
<p class="notice">ํ์๋์ ์์ ํ ์๋น์ค ์ด์ฉ์ ์ํด <strong>๋น๋ฐ๋ฒํธ๋ฅผ ํ์ธํด ์ฃผ์ธ์.</strong></p>
<p class="notice02" id="layerd_notice01" style="display:none;">IP๊ฐ ์์ฃผ ๋ณ๊ฒฝ๋๋ ๋คํธ์ํฌ ํ๊ฒฝ์์ ๋ก๊ทธ์ธํ๋ ๊ฒฝ์ฐ <br/><strong>IP๋ณด์์ ํด์ </strong> ํ ์ด์ฉํ์๊ธฐ ๋ฐ๋๋๋ค.</p>
<p class="notice02">๋ค์ ํ๋ฒ <strong>๋น๋ฐ๋ฒํธ ํ์ธ</strong> ํ์๋ฉด ์ด์ฉ์ค์ธ ํ๋ฉด์ผ๋ก ๋์๊ฐ๋ฉฐ, ์์ฑ ์ค์ด๋<br/>๋ด์ฉ์ ์ ์์ ์ผ๋ก ์ ์ก ๋๋ ๋ฑ๋กํ์ค ์ ์์ต๋๋ค.</p>
์ด๋ ๊ฒ ์ ๊ทผํ ์ ์์์ ์ ์ ์๋ค.
์ด๋ ๊ฒ ๋๋ฉด selenium์ ์ด์ฉํด์ ๋์ ์ผ๋ก ์ ๊ทผํ๊ธฐ๋ก ํ๋ค.
๋ก๊ทธ์ธ
id ์ password๋ฅผ ์ ๋ ฅํด์ ์๋์ ์ฝ๋๋ฅผ ์คํํ๋ฉด ๋ก๊ทธ์ธ์ด ์งํ๋จ์ ์ ์ ์๋ค.
import time
from selenium import webdriver
import csv
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from datetime import datetime
import re
url='https://nid.naver.com/nidlogin.login'
id='id'
pw='password'
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])
# chrome_options.add_argument("headless") # headless option
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
browser.implicitly_wait(2)
browser.execute_script(f"document.getElementsByName('id')[0].value=\'{id}\'")
browser.execute_script(f"document.getElementsByName('pw')[0].value=\'{pw}\'")
browser.find_element(by=By.XPATH,value='//*[@id="log.login"]').click()
time.sleep(1)
html ์ ๊ทผ
baseurl = 'https://cafe.naver.com'
clubid = 'clubid'
menuid = 'menuid'
cafemenuurl = f'{baseurl}/ArticleList.nhn?search.clubid={clubid}&search.menuid={menuid}&search.boardtype=L&userDisplay=50'
ํฌ๋กค๋งํ๋ ค๋ ์นดํ์ clubid์ menuid๋ฅผ ์์์ผ ํ๋ค.
clubid๋ ์นดํ id, menuid๋ ์นดํ์ ๊ฒ์ํ id
์๋์ ์นดํ๋ฅผ ์๋ก ๋ค๋ฉด, (url : https://cafe.naver.com/cafe1535 )
์ฌ๊ธฐ์์ ์ํ๋ ๊ฒ์ํ์ผ๋ก ์ด๋ํ๋ฉด
url์ ๋ณํ๊ฐ ์๋ค
๊ทธ๋ ๊ธฐ ๋๋ฌธ์ ๊ฐ๋ฐ์ ๋๊ตฌ์์ ์์ ๊ฒ์์ผ๋ก ์์๋ธ๋ค.
๋ด๊ฐ ์ํ๋ ๊ฒ์ํ a ๋งํฌ๋ /ArticleList.nhn?search.clubid=22243426&search.menuid=54&search.boardtype=L ์์ ์ ์ ์๋ค.
crawling code
i = 0
while(True):
pageNum = i + 1
userDisplay = 50
browser.get(f'{cafemenuurl}&search.page={str(pageNum)}')
browser.switch_to.frame('cafe_main')
soup = bs(browser.page_source, 'html.parser')
soup = soup.find_all(class_ = 'article-board m-tcol-c')[1]
datas = soup.select("#main-area > div:nth-child(4) > table > tbody > tr")
for data in datas:
article_info = data.select(".article")
article_href = article_info[0].attrs['href']
article_href = f'{baseurl}{article_href}'
browser.get(article_href)
browser.switch_to.frame('cafe_main')
article_soup = bs(browser.page_source, 'html.parser')
data = article_soup.find('div', class_='ArticleContentBox')
article_title = data.find("h3", {"class" : "title_text"})
article_date = data.find("span", {"class": "date"})
article_content = data.find("div", {"class": "se-main-container"})
article_author = data.find("button", {"class": "nickname"})
if article_title == None :
article_title = "null"
else:
article_title = article_title.text.strip()
if article_date == None:
article_date="null"
else:
article_date_str = article_date.text.strip()
article_date_obj = datetime.strptime(article_date_str, '%Y.%m.%d. %H:%M')
article_date = article_date_obj.strftime("%Y-%m-%d %H:%M:%S")
if article_content is None:
article_content = "null"
else:
article_content = article_content.text.strip()
article_content = " ".join(re.split("\s+", article_content, flags=re.UNICODE))
if article_author is None:
article_author = "null"
else:
article_author = article_author.text.strip()
f = open(f'test.csv', 'a+', newline='', encoding = "utf-8")
wr = csv.writer(f)
wr.writerow([article_title, article_author, article_date, article_href, article_content])
f.close()
browser.back()
time.sleep(1)
browser.close()
๋๋ ๊ฒ์๊ธ ์ ๋ชฉ(article_title), ๊ฒ์๊ธ ์์ฑ์(article_author), ๊ฒ์๊ธ ์๊ฐ(article_date), ๊ฒ์๊ธ ๋งํฌ(article_href), ๊ฒ์๊ธ ๋ด์ฉ(article_content) ์ ๊ฐ์ง๊ณ ์ค๊ณ ์๋ค.
์ถ๊ฐ๋ก ๊ฐ์ง๊ณ ์ค๊ณ ์ถ์ ์์ ๋ค์ html์์ ํ์ํ ๊ฒ๋ค์ ์์ ๋ฝ์์ค๋ฉด ๋ ๋ฏ
๊ทธ๋ฆฌ๊ณ ์ง๊ธ์ while ๋ฌธ์ผ๋ก ๊ณ์ ๋์์ด ๋๋๋ก ๋์ด ์๋๋ฐ, ์กฐ๊ฑด์ด ํ์ํ ๊ฒฝ์ฐ ์ถ๊ฐํ์ฌ ์ฌ์ฉํ๋ฉด ๋ ๋ฏํจ
output
๋๋ csv ํ์ผ์ ๋ฐ์ดํฐ ์๋๋ก ๋์ด ์์ด์ ์๋์ฒ๋ผ ๋ณผ ์ ์๋ค.
์ ์ฒด์ฝ๋
import time
from selenium import webdriver
import csv
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from datetime import datetime
import re
url='https://nid.naver.com/nidlogin.login'
id='id'
pw='password'
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])
# chrome_options.add_argument("headless") # headless option
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
browser.implicitly_wait(2)
browser.execute_script(f"document.getElementsByName('id')[0].value=\'{id}\'")
browser.execute_script(f"document.getElementsByName('pw')[0].value=\'{pw}\'")
browser.find_element(by=By.XPATH,value='//*[@id="log.login"]').click()
time.sleep(1)
baseurl = 'https://cafe.naver.com'
clubid = 'clubid'
menuid = 'menuid'
cafemenuurl = f'{baseurl}/ArticleList.nhn?search.clubid={clubid}&search.menuid={menuid}&search.boardtype=L&userDisplay=50'
i = 0
while(True):
pageNum = i + 1
userDisplay = 50
browser.get(f'{cafemenuurl}&search.page={str(pageNum)}')
browser.switch_to.frame('cafe_main')
soup = bs(browser.page_source, 'html.parser')
soup = soup.find_all(class_ = 'article-board m-tcol-c')[1]
datas = soup.select("#main-area > div:nth-child(4) > table > tbody > tr")
for data in datas:
article_info = data.select(".article")
article_href = article_info[0].attrs['href']
article_href = f'{baseurl}{article_href}'
browser.get(article_href)
browser.switch_to.frame('cafe_main')
article_soup = bs(browser.page_source, 'html.parser')
data = article_soup.find('div', class_='ArticleContentBox')
article_title = data.find("h3", {"class" : "title_text"})
article_date = data.find("span", {"class": "date"})
article_content = data.find("div", {"class": "se-main-container"})
article_author = data.find("button", {"class": "nickname"})
if article_title == None :
article_title = "null"
else:
article_title = article_title.text.strip()
if article_date == None:
article_date="null"
else:
article_date_str = article_date.text.strip()
article_date_obj = datetime.strptime(article_date_str, '%Y.%m.%d. %H:%M')
article_date = article_date_obj.strftime("%Y-%m-%d %H:%M:%S")
if article_content is None:
article_content = "null"
else:
article_content = article_content.text.strip()
article_content = " ".join(re.split("\s+", article_content, flags=re.UNICODE))
if article_author is None:
article_author = "null"
else:
article_author = article_author.text.strip()
f = open(f'test.csv', 'a+', newline='', encoding = "utf-8")
wr = csv.writer(f)
wr.writerow([article_title, article_author, article_date, article_href, article_content])
f.close()
browser.back()
time.sleep(1)
browser.close()
์ฐธ๊ณ
๋ค๋ค ์ด ๊ฒ์๊ธ ๋ณด๊ณ ์งํํ ๋ฏ ํจ
์ ๋ฆฌ๊ฐ ๋๋ฌด ์ ๋์ด ์์
[Python] ๋ค์ด๋ฒ ์นดํ ๊ฒ์๊ธ ํฌ๋กค๋ฌ(feat. ํฌ๋กฌ ๋๋ผ์ด๋ฒ & ์ ๋ ๋์)
๋ค์ด๋ฒ ์นดํ ๊ฒ์๊ธ์ ์ ๋ชฉ๊ณผ ๋งํฌ๋ฅผ ํฌ๋กค๋ง ํ๊ธฐ ์ํ ์ฝ๋์ด๋ค. ์๋ง ๊ฐ์ฅ ๊น๋ํ๊ฒ ์ ๋์๊ฐ๋ ์ฝ๋์ด์ง ์์๊น ์ถ๋ค. ๋ง์ ๋ถ๋ค๊ป ๋์์ด ๋์์ผ๋ฉด ํ๋ค. ๊ธฐ๋ณธ์ ์ผ๋ก ํฌ๋กฌ์ ๋ฒ์ ๊ณผ ํฌ๋กฌ
wookidocs.tistory.com
'๐ฉ๐ปโ๐ป > python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[python] Fastapi ์ ํ๋ฆฌ์ผ์ด์ exe ํ์ผ๋ก ๋ฐฐํฌํ๊ธฐ (0) | 2024.11.05 |
---|---|
[python] socket ๋์ ์ฐ๊ฒฐ (1) | 2024.10.22 |
[python] fastapi ์ค์๊ฐ ์คํธ๋ฆฌ๋ฐ ์ด๋ฏธ์ง ์ก์ถ (0) | 2024.10.21 |
[python] TypeError: 'type' object is not subscriptable (1) | 2024.05.02 |
[python] Received response with content-encoding: gzip, but failed to decode it. (0) | 2024.02.19 |