Posted 2018-10-26Updated 2022-12-07wiki / Scrapy4 minutes read (About 526 words)

Scrapy

공부한 내용을 스스로 보기 쉽게 정리한 글입니다.

Scrapy 라이브러리는 파이썬에서 제공하는 라이브러리로써, 대량의 페이지들의 Crawling을 손쉽게 해주는 라이브러리이다.

1. Install

파이썬의 라이브러리 이므로 pip 으로 설치 할 수 있다.
1
pip3 install scrapy

2. 실습

실습을 위해 import 할 것들

1
2
3

import scrapy
import requests
from scrapy.http import TextResponse

requests 를 통해 url 정보를 받아온다.
TextResponse 를 통해 받아온 html 파일을 encoding 과 text형식으로 return

1 2	req = requests.get("url_name") response = TextResponse(req.url, body=req.text, encoding="utf-8")

a = response.xpath('xpath')
# xpath 로 지정한 엘리먼트를 가져온다.
a_text = reponse.xpath('xpath/text()')
# 엘리먼트의 text data 를 가져온다.
a_text.extract()
# 엘리먼트의 text data들을 말그대로 extract 하여, list 형태로 return 해준다

3. Scrapy 사용하기

(1) scrapy 프로젝트 생성

shell command
1
scrapy startproject crawler

1	!scrapy startproject crawler

New Scrapy project 'crawler', using template directory '/Users/emjayahn/.pyenv/versions/3.7.0/envs/dss/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/emjayahn/Dev/DSS/TIL(markdown)/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com

1	!tree crawler

crawler
├── crawler
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files

(2) Scrapy 기본 구조

Spider
- 크롤링 절차 정하기
- 어떤 웹사이트들을 어떻게 크롤링 할 것인지 선언
- 각각의 웹페이지의 어떤 부분을 스크래핑 할 것 인지 명시하는 클래스
items.py
- spider 가 크롤링한 data 들을 저장할 때, 사용자 정의 자료구조 클래스
- MVC : 중 Model 부분에 해당
- Feature 라고 생각
pipeline.py
- 스크래핑한 데이터를 어떻게 처리할지 정의
- 데이터에 한글이 포함되어 있을 때는 encoding=’utf-8’ utf-8인코딩이 필요
settings.py
- Spider, item, pipeline 의 세부 사항을 설정
- (예) 크롤링 빈도 등
- (예) robots.txt - ROBOTSTXT_OBEY=True

Scrapy

Scrapy

1. Install

2. 실습

3. Scrapy 사용하기

(1) scrapy 프로젝트 생성

(2) Scrapy 기본 구조

Links

Categories

Recents

Archives

Tags

Subscribe for updates

Advertisement

follow.it