Little experiment to scrape web vr sites.
Getting started -> using scrapy on Windows.
Python tool. Downloading current Python (3.7.1): https://www.python.org/downloads/
Python is accessed via “py” not “python” on command prompt and pip was only available at C:\Users\[Your Name]\AppData\Local\Programs\Python\Python36-32\Scripts.
Set Windows environmental variables to access pip.
Install virtualenv using pip so projects are standalone with all necessary libraries.
pip install virtualenv
Create a new virtual environment.
virtualenv vr_scrape
New vr_scrape directory is now in the location where command was executed.
(Good summary of the roles of pip, virtualenv and the activate script here: https://www.dabapps.com/blog/introduction-to-pip-and-virtualenv-python/)
Activate the shell so it uses the new virtual environment (vr_scrape)
[path to vr_scrape environment directory]\Scripts\activate.bat
Install scrapy into this environment (currently active) in shell:
pip install scrapy==1.5.1
And then if you just type “scrapy”:
Scrapy 1.5.1 – no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use “scrapy <command> -h” to see more info about a command
To create a new project called “A_initial”:
scrapy startproject A_initial
Scrapy also seemed to need this package installed:
pip install pypiwin32
Sublime text 3 which required this new build system file :
{
“shell_cmd”: [“C:/Users/[USERNAME]/vr_scrape/Scripts/python.exe”,”$file”],
“selector”:”source.python”,
“file_regex”:”file \”(…*?)\”, line ([0-9]+)”
}
And then this “hello world” spider to scrape BBC news…
import scrapy
class BbcNews(scrapy.Spider):
#identity
name = “bbcnews”
#requestscr
def start_requests(self):
urls = [
“https://www.bbc.co.uk/news/world/africa”,
“https://www.bbc.co.uk/news/world/asia”,
“https://www.bbc.co.uk/news/world/australia”,
“https://www.bbc.co.uk/news/world/europe”,
“https://www.bbc.co.uk/news/world/latin_america”,
“https://www.bbc.co.uk/news/world/middle_east”,
“https://www.bbc.co.uk/news/world/us_and_canada”
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)#callback will parse the response
#response
def parse(self, response):
region = response.url.split(“https://www.bbc.co.uk/news/world/”)[1]
#response will be same as request url…
#then split by “https://www.bbc.co.uk/news/”
#we get this: [“”,”africa”] and then index 1 (2nd item in array)
_file = “{}.html”.format(region)
with open(_file, “wb”) as f: #wb write byte… response.body returns bytes
f.write(response.body)
This command (inside the environment folder) to scrape it…
scrapy crawl bbc news
Which then dumped out these files: