1 – webvr scrape

Little experiment to scrape web vr sites.

Getting started -> using scrapy on Windows.

Python tool.  Downloading current Python (3.7.1): https://www.python.org/downloads/

Python is accessed via “py” not “python” on command prompt and pip was only available at C:\Users\[Your Name]\AppData\Local\Programs\Python\Python36-32\Scripts.

Set Windows environmental variables to access pip.

Install virtualenv using pip so projects are standalone with all necessary libraries.

pip install virtualenv

Create a new virtual environment.

virtualenv vr_scrape

New vr_scrape directory is now in the location where command was executed.

(Good summary of the roles of pip, virtualenv and the activate script here: https://www.dabapps.com/blog/introduction-to-pip-and-virtualenv-python/)

Activate the shell so it uses the new virtual environment (vr_scrape)

[path to vr_scrape environment directory]\Scripts\activate.bat

Install scrapy into this environment (currently active) in shell:

pip install scrapy==1.5.1

And then if you just type “scrapy”:

Scrapy 1.5.1 – no active project

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use “scrapy <command> -h” to see more info about a command

To create a new project called “A_initial”:

scrapy startproject A_initial

Scrapy also seemed to need this package installed:

pip install pypiwin32

Sublime  text 3 which required this new build system file :

{
“shell_cmd”: [“C:/Users/[USERNAME]/vr_scrape/Scripts/python.exe”,”$file”],
“selector”:”source.python”,
“file_regex”:”file \”(…*?)\”, line ([0-9]+)”
}

And then this “hello world” spider to scrape BBC news…

import scrapy

class BbcNews(scrapy.Spider):
#identity
name = “bbcnews”

#requestscr
def start_requests(self):
urls = [
“https://www.bbc.co.uk/news/world/africa”,
“https://www.bbc.co.uk/news/world/asia”,
“https://www.bbc.co.uk/news/world/australia”,
“https://www.bbc.co.uk/news/world/europe”,
“https://www.bbc.co.uk/news/world/latin_america”,
“https://www.bbc.co.uk/news/world/middle_east”,
“https://www.bbc.co.uk/news/world/us_and_canada”

]

for url in urls:
yield scrapy.Request(url=url, callback=self.parse)#callback will parse the response

#response
def parse(self, response):
region = response.url.split(“https://www.bbc.co.uk/news/world/”)[1]
#response will be same as request url…
#then split by “https://www.bbc.co.uk/news/”
#we get this: [“”,”africa”] and then index 1 (2nd item in array)
_file = “{}.html”.format(region)
with open(_file, “wb”) as f: #wb write byte… response.body returns bytes
f.write(response.body)

This command (inside the environment folder) to scrape it…

scrapy crawl bbc news

Which then dumped out these files:

Leave a Reply

Your email address will not be published. Required fields are marked *