Python web scraping fetching data from the web

Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the pop...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Jarmul, Katharine, author (author), Lawson, Richard, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham : Packt 2017.
Edición:	Second edition
Materias:	Python (Computer program language) Computer programming.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630208406719

Tabla de Contenidos:

Cover
Credits
Copyright
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Python 3
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Scraping versus crawling
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawlers
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Using the requests library
Summary
Chapter 2: Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors and your Browser Console
XPath Selectors
LXML and Family Trees
Comparing performance
Scraping results
Overview of Scraping
Adding a scrape callback to the link crawler
Summary
Chapter 3: Caching Downloads
When to use caching?
Adding cache support to the link crawler
Disk Cache
Implementing DiskCache
Testing the cache
Saving disk space
Expiring stale data
Drawbacks of DiskCache
Key-value storage cache
What is key-value storage?
Installing Redis
Overview of Redis
Redis cache implementation
Compression
Testing the cache
Exploring requests-cache
Summary
Chapter 4: Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementing a multithreaded crawler
Multiprocessing crawler
Performance.
[Python multiprocessing and the GIL]
Python multiprocessing and the GIL
Summary
Chapter 5: Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Debugging with Qt
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Selenium and Headless Browsers
Summary
Chapter 6: Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with Selenium
Summary
Chapter 7: Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical character recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
The 9kw CAPTCHA API
Reporting errors
Integrating with registration
CAPTCHAs and machine learning
Summary
Chapter 8: Scrapy
Installing Scrapy
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Different Spider Types
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Scrapy Performance Tuning
Visual scraping with Portia
Installation
Annotation
Running the Spider
Checking results
Automated scraping with Scrapely
Summary
Chapter 9: Putting It All Together
Google search engine
Facebook
The website
Facebook API
Gap
BMW
Summary
Index.

Python web scraping fetching data from the web

Ejemplares similares