Web scraping with Python scrape data from any website with the power of Python

Successfully scrape data from any website with the power of Python About This Book A hands-on guide to web scraping with real-life problems and solutions Techniques to download and extract data from complex websites Create a number of different web scrapers to extract information Who This Book Is Fo...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Lawson, Richard, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham [United Kingdom] : Packt Publishing 2015.
Edición:	1st edition
Colección:	Community experience distilled.
Materias:	Python (Computer program language) Automatic data collection systems. Data mining.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629845206719

Tabla de Contenidos:

Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawler
Advanced features
Summary
Chapter 2: Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors
Comparing performance
Scraping results
Overview
Adding a scrape callback to the link crawler
Summary
Chapter 3: Caching Downloads
Adding cache support to the link crawler
Disk cache
Implementation
Testing the cache
Saving disk space
Expiring stale data
Drawbacks
Database cache
What is NoSQL?
Installing MongoDB
Overview of MongoDB
MongoDB cache implementation
Compression
Testing the cache
Summary
Chapter 4: Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementation
Cross-process crawler
Performance
Summary
Chapter 5: Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Summary
Chapter 6: Interacting with Forms
The Login form
Loading cookies from the web browser.
Extending the login script to update content
Automating forms with the Mechanize module
Summary
Chapter 7: Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical Character Recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
9kw CAPTCHA API
Integrating with registration
Summary
Chapter 8: Scrapy
Installation
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Visual scraping with Portia
Installation
Annotation
Tuning a spider
Checking results
Automated scraping with Scrapely
Summary
Chapter 9: Overview
Google search engine
Facebook
The website
The API
Gap
BMW
Summary
Index.

Web scraping with Python scrape data from any website with the power of Python

Ejemplares similares