Python web scraping fetching data from the web
Successfully scrape data from any website with the power of Python 3.x About This Book A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the pop...
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham :
Packt
2017.
|
Edición: | Second edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630208406719 |
Tabla de Contenidos:
- Cover
- Credits
- Copyright
- About the Authors
- About the Reviewers
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Introduction to Web Scraping
- When is web scraping useful?
- Is web scraping legal?
- Python 3
- Background research
- Checking robots.txt
- Examining the Sitemap
- Estimating the size of a website
- Identifying the technology used by a website
- Finding the owner of a website
- Crawling your first website
- Scraping versus crawling
- Downloading a web page
- Retrying downloads
- Setting a user agent
- Sitemap crawler
- ID iteration crawler
- Link crawlers
- Advanced features
- Parsing robots.txt
- Supporting proxies
- Throttling downloads
- Avoiding spider traps
- Final version
- Using the requests library
- Summary
- Chapter 2: Scraping the Data
- Analyzing a web page
- Three approaches to scrape a web page
- Regular expressions
- Beautiful Soup
- Lxml
- CSS selectors and your Browser Console
- XPath Selectors
- LXML and Family Trees
- Comparing performance
- Scraping results
- Overview of Scraping
- Adding a scrape callback to the link crawler
- Summary
- Chapter 3: Caching Downloads
- When to use caching?
- Adding cache support to the link crawler
- Disk Cache
- Implementing DiskCache
- Testing the cache
- Saving disk space
- Expiring stale data
- Drawbacks of DiskCache
- Key-value storage cache
- What is key-value storage?
- Installing Redis
- Overview of Redis
- Redis cache implementation
- Compression
- Testing the cache
- Exploring requests-cache
- Summary
- Chapter 4: Concurrent Downloading
- One million web pages
- Parsing the Alexa list
- Sequential crawler
- Threaded crawler
- How threads and processes work
- Implementing a multithreaded crawler
- Multiprocessing crawler
- Performance.
- [Python multiprocessing and the GIL]
- Python multiprocessing and the GIL
- Summary
- Chapter 5: Dynamic Content
- An example dynamic web page
- Reverse engineering a dynamic web page
- Edge cases
- Rendering a dynamic web page
- PyQt or PySide
- Debugging with Qt
- Executing JavaScript
- Website interaction with WebKit
- Waiting for results
- The Render class
- Selenium
- Selenium and Headless Browsers
- Summary
- Chapter 6: Interacting with Forms
- The Login form
- Loading cookies from the web browser
- Extending the login script to update content
- Automating forms with Selenium
- Summary
- Chapter 7: Solving CAPTCHA
- Registering an account
- Loading the CAPTCHA image
- Optical character recognition
- Further improvements
- Solving complex CAPTCHAs
- Using a CAPTCHA solving service
- Getting started with 9kw
- The 9kw CAPTCHA API
- Reporting errors
- Integrating with registration
- CAPTCHAs and machine learning
- Summary
- Chapter 8: Scrapy
- Installing Scrapy
- Starting a project
- Defining a model
- Creating a spider
- Tuning settings
- Testing the spider
- Different Spider Types
- Scraping with the shell command
- Checking results
- Interrupting and resuming a crawl
- Scrapy Performance Tuning
- Visual scraping with Portia
- Installation
- Annotation
- Running the Spider
- Checking results
- Automated scraping with Scrapely
- Summary
- Chapter 9: Putting It All Together
- Google search engine
- The website
- Facebook API
- Gap
- BMW
- Summary
- Index.