Web scraping with Python scrape data from any website with the power of Python
Successfully scrape data from any website with the power of Python About This Book A hands-on guide to web scraping with real-life problems and solutions Techniques to download and extract data from complex websites Create a number of different web scrapers to extract information Who This Book Is Fo...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham [United Kingdom] :
Packt Publishing
2015.
|
Edición: | 1st edition |
Colección: | Community experience distilled.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629845206719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Introduction to Web Scraping
- When is web scraping useful?
- Is web scraping legal?
- Background research
- Checking robots.txt
- Examining the Sitemap
- Estimating the size of a website
- Identifying the technology used by a website
- Finding the owner of a website
- Crawling your first website
- Downloading a web page
- Retrying downloads
- Setting a user agent
- Sitemap crawler
- ID iteration crawler
- Link crawler
- Advanced features
- Summary
- Chapter 2: Scraping the Data
- Analyzing a web page
- Three approaches to scrape a web page
- Regular expressions
- Beautiful Soup
- Lxml
- CSS selectors
- Comparing performance
- Scraping results
- Overview
- Adding a scrape callback to the link crawler
- Summary
- Chapter 3: Caching Downloads
- Adding cache support to the link crawler
- Disk cache
- Implementation
- Testing the cache
- Saving disk space
- Expiring stale data
- Drawbacks
- Database cache
- What is NoSQL?
- Installing MongoDB
- Overview of MongoDB
- MongoDB cache implementation
- Compression
- Testing the cache
- Summary
- Chapter 4: Concurrent Downloading
- One million web pages
- Parsing the Alexa list
- Sequential crawler
- Threaded crawler
- How threads and processes work
- Implementation
- Cross-process crawler
- Performance
- Summary
- Chapter 5: Dynamic Content
- An example dynamic web page
- Reverse engineering a dynamic web page
- Edge cases
- Rendering a dynamic web page
- PyQt or PySide
- Executing JavaScript
- Website interaction with WebKit
- Waiting for results
- The Render class
- Selenium
- Summary
- Chapter 6: Interacting with Forms
- The Login form
- Loading cookies from the web browser.
- Extending the login script to update content
- Automating forms with the Mechanize module
- Summary
- Chapter 7: Solving CAPTCHA
- Registering an account
- Loading the CAPTCHA image
- Optical Character Recognition
- Further improvements
- Solving complex CAPTCHAs
- Using a CAPTCHA solving service
- Getting started with 9kw
- 9kw CAPTCHA API
- Integrating with registration
- Summary
- Chapter 8: Scrapy
- Installation
- Starting a project
- Defining a model
- Creating a spider
- Tuning settings
- Testing the spider
- Scraping with the shell command
- Checking results
- Interrupting and resuming a crawl
- Visual scraping with Portia
- Installation
- Annotation
- Tuning a spider
- Checking results
- Automated scraping with Scrapely
- Summary
- Chapter 9: Overview
- Google search engine
- The website
- The API
- Gap
- BMW
- Summary
- Index.