Friday, 7 July 2017

web scraping

  • When performing data science tasks, it’s common to want to use data found on the internet. 
  • You’ll usually be able to access this data in csv format, or via an Application Programming Interface(API). 
  • However, there are some times when the data you want can only be accessed as part of a web page. 
  • In cases like this, you’ll want to use a technique called web scraping. 
Web-Scraping:
  • to get the data from the web page into a format you can work with in your analysis.


  • We are going to use Python as our scraping language, together with a simple and powerful library, BeautifulSoup.


pip install BeautifulSoup4 or beautifulsoup4



Before we start jumping into the code, let’s understand the basics of HTML and some rules of scraping.

<!DOCTYPE html>  ----> type deceleration
<html>  
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>

</html>


This is the basic syntax of an HTML webpage. Every <tag> serves a block inside the webpage:

1. <!DOCTYPE html>: HTML documents must start with a type declaration.

2. The HTML document is contained between <html> and </html>.

3. The meta and script declaration of the HTML document is between <head> and </head>.

4. The visible part of the HTML document is between <body> and </body> tags.

5. Title headings are defined with the <h1> through <h6> tags.


6. Paragraphs are defined with the <p> tag.


  • Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.
  • Also, HTML tags sometimes come with id or class attributes. 
  • The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. 
  • The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

Scraping Rules:
  • You should check a website’s Terms and Conditions before you scrape it. 
  • Be careful to read the statements about legal use of data. 
  • Usually, the data you scrape should not be used for commercial purposes.
  • Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. 
  • Make sure your program behaves in a reasonable manner (i.e. acts like a human). 
  • One request for one webpage per second is good practice.
  • The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.
Examle:1

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup)

Example:2

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(source)

Example:3

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup.title)

Example:4

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup.title.string)

Example:5

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup.title.text)

Example:6

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( ) 
soup=bs.BeautifulSoup(source,'lxml')
print(soup.p)

Example:7

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup.find_all('p'))

Example:8

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for paragraph in soup.find_all('p'):
    print(paragraph)

Example:9

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for paragraph in soup.find_all('p'):
    print(paragraph.string)

Example:10

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for paragraph in soup.find_all('p'):
    print(paragraph.text)

Example:11

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
print(soup.get_text( ))

Example:12

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for url in soup.find_all('a'):
    print(url.text)

Example:13

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for url in soup.find_all('a'):
    print(url.get('href'))

Example:14

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
nav=soup.nav
print(nav)

Example:15

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
nav=soup.nav
for url in nav.find_all('a'):
    print(url.get('href'))

Example:16

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
body=soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)

Example:17

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div'):
    print(div.text)

Example:18

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div',class_='body'):
    print(div.text)

Example:19

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
table=soup.table
print(table)

Example:20

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
table=soup.find('table')
print(table)

Example:21

import bs4 as bs
import urllib
source=urllib.urlopen('https://pythonprogramming.net/parsememcparseface/').read( )
soup=bs.BeautifulSoup(source,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
for tr in table_rows:
    td=tr.find_all('td')
    row=[i.text for i in td]
    print(row)

Example:22

import bs4 as bs
import urllib
import pandas as pd
dfs=pd.read_html('https://pythonprogramming.net/parsememcparseface/')
for df in dfs:
    print(df)

Example:23

import bs4 as bs
import urllib
import pandas as pd
dfs=pd.read_html('https://pythonprogramming.net/parsememcparseface/',header=0)
for df in dfs:
    print(df)

Example:24

import bs4 as bs
import urllib
import pandas as pd
source=urllib.urlopen('https://pythonprogramming.net/sitemap.xml').read()
soup=bs.BeautifulSoup(source,'xml')
print(soup)

Example:25

import bs4 as bs
import urllib
import pandas as pd
source=urllib.urlopen('https://pythonprogramming.net/sitemap.xml').read()
soup=bs.BeautifulSoup(source,'xml')
for url in soup.find_all('loc'):
    print(url.text)



Advanced Scraping Techniques:

  • BeautifulSoup is simple and great for small-scale web scraping, But if you are interested in scraping data at a larger scale, you should consider using these other alternatives: Scrapy
  • Try to integrate your code with some public APIs. The efficiency of data retrieval is much higher than scraping webpages. 
  • For example, take a look at Facebook Graph API, which can help you get hidden data which is not shown on Facebook webpages.
  • Consider using a database backend like MySQL to store your data when it gets too large.





No comments:

Post a Comment