Updated · Feb 11, 2024
Muninder Adavelli is a core team member and Digital Growth Strategist at Techjury. With a strong bac... | See full bio
Updated · Feb 11, 2024
Muninder Adavelli is a core team member and Digital Growth Strategist at Techjury. With a strong bac... | See full bio
April is a proficient content writer with a knack for research and communication. With a keen eye fo... | See full bio
Wikipedia is the world’s largest online encyclopedia, holding over 59.7 million articles in different languages and topics. All pages are free, making knowledge more accessible. This is why scrapers see the site as a “treasure trove of information.”
However, manually extracting data from multiple Wikipedia pages takes a lot of work. Going through lengthy articles can take forever before you can get the necessary information. Fortunately, there is a solution: APIs.
Discover how to extract data from Wikipedia using an API in this article. Dive in!
🔑 Key Takeaways
|
Getting data from Wikipedia can be challenging and tedious for scrapers due to the heavy volume of pages on the site. That is why most of them automate the data extraction process to save time.
The good thing is that Wikipedia has its own API to help you with your data extraction projects. It is free and easy to use. The following sections will discuss the prerequisites and steps on how to use Wikipedia API to extract data.
Read on.
📝 Note API is different from web scraping. While both are data extraction methods, the former collects data directly. Meanwhile, the latter provides a structured way to access specific data. The two methods have distinct advantages based on project needs. |
Before you start extracting Wikipedia using an API, make sure you have the following prerequisites:
📝 Note: Python version 2.7.9 onwards comes with a pre-installed PIP. |
💡Did You Know? The English Wikipedia is the largest Wikipedia edition, holding around 6.8 million articles. It releases an average of 542 new articles daily. Following it is the Cebuano Wikipedia, which has 6.1 million articles. |
Here is an illustration of the general coding process for extracting data from Wikipedia using Python and the Wikipedia API:
Alt tag: Coding Steps to Extract Wikipedia with an API
There are different ways to extract data from Wikipedia since its API has numerous modules. The code for each method depends on what data you want to extract.
Below are different guides on how to extract data from Wikipedia using an API Python based on three types of data:
You can get the gist of any Wikipedia page by extracting its abstract. An abstract gives you a preview of the topic, its key points, and other relevant ideas. Doing so reduces the tedious work of reading lengthy articles.
Below are the steps to extract the abstract of any Wikipedia article:
import requests |
subject = ‘Web scraping’ |
url = 'https://en.wikipedia.org/w/api.php' |
🗒️ Note To access Wikipedia pages and other Wikimedia projects, use “Wikipedia api.php.” The “api.php” bit is a request that the Wikipedia API reads and responds to. |
1. Set the parameters.
params = { 'action': 'query', 'format': 'json', 'titles': subject, 'prop': 'extracts', 'exintro': True, 'explaintext': True, } |
2. Initiate an HTTP get request to the Wikipedia API using the set parameters.
response = requests.get(url, params=params) |
3. Set the response data in JSON format.
data = response.json() |
4. Iterate each piece of data on every page.
for page in data['query']['pages'].values(): |
5. Display the extracted data on the terminal or console. For this sample, limit it to 227 characters.
print(page['extract'][:227]) |
🗒️ Note You can display all the text or data on the terminal or console. Use the following code: print(page['extract']) |
Final Code
Consolidate all the codes. You should have a final code that will look like this:
import requests subject = 'Web scraping' url = 'https://en.wikipedia.org/w/api.php' params = { 'action': 'query', 'format': 'json', 'titles': subject, 'prop': 'extracts', 'exintro': True, 'explaintext': True, } response = requests.get(url, params=params) data = response.json() for page in data['query']['pages'].values(): print(page['extract'][:227]) |
Here is the scraped abstract using the codes above:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. |
The Wikipedia AP also lets you extract how many pages are in a Wikipedia category. Knowing the number of pages lets you calculate the depth of the available information about a particular topic.
In addition, it helps researchers see how data is spread across various fields. Here's how to use Wikipedia API to get the number of pages in a category:
1. Import the Requests library.
import requests |
2. Define the topic.
subject = ‘Web scraping’ |
3. Call the endpoint to access Wikipedia.
url = 'https://en.wikipedia.org/w/api.php' |
4. Set the parameters.
params = { 'action': 'query', 'format': 'json', 'titles': f'Category: {subject}', 'prop': 'categoryinfo' } |
💡 Did You Know? Formatted string literals (or f-strings) embed Python expressions inside string literals. In Python 2, f-strings do not exist and are only available in Python 3.6. |
1. Initiate an HTTP get request to the Wikipedia API.
response = requests.get(url, params=params) |
2. Set the response data in JSON format.
data = response.json() |
3. Go through all the data on every page.
for page, pages in data['query']['pages'].items(): |
4. Display the extracted data on the terminal or console. If no data is available, it will return “Invalid.”
try: print(pages["title"] + " has " + str(pages["categoryinfo"]["pages"]) + " pages.") except Exception: print("Invalid") |
Final Code
Consolidate all the codes. Your final code should look like this:
import requests subject = 'Web scraping' url = "https://en.wikipedia.org/w/api.php" params = { 'action': 'query', 'format': 'json', 'titles': f'Category: {subject}', 'prop': 'categoryinfo' } response = requests.get(url, params=params) data = response.json() for page, pages in data['query']['pages'].items(): try: print(pages["title"] + " has " + str(pages["categoryinfo"]["pages"]) + " pages.") except Exception: print("Invalid") |
The codes will produce a result like this:
Category: Web scraping has 31 pages. |
Besides the abstract and pages in a category, you can also extract the related topics from any Wikipedia article. Knowing the associated concepts will help you better understand your main topic. It will give you a better view of the relationship between your subject and other concepts.
Follow the steps below to extract the related topics from a Wikipedia page:
1. Import Requests library for HTTP queries.
import requests |
2. Define the subject.
subject = ‘Web scraping’ |
3. Call the endpoint to access Wikipedia.
url = 'https://en.wikipedia.org/w/api.php' |
4. Set the parameters to get the links for the defined topic.
params = { 'action':'query', 'format':'json', 'list':'search', 'srsearch':subject } |
5. Initiate an HTTP get request and set the response data in JSON format.
response = requests.get(url, params=params) data = response.json() |
6. Iterate each title on every page.
for titles in data['query']['pages']: |
7. Display the extracted data on the terminal or console.
try: print(titles['title']) except Exception: print("Invalid") |
Final Code
Consolidate all the codes. You should have a final code that will look like this:
import requests subject = 'Web scraping' url = 'https://en.wikipedia.org/w/api.php' params = { 'action':'query', 'format':'json', 'list':'search', 'srsearch':subject } response = requests.get(url, params=params) data = response.json() for titles in data['query']['search']: try: print(titles['title']) except Exception: print("Invalid") |
Following the codes above will give you this result:
Web scraping Data scraping Web crawler Contact scraping Beautiful Soup (HTML parser) Alternative data (finance) HiQ Labs v. LinkedIn Scrape Proxy server List of web testing tools |
Given Wikipedia's extensive database, users and scrapers flock to the website each day. It suffers from daily network congestion, reaching over 25 billion page views in a month.
Extracting data adds to the traffic, so it is important to maintain ethical data extraction. You can do that by monitoring your extracting activities and implementing the following best practices:
|
Limit your requests and be considerate. Scrape data at a reasonable number in a controllable request to avoid being tagged as a possible DDoS attack. Excessive requests can also cause data congestion and take down a site. |
|
|
In December 2023, Wikipedia garnered 10.7 billion page views from desktop and 14.6 billion from mobile. Such numbers create heavy traffic. When extracting data, minimize the traffic by requesting multiple items in one request. |
|
|
If you have already sent a request, be patient enough to finish the preceding request before sending a new one. |
|
|
Minimize high edit rates. Ensure as well that the edits are credible and of high quality. Remember that Wikipedia has millions of active users. An unrestrained number of revisions can cause the servers to lag. |
|
|
When extracting data from Wikipedia, always remember to give credit where credit is due. Although the data is free with no other requirement, it is best practice to put a reference to the borrowed content. |
|
|
For apps using data from Wikipedia, authenticate the requests using OAuth 2.0 client credentials or authorization code flow. Authentication provides a secure method for logging in to a Wikipedia or Wikimedia account. |
Wikipedia is one of the most visited sites on the Internet. It is a massive repository of knowledge on different topics. That’s why it is a popular site for data extraction.
Manual data extraction is tedious and challenging due to the millions of pages on the website. However, the Wikipedia API makes the data extraction process automated and efficient.
Although Wikipedia data is free and accessible, practicing ethical data extraction is still necessary. Avoid sending multiple requests simultaneously, and always set references for the content.
The query command fetches information about a wiki and its stored data. On the other hand, “parse” only extracts the data you need.
For personal requests, the API limit is 5,000 requests per hour. For anonymous requests, it is 500 requests per hour per IP address.
Your email address will not be published.
Updated · Feb 11, 2024
Updated · Feb 08, 2024
Updated · Feb 05, 2024
Updated · Jan 30, 2024