An excellent technical weblog usually consists of fascinating content material customers need to learn. However discovering subject isn't simple. It requires a whole lot of looking out and studying throughout many sources. Even should you provide you with nice concepts, there is not any approach of figuring out how your readers will react, or in case your idea fits your target market.
Nonetheless, discovering a subject for an article does not all the time should be this tough. You should utilize methods like scraping to undergo hundreds of thousands of technical blogs (in a couple of minutes) and create a database filled with technical content material. Later, you should use this knowledge for predictive subject technology, and even compile analytical experiences on what kind of content material performs nicely over particular intervals.
Using the scraped weblog knowledge is limitless. Weblog scrapers assist plan content material successfully, shortly, and effectively, permitting you to design intuitive, thrilling, and interesting matters in seconds. So, on this article, I'll present you find out how to construct your personal weblog scraper in 5 steps.
Constructing a weblog scraper just isn't technically difficult. Nonetheless, it is beneficial to make use of Python because it provides third-party libraries that assist parse DOM components and create spreadsheets to retailer knowledge. Due to this fact, this text will give attention to constructing a easy technical weblog scraper utilizing Python.
Step 01 – Making a Digital Surroundings
Since Python purposes make the most of third-party dependencies for scraping, a digital setting have to be used. Due to this fact, create a digital setting by executing the beneath command.
python3 -m venv venv
After executing the command, a brand new listing titled
venv will get created within the mission listing. Hereafter, activate the digital setting utilizing the command proven beneath.
After executing the command, the setting's title will probably be in your terminal. This means that your digital setting has been activated efficiently.
Determine: Activating the digital setting
Step 02 – Putting in the Required Libraries
After creating and activating your digital setting, you might want to set up two third-party libraries. These libraries will assist scrape knowledge on net pages.
: The requests library will probably be used to carry out HTTP requests.
: The Lovely Soup library will probably be used to scrape info from net pages.
To put in the 2 libraries, run the 2 instructions displayed beneath.
python -m pip set up requests
python -m pip set up beautifulsoup4
After putting in, you will note the output proven beneath.
Determine: Putting in the required libraries
Step 03 – Analyzing the Weblog to Scrape
Now you can begin in your scraping script. For demonstration functions, this information will present you find out how to implement a script that may scrape a Medium Publication.
First, it is important to determine a reusable URL that can be utilized to scrap any publication in Medium. Fortunately, Medium has a URL that particular technical blogs can use to archive content material. It fetches a listing of all of the articles a publication has printed because it was created. The generic URL for archived content material is proven beneath.
For instance, you'll be able to compile a listing of archived content material of all content material printed on Enlear Academy utilizing the URL –
https://medium.com/enlear-academy/archive. It can show the output proven beneath.
Determine – Viewing the archived content material of Enlear Academy
This weblog scraper will use the generic archive URL and fetch a listing of all technical content material printed within the weblog, then it would accumulate attributes akin to:
You may extract the above info by inspecting the HTML for the Medium Content material Card, as proven beneath.
Determine: Figuring out CSS lessons and HTML Parts to focus on
All Medium content material playing cards are wrapped with a div containing CSS lessons –
streamItem streamItem--postPreview js-streamItem. Due to this fact, you should use Lovely Soup to get a listing of all
div components having the desired lessons and extract the listing of articles on the archive web page.
Step 04 – Implementing the Scraper
Create a file titled
scraper.py the place the code for the scraper will probably be included. Initially, it's essential to add the 2 imports as proven beneath.
from bs4 import BeautifulSoup # import BeautifulSoup
import requests # import requests
import json # import json for knowledge storing
requests library will probably be used to carry out a
GET request to the archives of the Medium publication.
# create request to archive pageblog_archive_url = ' = requests.get(blog_archive_url)
Then, the textual content returned by the response have to be parsed into HTML utilizing the HTML Parser of Lovely Soup, as proven beneath:
# parse the response utilizing HTML parser on BeautifulSoupparsedHtml = BeautifulSoup(response.textual content, 'html.parser')
Then, the tales might be queried by performing a DOM operation to fetch all of the
div components which have the category listing
streamItem streamItem--postPreview js-streamItem.
# get listing of all divs having the lessons "streamItem streamItem--postPreview js-streamItem"to get every story.
tales = parsedHtml.find_all('div', class_='streamItemstreamItem--postPreviewjs-streamItem')
Afterwards, we are able to iterate over every story within the
tales array and procure essential meta info akin to article title and subtitle, variety of claps, studying time, and URL.
formatted_stories = 
for story in tales:
# Get the title of the story
story_title = story.discover('h3').textual content if story.discover('h3') else'N/A'# get the subtitle of the story
story_subtitle = story.discover('h4').textual content if story.discover('h4') else'N/A'
# Get the variety of claps
clap_button = story.discover('button', class_='button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents')
claps = 0if (clap_button):
# If clap button has a DOM reference, get hold of its textual content
claps = clap_button.textual content
# Gget reference to the cardboard header containing writer data
author_header = story.discover('div', class_='postMetaInline u-floatLeft u-sm-maxWidthFullWidth')
# Entry the studying time span component andget its title attribute
reading_time = author_header.discover('span', class_='readingTime')['title']
# Get learn extra ref
read_more_ref = story.discover('a', class_='button button--smaller button--chromeless u-baseColor--buttonNormal')
url = read_more_ref['href'] if read_more_ref else'N/A'
# Add an object to formatted_stories
The above scraping script iterates over every story and performs 5 duties:
It obtains the article title by utilizing the
H3component within the card.
It obtains the article subtitle by utilizing the
H4component within the card.
It obtains the variety of claps by utilizing the clap button on the cardboard. The script executes a question to discover a
Buttonwith the category listing –
button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEventsafter which makes use of its
textual contentattribute to get a rely of whole claps for the article.
It obtains the studying time by accessing the cardboard header. The cardboard header might be accessed by performing a question to search out the
divcomponent having the CSS class listing –
postMetaInline u-floatLeft u-sm-maxWidthFullWidth. Then, a subsequent question is finished to discover a
spancomponent with the category
readingTimeto acquire the studying time for the article.
Lastly, the script obtains the article URL by accessing the Learn Extra part positioned on every card. The script makes use of the
hrefattribute on the consequence and searches for the
buttoncomponents with the category listing:
button button--smaller button--chromeless u-baseColor--buttonNormal.
After DOM Queries have obtained all the weather, the information is structured right into a JSON object. Then, it's pushed into the array named
Lastly, the array is written right into a JSON file utilizing the file module of Python, as proven beneath.
file = open('tales.json', 'w')
Step 05 – Viewing the Script in Motion
After executing the script written in step 4, the next output is generated.
Determine: Viewing the scraped weblog knowledge
That is it. You might have efficiently carried out a weblog scraper to scrape technical blogs on Medium utilizing a easy Python script. As well as, you can also make enhancements to the code and extract extra knowledge utilizing the DOM components and CSS lessons.
Lastly, you'll be able to push these knowledge right into a Information Lake on AWS and create analytical dashboards to assist establish trending, or most negligible most well-liked content material (primarily based on clap rely) to assist plan content material to your subsequent article.
Though weblog scrapers assist collect an enormous quantity of knowledge for subject planning, there are two major drawbacks since we closely depend upon the person interface.
You can't get extra knowledge
You solely have entry to the information obtainable within the person interface (what you see is what you get).
Modifications to the UI
If the location you're scraping makes important UI modifications, akin to altering the CSS lessons or the HTML components, the code you implement for scraping will most likely break as the weather can't be recognized anymore.
Due to this fact, it is important to be conscious of those drawbacks when implementing a weblog scraper.
Weblog scrapers considerably enhance content material creation and drive the content material planning trade. It helps content material managers plan content material throughout a number of iterations with minimal time and effort. The code carried out on this article is accessible in my GitHub repository.
I hope that you've discovered this text useful. Thanks for studying.