Setting up your own product monitor and scraper
Introduction
If you've been following sneakers for a while, you've probably heard about "monitors" and "scrapers". These are tools that allow you to monitor products and be notified when they're available. In this blog post we'll be covering how each of these tools works and how you can set up your own. This is in no way a tutorial on how to code, but rather a guide to help you start building your own tools.
Key takeaways
- Setting up your own monitors and scrapers is not extremely difficult, as long as you have a basic understanding of programming.
- Using correct headers, rotating proxies, and delays will help you avoid being detected by websites that have low security standards.
- You can use your own laptop or computer to run your scripts while testing, then deploy them on a server to have them running 24/7.
What tools do you need?
- Python (you can use any language you want!)
- MongoDB (or any other database you prefer!)
- Discord
Monitoring products
Use the interactive example above to see how a monitor works. Please keep in mind this interactive experience is a simplified and hardcoded example.
What is a monitor?
A monitor is a script which checks a website for updates. It can be used to monitor a product page, a search results page, or any other page that you want to track. Monitors are used most often for tracking new product releases and restocks. Setting up your own monitor requires a few different tools, which we'll cover in the next section.
How does a monitor work?
Here's a simplified description of how a monitor works:
- Our script is connected to a database and a website.
- Every few seconds, script checks website for updates.
- Something updated? Save to database and notify user!
Here are some examples of how a monitor can be used in the context of sneaker releases:
- When monitoring a product page, the script will check the page for product updates (such as price changes, stock updates, etc.). When the product is updated, the script will save the new data to the database and send a message to a Discord channel.
- When monitoring a search results page, the script will check the page for new products. When a new product is found, the script will save the new product to the database and send a message to a Discord channel.
The first example is a "restock monitor", alerting you when a product is back in stock. The second example is a "release monitor", alerting you when new products are availble to purchase. In this post we'll be building a simple release monitor alerting us whenever Shoe Palace releases new products online.
Scraping products
Use the interactive example above to see how a scraper works. Please keep in mind this interactive experience is a simplified and hardcoded example.
What is a scraper?
A scraper is a script which extracts data from a website. It can be used to extract images, information from product pages, or any other data that you want to extract. Scrapers are used most often for tracking new uploads (images, products, etc) before they're published for users. Setting up your own scraper requires a few different tools, which we'll cover in the next section.
How does a scraper work?
The theory behind a scraper is no different from a monitor. You have a script connected to a a database. The script is constantly checking the website for new data. When new data is found, the script will save the new data to the database and send a message to a Discord channel.
Here are some examples of how a scraper can be used for sneaker releases:
- When scraping an image server, the script will check the server for new images. When a new image is found, the script will save the new image to the database and send a message to a Discord channel.
- When scraping product pages, the script will check the website for new products. When a new product is found, the script will save the new product to the database and send a message to a Discord channel.
The first example is a "image scraper", alerting you when a new image is uploaded to a server. The second example is a "product scraper", alerting you when a new product is uploaded to a website. In this post we'll be building a simple image scraper alerting us whenever a new JD Sports image is live.
The theory behind a monitor and a scraper is the same. The only difference is the type of data we're extracting. While monitoring is focused (you know which pages you'll monitor, or the product you want to check updates for), scraping is more general (you're constantly spraying and praying for new data, you never know what you'll find).
Building your own toolbox
Now that we've covered the basics of monitors and scrapers, let's build our own toolbox. We will be using Python as our language of choice, MongoDB as our database, and Discord to send notifications. I'm assuming you have a basic understanding of programming and will be focusing on the logic, rather than the code. You can download the code used in this blog post here. If you want to run this code, you need to install the Python libraries using pip install -r requirements.txt
. If you don't have pip installed, you most likely need to tackle that whole install proces first before returning here.
Libraries? What libraries?
We use libraries in order to make our lives easier and not re-invent the wheel. Libraries are pre-written code that can be used to perform specific tasks. For example, the requests library allows us to make HTTP requests to a website. Another library we're using is pymongo, to connect to our MongoDB database. Last but not least, notifications will be sent using discord_webhook.
Keep in mind this post provides minimal examples. You can parse HTML using the beautifulsoup4 library, or use a headless browser like playwright to scrape websites, but we're not going to cover those here. This post is meant to get your gears turning - you can always learn more stuff on your own.
Building your own monitor
We will be building a monitor which checks Shoe Palace's website for new products. Shoe Palace uses Shopify (which still has a really useful JSON endpoint publicly available), therefore we don't have to worry about parsing HTML. Let's think about what the functions of our code should do:
Function | Description |
---|---|
connect_to_mongodb() | Connects to our MongoDB database |
fetch_products() | Fetches all products from a website |
extract_product_data() | Parses product data for each item |
save_to_database() | Saves product data to our database |
send_discord_notification() | Sends notification via Discord webhook |
main() | Logic and functions are called from here |
The logic of our code in relation to the functions should go like this:
- Connect to database using
connect_to_mongodb()
- While true, try to fetch products using
fetch_products()
- If successful, return product data
- If unsuccessful, print error message
- For each product, extract data using
extract_product_data()
- Return formatted product data
- For each product, check if it's new using
save_to_database()
- If product is new, save product to database
- If product is not new, skip over this product
- For each new product, try send alert via Discord using
send_discord_notification()
- If successful, congratulate yourself!
- If unsuccessful, print error message
That's basically it! Now let's see how the main function looks like:
def main():
"""Main monitor loop. Connects to MongoDB, fetches products, processes them, and sends notifications for new products."""
# Connect to MongoDB
print("Starting monitor...")
db, collection = connect_to_mongodb()
if not collection:
print("MongoDB connection failed. Exiting.")
return
print("Monitor running...")
# Main loop
while True:
try:
# Fetch products from the website
products = fetch_products()
if not products:
time.sleep(DELAY_IN_SECONDS)
continue
# Parse and process each product
new_products_count = 0
for product in products:
product_data = extract_product_data(product)
# Try check if product is new and save to database
is_new = save_to_database(collection, product_data)
# If product is new, send notification via Discord
if is_new:
new_products_count += 1
send_discord_notification(product_data)
if new_products_count > 0:
print(f"Found {new_products_count} new products")
# Wait for the next iteration
time.sleep(DELAY_IN_SECONDS)
# Stop loop using Ctrl+C
except KeyboardInterrupt:
print("Monitor stopped")
break
# Print exception when error
except Exception as e:
print(f"Error: {e}")
time.sleep(DELAY_IN_SECONDS)
If you're want to check out how every function looks like, you can find the code here.
Building your own scraper
We will be building a scraper which checks JD Sports's website for new images. JD Sports uses a content delivery network (CDN) to serve images, and they use numerical IDs to identify each image. This makes it really easy to scrape by incrementing the ID in the URL, and checking if that image exists.
Our functions will be similar, here's a list of what they should do:
Function | Description |
---|---|
generate_unique_string() | Generates unique cache bypass string |
connect_to_mongodb() | Connects to our MongoDB database |
fetch_image() | Fetches an image from the website |
extract_image_data() | Parses image data for each image |
save_to_database() | Saves image data to our database |
send_discord_notification() | Sends notification via Discord webhook |
main() | Logic and functions are called from here |
The new function you're seeing is generate_unique_string()
. We're using it to bypass cache. When a website serves data from its cache, the content you see is not always the latest version. There are multiple ways to fix this issue - but the quickest one is to make the link unique each time you visit it.
import random
import string
def generate_unique_string():
"""Generate a random unique string for CDN requests."""
# Generate 20 random characters (letters and numbers)
characters = string.ascii_uppercase + string.digits
return ''.join(random.choice(characters) for _ in range(20))
The logic of our code in relation to the functions should go like this:
- Connect to database using
connect_to_mongodb()
- While true, try to fetch images using
fetch_image()
andgenerate_unique_string()
- If successful, return image data
- If unsuccessful, print error message
- For each image, extract data using
extract_image_data()
- Return formatted image data
- For each image, check if it's new using
save_to_database()
- If image is new, save image to database
- If image is not new, skip over this image
- For each new image, try send alert via Discord using
send_discord_notification()
- If successful, congratulate yourself!
- If unsuccessful, print error message
Scraping products or images is not rocket science. Here's what the main function looks like:
def main():
"""Main scraper loop. Connects to MongoDB, fetches images, processes them, and sends notifications for new images."""
# Connect to MongoDB
print("Starting scraper...")
db, collection = connect_to_mongodb()
if collection is None:
print("MongoDB connection failed. Exiting.")
return
print("Scraper running...")
# Main loop
current_id = STARTING_ID
while True:
try:
# Fetch image from the website
response = fetch_image(current_id)
if response:
# Image exists, extract data
image_data = extract_image_data(current_id, response)
# Try to save to database and check if it's new
is_new = save_to_database(collection, image_data)
# If image is new, send notification via Discord
if is_new:
send_discord_notification(image_data)
print(f"Found new image: ID {current_id}")
else:
# Image exists but already in database
print(f"Image already exists: ID {current_id}")
else:
# Image fetch was unsuccessful
print(f"Failed to fetch image: ID {current_id}")
# Move to next ID
current_id += 1
# Reset to starting ID when we reach the end
if current_id > ENDING_ID:
current_id = STARTING_ID
print(f"Completed range {STARTING_ID} - {ENDING_ID}, starting over...")
# Wait between requests
time.sleep(DELAY_IN_SECONDS)
# Stop loop using Ctrl+C
except KeyboardInterrupt:
print("Scraper stopped")
break
# Print exception when error
except Exception as e:
print(f"Error: {e}")
time.sleep(DELAY_IN_SECONDS)
If you're want to check out how every function looks like, you can find the code here.
Conclusion
You've built your own monitor and scraper! Now you can expand your toolbox to include more complex features, such as proxy support, multi-threading or keyword filtering. You should probably save your code somewhere safe before making any major changes. The popular option is to use GitHub which can be used for version control and collaboration.
Not sure what to do now? Here's some ideas to get you started: Add a proxy to your request. Add support for rotating proxies. Add multiple endpoints for your script to monitor. Add multi-threading to monitor the endpoints concurrently. Add keyword filtering, and send notifications only for specific products.
The code referenced in this blog post was written by AI after prompting it with quotes from this blog post. The example code is not perfect in any way, it's just a proof of concept to help you get started. While you can use this as a starting point, you should also learn more about the tools you're using, and how to use them properly.
Frequently asked questions
How do I run the code?
pip install -r requirements.txt
. After installing the dependencies, run the code by using python3 monitor.py
or python3 scraper.py
depending on which script you want to run.