![]() |
Eugen Barilyuk Published: 6 July 2025 |
One summer day I was listening to a news piece on global economy, and noticed that I was nodding internally when the host started talking about inflation. While inflation is absolutely artificial (excessive money do not drop from the sky, they are printed by the government to cover its overspending), for ordinary people inflation is like a forest fire, which turns their finances into nothing.
The most interesting part of inflation is that while it is nationwide, every person has individual inflation. So, when a central bank states the inflation as 3% what does it mean? It means nothing for everyday life because each product has its own inflation rate. Bread prices may be inflating with a speed of 5% per year, car prices may be inflating with a speed of 1% per year. On average this gives 3% inflation. But you buy a car once in several years, and you buy bread every several days. Therefore, you will feel inflation close to 5%.
Knowing that official inflation level is as useful as a comb for a bald man, I got curious what a real inflation is for products consumed every day. Immediately I googled this only to find that searching "product inflation tracker" gives shady results of official inflation level, some useless indexes (like Big Mac Index) or calculators that want a user to provide the information on inflation level.
Immediately an idea sprawled: grocery stores exist, and they provide prices for products. Simply visit a store today, tommorrow, and after tommorrow, write down the price, and calculate how the price changed.
How hard can it be, right? Reality says - wrong!
On the first glance, the task was an easy peasy one. We simply need to track product's page in a store and grab its price. Having the table of price and date the price was collected makes it easy to calculate inflation rate, build a nice visual chart, and do any other fancy stuff for data analysis.
Since a product may disappear from sale, it was decided not to track a particular product, but to track the cheapest product in its category. For example, cheapest bread. This makes sense, as people will continue buying bread anyways. It's the price that matters, not brand and product name.
Having established basics, an approach emerged:
And that opened hell of HTML layouts.
At first, the plan was honest and civilized: use the good old BeautifulSoup - a Python library as gentle and friendly as its name. I fed it some HTML from the first store, and it dutifully fetched the price like a golden retriever bringing a stick. Success! This is going to be easy, I thought. A classic setup:
At that point, my noble produce inflation tracker quietly died. Not with a bang, but with a CAPTCHA. Almost every store now demands you to prove you're human - simply for the honor of reading the price tag of a loaf of bread. It’s as if prices are state secrets, and shoppers are undercover agents. You just want to know how much eggs cost, and suddenly you're solving puzzles for an algorithm that doesn’t believe in your existence.
Having next store in the tracker’s list returned empty soup full of nothing except CAPTCHA wall, I escalated. AI has advised to bring out the big guns: Firefox, a real browser with a real user profile, controlled by Selenium. Driven by code, grocery stores opened their pages for a proper citizen of the Internet.
def fetch_page_selenium(url): firefox_options = FirefoxOptions() firefox_options.binary_location = r"d:\...firefox.exe" firefox_options.add_argument('--profile') firefox_options.add_argument(r'd:\...\firefox-for-selenium') firefox_options.set_preference("javascript.enabled", True) firefox_options.set_preference("general.useragent.override", "Mozilla/5.0 ... Firefox/115.0") ... driver.get(url) driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})") ...
I impersonated Firefox 115.0 on Windows 10, disabled navigator.webdriver
, accepted insecure TLS certificates, and pretended to be less secure than I actually am - all to read the price of loaf of bread. Meanwhile, real humans browse the same site with dozen of tabs open and two ad blockers, and still get through fine. Irony?
But it didn’t matter. Pages would load normally, well, mostly of times. A very real human behind this code and its AI helper, have to think more on how to disguise the code in layers of fake humanity, just to convince another machine that he's not a machine.
In the meanwhile, first set of data was carefully placed into database of the inflationtracker.
Having almost won the bot check battle, an HTML layout nightmare arose.
When I tried to remember any supermarket chain, a Walmart brand popped in my head. Absolute and pure random, but it already highlighted that this project will make me sweat.
I started from googling Walmart's website, got walmart.com, and immediately stumbled upon the question: for which country is this page? Walmart does not highlight (or I'm blind enough to see) the country. Prices have $ sign, units of measurements are imperial. I assumed that I was in a Walmart USA. But actually, by this day I don't know for which country the page is designated.
Anyway, I navigate to https://www.walmart.com/search?q=bread&sort=price_low and locate the HTML tags with product title and product price. That immediately gave a major headache.
You see, the tools for webscraping search the webpage by CSS selectors. For example, data-automation-id="product-price". This selector is from Walmart's page, and you may assume it contains product price. And you will be right, but actually wrong (by the way, only now, while writing this article, I have actually noticed that this DIV contains price).
I have spotted that the price is provided outside of this DIV block in a span tag. Not a problem, use its span class="w_iUH7" selector! But who knew that exactly this selector is used for a product title:
This resulted in a code that copied product title as product price.
After some talking to an AI, an updated code emerged that extracted data fields correctly. Hooray?
After taming the chaos of duplicated CSS selectors - where both the product title and price were parsed as one - I thought I’d earned some peace. Wrong. The moment I exhaled, I fell straight into the next trap of HTML chaos: CSS selectors with unique content per item:
At first glance, this HTML looks like a clean layout that is easy to fetch and parse. Then you realize: those class names like id="21616092_EA"
are not stable. Every product had its own unique ID embedded.
Needless to say, my naive, hand-crafted, splintery script - forged lovingly with AI and blind optimism - had no idea how to navigate this mess. It broke faster than a politician’s promise under scrutiny.
So back I went back to negotiations with the artificial intelligence. This time, with a new agenda: teach the parser that some CSS selectors are variable. We, mostly AI, rewired the logic to work with variable attributes.
Code started to reliably fetch data from HTML layouts that have variable selectors, giving a faint hope that tomorrow, the HTML won't mutate... fast enough for the code to break.
Ok, now we're rocking the data collection, time to add another store. The code works flawlessly, scraping the next store should cause no problems.
And yes, there were no problems with grabbing a product title in this store. But price grabbing failed with a simple reason: website designer decided to make HTML layout with nested tags.
This nested design then appeared again much later in HTML layout of another store that definitely was not connected with the previous one, as it was from another country almost 1500 kilometers apart. Moreover, this store had more severe HTML tag nesting, causing significant data duplication.
The result was duplication of data, which looked like this respectively:
and
Ok, another round of negotiations with the AI, and then one more, and one more. Several days after, the correctly working code with improved check_attributes_match()
and extract_data_from_template()
complex functions were produced.
def extract_data_from_template(template_lines, page_html): """Extract data from page using template""" template_html = '\n'.join(template_lines) #print(f"Template HTML: {template_html}") # Fix malformed template HTML - if it starts with attributes, add opening tag if template_html.strip().startswith('class=') or template_html.strip().startswith('data-'): template_html = ' 0: # Complex nested template - get full text from container extracted_text = matching_element.get_text(strip=True) extracted_text = ' '.join(extracted_text.split()) # Clean whitespace else: # Simple template - use targeted extraction extracted_text = extract_text_from_element(matching_element, template_element) if extracted_text: ##print(f"Extracted text: '{extracted_text}'") extracted_parts.append(extracted_text) processed_elements.append(matching_element) # Combine all parts and remove duplicates while preserving order final_parts = [] for part in extracted_parts: if part not in final_parts: final_parts.append(part) if len(final_parts) == 3 and final_parts[0].isdigit() and final_parts[1].isdigit(): final_result = f"{final_parts[0]}.{final_parts[1]} {final_parts[2]}" elif len(final_parts) == 2 and final_parts[0].isdigit() and final_parts[1].isdigit(): final_result = f"{final_parts[0]}.{final_parts[1]}" else: final_result = ' '.join(final_parts) #print(f"Final result: '{final_result}'") return final_result
Web scraping and data extraction system is designed to monitor product prices across multiple international retail websites. The system employs a template-based approach for HTML parsing, utilizing both Selenium WebDriver for controlling real browser for dynamic content handling and BeautifulSoup for HTML parsing. The architecture supports multi-store, multi-country price tracking with built-in inflation rate calculations and standardized unit conversions.
The system's flexibility stems from its configuration-driven approach. The store_config.txt file defines scraping templates for each retail store, containing structured data blocks that specify how to extract product information from different website layouts.
The main execution flow orchestrates all system components in a coordinated sequence, processing each store configuration and extracting data for both cheapest and most expensive product variants.
The system iterates through each store configuration, processing both cheapest and most expensive variants when URLs are available. For each variant, the system:
Each store configuration follows a standardized format with five primary sections. Let's examine the Walmart configuration as an example:
STORE = Walmart COUNTRY = USA PRODUCT = Bread TITLE = [ <span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">FFF</span> ] PRICE = [ <span class="w_iUH7">current price FFF</span> ] CURRENCY_MAP = ["$": "USD"] URLS = [ cheapest: https://www.walmart.com/search?q=bread&sort=price_low most_expensive: ]
The "FFF" placeholder acts as a dynamic content marker, indicating where the actual product data should be extracted from the webpage. This templating approach allows the system to handle diverse HTML structures across different retailers.
The configuration parser processes each store block by identifying key sections and extracting the relevant information:
To efficiently store price tracking data across multiple dimensions the system implements a normalized relational database structure using SQLite.
The PriceSample table serves as the central repository, linking stores and product types while maintaining historical price data with calculated inflation rates.
The system employs a dual-approach web scraping strategy, utilizing both Selenium WebDriver for JavaScript-heavy sites and falling back to HTML-based scraping for simplicity.
The Selenium implementation uses Firefox WebDriver with extensive customization to handle modern web security measures and anti-bot detection systems:
def fetch_page_selenium(url): firefox_options = FirefoxOptions() firefox_options.binary_location = r"d:\...firefox.exe" firefox_options.set_capability("moz:webdriverClick", False) firefox_options.set_preference("dom.webdriver.enabled", False) firefox_options.set_preference("webdriver_accept_untrusted_certs", True)
The core extraction mechanism operates through a template matching algorithm that compares HTML elements from the configuration templates with actual webpage content.
Locates HTML elements in the scraped page that match the template specifications by comparing tag names, attributes, and content patterns.
Processes template definitions to extract relevant data from webpage content, handling both simple and complex nested HTML structures.
The template matching process involves several steps:
The extracted raw data undergoes comprehensive processing to standardize formats, calculate derived values, and ensure data quality before database insertion.
The extract_price_info
function handles the complex task of extracting numerical price values from diverse string formats across different locales and currencies:
def extract_price_info(price_string, currency_map): # Handle European decimal format (comma as decimal separator) if ',' in price_clean and '.' not in price_clean: price_clean = price_clean.replace(',', '.') elif ',' in price_clean and '.' in price_clean: # Handle format like "1.234,56" (European thousands separator) if price_clean.rfind(',') > price_clean.rfind('.'): price_clean = price_clean.replace('.', '').replace(',', '.')
The system processes different decimal and thousands separators used across various countries, ensuring accurate price extraction regardless of regional formatting conventions.
Product package information extraction employs regular expressions to identify and parse size and unit data from product titles:
def extract_package_info(title_string): patterns = [ r'(\d+(?:[,\.]\d+)?)\s*(oz|lb|g|kg|ml|l|fl oz|г|мл|л|кг)\b', r'(\d+(?:[,\.]\d+)?)\s*(ounce|pound|gram|kilogram|liter|litre)\b' ]
The system maintains unit conversion tables to standardize measurements across different regional systems:
UNIT_PATTERNS = { "oz": 0.0283495, "fl oz": 0.0295735, "lb": 0.453592, "kg": 1.0, "g": 0.001, "l": 1.0, "ml": 0.001, "г": 0.001, # Cyrillic gram "мл": 0.001, # Cyrillic milliliter "л": 1.0 # Cyrillic liter }
The system calculates standardized price-per-unit values to enable meaningful price comparisons across different package sizes and measurement systems:
calculate_price_per_unit(price_number, package_size, package_unit, currency)
Converts package sizes to standard units (kg for weight, liter for volume) and calculates normalized price-per-unit values for comparison purposes.
The system incorporates temporal price change tracking by calculating inflation rates based on historical price data stored in the database. This feature enables tracking of price changes over time for each product-store combination.
def calculate_inflation_rate(current_price, previous_price): if previous_price is None or previous_price == 0: return 0.0 inflation_rate = ((current_price - previous_price) / previous_price) * 100 return inflation_rate
The inflation calculation retrieves the most recent price record for each product-store-variant combination and computes the percentage change from the previous measurement. This enables tracking of price trends and identifying price fluctuations.
The system implements comprehensive error handling and data validation to ensure data integrity and system reliability:
The system validates that both product title and price information are present before attempting database insertion, preventing incomplete records from corrupting the dataset.
The system implements a data management strategy using SQLite with proper transaction handling and data integrity constraints.
Implements an upsert pattern to ensure stores are uniquely identified while maintaining referential integrity in the database schema.
Manages product type records using the same upsert pattern, ensuring consistent product categorization across all stores.
The database operations use INSERT OR REPLACE statements to handle potential duplicate entries gracefully, ensuring that the latest price information overwrites previous entries for the same date and product combination.
The system incorporates several performance optimizations to handle price monitoring efficiently:
The system architecture facilitates easy extension and maintenance through several design patterns:
Adding new stores requires only configuration file updates without code modifications. The template-based approach allows the system to adapt to different HTML structures through configuration changes.
Each major operation is encapsulated in dedicated functions with clear responsibilities, making the system easier to debug, test, and extend. The separation of concerns between web scraping, data processing, and database operations enables independent modification of each component.
While the system provides robust price monitoring capabilities, several limitations should be considered:
The Product Price Parser represents an approach to automated price monitoring, combining flexible configuration management with robust data processing capabilities. The system's architecture balances complexity with maintainability, providing a scalable foundation for international price tracking. Through its template-based extraction engine, comprehensive data processing pipeline, and normalized database structure, the system enables efficient monitoring of product prices across diverse retail environments.
The implementation demonstrates advanced web scraping techniques, proper database design principles, and thoughtful error handling strategies. The system's modular design and configuration-driven approach ensure that it can adapt to changing requirements and expanding monitoring scope without requiring significant architectural modifications.
Get full project code: https://github.com/Eb43/product-inflation-tracker
<--Visitor counter-->