Eugen Barilyuk

Published: 6 July 2025

←🏠 Back to eb43.github.io articles list

Produce Inflation Tracker vs. HTML Madness: A Journey Through the Chaotic Web Layouts of Grocery Websites

Before you begin reading - a note: this is not a tech article. This is a survival log. A descent into the uncanny valley of e-commerce HTML. A tale of optimism murdered by CSS. A chronicle of how the simple act of wanting to know how much bread costs turned into a digital psychodrama. If you thought building a price tracker in 2025 is easy, you must get rid of your hallucinations and hope.

One summer day I was listening to a news piece on global economy, and noticed that I was nodding internally when the host started talking about inflation. While inflation is absolutely artificial (excessive money do not drop from the sky, they are printed by the government to cover its overspending), for ordinary people inflation is like a forest fire, which turns their finances into nothing.

The most interesting part of inflation is that while it is nationwide, every person has individual inflation. So, when a central bank states the inflation as 3% what does it mean? It means nothing for everyday life because each product has its own inflation rate. Bread prices may be inflating with a speed of 5% per year, car prices may be inflating with a speed of 1% per year. On average this gives 3% inflation. But you buy a car once in several years, and you buy bread every several days. Therefore, you will feel inflation close to 5%.

Knowing that official inflation level is as useful as a comb for a bald man, I got curious what a real inflation is for products consumed every day. Immediately I googled this only to find that searching "product inflation tracker" gives shady results of official inflation level, some useless indexes (like Big Mac Index) or calculators that want a user to provide the information on inflation level.

Immediately an idea sprawled: grocery stores exist, and they provide prices for products. Simply visit a store today, tommorrow, and after tommorrow, write down the price, and calculate how the price changed.

How hard can it be, right? Reality says - wrong!

Simple idea of collecting data from website

On the first glance, the task was an easy peasy one. We simply need to track product's page in a store and grab its price. Having the table of price and date the price was collected makes it easy to calculate inflation rate, build a nice visual chart, and do any other fancy stuff for data analysis.

Since a product may disappear from sale, it was decided not to track a particular product, but to track the cheapest product in its category. For example, cheapest bread. This makes sense, as people will continue buying bread anyways. It's the price that matters, not brand and product name.

Having established basics, an approach emerged:

Navigate to grocery store's website
Search for a product that has to be tracked
Sort the search result by price to get the cheapest product
Nicely ask AI to help writing software code that grabs product title and price of the first product on the page. Since results are sorted starting cheapest one, the first product on page will have the lowest price
Collect a database of price changes, calculate real inflation, show fancy charts
Add more products for tracking, more stores, grab a beer and enjoy results

And that opened hell of HTML layouts.

Who knew stores treat price tags like government secrets? Not me

At first, the plan was honest and civilized: use the good old BeautifulSoup - a Python library as gentle and friendly as its name. I fed it some HTML from the first store, and it dutifully fetched the price like a golden retriever bringing a stick. Success! This is going to be easy, I thought. A classic setup:

At that point, my noble produce inflation tracker quietly died. Not with a bang, but with a CAPTCHA. Almost every store now demands you to prove you're human - simply for the honor of reading the price tag of a loaf of bread. It’s as if prices are state secrets, and shoppers are undercover agents. You just want to know how much eggs cost, and suddenly you're solving puzzles for an algorithm that doesn’t believe in your existence.

Having next store in the tracker’s list returned empty soup full of nothing except CAPTCHA wall, I escalated. AI has advised to bring out the big guns: Firefox, a real browser with a real user profile, controlled by Selenium. Driven by code, grocery stores opened their pages for a proper citizen of the Internet.

def fetch_page_selenium(url):
    firefox_options = FirefoxOptions()
    firefox_options.binary_location = r"d:\...firefox.exe"
    firefox_options.add_argument('--profile')
    firefox_options.add_argument(r'd:\...\firefox-for-selenium')
    firefox_options.set_preference("javascript.enabled", True)
    firefox_options.set_preference("general.useragent.override", 
        "Mozilla/5.0 ... Firefox/115.0")
    ...
    driver.get(url)
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    ...

I impersonated Firefox 115.0 on Windows 10, disabled navigator.webdriver, accepted insecure TLS certificates, and pretended to be less secure than I actually am - all to read the price of loaf of bread. Meanwhile, real humans browse the same site with dozen of tabs open and two ad blockers, and still get through fine. Irony?

But it didn’t matter. Pages would load normally, well, mostly of times. A very real human behind this code and its AI helper, have to think more on how to disguise the code in layers of fake humanity, just to convince another machine that he's not a machine.

In the meanwhile, first set of data was carefully placed into database of the inflationtracker.

Webscraping dreams crushed through duplicated CSS selector

Having almost won the bot check battle, an HTML layout nightmare arose.

When I tried to remember any supermarket chain, a Walmart brand popped in my head. Absolute and pure random, but it already highlighted that this project will make me sweat.

I started from googling Walmart's website, got walmart.com, and immediately stumbled upon the question: for which country is this page? Walmart does not highlight (or I'm blind enough to see) the country. Prices have $ sign, units of measurements are imperial. I assumed that I was in a Walmart USA. But actually, by this day I don't know for which country the page is designated.

Anyway, I navigate to https://www.walmart.com/search?q=bread&sort=price_low and locate the HTML tags with product title and product price. That immediately gave a major headache.

You see, the tools for webscraping search the webpage by CSS selectors. For example, data-automation-id="product-price". This selector is from Walmart's page, and you may assume it contains product price. And you will be right, but actually wrong (by the way, only now, while writing this article, I have actually noticed that this DIV contains price).

<div data-automation-id="product-price" class="flex flex-wrap justify-start items-center lh-title mb1">
   <div class="mr1 mr2-xl b black lh-solid f5 f4-l" aria-hidden="true"><span class="f3"></span>
      <span class="f6 f5-l" style="vertical-align:0.65ex;margin-right:2px">$</span>
      <span class="f2">1</span><span class="f6 f5-l" style="vertical-align:0.75ex">82</span>
    </div>
</div>

I have spotted that the price is provided outside of this DIV block in a span tag. Not a problem, use its span class="w_iUH7" selector! But who knew that exactly this selector is used for a product title:

<span class="w_iUH7">Freshness Guaranteed Garlic Herb French Bread, 14 oz<!-- --> </span>
<span class="w_iUH7">current price $1.82</span>

This resulted in a code that copied product title as product price.

After some talking to an AI, an updated code emerged that extracted data fields correctly. Hooray?

CSS selectors with variable content

After taming the chaos of duplicated CSS selectors - where both the product title and price were parsed as one - I thought I’d earned some peace. Wrong. The moment I exhaled, I fell straight into the next trap of HTML chaos: CSS selectors with unique content per item:

<p class="chakra-text css-1ecdp9w" data-testid="product-brand">ACE</p><h3 class="chakra-heading css-6qrhwc" id="21616092_EA" data-testid="product-title">Classic White Bistro</h3><p class="chakra-text css-1yftjin" data-testid="product-package-size">595 g, $0.76/100g</p>

At first glance, this HTML looks like a clean layout that is easy to fetch and parse. Then you realize: those class names like id="21616092_EA" are not stable. Every product had its own unique ID embedded.

Needless to say, my naive, hand-crafted, splintery script - forged lovingly with AI and blind optimism - had no idea how to navigate this mess. It broke faster than a politician’s promise under scrutiny.

So back I went back to negotiations with the artificial intelligence. This time, with a new agenda: teach the parser that some CSS selectors are variable. We, mostly AI, rewired the logic to work with variable attributes.

Code started to reliably fetch data from HTML layouts that have variable selectors, giving a faint hope that tomorrow, the HTML won't mutate... fast enough for the code to break.

Product price split over multiple nested HTML tags

Ok, now we're rocking the data collection, time to add another store. The code works flawlessly, scraping the next store should cause no problems.

And yes, there were no problems with grabbing a product title in this store. But price grabbing failed with a simple reason: website designer decided to make HTML layout with nested tags.

<div class="catalog-item__product-price product-price product-price--weight ">
  <data class="product-price__top">
    <span>20<span class="product-price__coin">90</span></span> 
  </data>
</div>

This nested design then appeared again much later in HTML layout of another store that definitely was not connected with the previous one, as it was from another country almost 1500 kilometers apart. Moreover, this store had more severe HTML tag nesting, causing significant data duplication.

<div data-testid="product-block-supplementary-price-2" class="sc-dqia0p-0 fLKnrn">250 g</div>•<div data-testid="product-block-price-per-unit" type="normal" class="sc-dqia0p-3 kGnDnm">1 kg = 91,60 Kč</div></div></div></div><div data-testid="product-tile-footer" class="sc-y4jrw3-18 lnBzvN"><div class="sc-y4jrw3-22 bZlgEg"><div class="sc-y4jrw3-23 SlJTS"><div class="sc-y4jrw3-27 eUiuhf"><div data-stellar="st-web-price" data-testid="product-block-price" class="sc-dqia0p-2 bFXgow"><div class="sc-dqia0p-8 kyoGRo"></div><div class="sc-dqia0p-9 dsNELN">22</div><sup class="sc-dqia0p-10 fjOrrj">90 </sup><div class="sc-dqia0p-8 kyoGRo">Kč</div>

The result was duplication of data, which looked like this respectively:

Full price: 20.90грн/шт 20.90грн/шт відділень 90 грн/шт /шт

and

Full title: Albert Toustový chléb světlý, balený 250 g•1 kg = 91,60 Kč 250 g•1 kg = 91,60 Kč 250 g•1 kg = 91,60 Kč 250 g 1 kg = 91,60 Kč

Ok, another round of negotiations with the AI, and then one more, and one more. Several days after, the correctly working code with improved check_attributes_match() and extract_data_from_template() complex functions were produced.

Click to unfold the code

def extract_data_from_template(template_lines, page_html):
    """Extract data from page using template"""
    template_html = '\n'.join(template_lines)
    #print(f"Template HTML: {template_html}")
    
    # Fix malformed template HTML - if it starts with attributes, add opening tag
    if template_html.strip().startswith('class=') or template_html.strip().startswith('data-'):
        template_html = ' 0:
                # Complex nested template - get full text from container
                extracted_text = matching_element.get_text(strip=True)
                extracted_text = ' '.join(extracted_text.split())  # Clean whitespace
            else:
                # Simple template - use targeted extraction
                extracted_text = extract_text_from_element(matching_element, template_element)
            
            if extracted_text:
                ##print(f"Extracted text: '{extracted_text}'")
                extracted_parts.append(extracted_text)
                processed_elements.append(matching_element)
    
    # Combine all parts and remove duplicates while preserving order
    final_parts = []
    for part in extracted_parts:
        if part not in final_parts:
            final_parts.append(part)
    
    if len(final_parts) == 3 and final_parts[0].isdigit() and final_parts[1].isdigit():
         final_result = f"{final_parts[0]}.{final_parts[1]} {final_parts[2]}"
    elif len(final_parts) == 2 and final_parts[0].isdigit() and final_parts[1].isdigit():
         final_result = f"{final_parts[0]}.{final_parts[1]}"
    else:
          final_result = ' '.join(final_parts)
    #print(f"Final result: '{final_result}'")
    return final_result

Jokes aside: how price parser works

Web scraping and data extraction system is designed to monitor product prices across multiple international retail websites. The system employs a template-based approach for HTML parsing, utilizing both Selenium WebDriver for controlling real browser for dynamic content handling and BeautifulSoup for HTML parsing. The architecture supports multi-store, multi-country price tracking with built-in inflation rate calculations and standardized unit conversions.

The system's flexibility stems from its configuration-driven approach. The store_config.txt file defines scraping templates for each retail store, containing structured data blocks that specify how to extract product information from different website layouts.

System workflow

The main execution flow orchestrates all system components in a coordinated sequence, processing each store configuration and extracting data for both cheapest and most expensive product variants.

Extract Both Variants

→

Save Results

Processing Loop Structure

The system iterates through each store configuration, processing both cheapest and most expensive variants when URLs are available. For each variant, the system:

Fetches the webpage content using Selenium WebDriver
Applies template matching to extract product title and price information
Processes the extracted data through the standardization pipeline
Calculates derived values such as price per unit and inflation rates
Saves the processed data to the database with appropriate metadata

Configuration format structure

Each store configuration follows a standardized format with five primary sections. Let's examine the Walmart configuration as an example:

STORE = Walmart
COUNTRY = USA
PRODUCT = Bread

TITLE = [
<span data-automation-id="product-title" class="normal dark-gray mb0 mt1 lh-title f6 f5-l lh-copy">FFF</span>
]

PRICE = [
<span class="w_iUH7">current price FFF</span>
]

CURRENCY_MAP = ["$": "USD"]

URLS = [
cheapest: https://www.walmart.com/search?q=bread&sort=price_low
most_expensive: 
]

The "FFF" placeholder acts as a dynamic content marker, indicating where the actual product data should be extracted from the webpage. This templating approach allows the system to handle diverse HTML structures across different retailers.

Configuration components

The configuration parser processes each store block by identifying key sections and extracting the relevant information:

Store sdentification: STORE, COUNTRY, and PRODUCT fields establish the basic metadata for each scraping target
HTML templates: TITLE and PRICE sections contain HTML snippets with "FFF" placeholders that guide the extraction process
Currency mapping: CURRENCY_MAP translates currency symbols to standardized currency codes
URL variants: URLS section provides endpoints for both cheapest and most expensive product sorting

Database schema and relationships

To efficiently store price tracking data across multiple dimensions the system implements a normalized relational database structure using SQLite.

Store Table

store_id INTEGER PRIMARY KEY AUTOINCREMENT
name TEXT NOT NULL UNIQUE
country TEXT

ProductType Table

product_type_id INTEGER PRIMARY KEY AUTOINCREMENT
name TEXT NOT NULL UNIQUE

PriceSample Table

sample_id INTEGER PRIMARY KEY AUTOINCREMENT
store_id INTEGER NOT NULL
product_type_id INTEGER NOT NULL
date TEXT NOT NULL
variant TEXT CHECK(variant IN ('cheapest', 'most_expensive'))
full_name TEXT
full_price_string TEXT
price_number REAL NOT NULL
price_currency TEXT
package_size_string TEXT
package_size_number REAL
package_unit TEXT
price_per_unit_string TEXT
price_per_unit_number REAL
inflation_rate REAL
FOREIGN KEY (store_id) REFERENCES Store(store_id)
FOREIGN KEY (product_type_id) REFERENCES ProductType(product_type_id)
UNIQUE(store_id, product_type_id, date, variant)

The PriceSample table serves as the central repository, linking stores and product types while maintaining historical price data with calculated inflation rates.

Web scraping implementation

The system employs a dual-approach web scraping strategy, utilizing both Selenium WebDriver for JavaScript-heavy sites and falling back to HTML-based scraping for simplicity.

Selenium configuration

The Selenium implementation uses Firefox WebDriver with extensive customization to handle modern web security measures and anti-bot detection systems:

def fetch_page_selenium(url):
    firefox_options = FirefoxOptions()
    firefox_options.binary_location = r"d:\...firefox.exe"
    firefox_options.set_capability("moz:webdriverClick", False)
    firefox_options.set_preference("dom.webdriver.enabled", False)
    firefox_options.set_preference("webdriver_accept_untrusted_certs", True)

Template-based data extraction

The core extraction mechanism operates through a template matching algorithm that compares HTML elements from the configuration templates with actual webpage content.

find_matching_element(page_soup, template_element)

Locates HTML elements in the scraped page that match the template specifications by comparing tag names, attributes, and content patterns.

extract_data_from_template(template_lines, page_html)

Processes template definitions to extract relevant data from webpage content, handling both simple and complex nested HTML structures.

The template matching process involves several steps:

Element identification: The system parses template HTML to identify elements containing "FFF" placeholders
Attribute matching: For each template element, the system searches for corresponding elements in the scraped page with matching tag names and attributes
Content extraction: Once matched, the system extracts the actual content replacing the "FFF" placeholder
Duplicate handling: The algorithm prevents duplicate extraction from nested elements through relationship tracking

Data processing pipeline

The extracted raw data undergoes comprehensive processing to standardize formats, calculate derived values, and ensure data quality before database insertion.

Price information processing

The extract_price_info function handles the complex task of extracting numerical price values from diverse string formats across different locales and currencies:

def extract_price_info(price_string, currency_map):
    # Handle European decimal format (comma as decimal separator)
    if ',' in price_clean and '.' not in price_clean:
        price_clean = price_clean.replace(',', '.')
    elif ',' in price_clean and '.' in price_clean:
        # Handle format like "1.234,56" (European thousands separator)
        if price_clean.rfind(',') > price_clean.rfind('.'):
            price_clean = price_clean.replace('.', '').replace(',', '.')

The system processes different decimal and thousands separators used across various countries, ensuring accurate price extraction regardless of regional formatting conventions.

Package size and unit processing

Product package information extraction employs regular expressions to identify and parse size and unit data from product titles:

def extract_package_info(title_string):
    patterns = [
        r'(\d+(?:[,\.]\d+)?)\s*(oz|lb|g|kg|ml|l|fl oz|г|мл|л|кг)\b',
        r'(\d+(?:[,\.]\d+)?)\s*(ounce|pound|gram|kilogram|liter|litre)\b'
    ]

The system maintains unit conversion tables to standardize measurements across different regional systems:

UNIT_PATTERNS = {
    "oz": 0.0283495,
    "fl oz": 0.0295735,
    "lb": 0.453592,
    "kg": 1.0,
    "g": 0.001,
    "l": 1.0,
    "ml": 0.001,
    "г": 0.001,    # Cyrillic gram
    "мл": 0.001,   # Cyrillic milliliter
    "л": 1.0       # Cyrillic liter
}

Price per unit calculations

The system calculates standardized price-per-unit values to enable meaningful price comparisons across different package sizes and measurement systems:

            calculate_price_per_unit(price_number, package_size, package_unit, currency)
            Converts package sizes to standard units (kg for weight, liter for volume) and calculates normalized price-per-unit values for comparison purposes.

Inflation rate calculation

The system incorporates temporal price change tracking by calculating inflation rates based on historical price data stored in the database. This feature enables tracking of price changes over time for each product-store combination.

def calculate_inflation_rate(current_price, previous_price):
    if previous_price is None or previous_price == 0:
        return 0.0
    inflation_rate = ((current_price - previous_price) / previous_price) * 100
    return inflation_rate

The inflation calculation retrieves the most recent price record for each product-store-variant combination and computes the percentage change from the previous measurement. This enables tracking of price trends and identifying price fluctuations.

Error handling and data validation

The system implements comprehensive error handling and data validation to ensure data integrity and system reliability:

The system validates that both product title and price information are present before attempting database insertion, preventing incomplete records from corrupting the dataset.

Validation mechanisms

Empty data checking: Validates that extracted product titles and prices contain meaningful content
Database constraint enforcement: Utilizes SQLite constraints to maintain referential integrity and prevent duplicate entries
Exception handling: Try-catch blocks protect against network failures, parsing errors, and database connection issues
Selenium timeout management: Implements waiting strategies to handle dynamic content loading

Database operations and data management

The system implements a data management strategy using SQLite with proper transaction handling and data integrity constraints.

Database helper functions

get_or_create_store(store_name, country_name)

Implements an upsert pattern to ensure stores are uniquely identified while maintaining referential integrity in the database schema.

get_or_create_product_type(product_name)

Manages product type records using the same upsert pattern, ensuring consistent product categorization across all stores.

The database operations use INSERT OR REPLACE statements to handle potential duplicate entries gracefully, ensuring that the latest price information overwrites previous entries for the same date and product combination.

Performance considerations and optimizations

The system incorporates several performance optimizations to handle price monitoring efficiently:

Connection pooling: Database connections are managed efficiently with proper opening and closing patterns
Batch processing: Configuration parsing and database operations are optimized for batch execution
Memory management: BeautifulSoup objects are created and destroyed appropriately to prevent memory leaks
Selenium lifecycle: WebDriver instances are properly initialized and terminated for each scraping session

Extensibility

The system architecture facilitates easy extension and maintenance through several design patterns:

Configuration-driven design

Adding new stores requires only configuration file updates without code modifications. The template-based approach allows the system to adapt to different HTML structures through configuration changes.

Modular function design

Each major operation is encapsulated in dedicated functions with clear responsibilities, making the system easier to debug, test, and extend. The separation of concerns between web scraping, data processing, and database operations enables independent modification of each component.

Scaper limitations

While the system provides robust price monitoring capabilities, several limitations should be considered:

Modern websites employ sophisticated anti-bot detection systems that may require ongoing adaptation of the Selenium configuration and request patterns.
Website layout changes require corresponding template updates in the configuration file
Network failures and temporary website unavailability require designing of respective mechanisms

Conclusion

The Product Price Parser represents an approach to automated price monitoring, combining flexible configuration management with robust data processing capabilities. The system's architecture balances complexity with maintainability, providing a scalable foundation for international price tracking. Through its template-based extraction engine, comprehensive data processing pipeline, and normalized database structure, the system enables efficient monitoring of product prices across diverse retail environments.

The implementation demonstrates advanced web scraping techniques, proper database design principles, and thoughtful error handling strategies. The system's modular design and configuration-driven approach ensure that it can adapt to changing requirements and expanding monitoring scope without requiring significant architectural modifications.

Get full project code: https://github.com/Eb43/product-inflation-tracker

<--Visitor counter-->

https://www.free-counters.org