Search Engines can’t rely on semantic markup
November 18th, 2011 by Tim Wintle(This post reflects my own personal opinion, and experiences in indexing web content – both at the Viral Ad Network and in non-VAN projects.)
One of the most common arguments justifying using tags carrying semantic value (such as <header>, <article>, etc) precisely is that it makes it easier for search engines to index your site…
But i’ve never heard someone who has written an indexer focus on that argument.
(note I’m not saying there isn’t a very minor effect… but certainly not a massive change)
When an indexer is trying to pick out the important text on a page, it has two main problems:
- Removing all the unimportant text (headers, navigation, footers etc)
- Dealing with spammers
The problem is that any technique websites can use to give hints to indexers will immediately be abused by spammers.
Take the keywords meta tag. Back in 1995, search engines thought it would be great if websites could add some semantic data to their sites, and they invented the “keywords” meta tag – which let websites specify keywords describing their site.
That failed badly – for every site which used the tag to add useful keywords, several spam sites used the keywords to add irrelevant keywords to their site. At the time it wasn’t unusual to skip through several pages of search results before finding a site which wasn’t spam. By 1998, search engines had started ignoring the tag.
As a result of spammers, it’s fairly much a given that all indexed data about the page has to come from what the user can see on the page – not from hidden markup features.
CSS lets us style elements in any way we like – we can effectively change the visual meaning of any two tags however we see fit.
It would be trivial for example to switch <h1> and <small> tags by modifying the styles for the elements. On a normal website you may never want to do that, but now consider you’re a spammer – you can put two different sets of copy on the page, and the raw markup can focus on one set of copy while your CSS focuses the user on completely different copy.
If an indexer was to base their importance largely on which tags were being used (e.g. rank <h1> tags highly) then they would be wide open to spam from that page.
It is possible for indexers to take hints from these tags, and use them if they match with their own results for detecting sections of a page, but they certainly can’t take them at face value.


