Viral Ad Network

Archive for the ‘Publishers’ Category

Search Engines can’t rely on semantic markup

November 18th, 2011 by Tim Wintle

(This post reflects my own personal opinion, and experiences in indexing web content – both at the Viral Ad Network and in non-VAN projects.)

One of the most common arguments justifying using tags carrying semantic value (such as <header>, <article>, etc) precisely is that it makes it easier for search engines to index your site…

But i’ve never heard someone who has written an indexer focus on that argument.

(note I’m not saying there isn’t a very minor effect… but certainly not a massive change)

When an indexer is trying to pick out the important text on a page, it has two main problems:

  1. Removing all the unimportant text (headers, navigation, footers etc)
  2. Dealing with spammers

The problem is that any technique websites can use to give hints to indexers will immediately be abused by spammers.

Take the keywords meta tag. Back in 1995, search engines thought it would be great if websites could add some semantic data to their sites, and they invented the “keywords” meta tag – which let websites specify keywords describing their site.

That failed badly – for every site which used the tag to add useful keywords, several spam sites used the keywords to add irrelevant keywords to their site. At the time it wasn’t unusual to skip through several pages of search results before finding a site which wasn’t spam. By 1998, search engines had started ignoring the tag.

As a result of spammers, it’s fairly much a given that all indexed data about the page has to come from what the user can see on the page – not from hidden markup features.

CSS lets us style elements in any way we like – we can effectively change the visual meaning of any two tags however we see fit.

It would be trivial for example to switch <h1> and <small> tags by modifying the styles for the elements. On a normal website you may never want to do that, but now consider you’re a spammer – you can put two different sets of copy on the page, and the raw markup can focus on one set of copy while your CSS focuses the user on completely different copy.

If an indexer was to base their importance largely on which tags were being used (e.g. rank <h1> tags highly) then they would be wide open to spam from that page.

It is possible for indexers to take hints from these tags, and use them if they match with their own results for detecting sections of a page, but they certainly can’t take them at face value.

Publishers – meet VANCrawler

November 14th, 2011 by Tim Wintle

If you check your server logs, you may have noticed that we have been running a new experimental web crawler – “VANCrawler”.

We’ve now set our crawler live across all publisher sites, so it’s time for a formal introduction.

What is VANCrawler?

VANCrawler is a web crawler which downloads a copy of all pages showing our ad units onto our own servers, where we apply our own natural language processing for advanced contextual targeting, scan pages for potential issues, and run countless other processing to help us target ads better.

What Do I need to do?

We have spotted several sites which have issues with the text encoding used on their pages. We suggest that all publishers double-check that they are declaring the correct encoding (“charset”) for their site.

Without this information, crawlers don’t know how to treat words containing accented letters such as “publicité”, or non-latin letters such as “Διαδίκτυο” or “網站”. If you’re not using these letters you may still be affected, as without the correct character set we may believe you are.

This is a problem for crawlers which index pages for search engines, and for users around the world as well as our own crawler, so fixing this issue on sites where it occurs may increase your search engine traffic as well.

You can specify the character encoding for your website in the Content-Type HTTP header, in the XML declaration for XHTML, or using a special tag for HTML. Please ensure that this matches the character encoding which your documents use when they are served.

If you are uncertain which encoding you are using, you can generally find the “charset” specified somewhere in your web browser (e.g. under “View>Character Encoding” in Firefox. For European users using windows, this will probably be “Windows-1252″. For Users on any other operating system (or for advanced websites on Windows), it will probably be “UTF-8″.

We have also noticed several sites serving pages which are not encoded using a valid encoding at all.

This may occur from using a server-side language such as PHP to concatenate strings which use different encodings. For example a string stored in a database may be encoded using UCS-4 (where each letter is represented as 4-bytes), and may then be concatenated with an ascii string (where each letter is one byte). This situation is even more difficult for crawlers to deal with, and so we suggest that publishers who program their own server-side pages check that they are not mixing different character sets when generating their pages.

Can I prevent VANCrawler from accessing my site?

VANCrawler follows the robots.txt standard, and will follow rules for User-Agent: VANCrawler..

If you want to limit how often we crawl pages on your site, you can use the crawl-delay robots.txt option to specify the minimum number of seconds we should wait between crawls.

As VANCrawler is being used to target ads to increase your revenue, we recommend you do not disallow access to pages on your site, however you may enter urls to exclude in the robots.txt format.

You can find more information on our ad crawler here: http://www.viraladnetwork.net/crawler-info