Viral Ad Network

VANCrawler

VANCrawler is our experimental crawler, which crawls our publisher websites to help target ads.

If you believe that our crawler is mis-behaving (ignoring robots.txt, or crawling non-publisher sites), then please let us know.

For Website Owners

Choosing pages to crawl

VANCrawler starts crawling pages which we believe have shown our ad units, and will try to ensure that we have up-to-date knowledge of the content of your site, so that we can target ads correctly.

We will generally only crawl pages which we believe are showing our ad units. If you are not a Viral Ad Network Publisher, and we have crawled your site, then it may be because you embed another page which is one of our publishers, or we may be crawling sites linking to our publishers as part of an anti-malware or anti-fraud crawl.

robots.txt

We aim to follow the robots.txt standard, following rules for User-Agent: VANCrawler.

You can use this to tell our crawler not to index specific pages on your site, but remember that this may reduce competition for ads on those pages, and so reduce your publisher income.

We do not follow the meta noindex attribute. We made this decision as our index is not publicly revealed, and is solely used for improved ad targeting

crawl-delay

We will generally follow the robots.txt Crawl-delay attribute, specified in seconds, which specifies the minimum time between crawls.

We set delays between crawls on a per-IP basis, so if you are using DNS load balancing you may find that we crawl your site more often than this.

In addition, due to network issues and the distributed nature of our crawling, you may find that the delay between crawls is occasionally slightly less than this period.

Character Encodings

We support a very wide range of character encodings, however it is important that your server or HTML tells us what the encoding is for any text documents returned (e.g. HTML/XHTML). For more information, please see the Wikipedia article.

Urls

Our crawler will treat www.site.com and site.com as different urls.

Parts of the url after a hash (#) are not sent to the server during a request (e.g. site.com/#test), so are dropped from the request.

We will attempt to remove query parameters which we believe will not modify the request (e.g. site.com/?utm_source=source1 - which is commonly used for tracking using google analytics).

All requests are fetched using HTTP GET. We will not POST data to your server.

Signup to our free viral insight newsletter:

• We will never share your email address.