Computing desk
< February 28	<< Feb \| March \| Apr >>	March 2 >

Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

March 1

Who is scraping the whole web?

What companies and institutions out there are scrapping and indexing the whole web? OsmanRF34 (talk) 13:42, 1 March 2013 (UTC)[reply]

Well, that's what a search engine does, like Google, although the reality is that they only find pages by following links from pages they already know about, so orphan pages aren't likely to be tracked. StuRat (talk) 15:49, 1 March 2013 (UTC)[reply]

Yes, but not all have their own back-end. Some will use Google's as back-end (and so don't count). And I could imagine that some gov body (some library) is also doing it, so it's not only SEs. OsmanRF34 (talk) 16:51, 1 March 2013 (UTC)[reply]

Yes, for instance citeseer, specifically Citeseer#CiteSeerX, does extensive crawling of university-hosted pages, though not the whole web. As a side note, I can personally attest (WP:OR) that they find orphan content, much to my chagrin. Lesson learned: use robots.txt! SemanticMantis (talk) 17:08, 1 March 2013 (UTC)[reply]

I wonder how. Do they use the list of registered domain names, then just try random words as possible subpages from there ? Or perhaps they use a bit of artificial intelligence, say to deduce that, if there'a a linked page at fubar.com/500.index and fubar.com/502.index, that there just might be one at fubar.com/501.index ? StuRat (talk) 18:02, 1 March 2013 (UTC)[reply]

Or perhaps they are eavesdropping on HTTP traffic and notice that HTTP requests for a particular URL are occurring, even though no other indexed website links to that URL! Law, sausages, and search engine indexes cease to inspire respect in proportion as we know how they are made... Nimur (talk) 20:43, 1 March 2013 (UTC)[reply]

For a more comprehensive non-SE scraper/crawler, see e.g. archive.org. SemanticMantis (talk) 17:12, 1 March 2013 (UTC)[reply]

It may be of interest to you to read Deep Web which talks about the limits of normal search engines. In fact there is quite a lot of technologies being used by companies generally to search and index particular areas of the internet they are interested in, in a systematic way - see for example Fast Search & Transfer recently acquired by Microsoft --_nonsense ferret 17:11, 1 March 2013 (UTC)[reply]

UTF-8? What happens with the C0 byte?

I was reading about UTF-8 and it seems that from 7F it jumps to C280 instead of going through C0 and C1, why? 190.60.93.218 (talk) 17:57, 1 March 2013 (UTC)[reply]

C080–C1BF are invalid redundant encodings of 0–7F. See UTF-8#Overlong encodings. -- BenRG (talk) 19:38, 1 March 2013 (UTC)[reply]

Does that mean that space can be used for other purposes? — Preceding unsigned comment added by 181.50.189.29 (talk) 00:18, 2 March 2013 (UTC)[reply]

Yes, if you don't mind that the result isn't UTF-8 and so you lose the interoperability benefits of UTF-8. The article mentions that C080 is sometimes used to encode U+0000 in languages that use 0 as a string terminator. -- BenRG (talk) 02:15, 2 March 2013 (UTC)[reply]