Jump to content

Talk:Scraper site/Archives/2017

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia


Global nonprofit search engine...???

This quote:

A search engine is not a scraper site itself; sites such as Yahoo and Google gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content in response to a user's search.

is completely out of synch with current SE practices. Right now Google, Bing, and Yahoo (the top 3 SEs) provide high/full-quality images to download right from the search engine (not thumbnails or "snipets") with no permission other than the tacit permission of NOT blocking their bots from indexing your pages (which could be deadly). There are two dozen scraper sites harvesting my images that fly right to the top of the SEs and I look like a footnote. I have a great tagging system down (I hired a contractor on a lowly artist's budget to write a program to tag bulk images with logos relative to their size, and I've spent a lot of time developing elite pro logos), but I see extremely few people doing the same. (Some people don't even think to put their name/site in their filenames.)

I mean Google isn't sinister about it, they do provide a clear "View page" link, but to compensate, there are two ways of downloading the full image instantly. An image downloads full quality when you click it and it enlarges (the preview looks small, but the full quality image has downloaded to your browser once the image gets especially crisp), then there's also a clear "View image" button if you're absolutely too clueless or lazy to click "save [full] image" from the preview! Anyway, there's some objective information out there about this (a lot of people are angry), but since the SEs control the flow of information (there are actually fake scam-detection sites that give the scraper sites A+ ratings and no complaints), there isn't a whole lot to bring to an article. I might attack the task another day, but I'm too tired of tagging to try it any time soon.

What we should have is a global nonprofit search engine akin to Wikipedia or the GNU/GPL. What'a'ya say, guys? Squish7 (talk) 19:13, 9 September 2014 (UTC)

possibly useful sources

Chronology/timeline is mostly from The Epoch Times, but most of the indented-content is from other sources.

  • 2004 == 'made for adsense' sites (not sure if the neologism was also from 2004 or if that was applied retroactively)
  • 2006 == [content] cloaking
  • 2011 == panda , and maybe the neologism 'scraper site'?
  • https://www.buzzfeed.com/jwherrman/why-does-google-still-reward-content-scraping
  • "She a person at HuffPo] pointed out that BuzzFeed publishes link out pages too — that is, small stubs of outside stories that exist primarily to link to the source, and to give us a way to link out to other sites using our CMS. This is true."
  • "...A prominent Huffington Post link out can drive tens or even hundreds of thousands of clicks, and the site is generous with them. It can rightly claim to be a massive source of traffic for other publishers. But stub stories like this serve other purposes too: They send people to sites in the hope that they’ll come back to HuffPo to comment, for example, not unlike Reddit posts; and they provide an illusion of more content on a given vertical page. Most important, they’re an apparently deliberate play for Google traffic. Business Insider reposts links in a similar fashion in order to include them in its front page “river” column. Deputy Editor Nich Carlson explains, “When readers click on the story’s headline in our river, they go directly to the original story. Likewise, our CMS automatically puts a link to the original story in our tweets.” These posts, like HuffPo’s, are a sort of CMS hack, and not really meant to be read on their own."
  • "But BI’s stub stories rarely show up in Google and almost never outrank the stories they link to. This is by design: “We put a note for Google in the post’s metadata that tells Google to ignore our post, and give the ‘juice’ to the original story. We do this by noting a canonical link,” says Carlson."
  • By not using HuffPo-style aggressive search engine optimization (for a variety of reasons), The Verge, and others, are leaving traffic on the table. For Google, this is far more damning [than for HuffPo]: Google is the table. A site should not be able to auto-post a stub of another story and immediately outrank it in the world’s most popular and powerful search engine — that is a bug. ...(Google’s head of Webspam and longtime search quality point man Matt Cutts has not responded to a request for comment.)"
  • 2014 == panda upgrades (also, google accused of being a scraper-site)
  • 2015 == panda v4.2 minor update

See also * Matt Cutts and Google Panda. If I have spare cycles I will try to come back to this, please verify these are all WP:RS entities before mainspacing. 47.222.203.135 (talk) 08:09, 4 January 2017 (UTC)

rm copyvio

"Sophisticated scraping activity can be camouflaged by utilizing multiple IP addresses and timing search actions so they don't proceed at robot-like speeds, and are more human-like." Everything up the comma is word for word from mediabuzz.sg. Needs to be verified (tho likely true) rephrased and referenced. Elinruby (talk) 20:17, 17 January 2017 (UTC)

---ACTUALLY looks more like *they* copied it since there are other phrases. Leaving this section here as note to self to verify this with diff. Looking for references right now.Elinruby (talk) 20:20, 17 January 2017 (UTC)