SimHash

In computer science, SimHash is a technique for quickly estimating how similar two sets are. The algorithm is used by the Google Crawler to find near duplicate pages. It was created by Moses Charikar. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC (Federated Learning of Cohorts) system.^[1]

Evaluation and benchmarks

A large scale evaluation has been conducted by Google in 2006^[2] to compare the performance of Minhash and Simhash^[3] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling^[4] and using Minhash and LSH for Google News personalization.^[5]

References

^ Cyphers, Bennett (2021-03-03). "Google's FLoC Is a Terrible Idea". Electronic Frontier Foundation. Retrieved 2021-04-13.
^ Henzinger, Monika (2006), "Finding near-duplicate web pages: a large-scale evaluation of algorithms", Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 284, doi:10.1145/1148170.1148222, ISBN 978-1595933690, S2CID 207160068.
^ Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms", Proceedings of the 34th Annual ACM Symposium on Theory of Computing, p. 380, doi:10.1145/509907.509965, ISBN 978-1581134957, S2CID 4229473.
^ Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th International Conference on World Wide Web (PDF), p. 141, doi:10.1145/1242572.1242592, ISBN 9781595936547.
^ Das, Abhinandan S.; Datar, Mayur; Garg, Ashutosh; Rajaram, Shyam; et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th International Conference on World Wide Web, p. 271, doi:10.1145/1242572.1242610, ISBN 9781595936547, S2CID 207163129.

External links

[1] Cyphers, Bennett (2021-03-03). "Google's FLoC Is a Terrible Idea". Electronic Frontier Foundation. Retrieved 2021-04-13.

[2] Henzinger, Monika (2006), "Finding near-duplicate web pages: a large-scale evaluation of algorithms", Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 284, doi:10.1145/1148170.1148222, ISBN 978-1595933690, S2CID 207160068.

[3] Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms", Proceedings of the 34th Annual ACM Symposium on Theory of Computing, p. 380, doi:10.1145/509907.509965, ISBN 978-1581134957, S2CID 4229473.

[4] Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th International Conference on World Wide Web (PDF), p. 141, doi:10.1145/1242572.1242592, ISBN 9781595936547.

[5] Das, Abhinandan S.; Datar, Mayur; Garg, Ashutosh; Rajaram, Shyam; et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th International Conference on World Wide Web, p. 271, doi:10.1145/1242572.1242610, ISBN 9781595936547, S2CID 207163129.

[1]

[2]

[3]

[4]

[5]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix

Evaluation and benchmarks

See also

References

External links