File:Zipf's law on War and Peace.png

Size of this preview: 500 × 599 pixels. Other resolutions: 200 × 240 pixels | 400 × 480 pixels | 641 × 768 pixels | 854 × 1,024 pixels | 2,011 × 2,411 pixels.

Original file (2,011 × 2,411 pixels, file size: 443 KB, MIME type: image/png)

This is a file from the Wikimedia Commons. Information from its description page there is shown below.
Commons is a freely licensed media file repository. You can help.

Summary

DescriptionZipf's law on War and Peace.png	English: Zipf's Law on War and Peace, plotted according to "Zipf's word frequency law in natural language: A critical review and future directions" (ST Piantadosi, 2014). It takes the text file, tokenizes it, then randomly split the tokens into two lists with 0.5 probability each. The word count in each list is counted. One word count is used to estimate the rank, and the other is used to estimate the frequency. This avoids the problem of correlated errors (see Piantadosi, 2014). The data is then plotted in log-log plot, and the Zipf law is fitted using the words with rank 10--1000 by least squares regression on the log-log plot. The alpha is estimated by hand (usually it's around 1 to 3) for convenience, though it can be estimated properly by standard least squares regression. The lower plot shows the remainder when the Zipf law is divided away. It shows that there is significant patterns not fitted by Zipf law. Text file of War and Peace from Project Gutenberg. Programmed in Python. Matplotlib code import nltk import urllib.request from collections import Counter import matplotlib.pyplot as plt import numpy as np import re import random def zipf_plot(txt_url, title, alpha=1, gutenberg=True): response = urllib.request.urlopen(txt_url) print("response received") long_txt = response.read().decode('utf8') # Remove the Project Gutenberg boilerplate if gutenberg: pattern = r"\\\* START OF THE PROJECT GUTENBERG EBOOK [^\] \\\" match = re.search(pattern, long_txt) if match: start_index = match.end() end_marker = "** END OF THE PROJECT GUTENBERG EBOOK" end_index = long_txt.find(end_marker, start_index) long_txt = long_txt[start_index:end_index].strip() else: print("Pattern not found in the text.") # Tokenize the text tokenizer = nltk.tokenize.RegexpTokenizer('\w+') tokens = tokenizer.tokenize(long_txt.lower()) # Randomly split the tokens into two corpora random.shuffle(tokens) half = len(tokens) // 2 corpus1, corpus2 = tokens[:half], tokens[half:] # Count word frequencies in each corpus counter1 = Counter(corpus1) counter2 = Counter(corpus2) # Find common words in both corpora common_words = set(counter1.keys()) & set(counter2.keys()) # Create a dictionary of word, frequency pairs where frequency is the average of frequencies in two corpora freqs = {word: (counter1[word] + counter2[word]) / 2 for word in common_words} # Create a dictionary of word, rank pairs where rank is determined from the frequencies in the second corpus ranks = {word: rank for rank, (word, freq) in enumerate(sorted([(w, counter2[w]) for w in common_words], key=lambda x: -x[1]), 1)} # Create arrays for ranks and frequencies, ordered by rank words = sorted(freqs.keys(), key=ranks.get) ranks = np.array([ranks[word] for word in words]) counts = np.array([freqs[word] for word in words]) plt.rcParams.update({'font.size': 30}) fig = plt.figure(layout="constrained", figsize=(20, 24)) axes = fig.subplot_mosaic("A;B", sharex=True) # Plot A: Zipf's Law with fitted line ax = axes["A"] ax.set_title(f"Zipf's Law on {title}") ax.set_ylabel('Frequency') # Zipf law fit indices = (ranks >= 10) & (ranks <= 1000) slope, intercept = np.polyfit(np.log(ranks[indices]), np.log(counts[indices]), 1) ax.plot(ranks, np.exp(intercept) * ranks ** slope, color='red', alpha=1, label=f"Zipf law (f = 1/(r+{alpha})^{-slope:.2f})") # Create the hexagonal heatmap using hexbin, on the log scale data hb = ax.hexbin(ranks+alpha, counts, gridsize=100, cmap='inferno', bins='log', xscale='log', yscale='log') # cb = plt.colorbar(hb, ax=ax) # cb.set_label('log10(N)') ax.legend() ax.grid() # Plot B: Zipf's Law with Zipf removed ax = axes["B"] ax.set_title("with Zipf removed") ax.set_xlabel('Frequency rank') ax.set_ylabel('Frequency/Zipf law') # Create the hexagonal heatmap using hexbin, on the log scale data hb = ax.hexbin(ranks+alpha, counts / (np.exp(intercept) * (ranks+alpha) ** slope), gridsize=100, cmap='inferno', bins='log', xscale='log', yscale='log') # cb = plt.colorbar(hb, ax=ax) # cb.set_label('log10(N)') ax.legend() ax.grid() plt.show() zipf_plot("https://www.gutenberg.org/cache/epub/2600/pg2600.txt", "War and Peace", alpha=2)
Date	11 July 2023
Source	Own work
Author	Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:

This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

You are free:

to share – to copy, distribute and transmit the work
to remix – to adapt the work

Under the following conditions:

attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

File history

Click on a date/time to view the file as it appeared at that time.

	Date/Time	Thumbnail	Dimensions	User	Comment
current	22:11, 11 July 2023		2,011 × 2,411 (443 KB)	Cosmia Nebula	Uploaded while editing "Zipf's law" on en.wikipedia.org

File usage

The following page uses this file:

Zipf's law

Metadata

This file contains additional information, probably added from the digital camera or scanner used to create or digitize it.

If the file has been modified from its original state, some details may not fully reflect the modified file.

Software used	Matplotlib version3.7.1, https://matplotlib.org/
Horizontal resolution	39.37 dpc
Vertical resolution	39.37 dpc

File:Zipf's law on War and Peace.png

Summary

Matplotlib code

Licensing

Captions

Items portrayed in this file

depicts

creator

some value

copyright status

copyrighted

copyright license

Creative Commons Attribution-ShareAlike 4.0 International

inception

11 July 2023

media type

image/png

source of file

original creation by uploader

File history

File usage

Metadata