Jump to content

Talk:List of countries by GNI (PPP) per capita/table generator.py

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Subpages index:

[edit]

Pages with the prefix 'List of countries by GNI (PPP) per capita' in the and 'Talk' namespaces:

TODO:

[edit]
  • Generalising this for other World Bank data.
  • Output the SVG map, too.
  • Possibly convert to Lua so that MediaWiki can run the code natively.

Why is this in the Talk namespace?

[edit]

Because of the policy in WP:SUBPAGES. There is no specific policy for helper-scripts, but generally, anything that shouldn't be seen as a (current, future, or draft) part of the encyclopaedia, should be put into a Talk subpage, instead of the main namespace, if a more appropriate namespace does not exist.

What does this script do?

[edit]

The script is straightforward to read, if you understand Python. But here is an outline:

It processes World Bank data (that you download yourself from the sources cited in the article) and generates Wikipedia table syntax for the article List of countries by GNI (PPP) per capita.

It calculates average growth rates, formats data, handles country name overrides (where WB and WP disagree on what the same country is called) and dependencies (e.g. Bermuda does not get a rank in the WP article but is seen as an economy by WB), and creates tables for countries, regions, and income groups.

The output of the script matches the format of how the Wikipedia table was before the script was in-use.

What should I know before I run it?

[edit]

While I ensured the code only does what it is meant to do, and nothing more, the following disclaimer should be heeded before you run it on any machine.

The Software is provided “as is”, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the Software.

— Copied from the text of the MIT License

Basically: you run the code as if you wrote, read, and understood it yourself. You take the blame if anything goes wrong on your end.

How do I run it?

[edit]

I don't actually know how you'd run it. But I run it on Debian, with a recent version of Python installed. I put the source data (from the World Bank) in the same directory as my terminal is currently in, then run the "commands to set up environment", then paste the entire script into IPython and let it execute.

If you ran into issues running it, please let the primary author know. The primary author may be different in the future, therefore you should refer to the contribution history of this page to check.

Why does it exist?

[edit]

I knew of no easy way to update all the data en masse, which has to happen roughly every year for the article to stay accurate.

Script

[edit]
# ~60% AI generated (GPT-4o)
# for attribution please see contribution history on this page:
# https://en.wikipedia.org/wiki/Talk:List_of_countries_by_GNI_(PPP)_per_capita/table_generator.py
# For CC-By-SA license please see:
# https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_Creative_Commons_Attribution-ShareAlike_4.0_International_License
# Commands to set up environment:
"""
sudo apt-get install python3-venv
python3 -m venv myenv
source myenv/bin/activate
pip install pandas
"""
import pandas as pd

# Economies with no 'rank' that we should also mark as italic with the responsible state's initials:
dependencies = {
    'BMU': 'UK',  # BMU is Bermuda's World Bank country code.
    'HKG': 'CN',  # Hong Kong SAR, China
    'MAC': 'CN',  # Macao SAR, China
    'CYM': 'UK',  # Cayman Islands
    'ABW': 'NL',  # Aruba
    'SXM': 'NL',  # Sint Maarten (Dutch part)
    'PRI': 'US',  # Puerto Rico
    'CUW': 'NL',  # Curacao
    'TCA': 'UK',  # Turks and Caicos Islands
    'AIA': 'UK',  # Anguilla
    'VGB': 'UK',  # British Virgin Islands
    'FLK': 'UK',  # Falkland Islands (Malvinas)
    'GIB': 'UK',  # Gibraltar
    'MSR': 'UK',  # Montserrat
    'SHN': 'UK',  # Saint Helena, Ascension, and Tristan da Cunha
    'GGY': 'UK',  # Guernsey
    'JEY': 'UK',  # Jersey
    'IMN': 'UK',  # Isle of Man
    'COK': 'NZ',  # Cook Islands
    'NIU': 'NZ',  # Niue
    'GRL': 'DK',  # Greenland
    'FRO': 'DK',  # Faroe Islands
    'ASM': 'US',  # American Samoa
    'GUM': 'US',  # Guam
    'MNP': 'US',  # Northern Mariana Islands
    'VIR': 'US',  # Virgin Islands (U.S.)
    'CXR': 'AU',  # Christmas Island
    'CCK': 'AU',  # Cocos (Keeling) Islands
    'NFK': 'AU',  # Norfolk Island
    'TKL': 'NZ',  # Tokelau
    # Overseas collectivities of France not listed
    # because the World Bank does not regard them
    # as separate to France (and neither does France)
}

# Cases where the World Bank and Wikipedia have different names for the same place
country_name_override = {
    'KOR': 'South Korea',
    'BHS': 'Bahamas',
    'BRN': 'Brunei',
    'COG': 'Republic of the Congo', 
    'COD': 'DR Congo',
    'COM': 'Comoros',
    'CPV': 'Cabo Verde',
    'CIV': 'Cote d\'Ivoire',
    'DMA': 'Dominica',
    'EGY': 'Egypt',
    'FJI': 'Fiji',
    'GMB': 'Gambia',
    'HKG': 'Hong Kong',
    'IRN': 'Iran',
    'LAO': 'Laos',
    'MAC': 'Macau',
    'FSM': 'Micronesia',
    'MNE': 'Montenegro',
    'MRT': 'Mauritania',
    'PHL': 'Philippines',
    'KNA': 'St. Kitts and Nevis',
    'LCA': 'St. Lucia',
    'VCT': 'St. Vincent and the Grenadines',
    'SYR': 'Syria',
    'TCD': 'Chad',
    'TUR': 'Turkey',
    'TLS': 'Timor-Leste',
    'VEN': 'Venezuela',
    'VNM': 'Vietnam',
    'PSE': 'Palestine',
    'YEM': 'Yemen',

    # Islands and Territories
    'BMU': 'Bermuda',
    'CYM': 'Cayman Islands',
    'CUW': 'Curaçao',
    'TCA': 'Turks and Caicos Islands',
    'SXM': 'Sint Maarten',
    'PRI': 'Puerto Rico',

    # No World Bank data but likely to be in the future:
    'PRK': 'North Korea',
    # and Bougaineville: independence is likely to be internationally recognised
}

# Read a CSV file, skipping some lines
def read_trimmed_csv(file_path, skip_lines):
    with open(file_path, 'r') as file:
        lines = file.readlines()[skip_lines:]
    data = ''.join(lines)
    from io import StringIO
    return pd.read_csv(StringIO(data))

file_path = 'API_NY.GNP.PCAP.PP.CD_DS2_en_csv_v2_3407668.csv'
metadata_path = 'Metadata_Country_API_NY.GNP.PCAP.PP.CD_DS2_en_csv_v2_3407668.csv'

# Trim first 4 lines of the main data file, and read into a dataframe
df = read_trimmed_csv(file_path, 4)
# Metadata has headers right at the top:
metadata_df = pd.read_csv(metadata_path)

# Merge main data with metadata
merged_df = df.merge(metadata_df[['Country Code', 'Region']], on='Country Code', how='left')

# Columns of interest and years
years_of_interest = [2000, 2010, 2020]
all_years = [col for col in df.columns if col.isdigit() and len(col) == 4]
selected_columns = ['Country Name', 'Country Code'] + all_years + ['Region']

# Filter to include only the necessary columns
merged_df = merged_df[selected_columns]

# Calculate the average growth rate
def calculate_growth_rate(gni_start, gni_end, year_start, year_end):
    num_years = year_end - year_start
    if num_years <= 0:
        raise ValueError("End year must be greater than start year.")
    if gni_start > 0 and gni_end > 0:
        return ((gni_end / gni_start) ** (1 / num_years)) - 1
    return float('nan')

# Determine the most recent year of available GNI data
merged_df['Most Recent Year'] = merged_df[all_years].apply(lambda row: row.dropna().index[-1] if row.dropna().any() else '', axis=1)
merged_df['Most Recent GNI'] = merged_df[all_years].apply(lambda row: row.dropna().iloc[-1] if row.dropna().any() else float('nan'), axis=1)

# Make Wikipedia parser add number separator
def formatnum(n):
    if n:
        return "{{formatnum:" + str(n) + "}}"
    else:
        return n

ftn = formatnum # shorthand for f-string

def create_wp_table_header(wb_regions=False):
    header = "{| class=\"wikitable sortable sticky-header sort-under "+("col1left" if wb_regions else "col2left")+''' {{right}}
|- 
''' + ("! style=width:19em | World region" if wb_regions else "!\n! style=width:16.5em | Economy") + '''
! Year
! GNI PPP<br/>per capita<br/>([[Geary–Khamis dollar|Int$]])
! 2000 
! 2010
! 2020
! {{tooltip|Growth<br/>rate|Average growth rate (2000 or closest year after that, up to Most Recent Year)}}
|-'''
    return header

def create_wikipedia_table(df, wb_regions=False):
    df = df.sort_values(by='Most Recent GNI', ascending=False, na_position='last')
    wikipedia_table = create_wp_table_header(wb_regions=wb_regions)
    row_i = 0

    for _, row in df.iterrows():
        country_name = row['Country Name']
        country_code = row['Country Code']

        gni_2000 = int(row['2000']) if pd.notna(row['2000']) else ''
        gni_2010 = int(row['2010']) if pd.notna(row['2010']) else ''
        gni_2020 = int(row['2020']) if pd.notna(row['2020']) else ''

        gni_recent = row['Most Recent GNI']
        recent_year = row['Most Recent Year']
        gni_recent_str = int(gni_recent) if pd.notna(gni_recent) else ''

        if recent_year:
            recent_year = int(recent_year)
        year_start = 2000
        if gni_2000:
            growth_rate = f"{calculate_growth_rate(float(gni_2000), float(gni_recent), year_start, recent_year) * 100:.2f}%"
        else:
            closest_year_after_2000 = next((year for year in all_years if pd.notna(row[year]) and int(year) > 2000), None)
            if closest_year_after_2000:
                year_start = int(closest_year_after_2000)
                growth_rate = f"{{{{tooltip|{calculate_growth_rate(float(row[closest_year_after_2000]), float(gni_recent), year_start, recent_year) * 100:.2f}%|{year_start} to {recent_year}}}}}"
            else:
                growth_rate = ''

        common_numbers = f"|| {recent_year} || {ftn(gni_recent_str)} || {ftn(gni_2000)} || {ftn(gni_2010)} || {ftn(gni_2020)} || {growth_rate}\n|-"

        if not wb_regions:
            country_name = country_name_override.get(country_code, country_name)
            if country_code in dependencies:
                this_row_i = "-"
                flag = f"''{{{{flag|{country_name}}}}} ({dependencies[country_code]})''"
            else:
                row_i += 1
                this_row_i = row_i
                flag = f"{{{{flag|{country_name}}}}}"
            wikipedia_table += f"\n| {this_row_i} || {flag} "+common_numbers
        else:
            wikipedia_table += f"\n| {country_name} "+common_numbers

    wikipedia_table = wikipedia_table.rstrip("\n|-") + '\n|}'
    return wikipedia_table

# Separate countries and regions data
countries_df = merged_df[merged_df['Region'].notna()]
regions_df = merged_df[merged_df['Region'].isna()]

# Function to create a filtered table, being filtered based on the region's name
def create_region_income_table(df, regions_list, wb_regions=False):
    region_df = df[df['Country Name'].isin(regions_list)]
    return create_wikipedia_table(region_df, wb_regions=wb_regions)

countries_table = create_wikipedia_table(countries_df)
print("Countries Table:\n")
print(countries_table)

# These are the names of each region-table in the actual table in Wikipedia
t1 = ["World",
"Sub-Saharan Africa",
"Latin America & Caribbean",
"North America",
"Europe & Central Asia",
"East Asia & Pacific",
"South Asia",
"Middle East & North Africa"]

t2 = ["High income",
"Upper middle income",
"Middle income",
"Lower middle income",
"Low income"]

t3 = ["Africa Eastern and Southern",
"Africa Western and Central",
"Arab World",
"Central Europe and the Baltics",
"Euro area",
"European Union",
"Fragile and conflict affected situations",
"Least developed countries: UN classification",
"OECD members",
"Heavily indebted poor countries (HIPC)",
"Other small states",
"Caribbean small states",
"Pacific island small states",
"Small states"]

# Generate tables for various regions and income groups
t1_table = create_region_income_table(regions_df, t1, wb_regions=True)
t2_table = create_region_income_table(regions_df, t2, wb_regions=True)
t3_table = create_region_income_table(regions_df, t3, wb_regions=True)

# Print the generated tables for regions and income groups
print("\nRegions Table t1:\n")
print(t1_table)

print("\nIncome Groups Table t2:\n")
print(t2_table)

print("\nRegion Subdivisions Table t3:\n")
print(t3_table)

Draft output

[edit]
Placeholder!