Wikipedia:Knowledge gaps in women's health/Task lists/Research questions
RQ1
[edit]During the WikiWomenCamp, a discussion point was raised where women are often 'victimized' on Wikipedia while men are shown to have 'achievements'. This is a very difficult hypothesis to test, and there has not been much research in this area. So, I have been thinking about doing some simple tests to find out if this is the case. I found an idea that might give us an indication of whether women are disproportionately victimized. On English Wikipedia, there is an entire class of articles starting with 'Murder of..', 'Disappearance of..', 'Kidnapping of..' etc, which indicate victimization. For example, in the article 'Murder of Vandana Das', the person 'Vandana Das' is notable only because of her death and otherwise not notable. Therefore, the murders of already notable individuals do not usually appear as a separate article but as a section in their individual pages (information regarding the murder of Gauri Lankesh is given as a section in the article about her). There are, of course, exceptions if the notable individual's murder was a landmark event (such as Assassination of John F. Kennedy). I wrote a code on Quarry to extract the list of articles starting 'Murder of..', 'Assassination of ..' etc. Together, they constitute over 5000 articles. Now, I want to categorize them based on sex. Now, this is very hard to do because most of the 'Murder of..' articles have no indications of the sex of the murdered individual added on Wikidata. So, it is going to be a difficult task to read through the actual Wikipedia articles manually. After the sex mapping is done, I would like to see if there are significant differences in the number of female vs male articles (assuming that non-binary individuals are too few amount to a statistically significant number). I also want to compare this data with the existing homicide data, and see if Wikipedia is reporting female murders disproportionate to that is happening in reality. Any ideas on NLP or other best practices to classifiy the list of 5000 + articles based on their sex?