May 15, 2013

Visualizing World Languages in Wikipedia

After seeing how the population of a language is not correlated with the volume of internet content of that language (although obvious) in the inforgraphics created by Funders and Founders: http://notes.fundersandfounders.com/post/50347902559/worlds-top-languages-on-the-internet , I decided to collect the number of Wikipedia articles on those languages. So I collected the data from Wikipedia written in the top 10 languages in the world and made these charts using Excel.

Sorted by Wikipedia articles
Sorted by population
After sorting according to the number of wikipedia articles, I see that although Mandarin Chinese is spoken by the highest number of people in the world, if we rank by the number of the wikipedia articles, Chinese goes to the 6th position. The bad thing about Excel is after I sort the columns, it does not retain the same color for the same language:(

Then again, I decided to see the percentage of articles in those languages, not the percentage in total, rather the proportion among the articles in these languages. So I drew this pie chart:


Visualizing World Languages and Bangla Content in Internet

Found this visualization in my facebook feed shared by some of my friends, and also in my lab's mailing list:


Funders and Founders made this visualization that compares the world language based on their population and internet-content. I am attaching the image from their website:

I would like to know more about the data and what they mean by Language used in Internet: language in text? Language in video/ music? Everything? For the case of internet, another interesting observation would be to see which language is generating more content now than before, because definitely, when internet started, everything was in English. 

I see Bengali is on the 7th position, way to go population growth!  At least we are among top ten languages, based on number of people. But then I look at the bar graph at the bottom, and can see the discrepancy. I am glad that Chinese and Spanish have some correlation with the number of people and the volume of internet-content. The problem with the visualization is, the circle shows the number, not the percentage. The bar shows the percentage, not the number. Although in both cases, I would like to see both percentage and number. And also I would like to see both the data in similar visualization, I do not see any point why one is a circle and other is a bar. It's difficult to compare the % of world population and % of internet-content. But I appreciate the effort to make a point with this visualization. Although I am confused about the thought-bubble with a person with turban, another one with farmer's hat, and another with a baseball cap: the content-creator is thinking about all of them? So s/he is trying to include internationalization features [for Chinese/Shikh/American people] ?

Makes me sort of sad though. Don't know about others, but in case of Bengali, the Wikipedia movement is trying to create more contents in Bengali, but one thing is, people who can afford a computer+internet, already know workable English (because they are privileged to go to school and have electricity), so they don't feel the need to have Bengali contents. And people who actually need it in Bengali cannot afford computers. When our higher level textbooks/exams are also in English, when it comes to search for academic documents, it makes more sense to search in English. We cannot deny the fact that the lingua franca of modern day academic publication/ research is English. 

Although I wonder how the chart will look like if they only consider social/creative content (like blog-posts, Facebook posts). The standard Unicode for Bengali font started around 2001, before that people had to write Bengali using Roman script. Phonetic keyboard made it lot more easier than before so we can easily type in Bengali without using a Bengali keyboard layout. It is easier than before, but still not as easy as typing in English. 



Bengali speaking-area.
 soource Wikipedia:
http://en.wikipedia.org/wiki/Bengali_language
Most Bengali content I came across over the internet are: 
       Text: newspaper articles, blogs, social media status update, text books.
       Video: Bengali songs, movie, or TV drama(Youtube), video lectures (Shikkhok.com). 
       Audio: Music sites sharing mp3s.


The blog posts and news paper articles are written in Bengali scripts, but the Facebook posts or Tweets are not always written using Bengali scripts, most people use the Roman scripts and transliterate Bengali into Roman alphabet, probably due to the Roman Keyboard and the lack of ease to use Bangla writing softwares. First of all, we don't need to install any software to use Roman alphabet, and people who seldom writes in Bangla, might not be interested to use such softwares. 

Another reason is familiarity: we do not have Bangla writing options in cell phones (at least in most cases), so people are already used to text in Bangla using Roman alphabet, and they do the same when they update status in Facebook. 

Moreover, there is no plugin that lets me write Bangla in-place when I update my status in Facebook/ Twitter. I have to type it somewhere else and then copy-paste the text in the status update text-area. 

Also, a big issue is to reach the appropriate audience. Almost everyone understands English ... OK, almost everyone using the internet understands English. So am I loosing some audience when I am writing in my native language? Or may be I can gain more Bangali audience by writing in Bangla. 

Now I come back to the original point, the privileged people may not care about more Bangla content, and the people who need Bangla content the most are living on the edge, so we cannot make a business model depending on that population. Then who are the consumers? If there is some e-commerce website in Bangla how will it compete against the English ones? Or do we even need such websites in Bangla? I am not aware of the current situation of mobile apps in Bangladesh, so I cannot actually tell whether the fishermen or farmers are actually using mobile internet to sell their products or not. But if that scheme turns out to be cost-effective for them, then that's a huge area to develop Bangla mobile apps.

Still, I am optimistic about the use of Bangla in internet. Now we have a full fledged search engine Pipilika (the ant): http://pipilika.com/  where we can search for Bangla contents.  We learn the best when we learn in our mother-tongue, so definitely there is need to have more Bangla contents that will help us learn. And no matter what, we think in our own language, so when it comes to share our creativity, we would prefer our own language. When I see how the Bangla blogosphere is emerging (to the point where government considers to censor it), I feel good about the future of creative contents in Bangla on internet. I mean, aren't we the people who love to write poems and fictions!