Subscribe to BCA Journal Know More

January 2012

Using the Internet for mass collaboration

By Samir Kapadia
Chartered Accountant
Reading Time 8 mins
fiogf49gjkf0d
About this article:

This article is based on a video of Luis von Ahn aired recently on a popular site i.e., www.ted.com. The video itself was recorded sometime around April 2011.

Every once in a while you come across something, an idea or a vision, that knocks you down completely. The thing that strikes you the most, is the simplicity. This article is about one such idea and how few individuals have used their minds to harness energies of millions and millions of people to help make a difference.

The Internet, as a resource, is viewed differently by different individuals. For some it is a source of information and knowledge, for others it is a means of earning a livelihood, and then there are those who are able to use their limitless imagination and ingenuity to effortlessly harness the power and labour of millions and millions of individuals, to achieve the unbelievable or the next to impossible.

One honest confession I need to make is that, while I had heard about mass collaboration and had seen its practical application (one of which is Wikipedia), but, when I first saw this video, I was completely awestruck and blown away.

Here are a few statistics to tell you why:
  • Currently, more than 350,000 websites are using these ideas ?
  • Time spent per day is equivalent to 500,000 man-hours ?
  • The number of words digitised by these sites exceeds 100 million a day — that’s the equivalent of effort required to digitising (approx) 2.5 million books a year ?
  • The effort put in, is all done one word at a time/10 seconds per person by approximately 500 million people.
Mind you ! ! ! this is just a sample of what limitless imagination and ingenuity can achieve.
So what is this mind boggling, out of the box idea, that I am raving about? Well . . . . . . . all I can say are three words CAPTCHA, RECAPTCHA & DUOLINGO.
CAPTCHA:

Captcha = Completely Automated Public Turing test to tell Computers and Humans Apart
Whats that . . . . . you said ? ? ? Is a very common response, so let me translate that into non-geek language.
Let say you are trying to register or log into sites like Google, Facebook, Twitter (and several others) and you see some oddly distorted letters/words (see picture below).
These seemingly innocuous letters (or text pieces) are a common site today. While most recognise these as a security feature, lesser number of web surfers know that these are tools for identifying whether the person accessing the site is a human being or a computer (bot) and hence the name – Completely Automated Public Turing test to tell Computers and Humans Apart.
For those of you who are unaware, unlike humans, a ‘bot’ cannot read distorted words. When you type the (correct) words in the box, it proves that you are human and the website allows you to register/access content/purchase goods/make reservations, etc.
Over a period of time Captcha has become (almost) a standard security feature. In the video von Ahn revealed that (by April 2011) there were more than 350,000 websites using Captcha and some approximately 500 million users every day were spending 10 seconds each while accessing various e-commerce sites.
The first reaction to the above is ‘WOW’ — 350,000 websites, 500 million users. von Ahn too felt a sense of pride that his invention was being used by so many people, but then he also thought that each of these 500 million users were spending 10 seconds each during the verification process, this translated to 500,000 man-hours (approx). Then came the thought, “Is there something I can do to utilise this effort to do something — something huge but simple — something that machines cannot do (as yet) as efficiently as humans can?” Needless to say that stopping the use of Captcha, given its benefits, was not an option. This thought was the seed to another research, resulting in what is commonly known as RECAPTCHA.

RECAPTCHA: von Ahn and his associate/intern came up with this idea on the basis of the findings of their research. The idea was to use the efforts of the 500,000 man-hours to digitise books. There are several projects doing this already, including one being pursued by Google. It is common knowledge amongst most people who are involved in the endeavour to digitise books, that computers and more specifically, optical character recognition (OCR) technology is applied for digitising books. And that typically, this involves one person using a scanner device to scan one page at a time and then wait for the OCR software to convert the scanned image in to a document. What is not very commonly known (at least with the public at large) is that the technology is not 100% accurate. Machines and for that matter computers/ software, at times, are not able to ‘recognise’ many of the characters that are scanned by them. This is more so when the book being scanned is older than 10 years. The difficulty arises due to a variety of factors such as the typeface used, yellowing of the pages, creases in the pages, wear and tear/ condition of the book. In all such cases, human effort is required (computers cannot do it as easily as humans). Thus, RECAPTCHA was born. Once again the idea was a simple one, the visitor was presented with two words (instead of one in Captcha) one which was known to the software and the other which was required to be ‘recognised’. When both words were recognised, the visitor was granted access to the site he was visiting. All the time, in the background, RECAPTCHA was comparing this result with the response provided by another 10 users (who were given the same combination). If the result matched, then another word was digitised.

Once again the idea was a runaway success — The number of words digitised by these sites exceeds 100 million a day — that’s the equivalent of effort required to digitising (approx) 2.5 million books a year. Given the success, RECAPTCHA was acquired by Google.

von Ahn and team revisited their question and embarked on a yet another journey. This time they decided that all the parties involved in the process should have something to gain — in captcha human effort was used to verify their status as humans. While this helped the website owners, it resulted in wastage of human effort. Recaptcha used this human effort to convert books — once again website owners and book readers gained- nothing for the visitors who were assisting in the digitising process were not being compensated. This thought gave birth to DUOLINGO.

DUOLINGO:

Just like digitising books, translating content is another ‘skill’ which the machines/software do not posses (as yet). It’s one thing to merely translate words and a different thing to translate the words with context. It is the context in which the words are spoken, which makes the text readable and by that measure more comprehensible. If you don’t believe try using the translators available for converting a poem in Hindi to English and vice versa (no offence but its like watching a Chinese movie — dubbed in Tamil — the tone/pitch of the dialogue or a fight scene versus the body language — I have always found it hilarious — try it sometime). Coming back to the topic . . . von Ahn and team came up with the idea of DUOLINGO.
What von Ahn and team realised was that there was content on the web which needed to be translated. The video has cited the example of translating content on the English version of Wikipedia to Spanish version — currently the Spanish version is only 30% of the English version and the cost of converting the same — as the video suggest — from the lowest cost vendor, based on the effort of exploited labourers in a third-world country- was $ 50 million. Cost apart the other quandary was where do you find enough people who know more than one language and were willing to participate in the translation process. The solution was that there are hundreds and thousands of people who want to learn another language, they have to pay money to learn, here was an opportunity to learn and apply at the same time — without spending anything from their pockets. Now there is a win-win for almost all !

  •     Content can be translated
  •     With context, translation is easier, fun, improves the learning/experience
  •     The accuracy is far higher than that offered by software currently available and almost comparable to the accuracy of a professional translator
  •     Both parties don’t pay money but put in their ‘efforts’
  •     Both parties gain
  •     And on the hindsight lesser exploitation of labour

The result: based on current stats the translation can be done in a matter of weeks.

Now that’s what I call innovation.

Like I said earlier, I was completely blown away when I saw the video, I am sure after reading this write-up (and maybe watching the video) you will be too.

Wish you Merry Christmas and a Happy New Year.

Disclaimer:

The purpose of this article is not to promote any particular site or person or software. The sole intention is to create awareness and to bring in to limelight some thought-provoking content.

You May Also Like