Nobal Tech: April 2010

False Positive vs False Negative

The terms false positive and false negative (along with true positive and true negative) come to us from the world of diagnostic tests. An anti-spam product is like a pregnancy test - it eventually comes down to yes or no.

False positive means the test said the message was spam, when in reality it wasn't.
A false negative means that the test said a message was not spam, when in reality it was.

We often think in terms of error rates, but with many diagnostic tests the kind of error is a big deal. It's not enough to know that the test is wrong 29% of the time. We want to know what kind of wrong. Spam tests are exactly like that. A false positive means that good mail might have gotten lost, while a false negative is just annoying. We care more about false positives than we do about false negatives (unless the CEO is getting inundated with false negatives). In addition to wanting to know how many errors there are, we also want to know what type they are.

Source

NetworkWorld.com

Twitter Language

tweets: Messages in Twitter (max 140 character)
twitter alphabet soup : The Twitter characters with special meaning are: @, d, RT and #:
@: Talk publicly to another person
d: Talk privately to another person
RT: Repeat another person's tweet
#: Tag a message with a label

Technical Blog kicks off

Now I've lunched my Techincal blog. I have two more blogs: Phulbari (Nepali), Angrejee (English & French). My original intention was to write all English stuffs in Angrejee . However, I found its difficult... Thus, I lunched this new one purely for Technical stuffs. I'll use that for non-tech stuffs.

The Sixth Sense Technology

Social Media Traffic Changes

Here are some graphs that show the change in social media's traffic. All pictures are taken from mashable.com.

Well-Educated- My definition

Well-Educated : "Some see just water in river, others see electricity; some see nothing in air, others see power; some see just pollution in waste, others see energy; some see frustration in failures, others see vehicles to the success; some see Facebook, Twitter, YouTube, Email and Chat in Internet, others see the possibilities and future. If you belong to 'others' you are well-educated." :)

Introducing Google Translate for Animals

Have FUN Guys :)

Google search tips

Searching is almost compulsory to get the job done. One can use Google, Yahoo!, Bing and other search engines to search stuffs in web. Personally, I use Google more often than any other.

The faster one can search things, the more productive he becomes. To find things quickly, we need to know search tips. Here I'm providing some URLs which talk about the tips in searching web using Google.

Actually, I'm not using many of these tips till today... However, I now try to use these tips. Hope I'll be more productive :) !

Tips for using Google Search:

Better Search using Solr and Lucene

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat.

A good tutorial for beginner: Better Search with Apache Lucene and Solr
Tutorial at Solr HomePage
Slides:

Apache Solr

Lucene revisited

Lucene is an open-source full-text search library which makes it easy to add search functionality to an application or website. Want to understand Lucene in 5 minutes ? Go here. The following slide provides a quick review of Lucene.

Figure: Steps in building applications using Lucene [Source: IBM ]

Lucene Introduction

Why Lucene ? From this DOC.

Incremental versus batch indexing
Data sources
Indexing Control
File Format
Content Tagging
Stop Word Processing
Stemming
Query Features
Concurrency
Non-English Support

Go through this document that presents the fundamental concept of Lucent e.g. Index, Document, Field, Term, Segment and Query Term. I recommend to read that for the beginners.

Searching and Indexing

Lucene is able to achieve fast search responses because, instead of rearching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

Lucene's Drawback and Nutch

Lucene provides a powerful indexing and search library which may be used as a base for online search engines, however on its own the library doesn't include any form of web crawling or HTML parsing abilities. These features are necessary in order to create a fully functional online search engine. Several projects have modified Lucene with the intent of adding this missing functionality. One of the most notable of these efforts is Nutch, a SourceForge.net project.

More Resources:

HTML5 - The Future of the Web

This post provides a quick introduction to HTML5- the future of the web.

1. VIDEO

2. SLIDES:

CAPTCHA Ads

What is a CAPTCHA ?

As the name implies, CAPTCHAs were created as a way for websites to differentiate a real human visitor from a bot. CAPTCHA forms allow webmasters to display an image which contains a random string of letters and numbers. Visitors to websites utilizing CAPTCHAs are then be prompted to correctly enter the text displayed in the image in order to proceed with certain actions such as registering a new account, or leaving a comment on a blog or forum.

This is done in order to prevent bots from mass registering accounts, automatically posting spammy comments, and sending spam messages to a large amount of registered users among other things. CAPTCHAs have advanced over time to become less vulnerable to bots and scripts attempting to solve the codes while striving to remain user-friendly.

How CAPTCHA Advertising Works

The evolution of CAPTCHAs has inevitably led to a form of advertising. In essence, the core concepts and purpose of CAPTCHAs will remain unchanged. Don’t worry, you’ll still be shown an image that displays a line of text which must be entered correctly in order to proceed. The difference, however, lies in the presentation of the CAPTCHA. Instead of seeing a distorted image that contains randomly generated characters, you will see an image containing text that has been carefully selected by an advertiser.

These advertisers, which will likely span a number of big name national and international corporations, will submit their ads to a company capable of displaying them. Webmasters looking to monetize their CAPTCHA forms will also sign up with this company, and will be given a script that places the customized CAPTCHA on select portions of their website. The advertiser will then pay the said company a set amount of cash every time the CAPTCHA is successfully filled out, and the webmasters will be paid a cut of what the advertiser is paying. The company, acting as an intermediary, then collects the change remaining after paying out the webmaster.

Source
Techi.com

Future computer may understand you

Days are coming...one day computer will understand our emotions ( anger, sadness, happiness, surprise and frustration) and treat us accordingly . Read more here.

Google Goggles: A Visual Search Engine

Browser market Share

Pic: Browser Market Shares. Source: www.tomshardware.com

Living VS Dead person in Facebook

I'd written an article (in Nepali) about the policies regarding dead person's email and their contents in email and social networking providers (available here) . Facebook, a popular social networking site, has a policy of making a person's page into memorial page after his death. When a person dies, his friends can make his page a memorial page after which the sensitive information like phone number, status updates etc. will be hidden.

As Facebook is new, number of living people in Facebook are higher than the dead people. But as it grows older, the trend will change. In an article it is mentioned that: "Perhaps someday there will be more memorial pages than pages for living people". So there will be more dead people in Facebook than living people huh !

Interesting Findings: Twitter text analysis

1. Verbs are much more common in their gerund form in Twitter than in general text. “Going”, “getting” and “watching” all appear in the top 100 words or so.

2. “Watching”, “trying”, “listening”, “reading” and “eating” are all in the top 100 first words, revealing just how often people use Twitter to report on whatever they are experiencing (or consuming) at the time.

3. Evidence of greater informality than general English: “ok” is much more common, and so is “f***”.

Source
Oxford-Twitter Analysis

Regarding Twitter

All the contents in this blog posts are taken from this paper.

Twitter.com is a online social network used by millions of people around the world to stay connected to their friends, family members and coworkers through their computers and mobile phones. The interface allows users to post short messages (up to 140 characters) that can be read by any other Twitter user.

Users declare the people they are interested in following, in which case they get notified when that person has posted a new message. A user who is being followed by another user does not necessarily have to reciprocate by following them back, which makes the links of the Twitter social network directed.

Twitter users are able to post direct and indirect updates. Direct posts are used when a user aims her update to a specific person, whereas indirect updates are used when the update is meant for anyone that cares to read it.

Even though direct updates are used to communicate directly with a specific person, they are public and anyone can see them.

FRIEND : Here, a user’s friend is a person whom the user has directed at least two posts to.

Research Findings :

the number of posts initially increases as the number of followers increases but it eventually saturates.
the number of posts increases as the number of friends increases
the users who receive attention from many people will post more often than users who receive little attention.
in order to predict how active a Twitter user is, the number of friends is a more accurate signal than the number of his followers.
most users have a very small number of friends compared to the number of followees they declared.
the cost of declaring a new followee is very low compared to the cost of maintaining a friends (i.e. exchanging directed messages with other users). Hence, the number of people a user actually communicates with eventually stops increasing while the number of followees can continue to grow indefinitely.
users with more followers and friends will be more active at posting than those with a small number of followers and friends.
a link between any two people does not necessarily imply an interaction between them. in the case of Twitter, most of the links declared within Twitter were meaningless from an interaction point of view. Thus the need to find the hidden social network; the one that matters when trying to rely on word of mouth to spread an idea, a belief, or a trend.

Conclusion:

In conclusion, even when using a very weak definition of “friend” (i.e. anyone who a user has directed a post to at least twice) we find that Twitter users have a very small number of friends compared to the number of followers and followees they declare. This implies the existence of two different networks: a very dense one made up of followers and followees, and a sparser and simpler network of actual friends. The latter proves to be a more influential network in driving Twitter usage since users with many actual friends tend to post more updates than users with few actual friends. On the other hand, users with many followers or followees post updates more infrequently than those with few followers or followees.

Real-time web search

I loved this article because it provided me an information about real-time web searching which is at its infancy. Real-time web searching means searching the real time content. For example, if a great politician dies, people generate content exponentially. Providing relevant information in real time is not so easy. Here I'm listing some of the points that I liked in the article.

Now a delay of minutes on a breaking news story is unacceptable
Real-time search starts by determining that something important is happening in, well, real time.
Real-time search today is in its infancy, but it's the next stage in the evolution of Internet search.
RT Searching should address how can the explosion of instant content produced by news organizations, blogs, and social-media users be organized so that results can be provided instantly
what is "real-time" content?: -it centers on the concept of microblogging, or instant publishing of content to the open Web from social-media services. But in practice, "real-time search is still primarily Twitter search
two components to real-time information: the actual content of the status update or post, and the link that is being shared within that update.
Why web search providers want to buy Twitter's 'Firehouse' ?... Why spend the money? It's simply too difficult to crawl Twitter the way traditional search engines crawl the Web. All three major search engines (Y,G,B) at this point have inked deals to have Twitter push its content directly to them, saving those companies (and Twitter) time, energy, and money.
deadlines are dead in the real-time world.
So if search engines are to remain relevant themselves, they'll need to make sense of this content. And unless social-media networks are able to make their content discoverable, they won't turn into the types of content-discovery engines that their public-relations people like to imagine are already here.
Expect the importance of real-time search to only grow over the next several years. For example, Yahoo's search deal with Microsoft does not include real-time indexing and ranking efforts, as the company believes that it's too important to give away.

Interesting Links:

Oneriot.com - Assumes that the content based on on the premise that the link being shared within the status update is more relevant than the message itself.
Wowd.com - An example search engine of real-time web searching

Nobal Tech