Nobal Tech: 2010

Why did I donate to Wikipedia ?

Very few things are free in this world. Wikipedia is one of the free websites. You can increase your knowledge or write articles to share with others for FREE.

Most of the websites you visit contain advertisement by which they receive money and survive. Have you ever noticed such ads in Wikipedia ? Absolutely not ! It's ads free and totally free. Then, how come the website is surviving ? The answer is here:

To do this without resorting to advertising, we need you. It is you who keep this dream alive. It is you who have created Wikipedia. It is you who believe that a place of calm reflection and learning is worth having. - Wikipedia Founder Jimmy Wales

It is worth-mentioning to tell you that Wikipedia is in top 10 MOST VISITED WEBSITES in this world. In average, I visit this website 5 times a day. You can even download all the Wikipedia in your computer if you wish. My friend did it and he said it was 28 Gb as of Nov, 2010.

I love this so much coz I receive a lot of information from it. I want to protect it. Thus, I donated $10 to keep it surviving. Here is the email I got once I donated to the Wikipedia Foundation:

Dear NOBAL BIKRAM NIRAULA,

Thank you for your gift of USD 10.00 to the Wikimedia Foundation, received on December 22, 2010. I’m very grateful for your support.

Your donation celebrates everything Wikipedia and its sister sites stand for: the power of information to help people live better lives, and the importance of sharing, freedom, learning and discovery. Thank you so much for helping to keep these projects freely available for their more than 400 million monthly readers around the world.

Your money supports technology and people. The Wikimedia Foundation develops and improves the technology behind Wikipedia and nine other projects, and sustains the infrastructure that keeps them up and running. The Foundation has a staff of about fifty, which provides technical, administrative, legal and outreach support for the global community of volunteers who write and edit Wikipedia.

Many people love Wikipedia, but a surprising number don't know it's run by a non-profit. Please help us spread the word by telling a few of your friends.

And again, thank you for supporting free knowledge.

Sincerely Yours,
Sue Gardner
Executive Director

* To donate: http://donate.wikimedia.org/
* To visit our Blog: http://blog.wikimedia.org/
* To follow us on Twitter: http://twitter.com/wikimedia
* To follow us on Facebook: http://www.facebook.com/wikipedia

Interesting Finding in Temporal Analysis of text

I found the following graph very interesting. The graph shows usage pattern of two English words "however" and "However" from 1800 to 2000. Note that "however" (small h) is used in the middle of sentence where as "However" (Capital H) is used at the beginning of a sentence.

Source: datamining.typepad.com

datamining.typepad.com : How do we interpret this? The most obvious interpretation might be that 'however' at the beginning of a sentence is becoming more frequent. We could also conclude that 'however' in general is becoming more frequent (imagine if we could combine the lines). Alternatively, it could mean that sentence length in the corpus is shifting. Given that we don't know the exact cultural mix of the 'British English' corpus, it could be somehow related to the mixture of American and British content. Finally, it could be due to the mix of fiction and non-fiction. Interestingly, the 'American English' corpus has quite a different signal.

Support Vector Machine (SVM) - A Practical Guide

Support Vector Machine (SVM) is a very popular classification method. Following is a useful documents if you are new to SVM. Note that recent and good document for the following presentation slides is here.

A Practical Guide to Support Vector Classification

Scanner Class in Java

When I started learning Java around 8 years ago, I had a problem. Because I'd already known C and C++, it was very hard for me to get input from console using Java code. C and C++ need just a line to get input whereas for Java we'd required a lot. Now it's more easier using Scanner class :

Reading From Keyboard :
import java.util.Scanner;

public class ScannerDemo {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
//
// Read string input for username
//
System.out.print("Username: ");
String username = scanner.nextLine();
//
// Read string input for password
//
System.out.print("Password: ");
String password = scanner.nextLine();
//
// Read an integer input for another challenge
//
System.out.print("What is 2 + 2: ");
int result = scanner.nextInt();
if (username.equals("admin") && password.equals("secret") && result == 4) {
System.out.println("Welcome to Java Application");
} else {
System.out.println("Invalid username or password, access denied!");
}
}
}

Note: Code is taken from this URL.

Reading From File:
import java.util.Scanner;
import java.io.*;
class HelpFile
{
   public static void main(String[] args) throws IOException
   {
   Scanner scanner = new Scanner(new File("test.txt"));
      while (scanner.hasNextLine())
   System.out.println(scanner.nextLine());
   }
}

Reading From Socket:

Scanner remote = new Scanner(socket.getInputStream());

PrintWriter out = new PrintWriter(socket.getOutputStream(), true);

//read line from keyboard
line = keyboard.nextLine();
//send line to remote node
out.println(line);
//wait for a line from remote node

line = remote.nextLine();

Weka and R Tutorial

World is amazing because of web. One can solve problems based on other's approach by looking them in web. One can teach what he knows to others by releasing article, audio or video lectures in web. Here I've found a very interesting site, SentimentMining.net, that is very useful for people who are interested in data mining or sentiment mining and statistical analysis. I particularly liked weka video tutorial in the site. It also has tutorials for R but I've not explored that much.

Facebook is giving more control of personal data

Facebook is going to give a functionality to download personal data such as pictures, videos, comments etc. It is very exiting and personally I feel that it is a good option to have. Let's watch the video how it works :

Sharing Folders between Windows7 host and Ubuntu Guest in VirtualBox

I'm using Ubuntu as a guest in Windows7 host using VirtualBox. I needed to share folder between the host and the guest. I searched for the solution and found a good link that explains the process. Click here to go the link. Basically following are the steps to be taken:

1.Share a Folder in Windows7 (Say C:/SharedFolder_Win7) using virtual box's GUI
2. Start or Restart Ubuntu
3. Run following commands :
3.1 sudo mkdir /mnt/SharedFolder
3.2 sudo mount.vboxsf SharedFolder /mnt/SharedFolder_Win7

Linked Data

Linked Data is a method to publish data on the Web and to interlink data between different data sources. Linked Data can be accessed using Semantic Web browsers, just as traditional Web documents are accessed using HTML browsers. However, instead of following document links between HTML pages, Semantic Web browsers enable surfers to navigate between different data sources by following RDF links. RDF links can also be followed by robots or Semantic Web search engines in order to crawl the Semantic Web.

The DBpedia data set is interlinked with various other data sources. The diagram below gives an overview of some of these data sources:

Facebook will remain free !

I have been noticing many rumors in Facebook that its users have to pay for using its services. Many Facebook groups are therefore made for free Facebook campaign. Today I read an article (available here ), which gives solid reasons why Facebook will not take fees for its basic services. I liked the logics and facts in the article and I'm going to share the points here:

Why do people worry Facebook might start charging soon? Probably because Facebook users feel like they're getting something valuable for free, and everybody knows there's no such thing as a free lunch.
How Facebook survives ? Because Facebook makes its money bringing together as big of an audience as possible and then selling that audience's attention to advertisers. It's a business that works.
What happens if Facebook is not free ? If Facebook started charging users, its membership would start shrinking fast -- and so would its revenues. So while Facebook may charge you for certain bonus features, such as gifts for your friends, or credits to play games like Farmville, it will never charge for basic access to the site.
Facebook is not free but sponsored ! - The fact that you keep coming back to Facebook makes it easier for Facebook to sell more ads -- and make more money. Your lunch isn't free, it's sponsored.

Power of content sharing

The more you share the more data can be mined; the more adverts can be targeted; the more money can be made. That's why Facebook's nudging you towards sharing more, and it's why Google is now personalising search for everybody whether they want it or not.

Source:
Read more

Introduction to SWRL

This slide explains what is SWRL and why we need it. It also shows how we write SWRL rule in Protege. Though the protege 3.2 editor shown in the slide, it can be used similarly in protege 4.0 as well ( View -> Ontology View -> Rules ). Below is a line that explains limitation of OWL and suggests use of rules to in ontology :

In OWL it is not possible to establish that a person is the boss of a secretary, only that a person is a boss.

SWRL Tips:

A SWRL rule contains an antecedent part, which is referred to as the body, and a consequent part, which is referred to as the head
Both the body and head consist of positive conjunctions of atoms. SWRL does not support negated atoms or disjunction. Thus, a SWRL rule may be read as meaning that if all the atoms in the antecedent are true, then the consequent must also be true.
How to write rule like : Prop1(?x,?y) V Prop2(?y,?x) -> Prop3(?y,?x) ?. Answer: As disjunction is not allowed in SWRL, we can break it down into two sub-rules. R1: Prop1(?x,?y) -> Prop3(?y,?x) and R2: Prop2(?y,?x) -> Prop3(?y,?x)

SWRL Tutorial 01

Operations on Ontologies

Source: Ontologies and Semantic Web

Operations on ontology include Merging, Mapping, Alignment, Refinement, Unification, Integration and Inheritance. Not all of these operations can be made for all ontologies. In general, these are very difficult tasks that are in general not solvable automatically -- for example because of undecidability when using very expressive logical languages or because of insufficient specification of an ontology that is not enough to find similarities with another ontology. Because of these reasons these tasks are usually made manually or semi-automatically, where a machine helps to find possible relations between elements from different ontologies, but the final confirmation of the relation is left on human. Human then decides based on natural language description of the ontology elements or decides only based on the natural language names of the ontology elements and common sense.

Ontology Reasoning

Why do we need reasoning in ontology?

Reasoning in ontologies and knowledge bases is one of the reasons why a specification needs to be formal one. By reasoning we mean deriving facts that are not expressed in ontology or in knowledge base explicitly. Reasoners are used to reason the ontology.

Tasks of Ontology Reasoners
A few examples of tasks required from reasoner are as follows.
Satisfiability of a concept - determine whether a description of the concept is not contradictory, i.e., whether an individual can exist that would be instance of the concept.

Subsumption of concepts - determine whether concept C subsumes concept D, i.e., whether description of C is more general than the description of D.

Consistency of ABox with respect to TBox - determine whether individuals in ABox do not violate descriptions and axioms described by TBox.

Check an individual - check whether the individual is an instance of a concept

Retrieval of individuals - find all individuals that are instances of a concept

Realization of an individual - find all concepts which the individual belongs to, especially the most specific ones

OWL Reasoners

A reasoner is a key component for working with OWL ontologies. In fact, virtually all querying of an OWL ontology (and its imports closure) should be done using a reasoner. This is because knowledge in an ontology might not be explicit and a reasoner is required to deduce implicit knowledge so that the correct query results are obtained. The OWL API includes various interfaces for accessing OWL reasoners. In order to access a reasoner via the API a reasoner implementation is needed. There following reasoners (in alphabetical order) provide implementations of the OWL API OWLReasoner interface:

FaCT++.
HermiT
Pellet
RacerPro

OWL API hasKey problem

OWL doesn't allow datatype properties to be inverse functional one. One can assume an inverse functional property as a unique key in database. OWL2 comes with the concept of hasKeys.

Few days ago I tried to parse the owl file having hasKey using Jena. Unfortunately, I found that Jena doesn't have the parser for OWL2 yet. Next, I found that The OWL API supports OWL2. Today, I spent my whole day to use the hasKey feature of OWL2 specification. I tried to parse the owl file using this parser. It parses the owl. But when I print the hasKey axioms after parsing, I get only output in forms of genid* i.e. no property names that are used in hasKey are obtained. To me it seems like a bug of the parser ... couldn't manage to get the keys :(. If anyone of you let me know how we get the properties specified in hasKey, I will give you a BIG thank you.

The portion of the owl is given below:
<owl:class rdf:about="#Conference">
<rdfs:subclassof rdf:resource="&owl;Thing">
<owl:haskey rdf:parsetype="Collection">
<owl:datatypeproperty rdf:resource="#confName">
<owl:datatypeproperty rdf:resource="#confYear">
<owl:datatypeproperty rdf:resource="#confType">
</owl:datatypeproperty></owl:datatypeproperty>
</owl:datatypeproperty></owl:haskey>
</rdfs:subclassof></owl:class>

Code to print the hasKey :
private void printHasKeyAxioms(OWLOntology ontology, OWLClass cls) {
Set keySet=ontology.getHasKeyAxioms(cls);
System.out.println("\t Total hasKey: "+keySet.size());
if(keySet.size()>0)
{
Iterator keyIter=keySet.iterator();
while(keyIter.hasNext()){
OWLHasKeyAxiom key=keyIter.next();
Set exp=key.getPropertyExpressions();
for(OWLPropertyExpression p:exp){
System.out.println("\t - "+p+" ");
}
}
}

Output (for cls=Conference): (Some info. is correct: Conference class has a key which has has three properties )
Total hasKey: 1
- <http://leo.inria.fr/publication.owl#genid7>
- <http://leo.inria.fr/publication.owl#genid9>
- <http://leo.inria.fr/publication.owl#genid11>

However, I expect names of properties instead of genid* in the output ...

Visitor Design Pattern

Visitor Pattern is a type of behavioral design pattern. Wikipedia says: the visitor design pattern is a way of separating analgorithm from an object structure it operates on. A practical result of this separation is the ability to add new operations to existing object structures without modifying those structures.

Example:

Few useful Unix commands

Replace a text in a file by another using perl command.
perl -pi -e 's/nobal/nobal niraula/g' a.xml

This command replaces "nobal" by "nobal niraula" in a.xml. we can do this task in multiple file. e.g. just give *.xml as argument if you want to replace in multiple xml file.

Find line(s) in a file containing a given text
grep "nobal" file.txt

Country Ranking by Internet Speed

RDF Revisited

Resource
The Resource Description Framework (RDF) is a standard (technically a W3C Recommendation) for describing resources.

Statements
Each arc in an RDF Model is called a statement. Each statement asserts a fact about a resource. A statement has three parts:

the subject is the resource from which the arc leaves
the predicate is the property that labels the arc
the object is the resource or literal pointed to by the arc

A statement is sometimes called a triple, because of its three parts.

RDF Syntax

RDF/XML
N-triple
N3
Turtle
JSON
TRiX

Via: Cell Phones

h-index: An evaluation tool for scientific productivity

Wikipedia defines h-measure as an index that attempts to measure both the scientific productivity and the apparent scientific impact of a scientist. Watch this video to know how one calculates the h-index :

I found a document that presents a partial list of computer science researchers whose h-indices are greater than 40 (Click here). So they must be crazy guys, don't you think so ;) ? Interestingly, one researcher in our team (LEO Team INRIA), Serge Abiteboul, is one of them. His current h-index is 49! Really interesting to know this. I would like to become crazy :)) ))) )))).

False Positive vs False Negative

The terms false positive and false negative (along with true positive and true negative) come to us from the world of diagnostic tests. An anti-spam product is like a pregnancy test - it eventually comes down to yes or no.

False positive means the test said the message was spam, when in reality it wasn't.
A false negative means that the test said a message was not spam, when in reality it was.

We often think in terms of error rates, but with many diagnostic tests the kind of error is a big deal. It's not enough to know that the test is wrong 29% of the time. We want to know what kind of wrong. Spam tests are exactly like that. A false positive means that good mail might have gotten lost, while a false negative is just annoying. We care more about false positives than we do about false negatives (unless the CEO is getting inundated with false negatives). In addition to wanting to know how many errors there are, we also want to know what type they are.

Source

NetworkWorld.com

Twitter Language

tweets: Messages in Twitter (max 140 character)
twitter alphabet soup : The Twitter characters with special meaning are: @, d, RT and #:
@: Talk publicly to another person
d: Talk privately to another person
RT: Repeat another person's tweet
#: Tag a message with a label

Technical Blog kicks off

Now I've lunched my Techincal blog. I have two more blogs: Phulbari (Nepali), Angrejee (English & French). My original intention was to write all English stuffs in Angrejee . However, I found its difficult... Thus, I lunched this new one purely for Technical stuffs. I'll use that for non-tech stuffs.

The Sixth Sense Technology

Social Media Traffic Changes

Here are some graphs that show the change in social media's traffic. All pictures are taken from mashable.com.

Well-Educated- My definition

Well-Educated : "Some see just water in river, others see electricity; some see nothing in air, others see power; some see just pollution in waste, others see energy; some see frustration in failures, others see vehicles to the success; some see Facebook, Twitter, YouTube, Email and Chat in Internet, others see the possibilities and future. If you belong to 'others' you are well-educated." :)

Introducing Google Translate for Animals

Have FUN Guys :)

Google search tips

Searching is almost compulsory to get the job done. One can use Google, Yahoo!, Bing and other search engines to search stuffs in web. Personally, I use Google more often than any other.

The faster one can search things, the more productive he becomes. To find things quickly, we need to know search tips. Here I'm providing some URLs which talk about the tips in searching web using Google.

Actually, I'm not using many of these tips till today... However, I now try to use these tips. Hope I'll be more productive :) !

Tips for using Google Search:

Better Search using Solr and Lucene

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat.

A good tutorial for beginner: Better Search with Apache Lucene and Solr
Tutorial at Solr HomePage
Slides:

Apache Solr

Lucene revisited

Lucene is an open-source full-text search library which makes it easy to add search functionality to an application or website. Want to understand Lucene in 5 minutes ? Go here. The following slide provides a quick review of Lucene.

Figure: Steps in building applications using Lucene [Source: IBM ]

Lucene Introduction

Why Lucene ? From this DOC.

Incremental versus batch indexing
Data sources
Indexing Control
File Format
Content Tagging
Stop Word Processing
Stemming
Query Features
Concurrency
Non-English Support

Go through this document that presents the fundamental concept of Lucent e.g. Index, Document, Field, Term, Segment and Query Term. I recommend to read that for the beginners.

Searching and Indexing

Lucene is able to achieve fast search responses because, instead of rearching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

Lucene's Drawback and Nutch

Lucene provides a powerful indexing and search library which may be used as a base for online search engines, however on its own the library doesn't include any form of web crawling or HTML parsing abilities. These features are necessary in order to create a fully functional online search engine. Several projects have modified Lucene with the intent of adding this missing functionality. One of the most notable of these efforts is Nutch, a SourceForge.net project.

More Resources:

HTML5 - The Future of the Web

This post provides a quick introduction to HTML5- the future of the web.

1. VIDEO

2. SLIDES:

CAPTCHA Ads

What is a CAPTCHA ?

As the name implies, CAPTCHAs were created as a way for websites to differentiate a real human visitor from a bot. CAPTCHA forms allow webmasters to display an image which contains a random string of letters and numbers. Visitors to websites utilizing CAPTCHAs are then be prompted to correctly enter the text displayed in the image in order to proceed with certain actions such as registering a new account, or leaving a comment on a blog or forum.

This is done in order to prevent bots from mass registering accounts, automatically posting spammy comments, and sending spam messages to a large amount of registered users among other things. CAPTCHAs have advanced over time to become less vulnerable to bots and scripts attempting to solve the codes while striving to remain user-friendly.

How CAPTCHA Advertising Works

The evolution of CAPTCHAs has inevitably led to a form of advertising. In essence, the core concepts and purpose of CAPTCHAs will remain unchanged. Don’t worry, you’ll still be shown an image that displays a line of text which must be entered correctly in order to proceed. The difference, however, lies in the presentation of the CAPTCHA. Instead of seeing a distorted image that contains randomly generated characters, you will see an image containing text that has been carefully selected by an advertiser.

These advertisers, which will likely span a number of big name national and international corporations, will submit their ads to a company capable of displaying them. Webmasters looking to monetize their CAPTCHA forms will also sign up with this company, and will be given a script that places the customized CAPTCHA on select portions of their website. The advertiser will then pay the said company a set amount of cash every time the CAPTCHA is successfully filled out, and the webmasters will be paid a cut of what the advertiser is paying. The company, acting as an intermediary, then collects the change remaining after paying out the webmaster.

Source
Techi.com

Future computer may understand you

Days are coming...one day computer will understand our emotions ( anger, sadness, happiness, surprise and frustration) and treat us accordingly . Read more here.

Google Goggles: A Visual Search Engine

Browser market Share

Pic: Browser Market Shares. Source: www.tomshardware.com

Living VS Dead person in Facebook

I'd written an article (in Nepali) about the policies regarding dead person's email and their contents in email and social networking providers (available here) . Facebook, a popular social networking site, has a policy of making a person's page into memorial page after his death. When a person dies, his friends can make his page a memorial page after which the sensitive information like phone number, status updates etc. will be hidden.

As Facebook is new, number of living people in Facebook are higher than the dead people. But as it grows older, the trend will change. In an article it is mentioned that: "Perhaps someday there will be more memorial pages than pages for living people". So there will be more dead people in Facebook than living people huh !

Interesting Findings: Twitter text analysis

1. Verbs are much more common in their gerund form in Twitter than in general text. “Going”, “getting” and “watching” all appear in the top 100 words or so.

2. “Watching”, “trying”, “listening”, “reading” and “eating” are all in the top 100 first words, revealing just how often people use Twitter to report on whatever they are experiencing (or consuming) at the time.

3. Evidence of greater informality than general English: “ok” is much more common, and so is “f***”.

Source
Oxford-Twitter Analysis

Regarding Twitter

All the contents in this blog posts are taken from this paper.

Twitter.com is a online social network used by millions of people around the world to stay connected to their friends, family members and coworkers through their computers and mobile phones. The interface allows users to post short messages (up to 140 characters) that can be read by any other Twitter user.

Users declare the people they are interested in following, in which case they get notified when that person has posted a new message. A user who is being followed by another user does not necessarily have to reciprocate by following them back, which makes the links of the Twitter social network directed.

Twitter users are able to post direct and indirect updates. Direct posts are used when a user aims her update to a specific person, whereas indirect updates are used when the update is meant for anyone that cares to read it.

Even though direct updates are used to communicate directly with a specific person, they are public and anyone can see them.

FRIEND : Here, a user’s friend is a person whom the user has directed at least two posts to.

Research Findings :

the number of posts initially increases as the number of followers increases but it eventually saturates.
the number of posts increases as the number of friends increases
the users who receive attention from many people will post more often than users who receive little attention.
in order to predict how active a Twitter user is, the number of friends is a more accurate signal than the number of his followers.
most users have a very small number of friends compared to the number of followees they declared.
the cost of declaring a new followee is very low compared to the cost of maintaining a friends (i.e. exchanging directed messages with other users). Hence, the number of people a user actually communicates with eventually stops increasing while the number of followees can continue to grow indefinitely.
users with more followers and friends will be more active at posting than those with a small number of followers and friends.
a link between any two people does not necessarily imply an interaction between them. in the case of Twitter, most of the links declared within Twitter were meaningless from an interaction point of view. Thus the need to find the hidden social network; the one that matters when trying to rely on word of mouth to spread an idea, a belief, or a trend.

Conclusion:

In conclusion, even when using a very weak definition of “friend” (i.e. anyone who a user has directed a post to at least twice) we find that Twitter users have a very small number of friends compared to the number of followers and followees they declare. This implies the existence of two different networks: a very dense one made up of followers and followees, and a sparser and simpler network of actual friends. The latter proves to be a more influential network in driving Twitter usage since users with many actual friends tend to post more updates than users with few actual friends. On the other hand, users with many followers or followees post updates more infrequently than those with few followers or followees.

Real-time web search

I loved this article because it provided me an information about real-time web searching which is at its infancy. Real-time web searching means searching the real time content. For example, if a great politician dies, people generate content exponentially. Providing relevant information in real time is not so easy. Here I'm listing some of the points that I liked in the article.

Now a delay of minutes on a breaking news story is unacceptable
Real-time search starts by determining that something important is happening in, well, real time.
Real-time search today is in its infancy, but it's the next stage in the evolution of Internet search.
RT Searching should address how can the explosion of instant content produced by news organizations, blogs, and social-media users be organized so that results can be provided instantly
what is "real-time" content?: -it centers on the concept of microblogging, or instant publishing of content to the open Web from social-media services. But in practice, "real-time search is still primarily Twitter search
two components to real-time information: the actual content of the status update or post, and the link that is being shared within that update.
Why web search providers want to buy Twitter's 'Firehouse' ?... Why spend the money? It's simply too difficult to crawl Twitter the way traditional search engines crawl the Web. All three major search engines (Y,G,B) at this point have inked deals to have Twitter push its content directly to them, saving those companies (and Twitter) time, energy, and money.
deadlines are dead in the real-time world.
So if search engines are to remain relevant themselves, they'll need to make sense of this content. And unless social-media networks are able to make their content discoverable, they won't turn into the types of content-discovery engines that their public-relations people like to imagine are already here.
Expect the importance of real-time search to only grow over the next several years. For example, Yahoo's search deal with Microsoft does not include real-time indexing and ranking efforts, as the company believes that it's too important to give away.

Interesting Links:

Oneriot.com - Assumes that the content based on on the premise that the link being shared within the status update is more relevant than the message itself.
Wowd.com - An example search engine of real-time web searching

Text Normalizer - Dealing with ascents

I had to sort French texts in alphabetical order. It was not as simple as we compare English strings because we must deal with the French ascents such as é and à.

If we don't process anything and use the simple string comparison function, we get équipement after zebra. However, we need équipement between words starting from 'd' and 'f' i.e. we want équipement as if it were equipement. In order to solve the problem, we must compare strings after we normalize and remove Diacritic:

String normalizedStr1=Normalizer.normalize(Text1, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");

String normalizedStr2=Normalizer.normalize(Text2, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");

Now we make comparison between normalizedStr1 and normalizedStr2 instead of Text1 and Text2.

The Great Walls

You might have noticed the word 'Walls' in the title. Actually, I was reading a technical article, China's Great Firewall spreads overseas. I noticed a new phrase "Great Firewall" for the first time. So, in world there are two great walls: The Great Wall and The Great Firewall. Both lie in China :).

Facebook wants to be the main river ?

A tributary is a stream or river which flows into a main stem (or parent) river. Facebook wants every site on the web to be a tributary. And it wants to be the main river using Open Graph API.

Basically, the Open Graph API is a way for Facebook to allow other companies, sites, services, etc to interact with Facebook without having to create a dedicated Facebook Page.
With the Open Graph API, Facebook wants to allow anyone to take their own site and essentially wrap it in a Facebook blanket. This doesn’t necessarily mean in a visual way, but rather that these sites which use the APIs will be able to replicate many of the core Facebook functionality on their own sites.
So you can imagine that you might be able to create a Facebook-style Wall to include on your site, but able to update your statuses from your site, leave comments, like items, etc. Again, it’s like a Facebook Page, but it would be on your site. And you can only include elements you want, and leave out others.

Text 2.0 - Interesting

One of my friends has been keeping his messenger status : "Everything is 2.0" since last few months back. It is because he is working in Web2.0 and communication systems and he thinks that everything is changing. Text was in 1.0 but now it is approaching to 2.0. This fact supports him :). Watch the video below, it is really interesting one !

Facebook beats Google

Here is an interesting news copied from consumerist.com:

It's official -- playing Farmville and tagging friends in photos (and consequently untagging embarrassing photos of yourself from your friends' photos) has become more popular than actually trying to find things on the internet, as a new report shows Facebook edged out Google as the most-visited site on the internet last week.

According to Hitwise, Facebook accounted for 7.07% of all web traffic for the week ending March 13. That barely edges out Google's 7.03%.

This is huge news for Facebook, who only a year ago accounted for around 2% of U.S. web traffic.

XML Namespace: Attributes are a little different

An attribute can appear in a different namespace than the element that contains it. For example, <movie:title xml:lang="fr"> has an attribute that is not from the movie namespace. If an attribute name has a prefix, its name is in the namespace indicated by the prefix. However, if an attribute name has no prefix, it has no namespace. This is true even when the default namespace has been assigned. The W3C Namespaces in XML Recommendation makes that point with this example:

<x xmlns="http://www.w3.org" xmlns:n1="http://www.w3.org">
  <good a="1" n1:a="2" />
</x>

The elements are affected by the declaration of a URI for the default namespace. That is, both x and good are associated with the URI "http://www.w3.org" because it's the default namespace. The attribute n1:a is also associated with that namespace, due to its use of the n1 prefix, which is associated with the same URI. There is no conflict that the a attribute is being declared twice, because while n1:a is in the http://www.w3.org namespace, the unprefixed a is not; the latter is not in any namespace.

Reference:

Copied from XML Namespace
Another Interesting tutorial for XML NS: Here

Dependency Trees

A dependency tree is a graphical representation of a sentence parsed using a dependency grammar. The nodes in the tree correspond to words in the sentence being parsed (and sometimes to special synthesized nodes). The arcs correspond to dependency relations between a "head" word, at the upper end of an arc, and the dependent words at the lower ends of the arcs connected to the head word. The grammatical relations between head and dependent words are such things as subject, object, modifier, etc.

TPTP - A Java Profiling Tool

In software engineering, program profiling, software profiling or simply profiling, a form of dynamic program analysis (as opposed to static code analysis), is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

The set of profiling tools provides software developers or testers with the ability to analyze the performance of a Java program or to gain a comprehensive understanding of the overall performance of an application. Eclipse Test and Performance Tools Platform (TPTP) is such a tool used for profiling. A good tutorial is here: Tutorial.

Ontology learning

Ontology learning also known as ontology extraction, ontology generation or ontology acquisition is a semi-automatic way of information extraction which is used to build an ontology from scratch (finding concepts and their relations), enriching or adopting an existing ontology.

C'est tres interessant

Apres long temps, j'ai ecrit une article en francais. En fait, j'ai une nouvelle interessante. La nouvelle est ici:

Aujourd'hui encore, il n'y a pas l'electricite et les routes dans ma ville. La vie est difficile. Par example, je peux appeller mes parents une fois de mois ! Donc on peux imaginer la situation et la vie là bas. En consequence, les personnes de ma ville veux partir le village pour trouver travailles et pour la vie mieux.

La tendance migration est normal. Mais la tendance donne resultats interessants. Par example, aujourd'hui j'ai trouve une personne qui j'ai vu beaucoup d'annes avant. J'ai vu lui dans ma ville 15 anne avant ! Merci web, merci mes articles blog et merci Google. A cause d'eux, il a trouve moi and donc j'ai trouve ma frere de ma ville. C'st tres interesterant....n'est pas? ;)

Federated Search

Federated search is the simultaneous search of multiple online databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.

Quotes that I like

"It is not enough to have a good mind; the main thing is to use it well." - Rene Descartes
"Education is the great engine of personal development. It is through education that the daughter of a peasant can become a doctor, that the son of a mineworker can become the head of the mine, that a child of farmworkers can become the president of a great nation. It is what we make out of what we have, not what we are given, that separates one person from another" ~ Nelson Mandela.
"It always seems impossible until its done" ~Nelson Mandela.

Finding 'a word' using Regular Expression

We frequently need to test where a word (rather than pattern) exists in other string or not. To illustrate more, consider the following two strings:

String1: This fact is very important to understand.
String2: port

Now two interesting cases arise:

Find whether pattern "port" appears in String1: In this case the regular expression would be: String regExp=".*"+ String2 +".*"; Clearly, we don't care what comes before and after the pattern. It would be true because port pattern in there in String1 (the word important contains it)
Find whether a word "port" appears in String1. Regular expression in this case is : String regExp=".*\\b"+ String2 +"\\b.*"; Following Java code is used to test this:

String regExp=".*\\b"+String2+"\\b.*";
Pattern p = Pattern.compile(regExp);
Matcher m = p.matcher(String1);
if( m.matches())
{

}

It fails here because "port" as a word doesn't appear in String1. It just appears as a pattern. If String1="The port was far" then the pattern matches because port appears as a word.

Key role is played by the \b of regular expression which is used to find word in a string.

Named Entity Recognization

Named Entity Recognition (NER) is also known as entity extraction and entity recognition. NER, a subtask of information extraction, is a process of finding mentions of specified things in the given text. In other words, it seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text, such as this one:

<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference in the 1990s.

Performance: state-of-the-art NER systems for English produce near-human performance.
Tools: Wikipedia lists a number of open source tools such as MALLET

Digg: Social News Website

Digg is a social news website where people share links and stories. Each registered user can vote and comment on the shared items. The contents are ordered based on the user's voting: the more people vote, the higher its ranking will be. Wikipedia mentions that social networking website are motivated by Digg's idea of sharing and voting features.

About Me

Blog Archive

Labels

Number of Visitors