Quotes that I like

  • "It is not enough to have a good mind; the main thing is to use it well." - Rene Descartes
  • "Education is the great engine of personal development. It is through education that the daughter of a peasant can become a doctor, that the son of a mineworker can become the head of the mine, that a child of farmworkers can become the president of a great nation. It is what we make out of what we have, not what we are given, that separates one person from another" ~ Nelson Mandela.
  • "It always seems impossible until its done" ~Nelson Mandela.

Finding 'a word' using Regular Expression

We frequently need to test where a word (rather than pattern) exists in other string or not. To illustrate more, consider the following two strings:

String1: This fact is very important to understand.
String2: port

Now  two interesting cases arise:
  • Find whether pattern "port" appears in String1: In this case the regular expression would be: String regExp=".*"+ String2 +".*"; Clearly, we don't care what comes before and after the pattern. It would be true because port pattern in there in String1 (the word important contains it)
  • Find whether a word "port" appears in String1. Regular expression in this case is : String regExp=".*\\b"+ String2 +"\\b.*"; Following Java code is used to test this:
String regExp=".*\\b"+String2+"\\b.*";
Pattern p = Pattern.compile(regExp);
Matcher m = p.matcher(String1);
if( m.matches())
{

}
It fails here because "port" as a word doesn't appear in String1. It just appears as a pattern. If String1="The port was far" then the pattern matches because port appears as a word.

Key role is played by the \b of regular expression which is used to find word in a string.

Named Entity Recognization

Named Entity Recognition (NER) is also known as entity extraction and entity recognition. NER, a subtask of information extraction, is a process of finding mentions of specified things in the given text. In other words, it seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text, such as this one:
<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference in the 1990s.
  • Performance: state-of-the-art NER systems for English produce near-human performance.
  • Tools: Wikipedia lists a number of open source tools such as MALLET

Digg: Social News Website

Digg is a social news website where people share links and stories. Each registered user can vote and comment on the shared items. The contents are ordered based on the user's voting: the more people vote, the higher its ranking will be. Wikipedia mentions that social networking website are motivated by Digg's idea of sharing and voting features.