Sunday, December 26, 2010

Kindle beyond tipping point


You know that kindle has caught on, when you find the priest in a hindu temple in chicago, reciting shlok (prayers) from a text on his kindle.

Poor resolution, took it from a distance, was in no-photography zone, but couldn't resist capturing this "tipping point" moment.

Also, the blog has been in-active for years.. and I don't know when will I post again. Sorry!

Saturday, March 24, 2007

Google's solution for deep web.. But, can you really copy all that data?

Greg Linden has a very good summary of the Google's approach for handling the deep web content here and here, mainly based on the Dec 2006 paper.

(This blog is a reproduction of the comments I originally posted as response to Greg's posts)

Overall, he is proposing that Google is approaching the problem by caching all the deep web content, thereby, putting an end to the federated solutions to this problem. However, I strongly suspect making a local copy of all the data using "deep crawl" is perhaps not the right approach for Deep Web.

1. The number of queries you need to pose for each form is huge. Specially, if you have text fields. Further, the scenarios where cardinality of input values is virtually infinite.

2. How do you model all queries to all these diverse forms? Going back to author's comment that there are a wide range of domains. How do derive the set of queries to be used to probe each of the sources?

3. The load on server for each probing query is much more than simple crawling of a surface web page. It would generally require a database query on the server side.

4. The data is quite dynamic. Such as price information - in much of the ecommerce domains, such as flights, hotels, etc. Given the required load for one cycle of "deep crawl", you cant have to many refresh cycles.

5. Keeping all the query enumeration, schema matching and mapping challenges aside, web sources imposes very unique challenges. There are a very few "get" based forms these days; Such offline "deep crawl" would not be *easy* to deploy in forms involving POST and javascripts.

Ofcourse, there are many more challenges in handling deep web content. In above, I have pointed out the ones that are very specific to "deep crawl" approach.

I work in a start up named Cazoodle, and we are taking a very different approach to solve the challenges in Deep web. We will be presenting our system prototype in ICDE. Hope to catch up with you if you are flying in there.

Thursday, February 22, 2007

python string += string

I use the string += operation very commonly in all my python programs, and sometimes store a few MB of data in memory before flushing it out to disk.

I just learnt that its implementation makes it a very slow operation. Basically, in python strings are immutable. This means += operation is destroying and creating objects on every call. Imagine doing this a few hundred thousand times in each program.

Today, when a simple loop was taking infinitely long time, I was forced to investigate, and sure enough someone had explained it on this thread on python forum.

But I cannot keep invoking file ios for each append operation either. Even though file writes already have buffering implemented, I like to explicitly store data in memory for a few steps of string appends, and then flushing it to disk. This is important if you want to monitor the progress of your program using these logs - deterministically - such as every 1000 steps of the loop. I wrote this simple class that makes this task very easy.

class hugeFileWrite:
def __init__(self, fname, step=100):
self.sout = ''
self.step = step
self.fname = fname
self.count = 0

f = open(fname, 'w')
f.write('')
f.close()

def addString(self, smore):
self.sout += smore
self.count += 1
if self.count > self.step:
self.flush()

# Make sure you call flush() after your last addString
def flush(self):
f = open(self.fname, 'a')
f.write(self.sout)
f.close()

self.sout = ''
self.count = 0

Wednesday, February 21, 2007

Google 101

Google folks are offering a class in UWashington - CSE 490h: Problem-solving on large-scale clusters: theory and applications.

What we study in the distributed systems classes is quite basic, and bit outdated compared to Google like infrastructure. This class might be pretty useful for people developing large scale cluster systems.

Also, see this talk on Google Cluster covering GFS, MapReduce, Sawzall and BigTable.

Thanks to Greg for all the pointers[1][2].

Friday, February 16, 2007

inspiring

Every once in a while, you come across acts that are inspiring. Speeches that boost up your motivation.

The famous inches speech by Al Pacino playing the football coach in Any Given Sunday:


And this image. (The website I found this image had some copyright notice, and was warning when I tried to copy this image. I hope I dont find myself fighting lawsuits for posting it here.)

Thursday, February 15, 2007

Learning the tricks of the trade

Industry experience adds a lot of weight to any person's resume. Sure, coz you learn many tools that are used in industry. But, I have started to believe that its partly also because you have learned how the system works.

When new, you dont want to come out of your mould, but sooner or later, you tend to give in, and start to pick up the tricks of the trade.

This is quite beautifully illustrated in this segment of Devil wears Prada on You Tube. I dont know how long this segment will be alive due to all the DRM issues. But until then, enjoy.

Tuesday, February 13, 2007

All the whinning about high salary of a CEO

I came accross a very interesting debate on the high salary (1 M) of Paul Levy, the CEO of a 100 B hospital in Boston. In a rare move of its kind, Paul has justified his salary and invited comments. As others, I am also impressed with his openness in such a political issue.

The major criticism are about the hospital marketing itself as non-profit, but using its size as the justification for 1 M salary of its CEO. Also, some people argue that a part of his salary could be better off being spent on acquiring additional resources for hospital.

I am actually just stepping out of school, and have seen very little of how economy and society operates. However, being an infant in the working class, impressed with the simplicity and democracy of a capitalistic economy, I find any opposition to his salary ridiculous, to say the least.

I completely endorse his salary. I think in a capitalist economy, salaries should be commensurate to the contribution of the individual. We need to pay as much as it is required to attract the most suitable talents.

Infact, unlinke some commenter, I dont even find anything wrong in athelets earning millions for just catching a ball. I dont think just because being a doctor is socially more noble than playing baseball, it deserves more respect and pay. If hospitals help people live longer life , entertainment helps people live happier life . In the end, what good is the longer life, if you werent not happy in those extra days.

As to someone trying to reframe the question, by saying we could pay Paul 0.5 M and spend the other 0.5 M in hiring extra nurses. Well, the thing is, Paul would perhaps move onto an organization that pays proportionate to his talent. Or may be wont strive so hard if not paid proportionate to his efforts. And therefore, this move might actually be counter productive. Its not a simple linear system, you see.

blogger pains

Why is switch to google's platform implying so many inconveniences for the users of Blogger?

First, they asked to do what-not-i-dont-remember. I was just clicking on whereever it was asking me to click. And then, it told me, I am now using Google version of Blogger. Well, okay! Why couldnt you make it simpler and more transparent to user?

As if thats not enough, everytime I now come to Blogger, I have to login twice. First, I login using my previous Blogger identity. And then it detects that this previous Blogger user is now using Google account. So gives me another login screen.

Why cant you use the same login screen for both the accounts, in the backend check against both databases, untill all your blogger users have moved into Google account? Which certainly would never happen coz of so many idle accounts. So until when do I have to keep logging in twice to use Blogger?

Another related bug (or is it a feature?) in Google Apps is, whenever you want to use some service, such as, Gmail, Orkut, Google Account, Blogger, you are required to login for the first access to each of the application. However, as soon as logout from any of them, it automatically logs you out from all the application. Well, why is this unsymmetry?

I will survive

Living in Midwest (urbana-champaign to be precise) makes you very strong.. well atleast in fighting against weather.

Today, we have prediction of over 1 foot of snow. Over 5 inches of snow fall happened overnight. Here is the note from Chancellor of UIUC:


Chancellor Richard Herman

All classes have been canceled for Tuesday (Feb. 13) at the University of Illinois at Urbana-Champaign. More than 5 inches of snow fell overnight, upto another foot of snow is possible by Tuesday night and a blizzard warning...


So what exactly is all this talk about global warming? On the one hand, we have a-glacier-like-situation in urbana.. on the other hand, we have polar glaciers melting. Perhaps, global warming just implies severe changes in climate.. not that it would be warmer in absolute terms.

Monday, February 12, 2007

Founders at work

I really liked this book by Jessica Livingston.

Its full of inspiring stories. I particularly like the first hand comments that are generally lacking in many other similar books I have read.

I also liked the fact that its not just a collection of super-successful-cant-be-replicated stories of Microsoft, Yahoo, Google, Apple. It covers a wider spectrum of companies, many of which happend in quite recent times, which makes it easier to relate to.

Thursday, February 08, 2007

the power of web

I really liked the following video that illustrates what is web, and how we can use it in an influential way. As John Battelle also points out, equally intriguing is the fact that the creator isnt an engineer but an anthropology prof.

Tuesday, January 30, 2007

Now its Microsofts turn

Few days ago, I posted about all the criticism Yahoo is facing in Yahoo is here to stay.. or is it?

Now this time, its microsoft's turn. Everyone seems to be disappointed with the giants progress so far. See the articles on CNET and and by Henry Blodget.

Echoing the similar sentiments as I expressed earlier for Yahoo!, I feel it would be too bad for search industry if Microsoft were to perish here. Those with more grey hair seem to have similar opinion [1] [2].

Wednesday, January 24, 2007

A solution that always works for incompetent leaders

Dont know the real problem? Know the real problem, but not bold enough to attack it?

As a leader what do you do in such a situation? Well, you cook up lot of non-issues, solve the non-existent problems, and amuse yourself with the idea of having contributed towards a better tomorrow.

The case in point is the recent decision by IIT Bombay authorities to limit the usage of internet and intranet services. Please read the email message at the bottom of this post to get the context.

The authorities find that the student attendance in classes are not as high.. Academic performance is going down.

At the same time they observe that students are spending lot of time on internet.

You see a change, you see an effect. Bang! You label the change as cause of the effect.

Internet is just a medium. Its nothing more than a technology shift. It isnt for the first time that we are seeing a technology change. When newspapers came in, they ate into the time we used to spend on other things. When radio came, samething happened. Then television came. Today internet is here.

As engineers, it is incumbent upon us to embrace the new technology with open arms. We are the creators of these technologies.. if we dont believe in change, why did we come into engineering to begin with?

Take any industry.. First we used to commute by road.. which limited the distances we could travel. Then came the railroads.. And the air travel has made it possible to collaborate and connect with people spread all over the world.

With addition of new technologies, we subsume the possibilities of yesterday.

Now that it is clear that we need to embrace the new technology, the next question is how to make the best use of internet. This medium gives us great power. And together with it comes the great responsibility.

What do you do when you are endowed with a great responsibility?

You rise to the occassion and deliever.

What is being done in IIT is, no.. we cant handle so much responsibility.. Lets reduce the responsibility to the level we know how to handle.

Is this right message for your students? How do you expect we will continue to believe that you can impart in us the qualities that makes a great leader?

I am not saying leaders should not do anything, and let things take their own turn. No.. I agree that a technology shift demands an adaptation. And this adaptation would require several initiativies from the leaders of our institute.

What I am arguing here is that the approach being taken is a very short-sighted one. The right solution requires bold leadership, a visionary who can lead us in this time of change. May be someone with qualities in JFK, or Rajiv Gandhi..

It is a pity that in a country with a population of 1 Billion, we cannot find the right leaders for the nation's best institution. Helplessly, we have left it in the hands of a bunch of incompetent managers ready to compromise the future of next generation of promising citizens with temporary fixes.


Now, I dont have the maturity to lay out a grand plan that can be enacted upon with 100% success rate. Nonetheless, I know a few high level things that needs to be done. You need to teach people, help them imagine the possiblities for creating useful tools using the new technology. You adapt your course structure to the new medium. You show them the vastness of knowledge, far larger than Main Library, that they can now learn to shape up their future. You teach them how to broadcast themselves into the world, as it gets flattened and democratized with the widespread adoption of internet.

On the other hand, it is pretty obvious that the approach being taken is not the right one. If there is a new technology, and we dont know how to make best use of it, lets not allow our students to use it. By restricting its usage, we will be able to minimize its bad effect.

I wonder what happens when students graduate from IITs, and get unlimited access to the internet? Well, who cares? They are not our liability any more.

What did you say? We didnt teach our students how to handle great responsibility?

Give us a break. It was never a part of the IIT's curriculum.


From: General Secretary Academic Affairs < gsecaa@iitb.ac.in >
Date: Jan 19, 2007 6:32 AM
Subject: Minutes of the meeting on Network Usage by IIT Students
To: General Secretary Academic Affairs < gsecaa@iitb.ac.in>

Hi all,
Attached are the minutes of the meeting on Network Usage by IIT
Students. Please go through the mail. In case you have a suggestion
or concern, use the forum iitb.general to share it with students and
faculty, instead of personal mails.

For the unaware, iitb.general is a forum that is accessible to all
students, faculty and alumni and is used for discussions and exchange of
ideas. To access iitb.general , visit http://varta.iitb.ac.in and log in
with your netmon ID and password.


Minutes:
----------------------------------------------------------------
INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY
OFFICE OF DEAN STUDENT AFFAIRS

19th January 2007

A meeting of the committee consisting of the following members were
held to discuss matters relating to computer and network usage by
students when the following members were present:

Prof. P. Gopalan, Dean, SA Convener
Prof. Anurag Mehra, Head, Computer Centre Member
Prof. Nand Kishore, Chairman, HWC Member
Prof. G.Sivakumar, Head, CSE Member
Prof. Raghav Varma, Warden, Hostel 10 Member
Prof. C. Amarnath, Ex-Dean, SA Member
Gen.Secretary, Hostel Affairs Member
Gen.Secretary, Academic Affairs Member
Mr. Swapnil S. Sachdev Invitee (MLC)
Mr. Parijat Garg Invitee

Item No 1:
Video streaming :
The committee decided that video streaming in inter and intra hostels
should be disallowed. For using this facility prior permission from
Head, Computer Centre should be obtained.

Item No.2:
Download limit:
The committee put a download limit of 3GB per month.

Item No.3:
Time based ban on internet access in hostels:
The committee had the requested to the students to give their inputs
in September 2006, but no inputs were received at that time, however
during the meeting the students proposed that internet access be
blocked between 2 am and 6 am. However, the committee decided that LAN
access only be available in hostels between 12.30p.m to 11 p.m.
Student dissent was noted. This policy would be effective from 26th
January 2007. It was agreed to review the policy after a period of two
months.

Item No.4:
Illegal Content :
The Head, Computer Centre brought to the notice of the students about
signing of an IT policy of the Institute by every student, which
says that they would use the IIT network to download any illegal
content on any computer connected to the Institute network for which
they are responsible. It was also discussed that the Institute
network is being used to disseminate "illegal materials" like films,
songs etc. by persons running ftp servers on their computers. It was
decided that it will be the task of the elected representatives
(councils) to report this to the Wardens for further action. It was
decided that a poster would be put up by 31st January 2007 after the
same has been vetted by Prof. Sivakumar and Head, CC. The poster
would explicitly spell out the consequences in case of violation
relating to the use of IIT network for the disseminating illegal
contents in any form.

(P.Gopalan )
Dean,SA

----------------------------------------------------------------

Monday, January 22, 2007

A big fraud.. or was I too lucky?

During some debugging, I was checking the access log on my web server.

Accidently, I noticed following (and many other similar) entries. Why would google, yahoo and microsoft all get interested in refreshing their index of my photo album at exactly the same time? Perhaps, some crawler out there is faking its identity.

I tried nslookup for these ips, and the info seem to correspond to microsoft, yahoo and google respectively. My network admin tried to convince me that I was lucky that all 3 search giants are simultaneously interested in me.

Likelihood of such an event is so small.. Assuming the refresh cycle of 30 days, and the scheduling epoch of one hour, the probability that all 3 companies would have scheduled a particular page in the same hour is something like (30*24)^(-3) = 3 out of a billion.

My probability estimation is missing some terms. You need another division by 20 Billion to account for "a particular page", and a multiplication with 30*24 for "the same hour".

I dont believe in so much of coincidence. Either there is a big fraud going on.. or the big giants have some collaborative projects going on..

65.55.209.52 - - [22/Jan/2007:19:15:25 -0600] "GET /~gkabra2/publish/summer2005/navahoPass/slides/P1040659.html HTTP/1.0" 200 14002 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"

74.6.86.107 - - [22/Jan/2007:19:29:34 -0600] "GET /~gkabra2/publish/summer2005/helen/slides/IMG_1050.html HTTP/1.0" 200 13570 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

66.249.66.171 - - [22/Jan/2007:19:31:27 -0600] "GET /~gkabra2/publish/summer2005/4thJuly/slides/P1050115.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1;+http://www.google.com/bot.html)"

Saturday, January 20, 2007

Vertical search engine market

Search engine watch has interesting article on the rise of the vertical search engine market. This is a very good news for my company, Cazoodle, for we are building tools that will facilitate development of verticals.

However, this article seem to confine the scope of verticals to the services similar to Yahoo search bulider and Google custom search engine. These services allow you to narrow down your search to a pre-specified set of sites. Some of them also expose a few enhanced search functionalities like keywords to include/exclude, term weighting scheme, etc.

I would imagine verticals include many more search services.. ranging from people search to google finance.

It is interesting though that, inspite of the narrow definition, verticals have been projected to be 1B industry by 2009. I wonder what would be the total market size if you consider the broader definition of verticals.

On the other hand, this article seems to indicate that the job portal sites are losing their market share to other niche, more sophisticated job domain portals based on social networking. However, the giants like monster, careerbuilder seem to be taking counter measures.. forming alliances to stretch out their reach. I guess the problem is not in their design but in their reach to larger audience. Yes, thats exactly what our company is building. So stay tuned!