Google's solution for deep web.. But, can you really copy all that data?
Greg Linden has a very good summary of the Google's approach for handling the deep web content here and here, mainly based on the Dec 2006 paper.
(This blog is a reproduction of the comments I originally posted as response to Greg's posts)
Overall, he is proposing that Google is approaching the problem by caching all the deep web content, thereby, putting an end to the federated solutions to this problem. However, I strongly suspect making a local copy of all the data using "deep crawl" is perhaps not the right approach for Deep Web.
1. The number of queries you need to pose for each form is huge. Specially, if you have text fields. Further, the scenarios where cardinality of input values is virtually infinite.
2. How do you model all queries to all these diverse forms? Going back to author's comment that there are a wide range of domains. How do derive the set of queries to be used to probe each of the sources?
3. The load on server for each probing query is much more than simple crawling of a surface web page. It would generally require a database query on the server side.
4. The data is quite dynamic. Such as price information - in much of the ecommerce domains, such as flights, hotels, etc. Given the required load for one cycle of "deep crawl", you cant have to many refresh cycles.
5. Keeping all the query enumeration, schema matching and mapping challenges aside, web sources imposes very unique challenges. There are a very few "get" based forms these days; Such offline "deep crawl" would not be *easy* to deploy in forms involving POST and javascripts.
Ofcourse, there are many more challenges in handling deep web content. In above, I have pointed out the ones that are very specific to "deep crawl" approach.
I work in a start up named Cazoodle, and we are taking a very different approach to solve the challenges in Deep web. We will be presenting our system prototype in ICDE. Hope to catch up with you if you are flying in there.
(This blog is a reproduction of the comments I originally posted as response to Greg's posts)
Overall, he is proposing that Google is approaching the problem by caching all the deep web content, thereby, putting an end to the federated solutions to this problem. However, I strongly suspect making a local copy of all the data using "deep crawl" is perhaps not the right approach for Deep Web.
1. The number of queries you need to pose for each form is huge. Specially, if you have text fields. Further, the scenarios where cardinality of input values is virtually infinite.
2. How do you model all queries to all these diverse forms? Going back to author's comment that there are a wide range of domains. How do derive the set of queries to be used to probe each of the sources?
3. The load on server for each probing query is much more than simple crawling of a surface web page. It would generally require a database query on the server side.
4. The data is quite dynamic. Such as price information - in much of the ecommerce domains, such as flights, hotels, etc. Given the required load for one cycle of "deep crawl", you cant have to many refresh cycles.
5. Keeping all the query enumeration, schema matching and mapping challenges aside, web sources imposes very unique challenges. There are a very few "get" based forms these days; Such offline "deep crawl" would not be *easy* to deploy in forms involving POST and javascripts.
Ofcourse, there are many more challenges in handling deep web content. In above, I have pointed out the ones that are very specific to "deep crawl" approach.
I work in a start up named Cazoodle, and we are taking a very different approach to solve the challenges in Deep web. We will be presenting our system prototype in ICDE. Hope to catch up with you if you are flying in there.
1 Comments:
Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!
Post a Comment
<< Home