Pregled posta

Adresa bloga: https://blog.dnevnik.hr/newcatchget

Marketing

google desktop

google desktop


(related: jonbenét ramsey, google, google spreadsheet, google calendar, google desktop, )


On linux I use the locate command to find files by name. I use grep google desktop search content of files within a directory (recursively). On windows, I have always used only the filename based search. I dont run Windows search indexing service. Nor do I have Google Desktop installed (they already know a lot about my web searches, why give away more info about me?).

Anyway, the purpose of this post is to pen a couple of my thoughts on the recent Googles complaint about Vistas search capabilities (or lack of them, to be more precise, as it pertains to Googles ability to replace the in-built functionality with its own).

Unstructured textual data is fundamentally searched using the technology outlined at

http://en.wikipedia.org/wiki/Tf-idf

It works pretty good when no one tries to influence the search results. When this algorithm is directly applied to the web, it fails miserably because of web masters and SEO experts trying melinda dillon tweak their pages to influence the basic algorithm. Then google desktop this $100+ billion dollar idea called PageRank by Google guys which figured out a way to beat this biased tweaking. So, thats why they are No. 1 today in the internet search.

However, files on your desktop are not being tweaked by anyone so that their files are going to show up first. Isnt it? This isnt the case either at home or at work. Essentially, any intranet information is mostly unbiased. In addition, the pagerank, which is mainly based on linking between pages, is mostly irrelevant since files on the disk are mostly non-html files. Like pdf, word, ppt, spreadsheet, txt files. Ofcourse, there may be html files too, but thats not the majority. Since PageRank is irrelevant for personal/intranet files, it doesnt matter whether one uses Googles desktop search or any other search based on the tf-idf algorithm mentioned earlier.

Googles argument is that people should have choice. There are people who are pro and against Googles idea. Some believe that this should be a OS functionality. Even though I dont use Google Desktop or Vista or Windows search indexing service, my thinking is, as there is no advantage of PageRank algorithm, it doesnt matter which one is used.

However, I have the following thought process, from a technical standpoint, based on my familiarity with materialized views. Materialized Views is a database concept that is used to speed up queries, much google desktop same way a text index helps speeding up text search as opposed to a normal grep command which has to scan the entire text of each google desktop Materialized Views can google desktop classified as fast refreshable or full refresh mvs. That is, depending on the complexity of the query google desktop materialized, its either possible to compute the query either in a incremental fashion (since the last refresh), or compute it completely. Luckily, the tf-idf is incremental. MVs also can be refreshed immediately or in deferred google desktop In immediate mode, as the underlying table is updated (and on commit), the MV gets computed. The equivalent to this in case of a text indexing a file system is, as and when a file is created, updated or deleted, the text index can be updated immediately. The benefit of immediate refresh is that there is no need for a periodic long-running process to update the index. The reason why the periodic process takes longer time is because it takes a while to figure out what files have been already indexed and which ones have been modified, created or deleted since the last index. With immediate refresh, this operation is spread google desktop each file operation. The disadvantage with the immediate refresh is the fact that it would slow down the individual operations, but thats mostly very negligible. With dual cpu cores these days, these additional book keeping tasks like text indexing can happen in the spare cpu. The same can be said for the long running periodic process as well, but the fact remains that it has to scan the entire disk, much like a virus scanning service.

Now, the ability to detect a change to a file is within the filesystem module which is a core OS functionality. Who can do this more efficiently than the one who wrote the file system in the first google desktop So, if Microsoft supports this type of an incremental file system search indexing, then with the quality of search results being no different google desktop to lack of PageRank advantage in a file repository), which one would you choose?



Popular topics today: buynowbe

Post je objavljen 26.12.2007. u 03:32 sati.