Charles knight referred me to Kaila Colbin blog post about what the future (2010) search engine should look like. I was honored to be asked for my opinion so I sat down and wrote my thoughts about what the future search engine should be. Granted, this is not exactly what I was asked to do, but I ended up with a four pages document so I decided to post it on our blog. I hope you find it interesting.
The (Near) Future of Search Engines
We have a saying that “Prophecy was given to the fools” (loosely translated from Hebrew) that basically means that only fools try to foresee the future. Since I don’t consider myself a fool (at least not at this point in my life) I will start by saying that I have no idea how search engines will work but I have some ideas about how they should.
After the previous unnecessary paragraph I will dive right into the subject at hand. Basically all proprietary search engines are built out of four parts: data collection (crawlers), data storage (indexing), data extraction (searching) and data presentation.
Data Collection – the Crawler:
The crawler is the part that goes over the information and collects it. In my view, this is one of the most important components in a search engine. The crawler has control over:
- The amount of data being collected.
- The quality of the data.
Data Quantity:
The coverage of a search engine is a key element in its overall quality. We as users, would like our search engine to be aware of any piece of information out there. Currently, Google has massive amount of crawlers indexing the web, but even they admit they stand no chance in covering what is really out there. I don’t really believe we can calculate the percentage of what is really covered since we know the volume of what we crawled, but we don’t know what we missed so we cannot really solve this simple formula.
The future of the crawlers should be in diverting resources from hundreds of centralized crawlers to the millions of users surfing the web. Let the user’s computers act as crawlers. This will save bandwidth and be much more efficient in finding the “dark parts” of the web. I actually think Google is already utilizing this method when we use some of its free software. They of course ask our permission to do so as they should. The problem with this method is with the indexer and I will refer to that later.
Data Quality:
The information is out there, there is a whole ocean of it – no worries there. Lets say we were able to collect it and successfully store it. If we are unable to correctly analyze it, then we will have problems retrieving it in a relevant way. Not all web pages where created equally, different pages have different roles. Inside the pages, different segments of text serve a different purpose.
It is very important to discriminate between the importance of web pages in dependence of the contextual meaning (i.e pageRank), but it’s as important to discriminate between the different parts of text inside these pages and to analyze their meaning.
The future search engine should include crawlers that “understand” the page structure and detect the role of each part.
I will supply an example:
A simple Wikipedia page has titles, paragraphs, different words are marked by underline/bold notations and we can calculate the importance of the document by its structure. We can assume that titles are important, bold and underline words are significant. The frequency of a word in a paragraph related to the diversity of words in the text serves another important role. The above considerations (and more) could be applied to many web pages that should be categorized as articles. Now what about web pages from discussion based sites? The structure and pageRank importance of these pages is completely different. The crawler should analyze the page as a discussion and not as an article since a discussion has a title, a topic and replies. The considerations taken into account should be applied to the separate parts on the discussion. A discussion has a date and replies. Text inside the 22nd reply does not have the same importance as the one in the 1st reply or the topic. Many times, the discussion page includes a text inside the page that has nothing to do with the discussion in hand. The crawler should be able to decipher between these document segments and textual attributes. The better it does the job, the more relevant the results we will get.
User intervention
I am not a big fan of users affecting the quality of the results. A good result for your query doesn’t have to be a good one for mine. It might work for a very simple query with one or two words, but certainly, when the query becomes more complex, even a human wouldn’t necessarily understand what one is asking for. User intervention in rating and grading a document can cause damage and skew the results since it cannot be normalized into a known and predictable algorithm.
Data Categorization
The crawler should apply categorization scheme upon each document. This categorization will help the user guide the search engine to better focus the results in case the focus was in the wrong place. The categories should match today’s “vertical” search engines; for example Blogs (technological, gossip, etc.), news (world, regional, science etc.), discussions (reviews, opinions, QnA etc.) and so on.
Bottomline
The crawler is the intelligent agent that collects and analyzes the data. There should be millions of crawlers, analyzing every piece of data on the web. The analysis should take into account the structure of the page, its category and prioritize its internal content accordingly.
Data Storage and Indexing
The future search engine should be updated about the content on the world wide web in real-time. Today, search engines prioritize the resources they crawl, in order to stay updated with the most important content in real time, but this is not enough and the road is long.
As mentioned above, Once the crawlers collect the data they should pass it on for storage. If we accept the idea that the users’ computers should act as crawlers, data from millions of computers will flow into one or more centralized data centers. There is a need for a vast amount of processing power in order to process data in this rate. It’s like trying to count all the water molecules of the Niagara falls as they splash in (or maybe I’m vastly exaggerating but you get the idea).
Bottomline
More data centers and more computational power should be invested in indexing incoming data in order to meet the real-time demand.
Data Extraction
The search and extraction of the data that has been crawled and stored, has very much to do with the quality of the crawling. If the crawler was able to analyze the page importance in regards to its contextual meaning correctly, and to determine the page structure by using a categorization methods, it will be much easier to retrieve the relevant document for a certain query.
There are many debates about how a user should interact with the machine – in our case the search engine. Currently, we as users need to rewrite the way we speak in order to interact with a search engine. Many times we need to completely rephrase a sentence so a search engine will “understand” it correctly (hopefully). No doubt, there is a need to solve this problem. Again, this is not the case when the search query is simple and contains only few and straight forward keywords. Unfortunately, many times this is not the case and a good search engine should be able to analyze the meaning of our search query. Semantic search engines are on the right path.
I also think, previous queries in the immediate time frame should be taken into account. When a user tries again, after failing to receive good results on a previous attempt he/she is “hinting” the search engine where to focus and what to fix.
Bottomline
The future search engine should try to understand free text and use semantic methods to better “understand” what the user is looking for. It can also interact with the user by suggesting more keywords in order to narrow down the field of interest.
The same query could return different results (and more accurate) for different users if the engine takes into account the previous queries the user entered (in the immediate time frame).
Data Presentation
If a search engine conquered all the technological barriers but the presentation layer isn’t good enough than it failed. The presentation is the place where the user sees the results returned in response to a query. If the results weren’t good enough, the search engine should supply an easy way to refine them. The user has very little tolerance, the interface should be highly intuitive and fast.
Giving too much information around the results is a big mistake, since the user will get lost. Too much flash and animation is cool and fun but not in the long run. When a user is looking for information it should be to the point.
The preview snippets should have the ability to be easily expanded further since this is where our eyes are focused when we check out the results relevancy.
I really don’t like the image previews on a website – It breaks the design flow, colors are mixed, the preview images are usually out dated and if not the user cannot read the text on the page. For reading the text we have preview snippets that are ordered by relevancy and marked when needed.
Summary
The future search engine should be a simple one on the front end. It should be fast, intuitive, and the user should easily guide the engine to refine the results when needed. The back-end should be adjusted to handle mass amount of information retrieved by millions of crawlers and analyzed according to the type of the document.
Combining all things said above should create an easy to use, intuitive, up-to-date “Find-Engine”.
Cheers,
Ran Geva,
Omgili CEO/Founder. |