Google's official description of its search technology
The back-end software of our search technology triggers a series of parallel calculations on the server side that take less than 1 second to execute. The search results of traditional search engines prior to Google were heavily dependent on the frequency of keywords on the page. (Page and Breen's original idea was to model the link structure of the World Wide Web in a graph-theoretic directed acyclic graph) and determine the importance of a web page, we assume that a web page's importance depends on how many other pages cite it, like the citation index in an academic paper, where an important paper will always be cited by many other papers. We then perform a hypertext matching analysis based on the search terms (indexing the content of the pages crawled by the bot) to determine the most relevant pages to the search request. By combining the most important pages and the pages most relevant to the search request, we are able to sort the results of the query by importance and relevance to the user's search request and present them to our users.
Data Center: the tower Google uses to index the world
Google's data center is highly confidential and there's not much we can learn about.
1. There are more than 19 data centres within the United States and the remaining 17 data centres are located around the world outside the United States.
2.Each data center is as large as 500,000 square feet, and it costs about $600 million to build a data center.
3)The Google data center is one of the most efficient facilities in the world, and it is also very environmentally friendly, with almost no carbon emissions.
4) The data center uses 50 to 100 megawatts of electricity, because of the need for cooling, usually built in a place convenient to water.
The Google servers are housed in a group of 1160 servers in a standard container as big as a house.
Processing flow
1. you write a blog, or tweet on Twitter, update the site, and so on to add content to the Web.
2. Google bots program (a search engine component as an intelligent agent program) to crawl your web page title and description, keyword and other content.
(1) Google crawler along the link path around the World Wide Web, if there is no hypertext path to your site, your site will not be indexed!
(2) If you set up no indexing in robots.txt, Google crawler will not crawl your web page.
(3) If the hypertext link to your site has a nofollow tag on the link, Google crawlers will not travel from the link path to your site.
(4) Google can also find your site through the blog software or xml sitemap.
(5) The more links to your site from sites with a higher PageRank, the higher your site's PageRank will be.
(6) The Google crawler will travel around all links that are not marked as nofollow
3. once the Google crawler visited, the page was indexed within seconds
(1) Web page content is stored in an inverted index
① The page title and link data are stored in an index for breadth-priority search.
② The content of the page is stored in a separate index for long-tail, personalized, in-depth priority searches that are not frequently searched.
(2) When you search with Google, you are not searching the World Wide Web, which is updated all the time, but rather you are searching Google's cache, and Google regularly updates its index database, which tends to have a shorter update cycle in the face of competition from Twitter's real-time search, etc.
4.Google evaluates the overall PageRank value of domains and web pages based on links.
5. check web pages to prevent cheating
(1) Google's search quality and anti-spam censorship and optimization algorithms
(2) More than 10,000 remote test users evaluate the quality of search results
(3) Google solicits users to report spam that is suspected of PageRank blackmail.
(4) Google is notified by the (US) Digital Millennium Copyright Act (DMCA) that Google is required to remove suspected pirated content from search results.
6. After doing a damage analysis of the page, each page now has many pieces of data used to aid the user's search (e.g., search keywords) that refer back to it in reverse.
7. Search requests from users
(1) Patrick Riley, Google Search Quality Engineer: In most Google searches, your search is in the middle of many parallel control processes or Google Labs innovation project group processes, and it's safe to say that every query request is involved in some Google creative experimentation.
8: Google will use synonyms to match the results of queries that are semantically similar to your search keywords
9. Generate preliminary query results
(1) Of course, Google can return thousands of unlimited number of query results, but generally only show less than 1000 query results, for the sake of "less is more, more is confused" consideration. (2) To localize the query results, the local site appears first in the query results.
10. the query result set was sorted by authority and PageRank, and duplicates were eliminated.
(1) Google identifies relevant keyword ads being bid on in auctions based on keyword, ad type, and user location
(2) Keyword advertising must comply with local legal provisions
1) Illegal advertising by advertisers to be banned
② If the search traffic for the keyword is too low or the number of clicks on the keyword ad is too low, it will be automatically disabled.
(3) For commercial strategy, customers like Amazon are given preferential discounts.
(3) Keyword-related ads are ranked by revenue potential (the quality of the ads is continually evaluated after bidding auctions for keywords)
(4) The content of the ad is generally fixed to the ad owner, but sometimes dynamic keywords are used to make the keyword ad more relevant to the search term.
① Some advertisements themselves allow for the addition of variable collateral information, such as website links, phone numbers, product links, addresses, etc.
(5) When the ad has a high CTR, it will be displayed at the top of the search results list to make it more visible.
(6) The rest of the ads are displayed in the corresponding positions in order
11. Filtering of query results
(1) For common queries (such as search requests on the Google home page), Google will add relevant thematic vertical search results (such as news, shopping, videos, books, maps, etc.) to the returned query results as well.
(2) Personalized aspects: the user visited the website in the query results list will be more up
(3) Sites that make heavy use of anchors are likely to be removed from the query results
(4) Clustering of search result sets: the importance of a web page is greatly increased if it is referenced by other sites with high PageRank.
(5) Trend analysis: for search keywords that have exploded in search traffic or have a lot of news, Google adds additional PageRank weights to the results of new queries. (Google has a Google Trends feature page that reflects keyword search traffic)
(6) Multiple pages under the same domain with the same PageRank will be grouped together.
12. and ultimately return a user-friendly, well laid out, organic query results page with clear separation of results and ads to the browser side.
All of these steps are completed in a total response time of less than 1 second, and 300 million hits per day generate over $20 billion in annual revenue for Google.