Team Number: 020
School Name: Bosque School
Area of Science: Computer Science
Project Title: Web Crawler
The advent of the Internet has provided a single human with more information at their fingertips than was ever available in the great libraries of the ancient world. However, access to these terabytes upon terabytes of information available on the Internet is limited by the instruments that are available to comb through the mountains of information and find relevant information. Since the dawn of the Internet, many companies have written search programs and engines to accomplish these tasks, but irrelevant information is rampant because some publishers of inappropriate sites or unscrupulous site designers have found ways to "trick" search engines into submitting their site at the top of the search queue or as a most relevant site when it is really not relevant under the search parameters entered by the user.
The purpose of our project is to develop a search algorithm that is able to find relevant information faster than all other search engines, and be able to discount unscrupulous practices or ranking ones site as relevant when it really is not.
Our project will be able to crawl and index web sites using the database language MySQL. Because of the massive computing power needed to do this, we will be restricting the initial test version of the program to only search fifty sample pages. The fifty pages will be created with similar content, but each will use different methods to "trick" the search engine into tagging it as most relevant. Based on the results from the sample, we will continually update the algorithm to better filter the pages. After we refine the algorithm, we will begin indexing the entire web using massive computing resources.
This will require an understanding of current search engine algorithms, parsing of Internet content, the MySQL (My Standard Query Language) to index results, syntax analysis, Internet communication protocols, and a programming language. We are hoping to use PHP (Preprocessor Hypertext) because it has better socket support, allows easier remote control, and it allows easy multithreading.