Team Number: 020
School Name: Bosque School
Area of Science: Computer Science
Project Title: Advanced Parsing Algorithm for Optimized Indexing of Site Data found Through TCP/IP Sockets.
The advent of the Internet has provided a single human with more information at their fingertips than was ever available in the great libraries of the ancient world. However, access to these terabytes upon terabytes of information available on the Internet is limited by the instruments that are available to comb through the mountains of information, directing you only to what is relevant. Many companies have written search programs and engines to accomplish these tasks, but we believe that significant improvements can still be made.
Therefore, the purpose of our project is to develop a search algorithm that is able to find relevant information faster than currently used search engines, and be able to overcome the unscrupulous practice of ranking ones site as relevant when it really is not.
In order to solve this problem, massive computing power will be needed. We will be restricting the initial test version of the program to only search twenty to thirty sample pages. These pages will be created with similar content, but each will use different methods to "trick" the search engine into tagging it as most relevant. Based on the results from the sample, we will continually update the algorithm to better filter the pages. After we refine the algorithm, we will begin indexing the entire web using massive computing resources. This will require an understanding of current search engine algorithms, parsing of Internet content, the MySQL (My Standard Query Language) to index results, syntax analysis, Internet communication protocols, and the use of Visual Basic and Java. We are hoping to use Visual Basic because it has better socket support, allows easier remote control, and it allows easy multithreading.
Progress to Date:
We have begun to create the program in Visual Basic, because we have not made a final decision on the programming language. So far, the program will select a page from a database and download it to the computer. Then source code is split in half, where one half is only text and the other half is HTML tags. In addition, the three databases have been created in SQL, although they do not contain any data. The program is going well and will be completed as soon as possible. The Visual Basic program also has a list of all the point values for different keywords and such, which will be tweaked later in the project.
When the program is completed, we will begin a testing phase where we create 20-30 sample pages with similar words on the pages, but different topics. We will then run the search engine on those pages and compare point results for each keyword, to ensure that the best point values for each different object are being obtained. After we have completed this phase, we will continue by setting the program with a few website URLs to search. This will allow the program to begin indexing the web.