CS 240 - Program 1

Keep checking the section entitled Project Notes for clarifications on the project specification.

Submitting your program

  • Programs must be turned in by 5:00 pm, Friday, October 22.
  • Programs will be turned in by ftp'ing your source code to the TA directory. This code will be returned to you (damaged) for your debugging exam.
  • You must also sign up with the TAs for a passoff where you can compile and run your code. They will then test and grade your code. Style grading will be done seperately from the pass off.
  • Each class you design must inherit from the CS240 base class.  For example,
    class MyClass : public CS240 { ... };
    The CS240 class will keep track of the number of times you construct new objects, and also how many of those objects you delete. Upon completion of your program (i.e., at the very end of your main function) you must call CS240::print( ); and the number of class constructions and destructions must be equal. Download the CS240 class which is found in CS240.cpp and CS240.h.

    Overview

    This programming assignment is to implement a web crawler that generates a word index for a WWW site or portion of a site. Your program must accept a URL that specifies the start of your search for web pages, a URL prefix that limits the scope of your searching, the name of a directory where the pages generated by your program should be placed, and a file of stop words. The form of the command line for your program should be:
     
    crawler startURL URLprefix outputDirectory stopWordFile
    The startURL is any valid URL. You must support URLs that begin with http://... as well as file:... For example
    http://www.cs.byu.edu/index.html
    file:/cs240/project1/main.cpp
    The URLprefix is any partial URL. It must at least contain the protocol and, for http, the domain name. It can further specify subdirectories. For example
    http://www.cs.byu.edu/
    http://students.cs.byu.edu/~cs240/
    The purpose of the URLprefix is to limit the set of pages that are indexed by your web crawler. You should ignore any links to pages that do not have the specified prefix on their full URL.

    The outputDirectory  is any valid UNIX directory name where the generated files can be placed. The home page for your generated index should be outputDirectory/index.html. All other file names that you generate can be anything you want as long as those files are found in outputDirectory.

    The Stopword File is the name of a text file that contains a list of words that should not be indexed. These words will appear in the file separated by spaces, newlines or both. These are words such as "a", "the", "to", "is", etc. that are very common but too trivial to appear in an index. We will provide you with this file during pass-off.
     

    Function of a web crawler

    Web crawlers do a depth-first or breadth-first search of all of the web pages that are directly or indirectly linked to some starting page. In general the function is to:
     
    Select a page that has not yet been searched
    If the page is not an HTML page then discard it and go on to the next page.
    Read in the selected page
    For all text areas in the page, parse out all of the words and for each word located that is not a stop word, place that word in the index with a reference to the page currently being indexed. Words are case insensitive and must be sorted in ascending order. (This will make it much easier for you to debug!)

    For all anchor tags with an "href" attribute (see the HTML definition) check the "href" value to see if it lies within the URLprefix. If the link is within the prefix, then add it to the list of pages to be searched.

    Save the URL and any summary information about the page

    Repeat until there are no pages left to search.

    Generate the HTML pages for the index.
     

    In this process, it is very important that you remember all of the pages that you have already visited or are planning to visit so that you do not follow any cycles.

    Friendly Web Crawling

    While testing your web crawler do not repeatedly hammer a particular site. For simple testing you should create a small set of web pages in a local directory and use them to work with, so that you do not slow down a real site with excessive traffic.

    Many sites such as Amazon.com do not actually store many web pages. Most of the pages are generated on the fly from a database. Please do not point your web crawler at any such commercial sites. They cause a huge number of hits on such sites and will overflow the capacities of our machines with the results.
     

    HTML

    One of the easiest ways to learn HTML is to look at the source for web pages that you commonly use. In Netscape you can use the Page Source item found in the View menu. There is a similar feature in Internet Explorer. One can also read about HTML  here. The key tags that you should understand are <a>, <title> and the heading tags <h1>, <h2>,. . ..
     

    Generated Output

    The home page for your generated output "index.html" should contain some kind of a welcoming header and a list of all letters in the alphabet. The welcoming header should also indicate the URLprefix for which this is an index. Each letter should be a link to a letter page for that letter.

    A letter page should contain a header which indicates the letter that this page is an index for and a link back to the home page. Each such letter page should contain all of the index words that begin with that page's letter. Remember that an index word is any word that appears in the text portion (not inside the tags) of any HTML page and excluding any words found in the stop word file. Index words are also case insensitive. Each index word should only appear once on a letter page and should be a hyperlink to a word page for that word.

    A word page should contain a header which indicates the word for this page as well as links to the home page and to the letter page for this word. On a word page each located page that contains that word should be listed. A located page is listed using its summary.  A page's summary is the contents of the <title> tag (if there is one). If there is no <title> tag in the page, then it should be the text contents of the first header tag (<h1>,<h2>, <h3>,...) in the file. If there are no <title> or header tags then the summary should be the first 100 characters of whatever text is not found inside of a tag. The page summary information should be hyperlinked to the actual page itself using the URL that you stored during the web crawling phase.
     

    Code help

    You will not need to learn any networking to complete this assignment. We have created a class Connection which you will use to open a connection to a file and read it from a socket. We are providing you with the code that will take a URL as an argument to a function. The main( ) function included is an example of how you will use the Connection class to read URLs. Basically, the method is as follows:
    1. First you call openConnection(char *) sending it a URL.
    2. Then you repeatedly call read(char *), which will return the information read as a string, also indicating how many bytes are received. Continue doing this until you have reached the end of file (i.e., bytes read is zero).
    3. Finally call closeConnection( ).
    Compile this program as follows:
    g++ openConn.cpp -lsocket -lnsl
    NOTE: Take out the main function at the bottom in order to compile our code as an object file. Use these compiler directives (i.e. -lsocket -lnsl) during compilation when linking together the object files.

    You may read the source code to the Connection class if you care to learn how it works, but it is not essential to completing the project.

    If you find any errors let the TAs know and we will be glad to fix them.

    Project Notes

    This section will be updated with any relevant information, as pertaining to ambiguities in the above specification.
     
  • A word is any contiguous sequence of alphabetic letters and numbers, beginning with a letter, that does not appear in a tag.
  • You need only consider links that end in ".html" or ".htm". You may assume that all other links are not HTML (we know this is not true, but it simplifies the program).
  • You do not have to worry about generating a default file name when one is not specified in a URL. Ignore all links that assume a default page.
  • Header information in an HTML file is not relevant for this assignment (i.e., don't worry about it, just throw it away!)
  • You don't have to parse text outside of the <HTML> ... </HTML> tags in a file.
  • Links containing "../" in the path (link to parent directory) don't have to be processed.
     


     
     
     
     
     
     
     
     


    Back to CS 240 Title Page