class MyClass : public CS240 { ... };The CS240 class will keep track
of the number of times you construct new objects, and also how many of those
objects you delete. Upon completion of your program (i.e., at the very end of
your main function) you must call CS240::print( ); and the number of
class constructions and destructions must be equal. Download the CS240 class
which is found in CS240.cpp and CS240.h.
The startURL is any valid URL. You must support URLs that begin with http://... as well as file:... For examplecrawler startURL URLprefix outputDirectory stopWordFile
http://www.cs.byu.edu/index.htmlThe URLprefix is any partial URL. It must at least contain the protocol and, for http, the domain name. It can further specify subdirectories. For example
file:/cs240/project1/main.cpp
http://www.cs.byu.edu/The purpose of the URLprefix is to limit the set of pages that are indexed by your web crawler. You should ignore any links to pages that do not have the specified prefix on their full URL.
http://students.cs.byu.edu/~cs240/
The outputDirectory is any valid UNIX directory name where the generated files can be placed. The home page for your generated index should be outputDirectory/index.html. All other file names that you generate can be anything you want as long as those files are found in outputDirectory.
The Stopword
File is the name of a text file that contains a list of words that should
not be indexed. These words will appear in the file separated by spaces,
newlines or both. These are words such as "a", "the", "to", "is", etc. that are
very common but too trivial to appear in an index. We will provide you with this
file during pass-off.
Select a page that has not yet been searchedIn this process, it is very important that you remember all of the pages that you have already visited or are planning to visit so that you do not follow any cycles.
If the page is not an HTML page then discard it and go on to the next page.
Read in the selected pageFor all text areas in the page, parse out all of the words and for each word located that is not a stop word, place that word in the index with a reference to the page currently being indexed. Words are case insensitive and must be sorted in ascending order. (This will make it much easier for you to debug!)Repeat until there are no pages left to search.For all anchor tags with an "href" attribute (see the HTML definition) check the "href" value to see if it lies within the URLprefix. If the link is within the prefix, then add it to the list of pages to be searched.
Save the URL and any summary information about the page
Generate the HTML pages for the index.
Many sites such as Amazon.com do not actually store many web pages. Most of
the pages are generated on the fly from a database. Please do not point your web
crawler at any such commercial sites. They cause a huge number of hits on such
sites and will overflow the capacities of our machines with the results.
A letter page should contain a header which indicates the letter that this page is an index for and a link back to the home page. Each such letter page should contain all of the index words that begin with that page's letter. Remember that an index word is any word that appears in the text portion (not inside the tags) of any HTML page and excluding any words found in the stop word file. Index words are also case insensitive. Each index word should only appear once on a letter page and should be a hyperlink to a word page for that word.
A word page should contain a header which indicates the word for this
page as well as links to the home page and to the letter page for this word. On
a word page each located page that contains that word should be listed. A
located page is listed using its summary. A page's summary is the contents
of the <title> tag (if there is one). If there is no <title> tag in
the page, then it should be the text contents of the first header tag
(<h1>,<h2>, <h3>,...) in the file. If there are no
<title> or header tags then the summary should be the first 100 characters
of whatever text is not found inside of a tag. The page summary information
should be hyperlinked to the actual page itself using the URL that you stored
during the web crawling phase.
g++ openConn.cpp -lsocket -lnslNOTE: Take out the main function at the bottom in order to compile our code as an object file. Use these compiler directives (i.e. -lsocket -lnsl) during compilation when linking together the object files.
You may read the source code to the Connection class if you care to learn how it works, but it is not essential to completing the project.
If you find any errors let the TAs know and we will be glad to fix them.
Back to CS 240 Title Page