Details on the Crawler

(C) 1997-98, Christof Prenninger

10 credit project (Java)



This page is intended to explain the internal structure and functions of the Crawler-object. A developer doesn't necessarily need to read this, but it might help to understand what some of the components do.

To see a bigger version of this picture simply click here.

(Dotted lines are code-fragments of the Crawler itself)

Overview:
The Crawler is a multi-threaded program, that uses various threads to download and parse files from the net. Since no computer can handle infinite numbers of threads, a maxThreadNum can be specified (see ControllerInterfac). The Crawler makes sure that there are never more than maxThreadNum threads (Readers+Parsers) running.

When the user starts the Crawler (e.g. by clicking start in the Controller-window) the root-node of the tree is created and sent to the Readers to be downloaded. The locally stored HTML-file is then sent to the Parsers which usually find links in that page. Those links are in turn sent to the Readers, and so on.

The Crawler uses different nodes to represent links. There are URLNodes, which only represent an URL and can't be loaded (mail, gopher,...), LoadableNodes, which represent an URL that can be downloaded (pictures, FTP-files), and HTMLNodes, which represent HTML-files. Since LoadableNodes are also URLNodes, and HTMLNodes can also be loaded, this is the class hierarchy of the different node-types:

Readers:
Since one HTML page can contain a lot of links, but there can only be so many Reader-threads, a todo-queue is needed in front of the Readers (todoReaders). Whenever there is something waiting in the todoReaders FIFOQueue, the Readers-object is informed once a second (a FIFOQueue is a thread). If another Reader-thread can be started, then it will be. Readers is the manager for all the Reader-threads (blue boxes on the right in graphics). When a Reader-thread is done loading a file, it tells the managing Readers-object, which sends out a ReadersMessage to all attached Observers (here only the Crawler).

Parsers:
If the downloaded File is an HTML page, it is sent to the Parsers to be parsed for new links. Parsing also means loading the content-info of an URL. This loading process accesses the net, so takes a while, thus it makes sense to have more than one Parser. Again, there can't be infinite Parser-threads, so a todoParsers queue is needed. Just like with the Readers, the Parsers-object manages the Parser-threads (blue boxes on the left). Just like the todoReaders FIFOQueue the todoParsers queue also informs the Parsers-object once every second whenever there is data waiting in the queue. When a Parser-thread finds a link, it informes the managing Parsers-object which sends a ParsersMessage to all attached Observers (here only the Crawler). Also, when a Parser-thread is finished parsing a HTML-file, such a ParsersMessage is sent out.

Whenever something is removed from a queue, or one of the Reader-/Parser-threads is done, the Crawler is informed and sends out VisualizerMessages to all attached Visualizers (see the green arrows going to the Visualizer in the graphics).

Before a HTMLNode is sent out to be parsed, the Controller is asked if it's sons will be loaded in the future. Remember a Parser loades the content-info of a newly found son-node; this takes time. If a node‘s sons are not expected to be downloaded in the future, the Parser shouldn't load that content-info to save time. Whether a HTMLNode's sons will be loaded or not is stored in the HTMLNode itself.


(C) 1997-98, Christof Prenninger