Domain-Specific Crawler Design

Domain-specific crawler creates a domain-specific Web-page repository by collecting domain-specific resources from the Internet. Domain-specific Web search engine basically searches domain-specific Web-pages from the domain-specific Web-page repository. In domain-specific Web search crawler, the crawler crawls down the Web-pages, which are relevant to our domain. In World Wide Web (WWW) majority of the Web-pages have no such tags, which tell the crawler to find any specific domain. To find the domain we need to visit all the Web-pages and calculate the relevance value. For a particular domain-specific Web-page, relevance value calculated based on that domain Ontology. There are various types of Web-page crawling mechanism already introduced by the Web researchers and they are focused crawler, hierarchical crawler [11], parallel crawler [12–18], etc., all are described in Chapter “Introduction”. Initially, we have provided a mechanism for crawling only single domain-specific Web-pages and proposed a new model Relevant Page Tree (RPaT).

Now, consider a situation where the Web-page is not related to the given domain but it belongs to another domain. For this scenario, we have enhanced our concept by offering a new proposal to work with multiple domains. In multiple domains-specific Web search, crawler crawls down the Web-pages and checking multiple domains simultaneously by using multiple domain Ontologies and finding which Web-page belongs to which domain. For multi domain crawler we introduced a new model of Relevant Page Graph (RPaG).

Further, we have improved our domain-specific crawler performance by introducing parallel crawling and domain classifier. To construct this prototype we have used two classifiers. These two classifiers are Web-page Content classifier and Web-page Uniform Resource Locator (URL) classifier. Based on these two classifiers we are customizing our crawler inputs and create a meta-domain, i.e., domain about domain. Web-page content classifier identifies relevant and irrelevant Web-pages, i.e., domain-specific Web-pages like Cricket, Football, Hockey, Computer Science, etc., and URL classifier classifies URL extension domains like .com, .edu, .net, .in, etc. Those two domain classifiers are used one after the other, i.e., two levels for leadership goals and objectives. For that reason, we are calling this Web search crawler a multilevel domain-specific Web search crawler. Finally, we found multilevel domain-specific Web search crawler mechanism produces better performance with respect to other crawlers.

This chapter is organized as follows. In Sect. 2, we discuss the proposed approach. For convenience, this section is further divided into three subsections. In Sect. 2.1, construction mechanism of single domain-specific Web crawler has been described. The algorithm, search engine resource repository building of multi domain-specific Web crawler is described in Sect. 2.2. Multilevel domain-specific Web crawler design and the used classifiers are discussed in Sect. 2.3. Experimental analysis is presented and discussed in Sect. 3. Finally, the important findings as obtained from this study and conclusions reached are highlighted in the last section.

To find a geographical location in the Globe, we usually follow the geographical map. By a similar analogy, a Web-page from the WWW, we usually use a Web search engine. Web crawler design is an important job to collect Web search engine resources from WWW. A better Web search engine resource leads to achieve a better performance of the Web search engine. In our approach, we crawl through the Web and add Web-pages to the database, which are related to a specific domain (i.e., related to a specific Ontology) and discard Web-pages which are not related to the considered domain. To determine if a Web-page is in a specific domain or not, we calculate relevance of that Web-page, and if the relevance score of that Web-page is more than a predefined relevance limit then we say that the Web-page belongs to the domain. We have generated a new Web search crawler model which supports parallel crawling mechanisms as well as identifies the proper domain by using Web-page content classifier and Web-page URL classifier.

Single Domain-Specific Web Search Crawler

Single domain-specific Web search crawler, the crawler crawls down the Web-pages, which are relevant to a single domain. To find such domain we need to visit all the Web-pages and calculate the relevance value.

Proposed Web-Page Content Relevance Calculation Algorithm for Single Domain
In this subsection, we have described how relevance score of a Web-page is calculated. This algorithm will take a weight table and a Web-page as an input and calculate relevance score of the Web-page. It takes each term from the weight table and calculates how many times it occurs in the Web-page and multiplies the term weight with the number occurrence.

Domain-Specific Web-Page Repository Building

Using ontological knowledge we can find relevant Web-pages from the Web. When the crawler finds a new Web-page then it calculates the relevance value (REL_VAL) of that Web-page. If the calculated relevance value is more than a predefined relevance limit (REL_LMT) then we say that the Web-page belongs to our considered domain (refer Fig. 2). There are a number of links those are associated with it for a web-page. Therefore, we need to take special care about those links to make our crawler focus on the specific domain.

Challenges Faced While Crawling

In our approach, we follow the links those are found in domain-specific Web-pages. Further, we never checked the relevance value of a web-page. However, if some domain-specific Web-pages are partitioned by some irrelevant Web-pages then the performance of the crawler will degrade. In Fig. 3, we have shown that at level 3 there are some irrelevant Web-pages. Now, we can reach some relevant Web-pages inside this irrelevant Web-page. But after calculating the relevance value if we discard the irrelevant Web-page then we are losing some valid relevant Web-pages (refer level 4 & 5 in Fig. 3), which leads to crawler performance issue.

As a solution to this problem, we have chosen a criterion that defines the tolerance limit. Further, the URL of irrelevant web-pages are stored in a different table, IRRE_TABLE. The IRRE_TABLE has two columns URL and Level. We are crawling through those URLs in IRRE_TABLE up to the tolerance limit level. If we find some relevant Web-pages then those Web-pages are added to the main domain-specific Web-page repository. If no relevant Web-pages are found within the predefined tolerance limit, then those URLs are discarded.

The value of the tolerance limit is very important, because it determines the Web-page link traversal depth of an irrelevant Web-page. A high value of tolerance limit ensures that there are more domain-specific pages but the performance of the crawler will be degrading. On the other hand, a low value of tolerance limit ensures good performance but there is less number of Web-pages. So an optimal value of tolerance limit should be assigned based on our experiment, which produces an optimal performance of the crawler.

Relevance Page Tree

Every crawler needs some seed URLs to retrieve Web-pages. To retrieve relevant Web-pages we need Ontology [19–21], Weight Table and Syntable [22–24]. First, a crawler takes one seed URL and calculates relevance value; if this page cross Relevance Limit then the crawler takes that page otherwise it will be rejecting that page. Therefore, if the relevance value is greater than the predefined relevance limit then that Web-page is called relevant Web-page.

Crawler crawls through the relevant pages and continues till it cannot cross certain predefined depth. Again the crawler collects another seed URL and does those operations until seed URL database becomes empty. Above operation is done directly over the Internet and we generate a graph which is typically called RPaT. A sample RPaT is shown in Fig. 4. Each node in RPaT contains two parts, one is page URL and another is Relevance Value. Here pages a, b, and c are seed URLs and their relevance values are 32, 15, and 25, respectively.

Search This Blog

The Prog Hub Blog