Web Pages (web + page)

Distribution by Scientific Domains
Distribution within Information Science and Computing


Selected Abstracts


SEARCHING FOR EXPLANATORY WEB PAGES USING AUTOMATIC QUERY EXPANSION

COMPUTATIONAL INTELLIGENCE, Issue 1 2007
Manabu Tauchi
When one tries to use the Web as a dictionary or encyclopedia, entering some single term into a search engine, the highly ranked pages in the result can include irrelevant or useless sites. The problem is that single-term queries, if taken literally, underspecify the type of page the user wants. For such problems automatic query expansion, also known as pseudo-feedback, is often effective. In this method the top n documents returned by an initial retrieval are used to provide terms for a second retrieval. This paper contributes, first, new normalization techniques for query expansion, and second, a new way of computing the similarity between an expanded query and a document, the "local relevance density" metric, which complements the standard vector product metric. Both of these techniques are shown to be useful for single-term queries, in Japanese, in experiments done over the World Wide Web in early 2001. [source]


Misery.com: A Year on the Headache Web Page

HEADACHE, Issue 1 2001
R.S. Singer MD
[source]


This Is Not Our Fathers' Generation: Web Pages, the Chicago Lyric Opera, and the Philadelphia Orchestra

THE JOURNAL OF POPULAR CULTURE, Issue 1 2002
Carolyn BoiarskyArticle first published online: 11 APR 200
First page of article [source]


Web Discovery and Filtering Based on Textual Relevance Feedback Learning

COMPUTATIONAL INTELLIGENCE, Issue 2 2003
Wai Lam
We develop a new approach for Web information discovery and filtering. Our system, called WID, allows the user to specify long-term information needs by means of various topic profile specifications. An entire example page or an index page can be accepted as input for the discovery. It makes use of a simulated annealing algorithm to automatically explore new Web pages. Simulated annealing algorithms possess some favorable properties to fulfill the discovery objectives. Information retrieval techniques are adopted to evaluate the content-based relevance of each page being explored. The hyperlink information, in addition to the textual context, is considered in the relevance score evaluation of a Web page. WID allows users to provide three forms of the relevance feedback model, namely, the positive page feedback, the negative page feedback, and the positive keyword feedback. The system is domain independent and does not rely on any prior knowledge or information about the Web content. Extensive experiments have been conducted to demonstrate the effectiveness of the discovery performance achieved by WID. [source]


A Web page that provides map-based interfaces for VRML/X3D content

ELECTRONICS & COMMUNICATIONS IN JAPAN, Issue 2 2009
Yoshihiro Miyake
Abstract The electronic map is very useful for navigation in the VRML/X3D virtual environments. So far various map-based interfaces have been developed. But they are lacking for generality because they have been separately developed for individual VRML/X3D contents, and users must use different interfaces for different contents. Therefore, we have developed a Web page that provides a common map-based interface for VRML/X3D contents on the Web. Users access VRML/X3D contents via the Web page. The Web page automatically generates a simplified map by analyzing the scene graph of downloaded contents, and embeds the mechanism to link the virtual world and the map. An avatar is automatically created and added to the map, and both a user and its avatar are bidirectionally linked together. In the simplified map, obstructive objects are removed and the other objects are replaced by base boxes. This paper proposes the architecture of the Web page and the method to generate simplified maps. Finally, an experimental system is developed in order to show the improvement of frame rates by simplifying the map. © 2009 Wiley Periodicals, Inc. Electron Comm Jpn, 92(2): 28,37, 2009; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecj.10017 [source]


Estimating and eliminating redundant data transfers over the web: a fragment based approach

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, Issue 2 2005
Christos Bouras
Abstract Redundant data transfers over the Web, can be mainly attributed to the repeated transfers of unchanged data. Web caches and Web proxies are some of the solutions that have been proposed, to deal with the issue of redundant data transfers. In this paper we focus on the efficient estimation and reduction of redundant data transfers over the Web. We first prove that a vast amount of redundant data is transferred in Web pages that are considered to carry fresh data. We show this by following an approach based on Web page fragmentation and manipulation. Web pages are broken down to fragments, based on specific criteria. We then deal with these fragments as independent constructors of the Web page and study their change patterns independently and in the context of the whole Web page. After the fragmentation process, we propose solutions for dealing with redundant data transfers. This paper has been based on our previous work on ,Web Components' but also on related work by other researchers. It utilises a proxy based, client/server architecture, and imposes changes to the algorithms executed on the Proxy server and on clients. We show that our proposed solution can considerably reduce the amount of redundant data transferred on the Web. Copyright © 2004 John Wiley & Sons, Ltd. [source]


Web links and search engine ranking: The case of Google and the query "jew"

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2006
Judit Bar-Ilan
The World Wide Web has become one of our more important information sources, and commercial search engines are the major tools for locating information; however, it is not enough for a Web page to be indexed by the search engines,it also must rank high on relevant queries. One of the parameters involved in ranking is the number and quality of links pointing to the page, based on the assumption that links convey appreciation for a page. This article presents the results of a content analysis of the links to two top pages retrieved by Google for the query "jew" as of July 2004: the "jew" entry on the free online encyclopedia Wikipedia, and the home page of "Jew Watch," a highly anti-Semitic site. The top results for the query "jew" gained public attention in April 2004, when it was noticed that the "Jew Watch" homepage ranked number 1. From this point on, both sides engaged in "Googlebombing" (i.e., increasing the number of links pointing to these pages). The results of the study show that most of the links to these pages come from blogs and discussion links, and the number of links pointing to these pages in appreciation of their content is extremely small. These findings have implications for ranking algorithms based on link counts, and emphasize the huge difference between Web links and citations in the scientific community. [source]


Bibliomining for automated collection development in a digital library setting: Using data mining to discover Web-based scholarly research works

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2003
Scott Nicholson
This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came from the academic library selection literature, and a Delphi study was used to refine the list to 41 criteria. A Perl program was designed to analyze a Web page for each criterion and applied to a large collection of scholarly and nonscholarly Web pages. Bibliomining, or data mining for libraries, was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. Accuracy and return were used to judge the effectiveness of each model on test datasets. In addition, a set of problematic pages that were difficult to classify because of their similarity to scholarly research was gathered and classified using the models. The resulting models could be used in the selection process to automatically create a digital library of Web-based scholarly research works. In addition, the technique can be extended to create a digital library of any type of structured electronic information. [source]


Web Discovery and Filtering Based on Textual Relevance Feedback Learning

COMPUTATIONAL INTELLIGENCE, Issue 2 2003
Wai Lam
We develop a new approach for Web information discovery and filtering. Our system, called WID, allows the user to specify long-term information needs by means of various topic profile specifications. An entire example page or an index page can be accepted as input for the discovery. It makes use of a simulated annealing algorithm to automatically explore new Web pages. Simulated annealing algorithms possess some favorable properties to fulfill the discovery objectives. Information retrieval techniques are adopted to evaluate the content-based relevance of each page being explored. The hyperlink information, in addition to the textual context, is considered in the relevance score evaluation of a Web page. WID allows users to provide three forms of the relevance feedback model, namely, the positive page feedback, the negative page feedback, and the positive keyword feedback. The system is domain independent and does not rely on any prior knowledge or information about the Web content. Extensive experiments have been conducted to demonstrate the effectiveness of the discovery performance achieved by WID. [source]


Effective page refresh policy

COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, Issue 3 2007
Kai Gao
Abstract Web pages are created or updated randomly. As for a search engine, keeping up with the evolving Web is necessary. But previous studies have shown the crawler's refresh ability is limited because it is not easy to detect the change instantly, especially when the resources are limited. This article concerns modeling on an effective Web page refresh policy and finding the refresh interval with minimum total waiting time. The major concern is how to model the change and which part should be updated more often. Toward this goal, the Poisson process is used to model the process. Further, the relevance is also used to adjust the process, and the probability on some sites is higher than others so these sites will be given more opportunities to be updated. It is essential when the bandwidth is not wide enough or the resource is limited. The experimental results validate the feasibility of the approach. On the basis of the above works, an educational search engine has been developed. © 2007 Wiley Periodicals, Inc. Comput Appl Eng Educ 14: 240,247, 2007; Published online in Wiley InterScience (www.interscience.wiley.com); DOI 10.1002/cae.20155 [source]


That site looks 88.46% familiar: quantifying similarity of Web page design

EXPERT SYSTEMS, Issue 3 2005
Giselle Martine
Abstract: Web page design guidelines produce a pressure towards uniformity; excessive uniformity lays a Web page designer open to accusations of plagiarism. In the past, assessment of similarity between visual products such as Web pages has involved an uncomfortably high degree of subjectivity. This paper describes a method for measuring perceived similarity of visual products which avoids previous problems with subjectivity, and which makes it possible to pool results from respondents without the need for intermediate coding. This method is based on co-occurrence matrices derived from card sorts. It can also be applied to other areas of software development, such as systems analysis and market research. [source]


A metagenetic algorithm for information filtering and collection from the World Wide Web

EXPERT SYSTEMS, Issue 2 2001
Z.N. Zacharis
This paper describes the implementation of evolutionary techniques for information filtering and collection from the World Wide Web. We consider the problem of building intelligent agents to facilitate a person's search for information on the Web. An intelligent agent has been developed that uses a metagenetic algorithm in order to collect and recommend Web pages that will be interesting to the user. The user's feedback on the agent's recommendations drives the learning process to adapt the user's profile with his/her interests. The software agent utilizes the metagenetic algorithm to explore the search space of user interests. Experimental results are presented in order to demonstrate the suitability of the metagenetic algorithm's approach on the Web. [source]


Reorganizing web sites based on user access patterns

INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE & MANAGEMENT, Issue 1 2002
Yongjian Fu
In this paper, an approach for reorganizing Web sites based on user access patterns is proposed. Our goal is to build adaptive Web sites by evolving site structure to facilitate user access. The approach consists of three steps: preprocessing, page classification, and site reorganization. In preprocessing, pages on a Web site are processed to create an internal representation of the site. Page access information of its users is extracted from the Web server log. In page classification, the Web pages on the site are classified into two categories, index pages and content pages, based on the page access information. After the pages are classified, in site reorganization, the Web site is examined to find better ways to organize and arrange the pages on the site. An algorithm for reorganizing Web sites has been developed. Our experiments on a large real data set show that the approach is efficient and practical for adaptive Web sites. Copyright © 2002 John Wiley & Sons, Ltd. [source]


Estimating and eliminating redundant data transfers over the web: a fragment based approach

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, Issue 2 2005
Christos Bouras
Abstract Redundant data transfers over the Web, can be mainly attributed to the repeated transfers of unchanged data. Web caches and Web proxies are some of the solutions that have been proposed, to deal with the issue of redundant data transfers. In this paper we focus on the efficient estimation and reduction of redundant data transfers over the Web. We first prove that a vast amount of redundant data is transferred in Web pages that are considered to carry fresh data. We show this by following an approach based on Web page fragmentation and manipulation. Web pages are broken down to fragments, based on specific criteria. We then deal with these fragments as independent constructors of the Web page and study their change patterns independently and in the context of the whole Web page. After the fragmentation process, we propose solutions for dealing with redundant data transfers. This paper has been based on our previous work on ,Web Components' but also on related work by other researchers. It utilises a proxy based, client/server architecture, and imposes changes to the algorithms executed on the Proxy server and on clients. We show that our proposed solution can considerably reduce the amount of redundant data transferred on the Web. Copyright © 2004 John Wiley & Sons, Ltd. [source]


Using the moving average rule in a dynamic web recommendation system

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 6 2007
Yi-Jen Su
In this, the Information Age, most people are accustomed to gleaning information from the World Wide Web. To survive and prosper, a Web site has to constantly enliven its content while providing various and extensive information services to attract users. The Web Recommendation System, a personalized information filter, prompts users to visit a Web site and browse at a deeper level. In general, most of the recommendation systems use large browsing logs to identify and predict users' surfing habits. The process of pattern discovery is time-consuming, and the result is static. Such systems do not satisfy the end users' goal-oriented and dynamic demands. Accordingly, a pressing need for an adaptive recommendation system comes into play. This article proposes a novel Web recommendation system framework, based on the Moving Average Rule, which can respond to new navigation trends and dynamically adapts recommendations for users with suitable suggestions through hyperlinks. The framework provides Web site administrators with various methods to generate recommendations. It also responds to new Web trends, including Web pages that have been updated but have not yet been integrated into regular browsing patterns. Ultimately, this research enables Web sites with dynamic intelligence to effectively tailor users' needs. © 2007 Wiley Periodicals, Inc. Int J Int Syst 22: 621,639, 2007. [source]


Handling linguistic web information based on a multi-agent system

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 5 2007
Zheng Pei
Much information over the Internet is expressed by natural languages. The management of linguistic information involves an operation of comparison and aggregation. Based on the Ordered Weighted Averaging (OWA) operator and modifying indexes of linguistic terms (their indexes are fuzzy numbers on [0,T] , R+), new linguistic aggregating methods are presented and their properties are discussed. Also, based on a multi-agent system and new linguistic aggregating methods, gathering linguistic information over the Internet is discussed. Moreover, by fixing the threshold ,, "soft filtering information" is proposed and better Web pages (or documents) that the user needs are obtained. © 2007 Wiley Periodicals, Inc. Int J Int Syst 22: 435,453, 2007. [source]


Internet-based cognitive behavioral therapy for tinnitus

JOURNAL OF CLINICAL PSYCHOLOGY, Issue 2 2004
Gerhard Andersson
Tinnitus is a common otological problem that is often resistant to surgical or medical interventions. In common with chronic pain, cognitive-behavioral treatment has been found to alleviate the distress and improve the functioning of tinnitus patients. Recently, a self-help treatment has been developed for use via the Internet. In this article, we describe the self-help program and apply it to a middle-aged woman with tinnitus. We report the case formulation, which was done in a structured interview, and the treatment interactions, which were conducted via e-mail. The self-help program was presented on Web pages, and weekly diaries were submitted to follow progress and give feedback. The treatment was successful with reductions of tinnitus-related annoyance and anxious and depressive mood. Implications for Internet administration of self-help treatment are discussed. © 2003 Wiley Periodicals, Inc. J Clin Psychol/In Session. [source]


Hyperlink Analyses of the World Wide Web: A Review

JOURNAL OF COMPUTER-MEDIATED COMMUNICATION, Issue 4 2003
Han Woo Park
We have recently witnessed the growth of hyperlink studies in the field of Internet research. Although investigations have been conducted across many disciplines and topics, their approaches can be largely divided into hyperlink network analysis (HNA) and Webometrics. This article is an extensive review of the two analytical methods, and a reflection on their application. HNA casts hyperlinks between Web sites (or Web pages) as social and communicational ties, applying standard techniques from Social Networks Analysis to this new data source. Webometrics has tended to apply much simpler techniques combined with a more in-depth investigation into the validity of hypotheses about possible interpretations of the results. We conclude that hyperlinks are a highly promising but problematic new source of data that can be mined for previously hidden patterns of information, although much care must be taken in the collection of raw data and in the interpretation of the results. In particular, link creation is an unregulated phenomenon and so it would not be sensible to assume that the meaning of hyperlinks in any given context is evident, without a systematic study of the context of link creation, and of the relationship between link counts, among other measurements. Social Networks Analysis tools and techniques form an excellent resource for hyperlink analysis, but should only be used in conjunction with improved techniques for data collection, validation and interpretation. [source]


The quality of patient-orientated Internet information on oral lichen planus: a pilot study

JOURNAL OF EVALUATION IN CLINICAL PRACTICE, Issue 5 2010
Pía López-Jornet PhD MD DDS
Abstract Objective, This study examines the accessibility and quality Web pages related with oral lichen planus. Methods, Sites were identified using two search engines (Google and Yahoo!) and the search terms ,oral lichen planus' and ,oral lesion lichenoid'. The first 100 sites in each search were visited and classified. The web sites were evaluated for content quality by using the validated DISCERN rating instrument. JAMA benchmarks and ,Health on the Net' seal (HON). Results, A total of 109 000 sites were recorded in Google using the search terms and 520 000 in Yahoo! A total of 19 Web pages considered relevant were examined on Google and 20 on Yahoo! As regards the JAMA benchmarks, only two pages satisfied the four criteria in Google (10%), and only three (15%) in Yahoo! As regards DISCERN, the overall quality of web site information was poor, no site reaching the maximum score. In Google 78.94% of sites had important deficiencies, and 50% in Yahoo!, the difference between the two search engines being statistically significant (P = 0.031). Only five pages (17.2%) on Google and eight (40%) on Yahoo! showed the HON code. Conclusion, Based on our review, doctors must assume primary responsibility for educating and counselling their patients. [source]


TEXTUAL REPRESENTATION OF DIVERSITY IN COAMFTE ACCREDITED DOCTORAL PROGRAMS

JOURNAL OF MARITAL AND FAMILY THERAPY, Issue 1 2006
John J. Lawless
The use of the Internet is growing at a staggering pace. One significant use of the Internet is for potential students and the parents of potential students to explore educational possibilities. Along these lines potential marriage and family therapy students may have many questions that include a program's commitment to cultural diversity. This study utilized qualitative content analysis methodology in combination with critical race theory to examine how Commission On Accreditation for Marriage and Family Therapy Education (COAMFTE) accredited doctoral programs represented cultural text on their World Wide Web pages. Findings indicate that many COAMFTE-accredited doctoral programs re-present programmatic information about diversity that appear to be incongruent with cultural sensitivity. These apparent incongruities are highlighted by the codification, inconsistent, and isolated use of cultural text. In addition, cultural text related to social justice was absent. Implications and suggestions are discussed. [source]


Identifying similar pages in Web applications using a competitive clustering algorithm

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE, Issue 5 2007
Andrea De Lucia
Abstract We present an approach based on Winner Takes All (WTA), a competitive clustering algorithm, to support the comprehension of static and dynamic Web applications during Web application reengineering. This approach adopts a process that first computes the distance between Web pages and then identifies and groups similar pages using the considered clustering algorithm. We present an instance of application of the clustering process to identify similar pages at the structural level. The page structure is encoded into a string of HTML tags and then the distance between Web pages at the structural level is computed using the Levenshtein string edit distance algorithm. A prototype to automate the clustering process has been implemented that can be extended to other instances of the process, such as the identification of groups of similar pages at content level. The approach and the tool have been evaluated in two case studies. The results have shown that the WTA clustering algorithm suggests heuristics to easily identify the best partition of Web pages into clusters among the possible partitions. Copyright © 2007 John Wiley & Sons, Ltd. [source]


Scatter matters: Regularities and implications for the scatter of healthcare information on the Web

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 4 2010
Suresh K. Bhavnani
Abstract Despite the development of huge healthcare Web sites and powerful search engines, many searchers end their searches prematurely with incomplete information. Recent studies suggest that users often retrieve incomplete information because of the complex scatter of relevant facts about a topic across Web pages. However, little is understood about regularities underlying such information scatter. To probe regularities within the scatter of facts across Web pages, this article presents the results of two analyses: (a) a cluster analysis of Web pages that reveals the existence of three page clusters that vary in information density and (b) a content analysis that suggests the role each of the above-mentioned page clusters play in providing comprehensive information. These results provide implications for the design of Web sites, search tools, and training to help users find comprehensive information about a topic and for a hypothesis describing the underlying mechanisms causing the scatter. We conclude by briefly discussing how the analysis of information scatter, at the granularity of facts, complements existing theories of information-seeking behavior. [source]


Detection of access to terror-related Web sites using an Advanced Terror Detection System (ATDS)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 2 2010
Yuval Elovici
Terrorist groups use the Web as their infrastructure for various purposes. One example is the forming of new local cells that may later become active and perform acts of terror. The Advanced Terrorist Detection System (ATDS), is aimed at tracking down online access to abnormal content, which may include terrorist-generated sites, by analyzing the content of information accessed by the Web users. ATDS operates in two modes: the training mode and the detection mode. In the training mode, ATDS determines the typical interests of a prespecified group of users by processing the Web pages accessed by these users over time. In the detection mode, ATDS performs real-time monitoring of the Web traffic generated by the monitored group, analyzes the content of the accessed Web pages, and issues an alarm if the accessed information is not within the typical interests of that group and similar to the terrorist interests. An experimental version of ATDS was implemented and evaluated in a local network environment. The results suggest that when optimally tuned the system can reach high detection rates of up to 100% in case of continuous access to a series of terrorist Web pages. [source]


A method for measuring the evolution of a topic on the Web: The case of "informetrics"

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 9 2009
Judit Bar-Ilan
The universe of information has been enriched by the creation of the World Wide Web, which has become an indispensible source for research. Since this source is growing at an enormous speed, an in-depth look of its performance to create a method for its evaluation has become necessary; however, growth is not the only process that influences the evolution of the Web. During their lifetime, Web pages may change their content and links to/from other Web pages, be duplicated or moved to a different URL, be removed from the Web either temporarily or permanently, and be temporarily inaccessible due to server and/or communication failures. To obtain a better understanding of these processes, we developed a method for tracking topics on the Web for long periods of time, without the need to employ a crawler and relying only on publicly available resources. The multiple data-collection methods used allow us to discover new pages related to the topic, to identify changes to existing pages, and to detect previously existing pages that have been removed or whose content is not relevant anymore to the specified topic. The method is demonstrated through monitoring Web pages that contain the term "informetrics" for a period of 8 years. The data-collection method also allowed us to analyze the dynamic changes in search engine coverage, illustrated here on Google,the search engine used for the longest period of time for data collection in this project. [source]


Controlled user evaluations of information visualization interfaces for text retrieval: Literature review and meta-analysis

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 6 2008
Charles-Antoine Julien
This review describes experimental designs (users, search tasks, measures, etc.) used by 31 controlled user studies of information visualization (IV) tools for textual information retrieval (IR) and a meta-analysis of the reported statistical effects. Comparable experimental designs allow research designers to compare their results with other reports, and support the development of experimentally verified design guidelines concerning which IV techniques are better suited to which types of IR tasks. The studies generally use a within-subject design with 15 or more undergraduate students performing browsing to known-item tasks on sets of at least 1,000 full-text articles or Web pages on topics of general interest/news. Results of the meta-analysis (N = 8) showed no significant effects of the IV tool as compared with a text-only equivalent, but the set shows great variability suggesting an inadequate basis of comparison. Experimental design recommendations are provided which would support comparison of existing IV tools for IR usability testing. [source]


Data cleansing for Web information retrieval using query independent features

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2007
Yiqun Liu
Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance. [source]


Metrics for the scope of a collection

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2005
Robert B. Allen
Some collections cover many topics, while others are narrowly focused on a limited number of topics. We introduce the concept of the "scope" of a collection of documents and we compare two ways of measuring it. These measures are based on the distances between documents. The first uses the overlap of words between pairs of documents. The second measure uses a novel method that calculates the semantic relatedness to pairs of words from the documents. Those values are combined to obtain an overall distance between the documents. The main validation for the measures compared Web pages categorized by Yahoo. Sets of pages sampled from broad categories were determined to have a higher scope than sets derived from subcategories. The measure was significant and confirmed the expected difference in scope. Finally, we discuss other measures related to scope. [source]


How users assess Web pages for information seeking

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 4 2005
Anastasios Tombros
In this article, we investigate the criteria used by online searchers when assessing the relevance of Web pages for information-seeking tasks. Twenty-four participants were given three tasks each, and they indicated the features of Web pages that they used when deciding about the usefulness of the pages in relation to the tasks. These tasks were presented within the context of a simulated work-task situation. We investigated the relative utility of features identified by participants (Web page content, structure, and quality) and how the importance of these features is affected by the type of information-seeking task performed and the stage of the search. The results of this study provide a set of criteria used by searchers to decide about the utility of Web pages for different types of tasks. Such criteria can have implications for the design of systems that use or recommend Web pages. [source]


Bibliomining for automated collection development in a digital library setting: Using data mining to discover Web-based scholarly research works

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 12 2003
Scott Nicholson
This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came from the academic library selection literature, and a Delphi study was used to refine the list to 41 criteria. A Perl program was designed to analyze a Web page for each criterion and applied to a large collection of scholarly and nonscholarly Web pages. Bibliomining, or data mining for libraries, was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. Accuracy and return were used to judge the effectiveness of each model on test datasets. In addition, a set of problematic pages that were difficult to classify because of their similarity to scholarly research was gathered and classified using the models. The resulting models could be used in the selection process to automatically create a digital library of Web-based scholarly research works. In addition, the technique can be extended to create a digital library of any type of structured electronic information. [source]


Quality control in scholarly publishing: A new proposal

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 11 2003
Stefano Mizzaro
The Internet has fostered a faster, more interactive and effective model of scholarly publishing. However, as the quantity of information available is constantly increasing, its quality is threatened, since the traditional quality control mechanism of peer review is often not used (e.g., in online repositories of preprints, and by people publishing whatever they want on their Web pages). This paper describes a new kind of electronic scholarly journal, in which the standard submission-review-publication process is replaced by a more sophisticated approach, based on judgments expressed by the readers: in this way, each reader is, potentially, a peer reviewer. New ingredients, not found in similar approaches, are that each reader's judgment is weighted on the basis of the reader's skills as a reviewer, and that readers are encouraged to express correct judgments by a feedback mechanism that estimates their own quality. The new electronic scholarly journal is described in both intuitive and formal ways. Its effectiveness is tested by several laboratory experiments that simulate what might happen if the system were deployed and used. [source]