Latent Semantic Indexing (latent + semantic_indexing)

Distribution by Scientific Domains


Selected Abstracts


Unified linear subspace approach to semantic analysis

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 1 2010
Dandan Li
The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, its retrieval effectiveness is limited because it is based on literal term matching. The Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) are two prominent semantic retrieval methods, both of which assume there is some underlying latent semantic structure in a dataset that can be used to improve retrieval performance. However, while this structure may be derived from both the term space and the document space, GVSM exploits only the former and LSI the latter. In this article, the latent semantic structure of a dataset is examined from a dual perspective; namely, we consider the term space and the document space simultaneously. This new viewpoint has a natural connection to the notion of kernels. Specifically, a unified kernel function can be derived for a class of vector space models. The dual perspective provides a deeper understanding of the semantic space and makes transparent the geometrical meaning of the unified kernel function. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also prove that the new methods are stable because although the selected rank of the truncated Singular Value Decomposition (SVD) is far from the optimum, the retrieval performance will not be degraded significantly. Experiments performed on standard test collections show that our methods are promising. [source]


Ranking and selecting terms for text categorization via SVM discriminate boundary

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, Issue 2 2010
Tien-Fang Kuo
The problem of natural language document categorization consists of classifying documents into predetermined categories based on their contents. Each distinct term, or word, in the documents is a feature for representing a document. In general, the number of terms may be extremely large and the dozens of redundant terms may be included, which may reduce the classification performance. In this paper, a support vector machine (SVM)-based feature ranking and selecting method for text categorization is proposed. The contribution of each term for classification is calculated based on the nonlinear discriminant boundary, which is generated by the SVM. The results of experiments on several real-world data sets show that the proposed method is powerful enough to extract a smaller number of important terms and achieves a higher classification performance than existing feature selecting methods based on latent semantic indexing and ,2 statistics values. © 2009 Wiley Periodicals, Inc. [source]


Document classification techniques for automated technology readiness level analysis

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, Issue 4 2008
Barry L. Britt
The overhead of assessing technology readiness for deployment and investment purposes can be costly to both large and small businesses. Recent advances in the automatic interpretation of technology readiness levels (TRLs) of a given technology can substantially reduce the risk and associated cost of bringing these new technologies to market. Using vector-space information-retrieval models, such as latent semantic indexing, it is feasible to group similar technology descriptions by exploiting the latent structure of term usage within textual documents. Once the documents have been semantically clustered (or grouped), they can be classified based on the TRL scores of (known) nearest-neighbor documents. Three automated (no human curation) strategies for assigning TRLs to documents are discussed with accuracies as high as 86% achieved for two-class problems. [source]


Lanczos and the Riemannian SVD in information retrieval applications

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, Issue 4 2005
Ricardo D. Fierro
Abstract Variations of the latent semantic indexing (LSI) method in information retrieval (IR) require the computation of singular subspaces associated with the k dominant singular values of a large m × n sparse matrix A, where k,min(m,n). The Riemannian SVD was recently generalized to low-rank matrices arising in IR and shown to be an effective approach for formulating an enhanced semantic model that captures the latent term-document structure of the data. However, in terms of storage and computation requirements, its implementation can be much improved for large-scale applications. We discuss an efficient and reliable algorithm, called SPK-RSVD-LSI, as an alternative approach for deriving the enhanced semantic model. The algorithm combines the generalized Riemannian SVD and the Lanczos method with full reorthogonalization and explicit restart strategies. We demonstrate that our approach performs as well as the original low-rank Riemannian SVD method by comparing their retrieval performance on a well-known benchmark document collection. Copyright 2004 John Wiley & Sons, Ltd. [source]