Abstract:
This paper describes the first implementation of POIROT, a web search agent based on relevance, that determines the users interests inspecting the pages bookmarked in the web browser and extracting keywords using some information theory methods such as TF-IDF. The keywords are used to build a training set that is processed by an Inductive Logic Programming (ILP) algorithm that learns what is “relevant” to the user. The rules generated with ILP are used to expand user queries and to rank the results. POIROT also models the behavior of the more important Internet search engines to determine which one to use depending on the topic to search. One important design consideration of POIROT is to build its models without asking the user for feedback, from this perspective POIROT is an active learner. Some comparisons with Metacrawler are reported, showing that POIROT outperforms in terms of relevance and precision of the results presented. Identifying users interests The approach followed by most of the web agents that try to model users is to present an initial questionnaire and ask for feedback from the user after each action or decision (Chen and Sycara 1997, Newell 1997). This approach tend to be molest and extremely poor to gather hidden relations between features or progressive shifts in user interests. One of our main motivations was to identify user interests, and model the users and search engines, without issuing questionnaires or asking the user for feedback. The monitoring of users actions seems to be a promising approach, but there are several issues that difficult its implementation. First, the capture of events in a web browser, as save, print or mouse actions, is not an easy task; and second, its difficult to assign the correct meaning to users actions during web browsing. For instance, its not clear how important is to follow a link or to “view” a page for a given time period. A page may stay in the browser’s window unattended while the users are taking a break, and that doesn’t mean that the page is of interest for the user. The frequency and numbers of visits to a page is a factor that indicates the interest in that page and, more important, interest on the topics related with that particular page (Tauscher and Greenberg 1999). One option to capture the features mentioned was to develop a web browser or a plug-in that monitor the user activity and reports it to a feature extraction module. The problem is that the plug-in “lives” during the execution of the web and it is limited to act in a single window at a time, limiting its monitoring possibilities. To develop a web browser, given the installed base of the most popular browsers (Sullivan 199b), doesn’t make sense. Another option considered was to use the cache files stored in the proxy server or local hard disks. The cache maintains a log of visited pages and stores the components that form the pages. But pages in the cache are not necessarily pages of interest, they just reflect the navigation pattern of the user. Popular web browsers give the user the possibility to bookmark pages for future reference. Some web browsers give also the option to organize the bookmarks by topic. Given that the bookmark is an user action that reflects the interest in a given page, we can assert that, at a given time, the pages bookmarked were relevant for the user in terms of general utility, quality, personal interest, frequency of use and potential of future use (Abrams and Baecker 1999). According to (Abrams 1997), 84% of the users use bookmarks, hotlists or favorites lists as a reference to pages that they liked. The bookmarks are tools of natural use for the user that face common problems of Internet as information overload, disorder, redundancy, lack of quality (Tauscher and Greenberg 1999, Lawrence and Giles 1998 and 1999, Baeza 1998). The bookmarks help to reduce the entropy inherent to Internet. Modeling users and search engines Given the decision to use bookmarks as the primary source of information about user interests, the next decision was to develop an agent to extract the information, build a model and interact with the user. From: AAAI Technical Report WS-00-01. Compilation copyright © 2000, AAAI (www.aaai.org). All rights reserved. The first problem in the design of the agent was to identify the features that determine that a given page is of interest for the user. The interest of a user, as we mentioned above, can be determined by several factors, but all of them are associated with the topics covered by the page, so we need to determine what are the topics covered by a given page in order to determine if that page is relevant for the user and its degree of relevance. The topics of a document are not necessarily written explicitly, for instance, a page of a college may mention words as “university”, “faculty”, “schedules” but the term “education” may not appear. Moreover, if that term appears, how to determine that is related to the topics of the page?. Clearly, the topics covered by a web page are determined from the words that are written in the page, the problem is to rank the words in the page according to the relation with the topic of the page. One alternative is to infer the topics covered by a document given the frequency of appearance of certain words in the document and a relation established between those words and the topics. In other words, the words that are more related with the topic of a document should be more frequent than others, and pages that cover same topics should share the same set of frequently used words. (Joachims 1998) suggests to use a feature vector consisting in a word counting vector. Each word has an associated importance, computed based on the frequency of appearance, that serve as input to a Support Vector Machine (SVM) (Haykin 1999, Burges 1998, Osuna et al. 1997, Nilsson 1999, Pontil and Verri 1997) that determines the proper classification of the document. The problem with this approach, in addition to the computational cost of SVM, is the lack of important features as incremental learning that would give the agent the ability to adapt to user changes. (Salton and McGill 1983) present a method to compute a weighted word counting, using weights determined by an algorithm called TF-IDF (Term Frequency – Inverse Document Frequency) that became a standard in information retrieval. The general idea is to represent each document as a vector in a feature space, where documents with similar topics should be “closer” in the space that unrelated ones, given a distance metric. Each dimension of the feature space represents a word and its weight. The weights are computed based on the frequency of the word, TF(p,d), and frequency of documents, DF(p). From DF the inverse, IDF(p), is computed as shown in (1). ) ( log ) ( w DF D w IDF = (1) where |D| is the total number of evaluated documents. The weight of a given word in a document is computed using equation (2). The higher the value of d(i), higher is the significance of the word within the document.. Note that the significance of a word within a document decreases if DF increases, given that if a word is very frequent in a set of documents, that word won’t help us to “separate” the documents in the feature space. d(i) = TF (pi, d) IDF(wi) (2) As an adaptation of TF-IDF to our particular problem, we assign additional weights to the words according to their position in the HTML document, higher for words appearing in the “keyword tag”, titles, headers, etc. and lower for normal text. Once the words in each document are ranked, we need to infer rules that, starting from these initial documents (the pages bookmarked by the user), help us to determine the relevance of new pages. This problem can be addressed as inductive concept learning, where the concept to learn (right-hand side of the rules) is “relevance” and the conditions (left-hand side) are logic expressions based on topics found in documents. The pages bookmarked by the users are used as the training set, they are all positive instances of the concept. Inductive Logic Programming (Bergadano and Gunetti 1995) was selected, instead of other inductive methods as Neural Networks, to infer rules due to its ability to derive rules of variable number of conditions, resulting in specific rules (many conditions) that give us precision and general rules (few conditions) that help us to maintain coverage in the search. We limit the number of topics to appear in the left-hand side of rules to 10, to avoid the over-fitting of the agent. The topics with higher ranking are used to build rules of the form: If (topic1, topic2, ... topicn) then Relevant with RF Where RF is the relevant factor computed from the ranking of the topics included in the rule. Once the initial rules are generated, ILP techniques are applied to that rules to generate aggregated rules that are also incorporated to the rule base. The result is a rule base with rules extracted directly from documents and more general rules that apply to the entire set. This maintains the required balance between precision and coverage that is shown in the results. The last issue to address is the modeling of the repositories, search engines in our case; allowing the agent to select the search engine with higher probability to give good results for a particular topic. We use a strategy similar to TF-IDF, maintaining a vector for each
Tópico:
Web Data Mining and Analysis