|
1. What is the meaning of the numbers associated with each keyphrase?
The numbers indicate the score of a phrase which is an estimate of its value as a keyphrase. Keyphrases are ranked in order of descending score. A score can be any positive real number. The scores with long documents as input tend to be higher than the scores with short documents. For some applications, it might be desirable to normalize the score.
2. How can I normalize the score?
For some applications, it might be desirable to normalize the score, so that the scores of keyphrases from different documents can be compared.
Here are some suggestions for normalization:
-
Ignore the scores produced by xAIgent. Given a large collection of documents (e.g., web pages), score each keyphrase by the percentage of documents for which the given keyphrase was suggested by the xAIgent. (Example: "The keyphrase 'corporate merger' was generated for 45 of the 100 documents. Thus 'corporate merger' has a score of 45%.")
-
Take the score produced by the xAIgent and normalize it so that it ranges from 0% to 100%, by dividing the score of each keyphrase by the score of the first keyphrase. (The first keyphrase always has the highest score.) (Example: xAIgent suggests three phrases: 'corporate merger' with a score of 50, 'stocks' with a score of 30, and 'bonds' with a score of 10. The normalized scores are 100%, 60%, and 20%, respectively.)
-
Longer documents often seem to have better keyphrases than shorter documents. The problem with suggestion (2) is that it ignores the document length. One possibility would be to multiply the normalized score of (2) by (say) the logarithm of the length of the document (measured in number of words or in bytes). Another possibility would be to sort the document collection by length and increase the score of documents according to the percentile in which they appear. (Example: "The keyphrase 'corporate merger' appears in document #345. The keyphrase has a normalized score of 60%. However, since document #345 is in the top 25 percentile of documents in the collection, according to length, we will boost the score of 'corporate merger' by 20%, for an adjusted score of 80%.")
3. Given a sentence such as, "I am not skiing today," why does the xAIgent select "skiing" as a keyphrase instead of "not skiing"?
The intention of the xAIgent is to capture the main topics that are discussed in the input document. xAIgent does not attempt to convey exactly how these topics are discussed. For example, if a document discusses legal issues concerning guns, the xAIgent might suggest the keyphrase "gun law". This keyphrase does not indicate whether the document supports strict legal control of guns or it is against any government involvement in gun control. The design of the xAIgent was based on a study of how authors use keyphrases. We have examined several thousand documents with keyphrases supplied by their authors. None of the keyphrases we have seen so far include the word "not".
4. I want to use the xAIgent for automatic document classification. Can you help me?
Automatic document classification is the use of software to sort documents into various pre-defined categories. A similar task is automatic document clustering, in which there are no pre-defined categories, so the software must create the categories by itself. If you want to learn more about automatic document classification and clustering, there is a hypertext Bibliography on Machine Learning Applied to Text. xAIgent can be used to generate features for use in feature vectors for machine learning algorithms. (If you are not familiar with this terminology, it should become clear to you as you read the papers in the bibliography.) If you wish to use the xAIgent to generate feature vectors, we suggest the following approach:
-
Apply the xAIgent to all of the documents in your sample collection.
-
Take the union of all of the extracted keyphrases as the feature set.
-
For each document and each feature, let the value of the feature be the number of times that the given phrase occurs in the given document (regardless of whether the xAIgent extracted it from the given document).
-
Apply your favourite machine learning algorithm (e.g., decision tree induction, neural network, genetic algorithm, etc.) to the resulting feature vectors.
5. How can I combine keyphrases that were extracted from many different documents?
For some applications, you may wish to have a list of keyphrases that covers a whole collection of documents, where each document has been processed individually by the xAIgent. If you have no constraints on the size of the list of keyphrases, you might simply take the union of all of the phrases as your combined list. To reduce the size of the list slightly, you might drop words that have the same stem (e.g., "automobile" and "automobiles"). If you want to substantially reduce the size of the list, then you can assign a normalized score to each keyphrase and select the keyphrases with the highest normalized scores.
6. Can the xAIgent handle language X?
The xAIgent currently works with monolingual documents in English, French, Japanese, German, Spanish, or Korean.
7. Can the xAIgent handle character encoding X?
The xAIgent currently supports ISO-8859-1 for English, French, German and Spanish. ISO-8859-1 is also known as ISO Latin-1. The xAIgent currently supports Unicode UCS2 for Japanese and Korean.
8. How can I generate 100 keyphrases?
The xAIgent currently allows the user to specify from 3 to 30 keyphrases. For some applications, you may wish to have more keyphrases. One solution is to break the document into smaller sections and pass each section to the xAIgent.
Suppose we gave you a book and asked you to give us a list of key phrases that capture the main topics of the book. When your list approached 30 key phrases, we think you would struggling to think of more key phrases. It seems likely that there are less than 30 "main topics" for most books. Perhaps an average book only has 10 or 15 "main topics", but you could cover each topic with 2 or 3 synonymous key phrases, to yield a total of about 30 key phrases.
On the other hand, if we took any single chapter from the same book, and asked you to give us a list of key phrases that capture the main topics of the chapter, we think the list would be approximately the same size as the list you would give us for the whole book. A key phrase that captures the "main topic" of the chapter might only capture a "minor topic" of the whole book. So the union of the keyphrases for each chapter would be a superset of the keyphrases for the whole book.
This is why the xAIgent has a maximum of 30 key phrases per "chunk". If you want more key phrases, then you can break the document into smaller "chunks" and take the union of the key phrases for each individual "chunk". We believe that this strategy will produce a superior list to the strategy of treating the document as a single, homogenous whole.
9. When I give a document to the xAIgent and ask for four keyphrases and then take the same document and ask for seven keyphrases, the four keyphrases are not always a subset of the seven keyphrases. Why?
This is explained in detail in Learning to Extract Keyphrases from Text. If it is important for your application that the four keyphrases that you get when you ask for four keyphrases should be the same as the first four keyphrases that you get when you ask for seven keyphrases, then ask for seven keyphrases but only take the first four. In general, if you currently want M keyphrases but you might eventually want N keyphrases (where N > M), then ask the xAIgent for N keyphrases, but only take the first M keyphrases. Better yet, store all N keyphrases, so you can later lookup the remaining N - M keyphrases instead of running the xAIgent twice.
10. In our documents, we have phrases with four and more words. What does the xAIgent do? Is there a limit to the number of words in a keyphrase?
The xAIgent is designed to extract key phrases with one, two, or three words. In a study of thousands of documents with key phrases supplied by the authors, and authors only create key phrases with four or more words about 5% of the time. When we try to include phrases with four or more words, we can cover a few more of the authors' key phrases, but we also introduce a few more errors. Since there is a net loss, the xAIgent does not attempt to cover these longer phrases. In order to capture these longer phrases you you might try inspecting the key prhases relative to each other. If the xAIgent outputs a phrase of the form "A B C" and a phrase of the form "B C D", then you can conjecture that these are parts of a longer phrase "A B C D", and join them together. For example, "National Research Council" and "Research Council Canada" would be joined to make "National Research Council Canada".
|
|