List of Listings – Collective Intelligence in Action

List of Listings

Chapter 3. Extracting intelligence from tags

Listing 3.1. Query for users who have used one of John’s tags

Listing 3.2. The final query for getting all tags that other users have used

Listing 3.3. The TagCloud interface

Listing 3.4. The TagCloudElement interface

Listing 3.5. The FontSizeComputationStrategy interface

Listing 3.6. Implementation of TagCloudImpl

Listing 3.7. The implementation of TagCloudElementImpl

Listing 3.8. Implementation of FontSizeComputationStrategyImpl

Listing 3.9. VisualizeTagCloudDecorator interface

Listing 3.10. Implementation of HTMLTagCloudDecorator

Listing 3.11. Sample code for generating tag clouds

Chapter 4. Extracting intelligence from content

Listing 4.1. The MetaDataVector interface

Listing 4.2. The MetaDataExtractor interface

Listing 4.3. Implementation of the SimpleMetaDataExtractor

Listing 4.4. Continuing with the implementation of SimpleMetaDataExtractor

Listing 4.5. Implementation of SimpleStopWordMetaDataExtractor

Listing 4.6. Implementation of SimpleStopWordStemmerMetaDataExtractor

Listing 4.7. Implement SimpleBiTermStopWordStemmerMetaDataExtractor

Chapter 5. Searching the blogosphere

Listing 5.1. Example of RSS 2.0 from

Listing 5.2. BlogSearcher interface

Listing 5.3. The BlogQueryParameter interface

Listing 5.4. The BlogQueryResult interface

Listing 5.5. The BlogSearchResponseHandler interface

Listing 5.6. Implementation of BlogSearcherException

Listing 5.7. Implementation of BlogQueryParameterImpl

Listing 5.8. First half of BlogSearcherImpl—SAX parser

Listing 5.9. Second half of BlogSearcherImpl—HTTP and parsing response

Listing 5.10. Constructor and attributes for BlogSearchResponseHandlerImpl

Listing 5.11. Parsing-related code for BlogSearchResponseHandlerImpl

Listing 5.12. Technorati response XML for search query

Listing 5.13. TechnoratiSearchBlogQueryParameterImpl

Listing 5.14. TechnoratiBlogSearcherImpl

Listing 5.15. TechnoratiResponseHandler

Listing 5.16. Output from Technorati search for “collective intelligence”

Listing 5.17. Unit test to call Technorati search

Listing 5.18. Example response from Bloglines search

Listing 5.19. Implementation of BlogLinesBlogSearcherImpl

Listing 5.20. BlogSearchResponseHandler

Listing 5.21. RSSFeedBlogQueryParameterImpl

Listing 5.22. RSSFeedBlogQueryParameterImpl

Listing 5.23. RSSFeedResponseHandler

Listing 5.24. Output from Blogdigger query for “collective intelligence”

Listing 5.25. Output from Blogdigger query for “collective intelligence”

Chapter 6. Intelligent web crawling

Listing 6.1. The crawl() method in the NaiveCrawler class

Listing 6.2. The constructor for NaiveCrawler

Listing 6.3. The code for CrawlerUrl

Listing 6.4. Getting the next url for the crawler

Listing 6.5. Example robots.txt file at

Listing 6.6. Parsing the robots.txt file to check for permissions

Listing 6.7. Retrieving content from URLs

Listing 6.8. Extracting the URLs

Listing 6.9. Checking for relevant content

Listing 6.10. Main program for the crawler

Listing 6.11. Sample of the URLs retrieved by the crawler

Listing 6.12. Dump of the URLs from the crawldb

Listing 6.13. Dump of a Nutch segment

Listing 6.14. Configuring nutch-site.xml

Chapter 7. Data mining: process, toolkits, and standards

Listing 7.1. Implementation of the WEKATutorial

Listing 7.2. Implementation of the method to create attributes

Listing 7.3. Implementation of the method createLearningDataSet

Listing 7.4. Creating the predictive model

Listing 7.5. Evaluating the quality and predicting the number of logins

Listing 7.6. Predicting the number of logins

Listing 7.7. The output from the main method

Listing 7.8. Constructor and main method for JDMConnectionExample

Listing 7.9. Creating a new connection in the JDMConnectionExample

Listing 7.10. Getting a ConnectionFactory and ConnectionSpec

Chapter 8. Building a text analysis toolkit

Listing 8.1. Implementation of the PorterStemStopWordAnalyzer

Listing 8.2. Test method to see the effect of PorterStemStopWordAnalyzer

Listing 8.3. Interface to validate phrases

Listing 8.4. Interface to access synonyms

Listing 8.5. The next() method for SynonymPhraseStopWordFilter

Listing 8.6. Injecting phrases and synonyms

Listing 8.7. Implementation of the SynonymPhraseStopWordAnalyzer

Listing 8.8. Implementation of the CacheImpl class

Listing 8.9. Implementation of SynonymsCacheImpl

Listing 8.10. Implementation of the PhrasesCacheImpl

Listing 8.11. Test program using SynonymPhraseStopWordAnalyzer

Listing 8.12. The Tag interface

Listing 8.13. The TagImpl implementation

Listing 8.14. The TagCache interface

Listing 8.15. The implementation for TagCacheImpl

Listing 8.16. The TagMagnitude interface

Listing 8.17. The implementation for TagMagnitudeImpl

Listing 8.18. The TagMagnitudeVector interface

Listing 8.19. The basic TagMagnitudeVectorImpl class

Listing 8.20. Computing the dot product in TagMagnitudeVectorImpl

Listing 8.21. A simple example for TagMagnitudeImpl

Listing 8.22. The interface for the InverseDocFreqEstimator

Listing 8.23. The interface for the EqualInverseDocFreqEstimator

Listing 8.24. The interface for the TextAnalyzer

Listing 8.25. The core of the LuceneTextAnalyzer class

Listing 8.26. Creating the term vectors in LuceneTextAnalyzer

Listing 8.27. Computing the tokens for the title and body

Listing 8.28. Tag listing for our example

Listing 8.29. Visualizing the term vector as a tag cloud

Listing 8.30. Computing the TagMagnitudeVector

Listing 8.31. Results from displaying the results for TagMagnitudeVector

Chapter 9. Discovering patterns with clustering

Listing 9.1. The definition for the Clusterer interface

Listing 9.2. The definition for the TextCluster interface

Listing 9.3. The definition for the TextDataItem interface

Listing 9.4. The definition for the DataSetCreator interface

Listing 9.5. The definition for the BlogAnalysisDataItem

Listing 9.6. Retrieving blog entries from Technorati

Listing 9.7. Converting blog entries into a List of TextDataItem objects

Listing 9.8. The implementation for InverseDocFreqEstimatorImpl

Listing 9.9. The implementation for ClusterImpl

Listing 9.10. The core of the TextKMeansClustererImpl implementation

Listing 9.11. Initializing the clusters

Listing 9.12. Recomputing the clusters

Listing 9.13. Results from a clustering run

Listing 9.14. The interface for HierCluster

Listing 9.15. The implementation for HierClusterImpl

Listing 9.16. The implementation for HierDistance

Listing 9.17. The cluster method for HierarchialClusteringImpl

Listing 9.18. Creating the initial clusters in HierarchialClusteringImpl

Listing 9.19. Merging the next cluster in HierarchialClusteringImpl

Listing 9.20. Printing the results from HierarchialClusteringImpl

Listing 9.21. Sample output from hierarchical clustering applied to blog entries

Listing 9.22. The first part of WEKABlogDataSetClusterer

Listing 9.23. An example dump of the Instances class

Listing 9.24. The second part of WEKABlogDataSetClusterer

Listing 9.25. The third part of WEKABlogDataSetClusterer

Listing 9.26. Sample output from one of the clustering runs

Listing 9.27. Settings-related code for the clustering process

Listing 9.28. Creating the clustering task

Listing 9.29. Executing the clustering task

Listing 9.30. Retrieving the clustering model

Chapter 10. Making predictions

Listing 10.1. Computing the gain associated with a distribution

Listing 10.2. Retrieving blogs

Listing 10.3. Creating the dataset

Listing 10.4. Creating Instance in WEKAPredictiveBlogDataSetCreatorImpl

Listing 10.5. The implementation of the WEKABlogClassifier class

Listing 10.6. Sample output from a decision tree

Listing 10.7. Implementing regression using the WEKA APIs

Listing 10.8. Settings-related code for the classification process

Listing 10.9. Create the classification task

Listing 10.10. Execute the classification task

Listing 10.11. Retrieving the classification model

Listing 10.12. Testing the classification model

Chapter 11. Intelligent search

Listing 11.1. The main method for the BlogSearchExample

Listing 11.2. Retrieving blog entries from Technorati

Listing 11.3. Creating a search index

Listing 11.4. Sample output from indexing the blogs

Listing 11.5. Searching the index

Listing 11.6. Sample output from our example

Listing 11.7. Deleting documents using the IndexReader

Listing 11.8. Batch deletion and addition of documents

Listing 11.9. Adding code to check whether the index is locked

Listing 11.10. Sample code to access the term frequency vector for a field

Listing 11.11. Illustrate flushing by RAM

Listing 11.12. Sample explanation of Lucene scoring

Listing 11.13. Example code showing the use of various Query classes

Listing 11.14. Sorting example

Listing 11.15. MultiFieldQueryParser example

Listing 11.16. Filtering the results

Listing 11.17. Searching across multiple instances

Listing 11.18. Example using TopDocCollector

Listing 11.19. Implementing a custom HitCollector

Chapter 12. Building a recommendation engine

Listing 12.1. Sample output from related blogs

Listing 12.2. Creating a BooleanQuery using the term vector in Lucene

Listing 12.3. Creating a composite BooleanQuery and retrieving blog entries

Listing 12.4. Iterating over all documents in the index

Listing 12.5. Iterating over all documents in the index

Listing 12.6. Implementation for RelevanceTextDataItem

Listing 12.7. The main steps for building a content-based recommendation engine

Listing 12.8. Getting relevant items in ContentBasedBlogRecoEngine

Listing 12.9. Related entries for a blog

Listing 12.10. Code to illustrate merging of documents

Listing 12.11. Creating the dataset for implementing k-NN

Listing 12.12. The dataset created from the first part of code

Listing 12.13. Making predictions using k-nearest neighbor

Listing 12.14. The output predicted and expected values for our example

Listing 12.15. Representing the data to illustrate dimensionality reduction

Listing 12.16. Output from running the first part of the code

Listing 12.17. Code illustrating dimensionality reduction

Listing 12.18. Output from running the second part of the code