Chapter 3. Extracting intelligence from tags – Collective Intelligence in Action

Chapter 3. Extracting intelligence from tags

This chapter covers
  • Three forms of tagging and the use of tags
  • A working example of how intelligence is extracted from tags
  • Database architecture for tagging
  • Developing tag clouds

In content-centric applications, users typically navigate content through categories or menus authored by the site editors. Each category can have a number of nested subcategories, allowing the user to drill down the subcategory tree and find content of interest. From a user-experience point of view, such navigation can be tedious. A user might need to navigate across multiple subtopics before finding the right item. This approach of manually categorizing items can be expensive and difficult to maintain over the long run due to the manpower involved, especially as the amount of content increases. As users generate content on your site, it’ll be expensive and sometimes financially infeasible to manually categorize the content being created. Imagine a site like Flickr with millions of photographs and the effort that would be required if you tried to manually categorize each photo.

An alternative to manual categorization and rigid menus is to build a system that can learn about each user—what kind of content she’s interested in—and dynamically build navigation links, hyperlinks to other relevant items, whose text or phrases are also familiar to the user. Further, such a system can be built in a cost-effective manner without having to rely on professional editors to categorize items.

Users tagging items—adding keywords or phrases to items—is now ubiquitous on the web. This simple process of a user adding labels or tags to items, bookmarking items, sharing items, or simply viewing items provides a rich dataset that can translate into intelligence, for both the user and the items. This intelligence can be in the form of finding items related to the one tagged; connecting with other users who have similarly tagged items; or drawing the user to discover alternate tags that have been associated with an item of interest and through that finding other related items.

In this chapter, we look at three forms of generating tags in your application: professionally generated tags, user-generated tags, and machine-generated tags. We review the advantages and disadvantages of each approach and develop guidelines for creating tags. We briefly review how you can use dynamic navigation in your application. We use a working example to illustrate the process of developing the term vector or metadata for the item and user. Next, we discuss how you can build this infrastructure in a scalable manner. For this, we first review the various designs used to persist tagging-related data and develop the recommended persistence architecture. Next, we develop code to build a tag cloud, one of the ways to add dynamic navigation to your application, and end with some practical issues.

3.1. Introduction to tagging

Tagging is the process of adding freeform text, either words or small phrases, to items. These keywords or tags can be attached to anything in your application—users, photos, articles, bookmarks, products, blog entries, podcasts, videos, and more.

In section 2.2.3, we looked at using term vectors to associate metadata with text. Each term or tag in the term vector represents a dimension. The collective set of terms or tags in your application defines the vocabulary for your application. When this same vocabulary is used to describe both the user and the items, we can compute the similarity of items with other items and the similarity of the item to the user’s metadata to find content that’s relevant to the user. In this case, tags can be used to represent metadata. Using the context in which they appear and to whom they appear, they can serve as dynamic navigation links. The tag cloud introduced later in the chapter is one example of such navigation.

In essence, tags enable us to

  1. Build a metadata model (term vector) for our users and items. The common terminology between users and items enables us to compute the similarity of an item to another item or to a user.
  2. Build dynamic navigation links in our application, for example, a tag cloud or hyperlinked phrases in the text displayed to the user.
  3. Use metadata to personalize and connect users with other users.
  4. Build a vocabulary for our application.
  5. Bookmark items, which can be shared with other users.

In this section, we look at some basic information about tagging.

3.1.1. Tag-related metadata for users and items

Tags provide a common vocabulary to represent metadata associated with users and items. The metadata developed can be again divided into content-based metadata and collaborative-based metadata.

In the content-based approach, metadata associated with the item is developed by analyzing the item’s content as explained in section 2.2.3. This is represented by a term vector, a set of tags with their relative weights. Similarly, metadata can be associated with the user by aggregating the metadata of all the items visited by the user within a window of time, as explained in section 2.4.2.

In the collaborative approach, user actions are used for deriving metadata. User tagging is an example of such an approach, and we illustrated with the example in section 2.2. Basically, the metadata associated with the item can be computed by computing the term vector from the tags—taking the relative frequency of the tags associated with the item and normalizing the counts.

When you think about metadata for a user and item using tags, think about a term vector with tags and their related weights.

We can categorize tags based on who generated them. As shown in figure 3.1, there are three main types of tags: professionally generated, user-generated, and machine-generated.

Figure 3.1. Three ways to generate tags

Next, we look at each of these three types of tags.

3.1.2. Professionally generated tags

There are a number of applications that are content rich and provide different kinds of content—articles, videos, photos, blogs—to their users. Vertical-centric medical sites, news sites, topic-focused group sites, or any site that has a professional editor generating content are examples of such sites. In these kinds of sites, the professional editors are typically domain experts, familiar with content domain, and are usually paid for their services. The first type of tags we cover is tags generated by such domain experts, which we call professionally generated tags.

Tags that are generated by domain experts have the following characteristics:

  • They bring out the concepts related to the text.
  • They capture the associated semantic value, using words that may not be found in the text.
  • They can be authored to be displayed on the user interface.
  • They can provide a view that isn’t centered around just the content of interest, but provides a more global overview.
  • They can leverage synonyms—similar words.
  • They can be multi-term phrases.
  • The set of words used can be controlled, with a controlled vocabulary.

Professionally generated tags require a lot of manpower and can be expensive, especially if a large amount of new content is being generated, perhaps by the users. These characteristics can be challenging for an automated algorithm.

3.1.3. User-generated tags

It’s now common to allow users to tag items. Tags generated by the users fall into the category of user-generated tags, and the process of adding tags to items is commonly known as tagging.

Tagging enables a user to associate freeform text to an item, in a way that makes sense to him, rather than using a fixed terminology that may have been developed by the content owner or created professionally.

Figure 3.2 shows the process of tagging at del.icio.us. Here, a user can associate any tag or keyword with a URL. The system displays a list of recommended and popular tags to guide the user.

Figure 3.2. Screenshot of how a user creates a tag at del.icio.us

The use of users to create tags in your application is a great example of leveraging the collective power of your users. Items that are popular will tend to be frequently tagged. From an intelligence point of view, for a user, what matters most is which items people similar to the user are tagging.

User-generated tags have the following characteristics:

  • They use terms that are familiar to the user.
  • They bring out the concepts related to the text.
  • They capture the associated semantic value, using words that may not be found in the text.
  • They can be multi-term phrases.
  • They provide valuable collaborative information about the user and the item.
  • They may include a wide variety of terms that are close in meaning.

User-generated tags will need to be stemmed to take care of plurals and filtered for obscenity. Since tags are freeform, variants of the same tag may appear. For example, collective intelligence and collectiveintelligence may appear as two tags. As shown in figure 3.2, you may want to offer recommended tags to the user based on the dictionary of tags created in your application and the first few characters typed by the user.[1] We discuss user-generated tags and tagging in more detail later in the next section.

1 This is commonly done using AJAX.

3.1.4. Machine-generated tags

Tags or terms generated through an automated algorithm are known as machine-generated tags. In section 2.2.3, we briefly reviewed the steps involved in generating tags using an automated algorithm. Section 4.3 has a working example of how tags can be generated by analyzing the text. We develop the toolkit to extract tags in chapter 8. An algorithm generates tags by parsing through text and detecting terms and phrases.

Machine-generated tags have the following characteristics:

  • They use terms that are contained in the text, with the exception of injected synonyms.
  • They’re usually single terms— Multi-term phrases are more difficult to extract and are usually done using a set of predefined phrases. These predefined phrases can be built using either professional or user-generated tags.
  • They can generate a lot of noisy tags— tags that can have multiple meanings based on the context, including polysemy[2]and homonyms.[3]— For example, the word gain can have a number of meanings—height gain, weight gain, stock price gain, capital gain, amplifier gain, and so on. Again, detecting multiple-term phrases, which are a lot more specific than single terms, can help solve this problem.

    2 A word that has multiple related meanings depending on the context.

    3 A word that has multiple unrelated meanings depending on the context.

In the absence of user-generated and professionally generated tags, machine-generated tags are the only alternative. This is especially true for analyzing user-generated content.

Chances are that you’ll be using tags for both dynamic navigation in your application and for representing metadata. The next section covers some tips on how to tag items that I’ve learned from my experience.

3.1.5. Tips on tagging

Here are a few guidelines for how to create tags:

  • If possible, build a tag dictionary for your product— Further tags can be organized in a “is-a” hierarchy. While tagging, build a synonym dictionary—a set of tags that have the same meaning. Leverage this dictionary to extract phrases while analyzing content and inject synonyms automatically while parsing text (see chapter 8).
  • Avoid tags that can have multiple meanings based on their context— For example, gain can have multiple meanings as discussed in the previous section. Use a more qualified phrase, such as weight gain or stock price gain.
  • Don’t use multiple tags to capture both singular and plurals, since the tags will be stemmed— Similarly, don’t worry about capitalization—all tags will be converted to lowercase.
  • Multi-term phrases are rarer in occurrence and therefore can give highly relevant matches— To detect multi-term phrases using an automated algorithm, you may need to use a phrase dictionary. This is discussed more in chapter 8.
  • For a system that uses only professionally generated tags, the weight associated with each tag is related to the number of other tags used to describe the item and the inverse-document-frequency (idf) for the tag— Every additional tag can dilute the weight of the other tags for an item. Use adequate but only relevant tags for an item.
  • You can use a combination of the three sources of tag generation.
  • While tagging, build a synonym dictionary—a set of tags that have the same meaning.

So far we’ve looked at the three forms of tagging and some pointers on how to tag. You may wonder what motivates a user to tag information; we look at this next.

3.1.6. Why do users tag?

In its most basic form, people may tag things so as to organize items of interest and remember them. For example, if you have a large number of files on your computer, you normally create folders to organize your files. Tagging is similar—you create categories or buckets and associate them with the item of interest. An item can be tagged with multiple labels, as shown in figure 3.3, a screenshot from Amazon.com, where users are allowed to add multiple tags to an item. By tagging items, users can use tags that make sense to them and don’t have to use the classification of the content owner or site.

Figure 3.3. Amazon allows users to tag a product and see how others have tagged the same product.

Users also tag items so that they can share the information with others, find related items that others have tagged in the same category, and also when they want to be found by others (mainly in social-networking sites).

As shown in figure 3.3, let’s assume that a user is interested in management and has placed a management tag for his items of interest. Over a period of time, if the user would like to see all items related to management, he can look at all items that have been tagged with management. Tags help users organize and find items of interest.

Another feature that’s actively used, especially for social networking applications, is allowing users to explore tags. In our example, any user would be able to click on the management link and see other items that have been tagged by other users with the same tag. This helps them discover new items of interest. User can also look at which users have used the same tag, thus helping them find other users with similar interests.

You can also find similar tags. The same item may have been tagged by others with different keywords. By following the links, you can associate similar tags. For example, in figure 3.3, the same item has been tagged as social networks, sociology, management, marketing, and so on. These four tags may be similar, especially if they’re repeated consistently across multiple items.

So far, we’ve looked at what tags are and how they’re generated. At the beginning of the section, we listed how tags can be used in your application. Next, we go through each use case in more detail.

3.2. How to leverage tags

It’s useful to build metadata by analyzing the tags associated with an item and placed by a user. This metadata can then be used to find items and users of interest for the user. In addition to this, tagging can be useful to build dynamic navigation in your application, to target search, and to build folksonomies. In this section, we briefly review these three use cases.

3.2.1. Building dynamic navigation

In early applications, content would be categorized by professional authors. A user would navigate to the content by first selecting a category and then drilling down the category tree until the content was found. Tags provide an alternative to this rigid categorization, with a tag cloud as one manifestation of this dynamic navigation. Hyperlinks within content are another manifestation of this dynamic navigation. It’s now fairly common for applications to hyperlink cities, phone numbers, and certain keywords as they display content on their sites. You must have seen this kind of hyperlinking, perhaps in your Yahoo! or Gmail email account.

As shown in figure 3.4, a tag cloud shows a list of tags, typically arranged alphabetically, with the font size representing the tag’s frequency of use. The bigger the font in a tag cloud, the more frequently it’s used. Some tag clouds also use different font colors in their tag clouds.

Figure 3.4. Tag cloud from squidoo.com

Figure 3.5 shows the tag cloud of all-time most popular tags at Flickr. There are a number of interesting things that we can learn from this tag cloud. First, tags are not stemmed—plurals are treated as separate tags. For example, animal and animals are treated as separate tags, as are flower and flowers.

Figure 3.5. Tag cloud of all-time most popular tags at Flickr

Note that the tag cloud here has only single-term tags—new, york, and newyork are three separate tags. The tag San could be a part of San Diego, San Jose, or San Francisco. Clearly, in this case, the content parsing isn’t intelligent enough to extract the content in a way that makes sense.

Toward the center of the tag cloud, you’ll see the tag nyc. Clearly, nyc is a synonym for new york city, but these two phrases are treated as separate tags. To properly leverage tags, we need to do the following:

  • Stem tags to take care of plurals.
  • Detect multi-term phrases.
  • Handle synonym tags.

The process of stemming, detecting phrases, and injecting synonyms is illustrated in section 4.3, and the infrastructure for such an analysis is developed in chapter 8.

A tag cloud needs terms and their relative weights, and that’s what’s contained in a term vector! Whereas a term vector is one way to represent metadata, a tag cloud is a visual representation of the metadata associated with an item or a user.

A tag cloud can be created for each user using her metadata—her term vectors—whether that metadata is learned from the content or from a collaborative-based approach or a combination of the two. In essence, a tag cloud associated with a user or an item is dynamic—it’s dependent on the tags that users have assigned to it or that the user has visited.

When a user clicks on a tag in a tag cloud, you have the context for the tag—the associated metadata—and this can be converted into a call to the recommendation engine, where you search for content that’s related to the user and the tag of interest. We look at building a tag cloud in detail in section 3.4.

3.2.2. Innovative uses of tag clouds

At the beginning of the chapter, we mentioned the use of tags to create dynamic navigation. Every day, more and more personalized tag clouds are appearing in applications. The tags in these clouds are being generated in one or more of three ways: professional, automated, or user-generated. Figure 3.6 shows the basic strategy used. Use the set of contents associated with the user—visited, subscribed to, and so on as shown on the right side of the figure—to get a combined term vector and then display it to the user. Remember that a tag cloud in essence is a visualization of a term vector.

Figure 3.6. Combining term vectors from a number of documents to form a tag cloud

Here are a couple of other interesting examples of using tag clouds. John Herren provides a good example of how different APIs can be combined to build powerful applications. He built a prototype application, Yahoo! News Tag Soup (http://yahoo.theherrens.com/index.php), that combines Yahoo!’s content analysis web service with Yahoo! News feeds. The service extracts keywords from the news article, using Yahoo!’s content analysis web service. He then uses that information to build a dynamic tag cloud. This idea was further developed to provide users with their own personal tag cloud at TagCloud.com (http://www.tagcloud.com/). At this site, users can register themselves, and based on the news feeds that they register for, a personal tag cloud is generated for each user.

ZoomCloud (http://zoomclouds.egrupos.net/) provides the capability to add a tag cloud to your application. You can create a tag cloud on the ZoomCloud site by feeding in an RSS feed. ZoomCloud provides a template to customize the look-and-feel of the cloud and code that you can embed in your site. The tag cloud is hosted on their site, and can be customized for your blog or application.

In the future, look for more personalized tag clouds that combine multiple sources of information.

3.2.3. Targeted search

In June 2005, in its quest to challenge Google’s search engine, Yahoo! launched its new search technology called MyRank. Google’s search engine, which holds 50–60 percent[4] of the search market share, is based on the PageRank algorithm—the number of connections to a page, indicating the importance of the page. MyRank instead taps into the collective intelligence generated by a community of users. MyRank powers Yahoo!’s community, MyWeb 2.0. Users can save copies of web pages of interest in their personal cache and tag them. Search results for each user are dependent on the items saved and tagged by that user and his community of friends; results depend on the “quality” of a user’s community and the pages they save and tag. Now let’s look at the use of tags beyond MyRank.

4 Nelsen/Netratings gave the number as 59% for the month of May 2008: http://www.nielsen-netratings.com/pr/pr_080619V.pdf.

Just like Google’s page rank system looks at the number of links to a page as a metric to quantify the page’s importance in its search engine, tags provide a similar metric. If an article is being tagged by many users using the same tag, it’s probably very relevant to that topic and of interest to other users.

As shown in figure 3.7, you can combine the tag’s context with the user’s metadata as a query to the search engine to get relevant results for the user.

Figure 3.7. Using a tag, the context that it appears in, and user metadata to get relevant results from a search engine

An example illustrates this approach well. Let’s say that a tag Spring appears in a tag cloud that’s used for navigating books in the application. Further, the user’s profile identifies her as a developer. Here, the context of the query is books, while the metadata associated with the user is from her profile—developer. Relevant results can be shown to the user when she clicks on the tag Spring by making the following query to the search engine: Books Spring Developer.

3.2.4. Folksonomies and building a dictionary

User-generated tags provide an ad hoc way of classifying items, in a terminology that’s relevant to the user. This process of classification, commonly known as folksonomies, enables users to retrieve information using terms that they’re familiar with. There are no controlled vocabularies or professionally developed taxonomies.

The word folksonomy combines the words folk and taxonomy. Blogger Thomas Vander Wal is credited with coining the term.

Folksonomies allow users to find other users with similar interests. A user can reach new content by visiting other “similar” users and seeing what other content is available. Developing controlled taxonomies, as compared to folksonomies, can be expensive both in terms of time spent by the user using the rigid taxonomy, and in terms of the development costs to maintain it. Through the process of user tagging, users create their own classifications. This gives useful information about the user and the items being tagged.

The tags associated with your application define the set of terms that can be used to describe the user and the items. This in essence is the vocabulary for your application. Folksonomies are built from user-generated tags. Automated algorithms have a difficult time creating multi-term tags. When a dictionary of tags is available for your application, automated algorithms can use this dictionary to extract multi-term tags. Well-developed ontologies, such as in the life sciences, along with folksonomies are two of the ways to generate a dictionary of tags in an application.

Now that we’ve looked at how tags can be used in your application, let’s take a more detailed look at user tagging.

3.3. Extracting intelligence from user tagging: an example

In this section, we illustrate the process of extracting intelligence from the process of user tagging. Based on how users have tagged items, we provide answers to the following three questions:

  • Which items are related to another item?
  • Which items might a user be interested in?
  • Given a new item, which users will be interested in it?

To illustrate the concepts let us look at the following example. Let’s assume we have two users: John and Jane, who’ve tagged three articles: Article1, Article2, and Article3, as follows:

  • John has tagged Article1 with the tags apple, fruit, banana
  • John has tagged Article2 with the tags orange, mango, fruit
  • Jane has tagged Article3 with the tags cherry, orange, fruit

Our vocabulary for this example consists of six tags: apple, fruit, banana, orange, mango, and cherry. Next, we walk through the various steps involved in converting this information into intelligence. Lastly, we briefly review why users tag items.

Let the number of users who’ve tagged each of the items in the example be given by the data in table 3.1. Let each tag correspond to a dimension. In this example, each item is associated with a six-dimensional vector. For your application, you’ll probably have thousands of unique tags. Note the last column, normalizer, shows the magnitude of the vector. The normalizer for Article1 is computed as √42+82+62+32 = 11.18.

Table 3.1. Raw data used in the example
 

apple

fruit

banana

orange

mango

cherry

normalizer

Article1 4 8 6 3     11.18
Article2   5   8 5   10.68
Article3 1 4   3   10 11.22

Next, we can scale the vectors so that their magnitude is equal to 1. Table 3.2 shows the normalized vectors for the three items—each of the terms is obtained by dividing the raw count by the normalizer. Note that the sum of the squares of each term after normalization will be equal to 1.

Table 3.2. Normalized vector for the items
 

apple

fruit

banana

orange

mango

cherry

Article1 .3578 .7156 .5367 .2683    
Article2   .4682   .7491 0.4682  
Article3 .0891 .3563   .2673   .891

3.3.1. Items related to other items

Now we answer the first of our questions: which items are related to other items?

To find out how “similar” or relevant each of the items are, we take the dot product for each of the item’s vector to obtain table 3.3. This in essence is an item-to-item recommendation engine.

Table 3.3. Similarity matrix between the items
 

Article1

Article2

Article3

Article1 1 .5360 .3586
Article2 .5360 1 .3671
Article3 .3586 .3671 1

To get the relevance between Article1 and Article2 we took the dot product:

(.7156 * .4682 + .2683 * .7491) = .536

According to this, Article2 is more relevant to Article1 than Article3.

3.3.2. Items of interest for a user

This item-to-item list is the same for all users. What if you wanted to take into account the metadata associated with a user to tailor the list to his profile? Let’s look at this next.

Based on how users tagged items, we can build a similar matrix for users, quantifying what items they’re interested in as shown in table 3.4. Again, note the last column, which is the normalizer to convert the vector into a vector of magnitude 1.

Table 3.4. Raw data for users
 

apple

fruit

banana

orange

mango

cherry

normalizer

John 1 2 1 1 1   2.83
Jane   1   1   1 1.73

The normalized metadata vectors for John and Jane are shown in table 3.5.

Table 3.5. The normalized metadata vector for the two users
 

apple

fruit

banana

orange

mango

cherry

John .3536 .7071 .3536 .3536 .3536  
Jane   .5773   .5773   .5773

Now we answer our second question: which items might a user be interested in?

To find out how relevant each of the items are to John and Jane, we take the dot product of their vectors. This is shown in table 3.6.

Table 3.6. Similarity matrix between users and items
 

Article1

Article2

Article3

John .917 .7616 .378
Jane .568 .703 .8744

As expected in our fictitious example, John is interested in Article1 and Article2, while Jane is most interested in Article3. Based on how the items have been tagged, she is also likely to be interested in Article2.

3.3.3. Relevant users for an item

Next, we answer the last question: given a new item, which users will be interested in it?

When a new item appears, the group of users who could be interested in that item can be obtained by computing the similarities in the metadata for the new item and the metadata for the set of candidate users. This relevance can be used to identify users who may be interested in the item.

In most practical applications, you’ll have a large number of tags, items, and users. Next, let’s look at how to build the infrastructure required to leverage tags in your application. We begin by developing the persistence architecture to represent tags and related information.

3.4. Scalable persistence architecture for tagging

Web 2.0 applications invite users to interact. This interaction leads to more data being available for analysis. It’s important that you build your application for scale. You need a strong foundation to build features for representing metadata with tags, representing information in the form of tag clouds, and building metadata about users and items. In this section, we concentrate on developing the persistence model for tagging in your application. Again, the code for the database schemas is downloadable from the download site.

This section draws from previous work done in the area of building the persistence architecture for tagging, but generalizes it to the three forms of tags and illustrates the concepts via examples.

In chapter 2, we had two main entities: user and item. Now we introduce two new entities: tags and tagging source. As shown in figure 3.8, all the tags are represented in the tags table, while the three sources of producing tags—professional, user, and automated—are represented in the tagging_source table.

Figure 3.8. The tags and tagging_source database tables

The tags table has a unique index on the tag_text column: there can be only one row for a tag. Further, there may be additional columns to describe the tag, such as stemmed_text, which will help identify duplicate tags, and so forth.

Now let’s look at developing the tables for a user tagging an item. There are a number of approaches to this. To illustrate the benefits of the proposed design, I’m going to show you three approaches, with each approach getting progressively better. The schema also gets progressively more normalized. If you’re familiar with the principles of database design, you can go directly to section 3.4.2.

3.4.1. Reviewing other approaches

To understand some of the persistence schemas used for storing data related to user tagging, we use an example. Let’s consider the problem of associating tags with URLs; here the URL is the item. In general, the URL can be any item of interest, perhaps a product, an article, a blog entry, or a photo of interest. MySQLicious, Scuttle, and Toxi are the three main approaches that we’re using.

I’ve always found it helpful to have some sample data and represent it in the persistence design to better understand the design. For our example, let a user bookmark three URLs and assign them names and place tags, as shown in table 3.7.[5]

5 The URLs are also reference to sites where you can find more information to the persistence architectures: MySQLicious, Scuttle, and Toxi.

Table 3.7. Data used for the bookmarking example

Url

Name

Tags

http://nanovivid.com/projects/mysqlicious/ MySQLicious Tagging schema denormalized
http://sourceforge.net/projects/scuttle/ Scuttle Database binary schema
http://toxi.co.uk/ Toxi Normalized database schema
MySQLicious

The first approach is the MySQLicious approach, which consists of a single denormalized table, mysqlicious, as shown in figure 3.9. The table consists of an autogenerated primary key, with tags stored in a space-delimited manner. Figure 3.8 also shows the sample data for our example persisted in this schema. Note the duplication of database and schema tags in the rows. This approach also assumes that tags are single terms.

Figure 3.9. The MySQLicious schema with sample data

Now, let’s look at the SQL you’d have to write to get all the URLs that have been tagged with the tag database.

Select url from mysqlicious where tags like "%database%"

The query is simple to write, but “like” searches don’t scale well. In addition, there’s duplication of tag information. Try writing the query to get all the tags. This denormalized schema won’t scale well.

 

Tip

Avoid using space-delimited strings to persist multiple tags; you’ll have to parse the string every time you need the individual tags and the schema won’t scale. This doesn’t lend well to stemming words, either.

 

Next, let’s improve on this solution by looking at the second approach: the Scuttle approach.

Scuttle Solution

The Scuttle solution uses two tables, one for the bookmark and the other for the tags, as shown in figure 3.10. As shown, each tag is stored in its own row.

Figure 3.10. Scuttle representation with sample data

The SQL to get the list of URLs that have been tagged with database is much more scalable than for the previous design and involves joining the two tables:

Select b.url from scuttle_bookmark b, scuttle_tags t where
   b.bookmark_id = t.bookmark_id and
   t.tag = 'database' group by b.url

The Scuttle solution is more normalized than MySQLicious, but note that tag data is still being duplicated.

Next, let’s look at how we can further improve our design. Each bookmark can have multiple tags, and each tag can have multiple bookmarks. This many-to-many relationship is modeled by the next solution, known as Toxi.

Toxi

The third approach that’s been popularized on the internet is the Toxi solution. This solution uses three tables to represent the many-to-many relationship, as shown in figure 3.11. There’s no longer duplication of data. Note that the toxi_bookmark table is the same as the scuttle_bookmark table.

Figure 3.11. The normalized Toxi solution with sample data

So far in this section, we’ve shown three approaches to persisting tagging information. Each gets progressively more normalized and scalable, with Toxi being the closest to the recommended design. Next, we look at the recommended design, and also generalize the design for the three forms of tags: professionally generated, user-generated, and machine-generated.

3.4.2. Recommended persistence architecture

The scalable architecture presented here is similar to the one presented at MySQLForge called TagSchema, and the one presented by Jay Pipes in his presentation “Tagging and Folksonomy Schema Design for Scalability and Performance.” We generalize the design to handle the three kinds of tags and illustrate the design via an example.

Let’s begin by looking at how to handle user-generated tags. We use an example to explain the schema and illustrate how commonly used queries can be formed for the schema.

Schema for User-Generated Tags

Let’s continue with the same example that we began with at the beginning of section 3.3.2. Let’s add the user dimension to the example—there are users who are tagging items. We also generalize from bookmarks to items.

In our example, John and Jane are two users:

  • John has tagged item1 with the tags tagging, schema, denormalized
  • John has tagged item2 with the tags database, binary, schema
  • Jane has tagged item3 with the tags normalized, database, schema

As shown in figure 3.12, there are three entities—user, item, and tags. Each is represented as a database table, and there is a fourth table, a mapping table, user_item_tag.

Figure 3.12. The recommended persistence schema designed for scalability and performance

Let’s look at how the design holds up to two of the common use cases that you may apply to your application:

  • What other tags have been used by users who have at least one matching tag?
  • What other items are tagged similarly to a given item?

As shown in figure 3.13 we need to break this into three queries:

  1. First, find the set of tags used by a user, say John.
  2. Find the set of users that have used one of these tags.
  3. Find the set of tags that these users have used.
Figure 3.13. Nesting queries to get the set of tags used

Let’s write this query for John, whose user_id is 1. The query consists of three main parts.

First, let’s write the query to get all of John’s tags. For this, we have to inner-join tables user_item_tag and tags, and use the distinct qualifier to get unique tag IDs.

Select distinct t.tag_id, t.tag_text from tags t, user_item_tag uit where
   t.tag_id = uit.tag_id and uit.user_id = 1;

If you run this query, you’ll get the set (tagging, schema, denormalized, database, binary).

Second, let’s use this query to find the users who’ve used one of these tags, as shown in listing 3.1.

Listing 3.1. Query for users who have used one of John’s tags

Note that the first query:

Select distinct t.tag_id, t.tag from tags t, user_item_tag uit where
   t.tag_id = uit.tag_id and uit.user_id = 1

is a subquery in this query. The query selects the set of users and will return user_ids 1 and 2.

Third, the query to retrieve the tags that these users have used is shown in listing 3.2

Listing 3.2. The final query for getting all tags that other users have used
Select uit3.tag_id, t3.tag_id, count(*) from user_item_tag uit3, tags t3
whereuit3.tag_id = t3.tag_id and uit3.user_id
   in (Select distinct uit2.user_id from user_item_tag uit2, tags t2
where uit2.tag_id = t2.tag_id and
   uit2.tag_id in (Select distinct t.tag_id from tags t, user_item_tag uit
      where t.tag_id = uit.tag_id and uit.user_id = 1) )
   group by uit3.tag_id

Note that this query was built by using the query developed in listing 3.1. The query will result in six tags, which are shown in table 3.8, along with their frequencies.

Table 3.8. The result for the query to find other tags used by user 1

tag_id

tag_text

count(*)

1 tagging 1
2 schema 3
3 denormalized 1
4 database 2
5 binary 1
6 normalized 1

Now let’s move on to the second question: what other items are tagged similarly to a given item? Let’s find the other items that are similarly tagged to item1.

First, let’s get the set of tags related to item1, which has an item_id of 1—this set is (tagging, schema, normalized):

Select uit.tag_id from user_item_tag uit, tags t where
   uit.tag_id = t.tag_id and
   uit.item_id = 1

Next, let’s get the list of items that have been tagged with any of these tags, along with the count of these tags:

Select uit2.item_id, count(*) from user_item_tag uit2 where
 uit2.tag_id in (Select uit.tag_id from user_item_tag uit, tags t where
  uit.tag_id = t.tag_id and uit.item_id = 1)
  group by uit2.item_id

This will result in table 3.9, which shows the three items with the number of tags.

Table 3.9. Result of other items that share a tag with another item

item_id

count(*)

Tags

1 3 tagging, schema, normalized
2 1 schema
3 1 schema

So far, we’ve looked at the normalized schema to represent a user, item, tags, and users tagging an item. We’ve shown how this schema holds for two commonly used queries. In chapter 12, we look at more advanced techniques—recommendation engines—to find related items using the way items have been tagged.

Next, let’s generalize the design from user tagging to also include the other two ways of generating tags: professionally and machine-generated tags.

Schema for Professionally and Machine-Generated Tags

We add a new table, item_tag, to capture the tags associated with an item by professional editors or by an automated algorithm, as shown in figure 3.14. Note that there’s also a weight column—this table is in essence storing the metadata related with the item.

Figure 3.14. Table to store the metadata associated with an item via tags

Finding tags and their associated weights for an item is simply with this query:

Select tag_id, weight from item_tag
 where item_id = ? and
tagging_source_id = ?

In this section, we’ve developed the schema for persisting tags in your application. Now, let’s look at how we can apply tags to your application. We develop tag clouds as an instance of dynamic navigation, which we introduced in section 3.1.4.

3.5. Building tag clouds

In this section, we look at how you can build tag clouds in your application. We first extend the persistence design to support tag clouds. Next, we review the algorithm to display tag clouds and write some code to implement a tag cloud.

3.5.1. Persistence design for tag clouds

For building tag clouds, we need to get a list of tags and their relative weights. The relative weights of the terms are already captured in the item_tag table for professionally generated and machine-generated tags. For user tagging, we can get the relative weights and the list of tags for the tag cloud with this query:

Select t.tag, count(*) from user_item_tag uit, tags t where
   Uit.tag_id = t.tag_id group by t.tag

This results in table 3.10, which shows the six tags and their relative frequencies for the example in section 3.3.3.

Table 3.10. Data for the tag cloud in our example

tag_text

count(*)

tagging 1
schema 3
denormalized 1
database 2
binary 1
normalized 1

The use of count(*) can have a negative effect on scalability. This can be eliminated by using a summary table. Further, you may want to get the count of tags based on different time windows. To do this, we add two more tables, tag_summary and days, as shown in figure 3.15. The tag_summary table is updated on every insert in the user_ item_tag table.

Figure 3.15. The addition of summary and days tables

The tag cloud data for any given day is given by the following:

select t.tag, ts.number from tags t, tag_summary ts where
   t.tag_id = ts.tag_id and
   ts.day = 'x'

To get the frequency over a range of days, you have to use the sum function in this design:

select t.tag, sum(ts.number) from tag tags t, tag_summary ts where
   t.tag_id = ts.tag_id and
   ts.day > 't1' and ts.day <'t2' group by t.tag

When a user clicks on a particular tag, we need to find out the list of items that have been tagged with the tag of interest. There are a number of approaches to showing results when a user clicks on a tag. The tag value could be used as an input to a search engine or recommendation engine, or we can query the userItemTag or the itemTag tables. The following query retrieves items from the userItemTag table:

select uit.item_id, count(*) from user_item_tag uit where
   uit.tag_id = 'x' group by uit.item_id

Similarly, for professional and automated algorithm generated tags we can write the query

select item_id from item_tag where tag_id = ? order by weight desc

Since we’ve developed the database query for building the tag cloud, let’s next look at how we can build a tag cloud after we have access to a list of tags and their frequency.

3.5.2. Algorithm for building a tag cloud

There are five steps involved in building a tag cloud:

1.  The first step in displaying a tag cloud is to get a list of tags and their frequencies—a list of <Tag name, frequency>.

2.  Next, compute the minimum and maximum occurrence of each tag. Let’s call these numberMin and numberMax.

3.  Decide on the number of font sizes that you want to use; generally this number is between 3 and 20. Let’s call this number numberDivisions.

4.  Create the ranges for each font size. The formula for this is

For i = 1 to numberDivisions
rangeLow = numberMin + (i – 1) * (numberMax – numberMin)/ numberDivisions
high = numberMin + i*( numberMax - numberMin)/ numberDivisions

For example, if numberMin, numberMax, and numberDivisions are (20, 80, 3), the ranges are (20–40, 40–60, 60–80).

5.  Use a CSS stylesheet for the fonts and iterate over all the items to display the tag cloud.

Though building a tag cloud is simple, it can be quite powerful in displaying the information. Kevin Hoffmann, in his paper “In Search of ... The Perfect Tag Cloud,” proposes a logarithmic function—take the log of the frequency and create the buckets for the font size based on their log value—to distribute the font size in a tag cloud.

In my experience, when the weights for the tags have been normalized (when the sum of squared values is equal to one), the linear scaling works fairly well, unless the min or the max values are too skewed from the other values.

Implementing a tag cloud is straightforward. It’s now time to roll up our sleeves and write some code, which you can use in your application to implement a tag cloud and visualize it.

3.5.3. Implementing a tag cloud

Figure 3.16 shows the class diagram for implementing a tag cloud. We also use this code later on in chapter 8. We use the Strategy[6] design pattern to factor out the scaling algorithm used to compute the font size. It’s also helpful to define interfaces TagCloud and TagCloudElement, as there can be different implementations for them.

6 Gang of Four—Strategy pattern

Figure 3.16. Class design for implementing a tag cloud

The remaining part of this section gets into the details of implementing the code related to developing a tag cloud. Figure 3.16 shows the classes that we develop in this section.

TagCloud

First, let’s begin with the TagCloud interface, which is shown in listing 3.3.

Listing 3.3. The TagCloud interface
package com.alag.ci.tagcloud;

import java.util.List;

public interface TagCloud {
   public List<TagCloudElement> getTagCloudElements();
}

This is simple enough, and has one method to get the List of TagCloudElements.

TagCloudElement

The TagCloudElement interface corresponds to a tag and contains methods to get the tag text, the tag weight, and the computed font size. This is shown in listing 3.4.

Listing 3.4. The TagCloudElement interface

The TagCloudElement interface extends the Comparable interface, which allows Tag-Cloud to return these elements in a sorted manner. I’ve used a String for the font size, as the computed value may correspond to a style sheet entry in your JSP. Also a double is used for the getWeight() method.

FontSizeComputationStrategy

The FontSizeComputationStrategy interface has only one method, as shown in listing 3.5.

Listing 3.5. The FontSizeComputationStrategy interface
package com.alag.ci.tagcloud;

import java.util.List;

public interface FontSizeComputationStrategy {
   public void computeFontSize(List<TagCloudElement> elements);
}

The method

void computeFontSize(List<TagCloudElement> elements);

computes the font size for a given List of TagCloudElements.

TagCloudImpl

TagCloudImpl implements the TagCloud and is fairly simple, as shown in listing 3.6.

Listing 3.6. Implementation of TagCloudImpl

It has a list of TagCloudElements and delegates the task of computing the font size to FontSizeComputationStrategy, which is passed in its constructor. It also sorts the List<TagCloudElement> elements alphabetically.

TagCloudElementImpl

TagCloudElementImpl is shown in listing 3.7.

Listing 3.7. The implementation of TagCloudElementImpl

TagCloudElementImpl is a pure bean object that implements the Comparable interface for alphabetical sorting of tag texts as shown in listing 3.7.

FontSizeComputationStrategyImpl

The implementation for the base class FontSizeComputationStrategyImpl is more interesting and is shown in listing 3.8.

Listing 3.8. Implementation of FontSizeComputationStrategyImpl

This takes in the number of font sizes to be used and the prefix to be set for the font. In your application, there might be an enumeration of fonts and you may want to use Enum for the different fonts. I’ve made the class abstract to force the inheriting classes to overwrite the scaleCount method, as shown in figure 3.16.

The method computeFontSize first gets the minimum and the maximum and then computes the bucket for the font size using the following:

  for (TagCloudElement tce: elements) {
     int index = (int) Math.floor((scaleCount(tce.getWeight()) –
        minscaled)/diff);
    if (Math.abs(tce.getWeight() - maxCount) < PRECISION){
        index = this.numSizes - 1;
    }
    tce.setFontSize(this.prefix + index);
  }
}

To understand the formula used to calculate the font index, let, x be the scaled value of the number of times a tag appears then that tag falls in bin n, where

Note that when x is the same as maxscaled, n is numSizes. This is why there’s a check for maxCount:

if (tce.getWeight() == maxCount) {

This implementation is more efficient than creating an array with the ranges for each of the bins and looping through the elements.

Extending FontSizeComputationStrategyImpl

Lastly, the two classes extending FontSizeComputationStrategyImpl simply need to implement the scaleCount method and have a constructor that calls super, as shown in figure 3.17.

Figure 3.17. The class diagram for FontSizeComputationStrategy

First, let’s look at the implementation of LinearFontSizeComputationStrategy, which simply overrides the scaleCount method:

protected double scaleCount(double count) {
   return count;
}

Similarly, LogFontSizeComputationStrategy implements the same method as the following:

protected double scaleCount(double count) {
   return Math.log10(count);
}

You can implement your own variant of the FontSizeComputationStrategy by simply overwriting the scaleCount method. Some other strategies that you may want to consider are using clustering (see chapter 9) or assigning the same number of items (or nearly the same) for each of the font sizes. For this, sort the items by weight and assign the items to the appropriate bins.

Now that we’ve implemented a tag cloud, we need a way to visualize it. Next, we develop a simple class to generate HTML to display the tag cloud.

3.5.4. Visualizing a tag cloud

We use the Decorator design pattern, as shown in figure 3.18, to define an interface VisualizeTagCloudDecorator. It takes in a TagCloud and generates a String representation.

Figure 3.18. Using the Decorator pattern to generate HTML to represent the tag cloud

The code for VisualizeTagCloudDecorator is shown in listing 3.9.

Listing 3.9. VisualizeTagCloudDecorator interface
package com.alag.ci.tagcloud;

public interface VisualizeTagCloudDecorator {
   public String decorateTagCloud(TagCloud tagCloud);
}

There’s only one method to create a String representation of the TagCloud:

public String decorateTagCloud(TagCloud tagCloud);

Let’s write a concrete implementation of HTMLTagCloudDecorator, which is shown in listing 3.10.

Listing 3.10. Implementation of HTMLTagCloudDecorator

Here, the title of the generated page is hard-coded to TagCloud:

private static final String HEADER_HTML =
      "<html><br><head><br><title>TagCloud <br></title><br></head>";

The method getFontMap() simply creates a Map of font strings that will be used:

private void getFontMap() {
   this.fontMap = new HashMap<String,String>();
   fontMap.put("font-size: 0", "font-size: 13px");
   //... other font mapping
}

For your application, you’ll probably read this mapping from an XML file or from the database.

The rest of the code generates the HTML for displaying the tag cloud:

for (TagCloudElement tce : elements) {
   sw.append("&nbsp;<a style=\""+
       fontMap.get(tce.getFontSize())+";\">" );
   sw.append(tce.getTagText() +"</a>&nbsp;");
   if (count++ == NUM_TAGS_IN_LINE) {
      count = 0;
      sw.append("<br>" );
   }
}

A simple test program is shown in listing 3.11. The asserts have been removed to make it easier to read. This code creates a TagCloud and creates an HTML file to display it.

Listing 3.11. Sample code for generating tag clouds
package com.alag.ci.tagcloud.test;

import java.io.*;
import java.util.*;

import com.alag.ci.tagcloud.*;
import com.alag.ci.tagcloud.impl.*;

import junit.framework.TestCase;

public class TagCloudTest extends TestCase {

   public void testTagCloud() throws Exception {
      String firstString = "binary";
      int numSizes = 3;
      String fontPrefix = "font-size: ";

      List<TagCloudElement> l = new ArrayList<TagCloudElement>();
      l.add(new TagCloudElementImpl("tagging",1));
      l.add(new TagCloudElementImpl("schema",3));
      l.add(new TagCloudElementImpl("denormalized",1));
      l.add(new TagCloudElementImpl("database",2));
      l.add(new TagCloudElementImpl(firstString,1));
      l.add(new TagCloudElementImpl("normalized",1));

      FontSizeComputationStrategy strategy =
         new LinearFontSizeComputationStrategy(numSizes,fontPrefix);
      TagCloud cloudLinear = new TagCloudImpl(l,strategy);
      System.out.println(cloudLinear);

      strategy = new LogFontSizeComputationStrategy(numSizes,fontPrefix);
      TagCloud cloudLog = new TagCloudImpl(l,strategy);
      System.out.println(cloudLog);

      //write to file
      String fileName = "testTagCloudChap3.html";
      writeToFile(fileName,cloudLinear);
   }
   private static void writeToFile(String fileName, TagCloud cloud)
      throws IOException {
      BufferedWriter out = new BufferedWriter(
            new FileWriter(fileName));
      VisualizeTagCloudDecorator decorator = new HTMLTagCloudDecorator();
      out.write(decorator.decorateTagCloud(cloud));
      out.close();
   }
}

A TagCloud is created by the following code:

     List<TagCloudElement> l = new ArrayList<TagCloudElement>();
     l.add(new TagCloudElementImpl("tagging",1));
....
     FontSizeComputationStrategy strategy =
        new LinearFontSizeComputationStrategy(numSizes,fontPrefix);
     TagCloud cloudLinear = new TagCloudImpl(l,strategy);

The method writeToFile simply writes the generated HTML to a specified file:

BufferedWriter out = new BufferedWriter(
      new FileWriter(fileName));
VisualizeTagCloudDecorator decorator = new HTMLTagCloudDecorator();
out.write(decorator.decorateTagCloud(cloud));
out.close();

Figure 3.19 shows the tag cloud developed for our example.[7] Note that schema has the biggest font, followed by database.

7 Both the linear and logarithmic functions gave the same font sizes for this simple example when three font sizes were used, but they were different when five were used.

Figure 3.19. The tag cloud for our example

In this section, we developed code to implement and visualize a tag cloud. Next, let’s look at a few interesting topics related to tags that you may run into in your application.

3.6. Finding similar tags

As of February 2007, 35 percent[8] of all posts tracked by Technorati used tags. As of October 2006, Technorati was tracking 10.4 million tags. There were about half a million unique tags in del.icio.us, as of October 2005, with each item averaging about two tags. Given the large number of tags, a good question is how to find tags that are related to each other—tags that are synonymous or that show a parent-child relationship. Building this manually is too expensive and nonscalable for most applications.

8http://technorati.com/weblog/2007/04/328.html

A simple approach to finding similar tags is to stem—convert the word into its root form—to take care of differences in tags due to plurals after removing stop words—commonly occurring words. Having a synonym dictionary also helps keep track of tags that are similar. When dealing with multi-term phrases, two tags could be similar but may have their terms in different positions. For example, weight gain and gain weight are similar tags.

Another approach is to analyze the co-occurrences of tags. Table 3.11 shows data that can be used for this analysis. Here, the rows correspond to tags and the columns are the items in your system. There’s a 1 if an item has been tagged with that tag. Note the similarity to the table we looked at in section 2.4. You can use the correlation similarity computation to find correlated tags. Matrix dimensionality reduction using Latent Semantic Indexing (LSI) is also used (see section 12.3.3). LSI has been used to solve the problems of synonymy and polysemy.

Table 3.11. Bookmarking data for analysis
 

Item 1

Item 2

Item 3

Tag1 1    
Tag2   1 1
Tag3 1   1

When finding items relevant to a tag, don’t forget to first find a similar set of tags to the tag of interest and then find items related to the tag by querying the item_tag table.

3.7. Summary

Tagging is the process of adding freeform text, either words or small phrases, to items. These keywords or labels can be attached to anything—another user, photos, articles, bookmarks, products, blog entries, podcasts, videos, and more. Tagging enables users to associate freeform text with an item, in a way that makes sense to them, rather than using a fixed terminology that may have been developed by the content owner.

There are three ways to generate tags: have professional editors create tags, allow users to tag items, or have an automated algorithm generate tags. Tags serve as a common vocabulary to associate metadata with users and items. This metadata can be used for personalization and for targeting search to a user.

User-centric applications no longer rigidly categorize items. They offer dynamic navigation, which is built from tags to their users. A tag cloud is one example of dynamic navigation. It visually represents the term vector—tags and their relative weights. We looked at how tags can be persisted in your application and how you can build a tag cloud.

In the next chapter, we look at the different kinds of content that are used in application and how they can be abstracted from an analysis point of view. We also demonstrate the process of generating a term vector from text using a simple example.

3.8. Resources

“All of Web2.0.” Chrisekblog, http://chrisek.com/wordpress/2006/10/03/all-of-web-20/

“Building a tag cloud in Java.” http://randomcoder.com/articles/building-a-tag-cloud-in-java

“Everything Web2.0.” Matt’s blog. http://yahoolog.com/blog/?p=94

Freitag,Pete. “How to make a tag cloud.” http://www.petefreitag.com/item/396.cfm

Gamma, Eric, et. al. Design Patterns - Elements of Reusable Object-Oriented Software. 1995, Addison-Wesley Professional.

Green,Heather. “A Tag Team’s Novel Net Navigation.” BusinessWeek. February 28, 1995. http://www.businessweek.com/technology/content/feb2005/tc20050228_6395_tc024.htm?chan=search

Grossman, Frieder. Information Retrieval: Algorithms and Heuristics. 2006. Springer.

Hoffman, Kevin. “In Search of a Perfect Tag Cloud.” http://files.blog-city.com/files/J05/88284/b/insearchofperfecttagcloud.pdf

“Homonyms.” wikipedia.org, http://en.wikipedia.org/wiki/Homonyms

Keller, Philipp. “Tags Database Schema.” http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html

Konchady, Manu. “Text Mining Application Programming.” 2006. Thomson Delmar Learning.

Kopelman, Josh. “53,651.” May 2006. http://redeye.firstround.com/2006/05/53651.html

MySQLicious. http://nanovivid.com/projects/mysqlicious/

“Nielsen Net Ratings Announces February U.S Search Share Rankings.” January, 2008. http://www.nielsen-netratings.com/pr/pr_080118.pdf

Pipes, Jay. “Tagging and Folksonomy Schema Design for Scalability and Performance.” MySQL Inc.

“Polysemy.” wikipedia.org, http://en.wikipedia.org/wiki/Polysemy

Scuttle. http://sourceforge.net/projects/scuttle/

Sinha,Rashmi. “A social analysis of tagging (or how tagging transforms the solitary browsing experience into a social one).” January 18, 2006. http://www.rashmisinha.com/archives/06_01/social-tagging.html

“Tag Schema.” MySQL Inc. http://forge.mysql.com/wiki/TagSchema#Tagging_and_Folksonomy_Schema_Concepts

“Tagcloud examples.” http://microformats.org/wiki/tagcloud-examples

Toxi. http://toxi.co.uk/

“Zoom Clouds.” http://zoomclouds.egrupos.net/