Wednesday, April 15, 2015

Using Term Frequency Analysis to Measure Your Content Quality

Posted by EricEnge

It's time to look at your content differently—time to start understanding just how good it really is. I am not simply talking about titles, keyword usage, and meta descriptions. I am talking about the entire page experience. In today's post, I am going to introduce the general concept of content quality analysis, why it should matter to you, and how to use term frequency (TF) analysis to gather ideas on how to improve your content.

TF analysis is usually combined with inverse document frequency analysis (collectively TF-IDF analysis). TF-IDF analysis has been a staple concept for information retrieval science for a long time. You can read more about TF-IDF and other search science concepts in Cyrus Shepard's excellent article here.

For purposes of today's post, I am going to show you how you can use TF analysis to get clues as to what Google is valuing in the content of sites that currently outrank you. But first, let's get oriented.

Conceptualizing page quality

Start by asking yourself if your page provides a quality experience to people who visit it. For example, if a search engine sends 100 people to your page, how many of them will be happy? Seventy percent? Thirty percent? Less? What if your competitor's page gets a higher percentage of happy users than yours does? Does that feel like an "uh-oh"?

Let's think about this with a specific example in mind. What if you ran a golf club site, and 100 people come to your page after searching on a phrase like "golf clubs." What are the kinds of things they may be looking for?

Here are some things they might want:

  1. A way to buy golf clubs on your site (you would need to see a shopping cart of some sort).
  2. The ability to select specific brands, perhaps by links to other pages about those brands of golf clubs.
  3. Information on how to pick the club that is best for them.
  4. The ability to select specific types of clubs (drivers, putters, irons, etc.). Again, this may be via links to other pages.
  5. A site search box.
  6. Pricing info.
  7. Info on shipping costs.
  8. Expert analysis comparing different golf club brands.
  9. End user reviews of your company so they can determine if they want to do business with you.
  10. How your return policy works.
  11. How they can file a complaint.
  12. Information about your company. Perhaps an "about us" page.
  13. A link to a privacy policy page.
  14. Whether or not you have been "in the news" recently.
  15. Trust symbols that show that you are a reputable organization.
  16. A way to access pages to buy different products, such as golf balls or tees.
  17. Information about specific golf courses.
  18. Tips on how to improve their golf game.

This is really only a partial list, and the specifics of your site can certainly vary for any number of reasons from what I laid out above. So how do you figure out what it is that people really want? You could pull in data from a number of sources. For example, using data from your site search box can be invaluable. You can do user testing on your site. You can conduct surveys. These are all good sources of data.

You can also look at your analytics data to see what pages get visited the most. Just be careful how you use that data. For example, if most of your traffic is from search, this data will be biased by incoming search traffic, and hence what Google chooses to rank. In addition, you may only have a small percentage of the visitors to your site going to your privacy policy, but chances are good that there are significantly more users than that who notice whether or not you have a privacy policy. Many of these will be satisfied just to see that you have one and won't actually go check it out.

Whatever you do, it's worth using many of these methods to determine what users want from the pages of your site and then using the resulting information to improve your overall site experience.

Is Google using this type of info as a ranking factor?

At some level, they clearly are. Clearly Google and Bing have evolved far beyond the initial TF-IDF concepts, but we can still use them to better understand our own content.

The first major indication we had that Google was performing content quality analysis was with the release of the Panda algorithm in February of 2011. More recently, we know that on April 21 Google will release an algorithm that makes the mobile friendliness of a web site a ranking factor. Pure and simple, this algo is about the user experience with a page.

Exactly how Google is performing these measurements is not known, but what we do know is their intent. They want to make their search engine look good, largely because it helps them make more money. Sending users to pages that make them happy will do that. Google has every incentive to improve the quality of their search results in as many ways as they can.

Ultimately, we don't actually know what Google is measuring and using. It may be that the only SEO impact of providing pages that satisfy a very high percentage of users is an indirect one. I.e., so many people like your site that it gets written about more, linked to more, has tons of social shares, gets great engagement, that Google sees other signals that it uses as ranking factors, and this is why your rankings improve.

But, do I care if the impact is a direct one or an indirect one? Well, NO.

Using TF analysis to evaluate your page

TF-IDF analysis is more about relevance than content quality, but we can still use various precepts from it to help us understand our own content quality. One way to do this is to compare the results of a TF analysis of all the keywords on your page with those pages that currently outrank you in the search results. In this section, I am going to outline the basic concepts for how you can do this. In the next section I will show you a process that you can use with publicly available tools and a spreadsheet.

The simplest form of TF analysis is to count the number of uses of each keyword on a page. However, the problem with that is that a page using a keyword 10 times will be seen as 10 times more valuable than a page that uses a keyword only once. For that reason, we dampen the calculations. I have seen two methods for doing this, as follows:

term frequency calculation

The first method relies on dividing the number of repetitions of a keyword by the count for the most popular word on the entire page. Basically, what this does is eliminate the inherent advantage that longer documents might otherwise have over shorter ones. The second method dampens the total impact in a different way, by taking the log base 10 for the actual keyword count. Both of these achieve the effect of still valuing incremental uses of a keyword, but dampening it substantially. I prefer to use method 1, but you can use either method for our purposes here.

Once you have the TF calculated for every different keyword found on your page, you can then start to do the same analysis for pages that outrank you for a given search term. If you were to do this for five competing pages, the result might look something like this:

term frequency spreadsheet

I will show you how to set up the spreadsheet later, but for now, let's do the fun part, which is to figure out how to analyze the results. Here are some of the things to look for:

  1. Are there any highly related words that all or most of your competitors are using that you don't use at all?
  2. Are there any such words that you use significantly less, on average, than your competitors?
  3. Also look for words that you use significantly more than competitors.

You can then tag these words for further analysis. Once you are done, your spreadsheet may now look like this:

second stage term frequency analysis spreadsheet

In order to make this fit into this screen shot above and keep it legibly, I eliminated some columns you saw in my first spreadsheet. However, I did a sample analysis for the movie "Woman in Gold". You can see the full spreadsheet of calculations here. Note that we used an automated approach to marking some items at "Low Ratio," "High Ratio," or "All Competitors Have, Client Does Not."

None of these flags by themselves have meaning, so you now need to put all of this into context. In our example, the following words probably have no significance at all: "get", "you", "top", "see", "we", "all", "but", and other words of this type. These are just very basic English language words.

But, we can see other things of note relating to the target page (a.k.a. the client page):

  1. It's missing any mention of actor ryan reynolds
  2. It's missing any mention of actor helen mirren
  3. The page has no reviews
  4. Words like "family" and "story" are not mentioned
  5. "Austrian" and "maria altmann" are not used at all
  6. The phrase "woman in gold" and words "billing" and "info" are used proportionally more than they are with the other pages

Note that the last item is only visible if you open the spreadsheet. The issues above could well be significant, as the lead actors, reviews, and other indications that the page has in-depth content. We see that competing pages that rank have details of the story, so that's an indication that this is what Google (and users) are looking for. The fact that the main key phrase, and the word "billing", are used to a proportionally high degree also makes it seem a bit spammy.

In fact, if you look at the information closely, you can see that the target page is quite thin in overall content. So much so, that it almost looks like a doorway page. In fact, it looks like it was put together by the movie studio itself, just not very well, as it presents little in the way of a home page experience that would cause it to rank for the name of the movie!

In the many different times I have done an analysis using these methods, I've been able to make many different types of observations about pages. A few of the more interesting ones include:

  1. A page that had no privacy policy, yet was taking personally identifiable info from users.
  2. A major lack of important synonyms that would indicate a real depth of available content.
  3. Comparatively low Domain Authority competitors ranking with in-depth content.

These types of observations are interesting and valuable, but it's important to stress that you shouldn't be overly mechanical about this. The value in this type of analysis is that it gives you a technical way to compare the content on your page with that of your competitors. This type of analysis should be used in combination with other methods that you use for evaluating that same page. I'll address this some more in the summary section of this below.

How do you execute this for yourself?

The full spreadsheet contains all the formulas so all you need to do is link in the keyword count data. I have tried this with two different keyword density tools, the one from Searchmetrics, and this one from motoricerca.info.

I am not endorsing these tools, and I have no financial interest in either one—they just seemed to work fairly well for the process I outlined above. To provide the data in the right format, please do the following:

  1. Run all the URLs you are testing through the keyword density tool.
  2. Copy and paste all the one word, two word, and three word results into a tab on the spreadsheet.
  3. Sort them all so you get total word counts aligned by position as I have shown in the linked spreadsheet.
  4. Set up the formulas as I did in the demo spreadsheet (you can just use the demo spreadsheet).
  5. Then do your analysis!

This may sound a bit tedious (and it is), but it has worked very well for us at STC.

Summary

You can also use usability groups and a number of other methods to figure out what users are really looking for on your site. However, what this does is give us a look at what Google has chosen to rank the highest in its search results. Don't treat this as some sort of magic formula where you mechanically tweak the content to get better metrics in this analysis.

Instead, use this as a method for slicing into your content to better see it the way a machine might see it. It can yield some surprising (and wonderful) insights!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

No comments:

Post a Comment