Key Points
- 99% of all articles that address content analysis / inventories / audits recommend line-by-line analysis of content. This limits ROI at scale.
- Line-by-line analysis restricts the ability to iterate on the analysis.
- Line-by-line analysis results in less consistency in decisions for business impact.
Note: if you have a thousand or fewer URLs, you can skip this article — doing item-by-item analysis based on a crawler like Screaming Frog in a spreadsheet is fine and intuitive at small scale (see calculator if you have a mid-sized site and aren't sure what you need — if you know you have a complex site, read on!).
Content Analysis is understanding and deciding what to do with content. Its purpose is to derive more value out of a digital presence's content. In general it is looking broadly at the content of a digital presence.
Contrasting Approaches to Content Analysis
99% of all articles that address content analysis / inventories / audits recommend line-by-line analysis of content. This approach is appealing for several reasons:
Everyone already has a spreadsheet app available and knows how to use it.
This is the default industry-standard way of doing content analysis (so, if you've done at least one inventory or audit, you've probably done it this way already).
It is intuitive and really feels like progress to go through content line by line...
... and for small sites it really is!
But for more complex sites this falls apart because:
The effort it takes to do the content analysis is so inefficient that only the barest analysis is done, either by only analyzing a small section of the site or by doing minimal analysis per item.
In practice, this is often a deferral of work, for instance from the web agency to the site owner — when the site owner would be far more better served by a more facilitated and structured approach to understanding and deciding about their content.
Any decisions made about the content are not made in a consistent manner, and are more biased toward individual teams'/people's perspectives than satisfying broad, cross-sites user needs.
It is impossible to do very high insight analysis. Looking for near-duplicate content? Impossible to do line by line. Want to test a content hypothesis? Impossible to do line by line.
It's impractical to change your mind mid-way. Already have several teams going through their content line-by-line and decide your definition of what content to delete changes? Ooph, that's a problem to change now.
It becomes difficult to separate decision-making from content execution. If the same team is going through every piece of content to decide what to do with it, then they will probably re-evaluate decisions again when they are are reviewing the list at time of execution.
Line-by-line | Pattern Based | |
---|---|---|
Separate decision-making from execution (reduces iterations of analysis) | No | Yes |
Iterate on decisions without inspecting each content item | No | Yes |
Automate common strategy-driven elements like deduplicating URLs | No | Yes |
Make consistent decisions broadly, across entire digital presences | No | Yes |
Analysis Impact = Business Impact + URLs x Reduction in Per-Content Transformation Effort
There are two types of impact from content analysis:
Business Impact. Of course in the end we have websites in order to achieve business impact (higher sales, people finding the information they need easily, etc). The primary reason to do any content analysis is toward business impact digitally. Deleting a lot of ineffective content is high impact. Developing new content for an important audience that is not yet served is high impact. Rewriting the description of important products is high impact. Everyone just improving their own sites / site sections willy-nilly is less impactful.
Reduction in Content Transformation Effort. There are plenty of ways of deciding to delete content. Doing so with rules (delete all press releases over a year old that have received no page views the last six months) means that you spent far, far less time deciding on that than trying to do this line by line.
Another aspect of business impact is who cares about your analysis. It can be highly impactful to do content analysis for solely the core content team when approaching a major transformation, but an even higher-impact set of stakeholders to convince would be higher-up executives. When looking at patterns rather than just lines, even the executive team may be interested.
Analysis Effort = URLs x Time Per URL x Iterations
For line-by-line analysis, the effort is:
The count of URLs
The time spent analyzing (not executing the change upon) each URL
The number of iterations on the analysis
For pattern-based analysis, you are looking at buckets of content (one bucket might be journal articles, another might be product information pages with incomplete information, another might be landing pages with many internal links to them, etc). This means you can be analyzing and deciding upon a lot of URLs (or content items) in one swath, such as what to do with all bio pages.
So for pattern-based analysis: Analysis Effort = Buckets x Time Per Bucket x Iterations
One massive advantage of pattern-based analysis is that some analysis takes no human time per content item — for instance, if you want to see what sites use an old version of jQuery, that can be done automatically (after setting up the rules to extract these).
Note that iterations are both desirable and undesirable: you WANT to be able to efficiently iterate on your analysis (changing content hypotheses, etc) but you do NOT want to iterate unnecessarily (looking at the same content multiple times since the teams weren't aligned on how they were supposed to do the analysis).
Analysis ROI = Impact / Effort
In many cases, it's impossible to even have the impact you really need by doing line-by-line analysis. In other words, much content analysis has little business impact. But in general one way of calculating Analysis ROI is the impact divided by the analysis effort. This isn't a ready-to-calculate formula since these impacts depend on your situation, and it can be tough to account for the effort as well. That said, keeping this notional formula in mind may help in deciding how to approach content analysis.
An example: comparing impact and effort
In real content analysis for complex sites, the iterations are often far more frequent than listed below. Basically, one level of analysis opens up new sets of questions that you ask, building upon and also modifying the existing analysis. That said, for the purposes of comparisons consider the following steps:
Crawl the site to generate a list†. There are tons of tools that will do this. Regardless, in most cases you just point the crawler at a starting URL and let the crawler loose.
Deduplicate URLs. From a content analysis perspective, even if there are a thousand blog listing URLs (like blog?page=1, blog?page=2, ... blog?page=1000), really it's one page. You can do this sort of things in Google Sheets, but it's more error prone.
Formulas to pull out URL folders, etc. There are some patterns in the URL that are usually useful — things like subdomain, folders in the URL, etc.
Develop a "massing" chart. Develop a chart that simply shows how much content is where, usually lumping all the small sections so you don't wind up with a meaningless "long tail" chart.
Merge in basic analytics data†. Merge in basic analytics data, like from Google Analytics.
Decide what to do with content†. For each URL, decide what to do with it (remember: this can be done using rules).
Change the criteria for decisions, and re-decide what to do with the content. In an item-by-item review, this means going through all the content again. In a pattern-based analysis, this means changing and rerunning the rules to assign the decisions. For the purposes of comparison, let's assume that these new criteria help differentiate between higher-value content that should get more hands-on attention and rewriting vs. content that can be dealt with automatically (meaning that the higher-value content will actually get the attention it deserves).
Test a new content hypotheses. As you work through the analysis, you have observed several examples of a type of problem in the content. Let's say it's pages not using the approved template. You may want to test the hypothesis that this is a pervasive issue by scraping that pattern out of all the pages to see which actually have the issue.
There's a huge qualitative difference in line-by-line vs. pattern-based content analysis that's tough to capture in an attempt to capture the difference in effort in a table like the below: in practice, you will not do many of the steps below if it's line-by-line analysis (you'll probably only do those with a † next to them)! Fundamentally, you just don't do the same type of analysis when it's line-by-line based, and you definitely don't iterate on it as much (at least for the entire digital presence).
Step | Line-by-line | Pattern Based ❈ | What you have at this point |
---|---|---|---|
1. Crawl site† | A few minutes | A few minutes | You have a list of URLs. |
2. Deduplicate URLs | Hour+ | 0 | You have a list of URLs useful from a strategy perspective. |
3. Formulas to pull out URL folders, etc | Hour+ | 0 | You have a list of URLs with useful information. |
4. Develop "massing" chart | Hours | Minutes | A chart comparing folders or subdomains, merging "long tail" sections |
5. Merge in basic analytics data† | Hours | Minutes | Usage information to help inform decisions |
6. Decide what to do with content† | Days | Hours | Disposition information for every piece of content |
7. Change the criteria for decisions, and re-decide what to do with the content | Days | Hour | Revised disposition per content item |
8. Test a new content hypotheses | Days | Hour | The pervasiveness of a content issue, which you can also use as a filter |
❈ Using Content Chimera as an example.
† Realistically these are the only steps that usually happens for line-by-line analysis.
As to impact, it's theoretically possible that the same business impact is achieved in both approaches, although practically the line-by-line approach probably would not follow all the steps and result in sub-optimal decisions (meaning that, for example, the higher value content would not get the additional improvements it needed). So:
If the business impact is the same, then still the effort is higher (so the ratio of impact/effort is lower for line-by-line), OR
If the business impact is lower for line-by-line, then the effort is still higher, so the ratio of impact/effort is even worse.
Of course, for smaller sites it still wouldn't be worth the more sophisticated pattern-based approach (there is some learning curve etc).