How effective is WCAG?
Methodological flaws put question mark on study of the impact of WCAG on user problems


A recent study has questioned the relevance of WCAG for improving website accessibility. It claims that only half of users' problems were covered by WCAG success criteria, and sees no evidence that the use of WCAG techniques improved matters.

Author: Detlev Fischer

The study "Guidelines are Only Half of the Story: Accessibility Problems Encountered by Blind Users on the Web" was authored by Christopher Power, André Pimenta Freire, Helen Petrie, David Swallow, all at the University of York, UK. It was presented at CHI 2012, May 5–10, 2012, Austin, Texas, USA and is published as part of the ACM digital library.

The study is based on the use of 16 websites by 32 blind users. According to the study, only 50.4% of the actual problems encountered by blind users were covered by Success Criteria (SC) in WCAG 2.0. More importantly, the study claims that whether or not WCAG techniques were implemented by the sites had no significant relation to the actual degree of accessibility that they offer.

If borne out by the study's facts and analysis, the latter point would indeed be alarming. It would send out the message that whether or not authors try to make their sites WCAG-conformant does not matter much for users with disbilities.

And that message was indeed sent out in the echo that the study received in fora and on Twitter. @stommepoes opened a thread named Research shows adhering to WCAG doesn't solve blind users' problems and went some way in showing that sites present not just accessibility problems but also usability problems, and that WCAG would be over-tasked covering all of the latter.

On Twitter, the chosen headline was merrily retweeted without further comment. Webaim posted a mail thread discussing the study that is a bit difficult to read since it repeats posts several times.

In all the discussion, no one seemed to question the validity of the study itself. This is what I attempt in this article, looking more closely at its data, its metrics, its analysis and conclusions.

Methodological issues

I will not cover a number of study design decisions which may well have an impact on the results, just list a few of them (some have been mentioned in other discussions):

  • Only blind users were included in the study
  • Some participants were used to very dated assistive technology (e.g., JAWS 5.0) but were asked to use recent versions such as JAWS 10.0 in the study
  • Manual site audits of the WCAG conformance of the sites included covered just the home page, user tasks presumably many other pages
Unclear differentiation of sites into conformant and non-conformant

Table 1 of the study provides an overview of the results. It lists 12 of the 16 selected sites as conforming to WCAG 1.0 on Level A or AA. However, the text states that manual audits of these and many other sites hardly found any sites which showed full WCAG conformance. The table therefore also lists for each site the number of failed SCs and the number of failure instances after the stated level of conformance.

Just one site has no SC violations against WCAG 2.0, all others show various violations. One site stating conformance even fails as much as 23 SCs in 146 failure instances according to the manual audit, while another one listed as non-conformant in the table fails only 8 SC, and on fewer instances.

It is not at all clear and goes unexplained in the study how the pool of 16 sites was divided into conformant and non-conformant - an important division consequently used in the subsequent analysis. In strict terms, none of the sites included in the sample fully conformed to WCAG 1.0, and only one was fully conformant against WCAG 2.0. If the division was based on the audit and not sites' stated conformance, where was the cut-off point at which researchers decided to treat a site in the sample as non-conformant? Much of the subsequent data analysis rests on this distinction which is profoundly unclear.

The metrics used for manual site audits

The metrics applied in the manual audits against WCAG list the number of SCs violated, and the number of failure instances, but the latter are not indexed to SC. Such flat counts cannot provide a meaningful indication of the actual accessibility of the site because they do not reflect the severity of the failures and failure instances encountered.

Even within one of the catch-all SCs like 1.3.1 Info and Relationships, a count of failure instances may be clocking up rather trivial things (paragraphs in div- instead of p-elements), or indicate (and possibly drown) more severe problems like labels not programmatically linked to inputs, scripted links not recognised as links, or missing headings. Also, the study gives no indication as to what was counted as failure instance in the manual audits - every missing p, or only more serious issues?

The metrics used for user problems

The study borrowed a severity ranking scale from Nielsen which allowed users to rate the problems they encountered as cosmetic, minor, major, or catastrophic. That is a useful distinction, but the study makes no use of it. It just gives the total number of 1383 problem instances of accessibility problems across the 16 websites, summing up everything from cosmetic to catastrophic. A more differentiated treatment focusing especially on the major and catastrophic problems or weighting problems by severity would have been a lot more useful. Using the flat sum of problem instances significantly weakens the expressiveness of the chosen metric.

Mapping user problems onto conformant and non-conformant sites

Given to the obvious lack of a clear differentiation between conformant and non-conformant sites, the subsequent user problem mapping seems dubious at best. Still, the presentation of results first shows that sites classed as non-conformant against WCAG 1.0 had a significantly higher mean number of user problems (Fig. 1. and 2).

This is borne out by a look at the table listing the number of user problems found per site. The mean number of problems found on the four sites that are declared non-conformant to WCAG 1.0 is 150.75. Sites classified as conformant to WCAG 1.0 A or AA have a much lower mean number of 64.8 and 65.5 respectively. That alone would indicate that sites which (at some point in time) made an effort to conform created far fewer problems than sites that never claimed conformance (or were classed as non-conformant for other reasons the study fails to specify).

The study then details results for WCAG 2.0, only to conclude that a comparison to WCAG 1.0 was not possible as there were not enough sites conforming (or stating conformance?) to WCAG 2.0 in the sample.

The next step of the data analysis muddies the initially clear result, however. A "one-way ANOVA [ANOVA = analyisis of variance - my comment] between non-conformant websites and Level A conformant websites" (what conformance? Both WCAG versions or just one? Stated or manually audited conformance?) is performed that now shows no significant difference in the mean number of problems between both groups. This seems in stark contrast to the results reported immediately before. Maybe I know too little about statistics to fully appreciate this step in the analysis. I would be grateful for comments or clarifications by readers more familiar with statistical analysis.

A further step in the data analysis looks for a correlation between the number of failing WCAG success criteria identified in the manual audits and the number of user problems identified, and again finds a significant correlation, while no correlation was found for instances of violation. The latter just underlines the fact that a mere failure instance count without indexing instances to the severity of SC violations is simply not meaningful.

Again, this study result could be interpreted as indicating that fewer SC violations indeed translate into fewer user problems. The study chooses to draw different conclusions. Its claim that the adherence to WCAG requirements does not make a significant difference in terms of achieving actual accessibility seems not only based on flawed data, but is actively misleading.

Classification of user problems into WCAG-related and unrelated

One of the tenets of the study is that user problems can be categorised into those that bear a relation to WCAG success criteria, and those that don't. This is not as easy as it appears. Take, for example, the six problem areas listed in the study as "not covered in WCAG 2.0". The two most prominent problems which supposedly bear no relationship to WCAG are:

  1. Content found in pages where not expected by users
  2. Content not found in pages where expected by users

It seems obvious that there is a strong link between these two problems and the following WCAG success criteria:

After all, users form a expectations of content when reading / hearing link text and headings, and navigate accordingly.

The study provides the following example:

On the Automobile Association website (theAA.ccom), users were looking for driving tips. They did find this information under the link "Learn to Drive", but they were surprised that they were on such a page which did not match their mental model of the information architecture of such a site.

This seems a case of unclear link purpose. The content problem reported for the resulting page may be related to a WCAG failure on the page that led to it.

To be sure, no link text and no heading can ensure that users will correctly predict the content of the link destination or subsequent text, but simply stating that these user problems bear no relationship to WCAG is a gross simplification which puts a question mark on the study's claim that "only half of user problems identified are covered by WCAG".

Mapping user problems onto the level of implementation of WCAG techniques

The study claims that "16.7% of websites implemented techniques recommended in WCAG 2.0 but the techniques did not solve the problems". To appreciate this seemingly low figure (especially given the fact that the sites selected claim conformance to WCAG), it is important to realise that it is not at all trivial to establish whether or not a particular WCAG technique has been used in an actual implementation.

Many WCAG techniques are documented in a rudimentary form and based on simplified examples. Few real-world implementations use techniques exactly as described in WCAG. This makes any mapping of implemented techniques and WCAG techniques a non-trivial and necessarily imprecise exercise. In addition, many General Techniques such as G140: Separating information and structure from presentation to enable different presentations are in fact so general that any decent coding can be said to follow them. Were all those WCAG techniques included? If not, how and where did the study draw the line? There is no explanation how the study determined whether WCAG techniques were "implemented by developers" or not.

Another problem is that techniques overlap and interact. It will often be hard to say whether a problem encountered can clearly be attributed to the failure of using a particular technique. Take an example from SC 1.3.1 "Info and Relationships", which according to some sources can account for up to half of all accessibility problems found. The HTML heading code h1-h6 may be correctly used (according to Technique H42 Using h1-h6 to identify headings) to establish a document hierarchy. But heading content may be incomplete or confusing, violating Technique G132 Providing descriptive headings related to Success Criterion 2.4.6 (Headings and Labels).

Are WCAG 2.0 too hard to understand?

The study makes the point that developers struggle to understand WCAG, and that WCAG 2.0 has not improved matters. There is certainly truth in that accusation.

  • WCAG 2.0 often use a judicious but quite convoluted wording in the interest of abstraction. Part of the reason for this is that the guidelines aim for technology independence. Divorcing the normative text of WCAG from technology-specific examples makes sense, but also creates a gap that makes interpretation harder.
  • Additional confusion lurks in the concept of 'Sufficient Technique'. In principle, using and positively testing the implementation of a Sufficient Technique should ensure that the related SC is met. Often, however, several techniques must be applied as an ensemble (use technique X AND technique Y, or, X AND "one of the following techniques") . In some cases, Sufficient Techniques are grouped into several situations (A, B, C..), which are often not mutually exclusive.
  • It gets even more complex. Even if a Sufficient Technique (or group of Techniques) has been used successfully, the SC may not be met because some WCAG failure belonging to the same SC is found to apply. Take again the use of headings as required in SC 1.3.1. Even if G132 and H42 have been used perfectly, one of the Failures listed under SC 1.3.1 may apply and invalidate conformance to SC 1.3.1.
The difficulty of applying WCAG reliably

All this surely contributes to the fact that WCAG is complex and difficult to apply. But another reason for the difficulty is that the requirements expressed in WCAG are complex, as is real world page content. The paper quotes a 2010 study by Brajnik et al. which found that for 50% of WCAG 2.0 success criteria, different evaluators did not reach a 80% level of agreement whether a problem was present on the page. It also identified 20% false positives and found that 32% of problems had been missed by evaluators.

This is no surprise because in implementing real-word content, techniques are modified and in many cases, interact. In our own evaluation experience, allocating a particular problem to a specific SC is often not easy, especially as one and the same problem instance can violate several SC at the same time. A text label may be not in the right position, not be sufficiently descriptive, and not programmatically linked to the input field, violating four success criteria at once (1.3.2, 2.4.6, 3.3.2, 4.1.2).

Problems when using several Sufficient Techniques simultaneously

Even for fairly clear-cut success criteria, there is room for interpretation in cases where more than one sufficient technique has been used. Take SC 2.4.1, Bypass Blocks. This SC can be met either by having a proper heading structure that allows screen reader users to skip content (H69: Providing heading elements at the beginning of each section of content), or by implementing one of several skip links techniques (G1, G123, G124). But what happens if an author uses both headings and skip links and the skip links turn out to be not working (quite a frequent problem)? There is no Failure associated to SC 2.4.1 that would cover this case, and if the headings structure is fine, on what grounds should the SC fail? Still, the problem for the screen reader users (indeed all keyboard users) is evident: a skip link is announced / highlighted but it doesn't do the trick. It is not surprising that evaluators will come to different conclusions when rating this case.

Providing insights into the frequency of actual problems of blind users

The strength of the study is not what it says about WCAG, but that it provides evidence of the type and frequency of problems encountered by blind users in carrying out actual tasks.

The link purpose problem

While given the methodological problems outlined I would be careful in taking any of the numbers at face value, I found it striking that the most frequently encountered problem of all was "Link destination not clear" (117 times). The study's conclusion that the sufficient techniques for SC 2.4.4 Link Purpose are in fact not sufficient at least for the group of blind users strikes me as valid.

WCAG 1.0 recommended, in Checkpoint 13.1: "Clearly identify the target of each link. Link text should be meaningful enough to make sense when read out of context." WCAG 2.0 has relaxed this requirement by allowing authors to provide the purpose in the link's context. For blind users who rely on generated indices of headings, WCAG 2.0 is clearly a step backwards. As the lack of descriptive links will also contribute to other frequent problems such as "Content not found in pages where expected by users", it may indeed be worth rethinking SC 2.4.4 in future work on WCAG.

Final remark

As other commentators have noted: the fact that only a part of users' problems could be mapped to WCAG criteria is no surprise. While WCAG 2.0 does cover more usability issues than WCAG 1.0 (such as error handling) it doesn't cover all of them. The study calls for the inclusion of a broader range of problems in WCAG 2.0. This would make WCAG 2.0 even more complex and the succes of its criteria even harder to measure.

The study's most worrying conclusion – that the implementation of WCAG techniques did little to help blind users – seems based on a number of methodological flaws that I have tried to unpack above.

The study's call for more empirical user studies is certainly justified. But its methodology shows too many glaring problems to put any trust in its findings regarding the value of WCAG.