What is GRADE?
Since 2000, the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) working group has worked to develop a systematic and explicit approach to making judgements about quality of evidence and strength of recommendations. The GRADE approach seeks to address many of the perceived shortcomings of existing models of evidence grading. Crucially, evidence is evaluated not study by study, but across studies for specific clinical outcomes. The methods developed by the GRADE working group take into account methodological flaws within the component studies, the consistency of results across different studies, how generaliseable the research results are to the wider patient base, and how effective the treatments have been shown to be. All treatment comparisons are given one of four GRADE scores reflecting the quality of the evidence: high, moderate, low or very low quality evidence. The approach taken by the GRADE working group is widely seen as representing the most effective method of linking evaluations of the quality of evidence to clinical recommendations. Our approach to grading evidence is based on the work of the GRADE working group. When taken with our existing intervention categorisations, we believe that our approach will give clinicians a clear view of the evidence relating to key treatment interventions.
How have we come to the GRADE scores for each comparison?
We have developed a pragmatic approach that has enabled us to apply the principles of GRADE to our systematic reviews in a reproducible and appropriate way. We had to overcome several problems in order to be consistent in our approach across all of our content. For each review we present a table that identifies on what basis judgements of evidence quality are made.
- Type of evidence: We allocated four points to evidence largely based on RCTs, two points to evidence based on observational studies, and one point for evidence based on expert opinion. Because BMJ Clinical Evidence reviews usually have a search that is restricted to RCTs and systematic reviews of RCTs, we have almost certainly omitted observational evidence of relevance to the GRADE comparisons. Where we found no RCTs or systematic reviews, we have reported that we found no clinically important results, rather than search for and report observational studies which are at high risk of bias, and therefore may not yield clinically important results about the effects of interventions.
- Sparse data: We describe comparisons with fewer than 200 participants in total as sparse data. Whilst a definition of sparse data should be based on event rates, many of our outcomes are presented as continuous data, which does not easily lend itself to conversion into event rates.
- Quality points: We merged sparse data, follow-up, withdrawals, blinding, allocation concealment, and other quality issues into one quality category, and allowed deduction of up to three points for quality flaws. We did not re-appraise the studies, but assumed that our contributors would have highlighted any methodological flaws with the studies in the text of our review. All contributors will be asked to check the review content during the following update to see whether important weaknesses have been overlooked.
- Consistency: We merged heterogeneous studies, with different end-points and populations, providing they all evaluated the outcome in question and compared the same interventions. We allowed the deduction of one point if the relevant studies were largely inconsistent in their conclusions, and added one point if there was evidence of a dose response.
- Directness: Up to two points were deducted for the following reasons: for studies that were of limited generaliseability, because the included population was too narrow or too broad; for outcomes that were difficult to generalise, such as those only reported as composite outcomes or poorly defined; for treatments that did not fully represent all those included in the intervention title; and where the comparators were not identical.
- Effect size: Because we rarely had a single meta-analysis for the outcome and comparison in question, and because many of our outcomes are expressed as continuous data, we had to modify the GRADE recommendation to add one point for a relative risk or odds ratio of 2 or more, and to add two points for a relative risk or odds ratio of 5 or more. We looked at all effect sizes for the comparison in question, reported in individual RCTs or meta-analyses, and added one point if they were all greater than 2 (or less than 0.5) or two points if they were all greater than 5 (or less than 0.2). If one or more of the effect sizes reported was less than 2, or if the results were not statistically significant, no points were added.
- GRADE score: We used four categories of quality of evidence: high (four or more points overall), moderate (three points), low (two points), and very low (one, zero, or minus points).
- Strength of recommendation: We have already categorised all interventions included in our reviews according to their likely effectiveness. We felt that this categorisation reflected the strength of recommendation, and that no further recommendations were needed.
- Cost-effectiveness assessment: BMJ Clinical Evidence does not include data on cost-effectiveness as this varies internationally. Therefore, we have not included cost-effectiveness data in our evaluation of the evidence, or in our categorisations.
Scoring system used for BMJ Clinical Evidence reviews
| Type of evidence | ||
| Score | 4 2 1 |
RCTs/ SR of RCTs, +/- other types of evidence Observational Non-analytical (expert opinion) |
| Quality | ||
| Base on | Blinding and allocation process Follow-up and withdrawals Sparse data Other concerns |
|
| Score | 0 -1 -2 -3 |
No problems Problem with 1 category Problem with 2 categories Problem with 3 or more categories |
| Add 1 point if adjustment for confounders would have increased effect size |
||
| Consistency | ||
| Base on | Significant difference for comparison and outcome reported by each study |
|
| Score | +1 0 -1 |
Evidence of dose response across or within studies (or inconsistency across studies is explained by a dose response) All/most studies show benefit (or harm) No agreement between studies |
| Directness | ||
| Base on | Generaliseability of population and outcomes from each study to population of interest |
|
| Score | 0 -1 -2 |
Population and outcomes generaliseable Population OR outcome not generaliseable Population AND outcome not generaliseable |
| Effect size | ||
| Base on | OR/RR/HR for comparison |
|
| Score | 0 +1 +2 |
Not all effect sizes more than 2 or less than 0.5 Effect size more than 2 or less than 0.5 for all studies/meta-analyses included in comparison Effect size more than 5 or less than 0.2 for all studies/meta-analyses included in comparison |






