I recently discussed Brad Schoenfeld’s latest study.
I went over the small sample size issue. I would have liked to go more in depth on that, but the paper did not include the results for each individual.
Luckily, James Krieger, a coauthor, has published that data.
Much like Brad, I’m a big fan of James’. I’ve been following him on and off for ten years. His nutrition writing in particular has been phenomenal. There were times in college I read him more than my exercise physiology books, and was only better off because of it…
But his rationale for why this data wasn’t included in the final publication is seriously lacking, passing the blame to “reviewers” and you have to “pick your battles.” In the least, put a link to an appendix or something. It’s one image. That last alcohol study I reviewed had two such links to thousands of pages.
If exercise science researchers are going to insist on doing these studies with like 35 or less people, you have to publish these kind of plots, as they explicitly illustrate the fatal variability of small sample sizes. Not doing so implies the data was less varied than it likely was.
-> No, posting standard deviations is not enough.
I’ve talked about this when it comes to eating response after exercise:
–What’s the deal with exercise and your appetite?
After a certain kind of workout, basically 50% of people eat more while the other eat less. Sooo, how do you make a recommendation, at the individual level, with that knowledge? Do you have any better advice than “Play with it”?
-> This is why some, legitimately, say “Exercise makes me fat.” While others the weight starts falling right off them.
Let’s look at the muscle size data, and why an argument can be made studies like this are borderline uninterpretable for practitioners wanting to use the data “in the field.”
As a reminder, the study found a dose response relationship for exercise volume and muscle size.
But look at tricep thickness.
Four of the participants in the one set group, damn near 50% of the visible participants, they did BETTER than five participants in the 3 sets group:
And six did better than two in the 5 set group:
Again, this isn’t like one out of a thousand participants. Even with one set vs five, we’re talking ~66% of the one set group doing better than ~20% of five sets. Because we in exercise science refuse to use larger sample sizes, we have to pause and analyze what happens to two people.
To be clear, those two people in the five sets group LOST muscle.
Someone might say “Well, at least you can deduce 5 sets, or more volume, has a better chance of gaining muscle than 1 set.” Are you positive about that? Devil’s advocate: Look at the above again. What about the fact the people in the 3 sets group who lost muscle, had a greater average loss than 1 set? Are we now in the realm of risk / reward analysis? Do we need to assess the triceps differently because of this?
You can pick and choose example after example.
Furthermore, notice the variability within a group.
For five sets, in bicep thickness, two people gained ~6mm, nobody else gained more than 4, most gained 3 or less.
Aesthetically (I’m aware this is not the true statistical usage), those two top people in the 5 sets group sure look like outliers. Should we include or exclude outliers? When it comes to this topic, are they anomalies, or are they highly genetically suited for high volume training?
Triceps. One gained ~6.5mm; another lost over 3mm!
Look, I know we’re talking millimeter differences here, but +6 mm vs -3 mm is a 9 mm swing. In fact, just consider the difference between the person who gained 1mm and the person who gained 6mm. That’s a 6x difference. 500% difference. Seriously. five HUNDRED percent.
This is why assessing and modeling humans can be so impossible. Small sample sizes => Impossible * infinity.
I’m a personal trainer. What am I supposed to do with this information? How do I know whether I have a person where 3 sets will make them gain or lose muscle? Or if one set is better than three? This study does not remotely help me in that regard. All it does is, maybe, help me speak in extremely broad generalities. That’s fine if I’m talking about a large group, but when I’m directly speaking to an individual?
This study used trained people i.e. been resistance training for over a year. If I have a trained client, can I automatically tell them more volume will help them get bigger? This is what the conclusion of the study says,
“we show that increases in muscle hypertrophy follow a dose-response relationship, with increasingly greater gains achieved with higher training volumes. Thus, those seeking to maximize muscular growth need to allot a greater amount of weekly time to achieve this goal”
Do you think that’s a fair conclusion now that you’ve seen some people might get smaller if they do more volume? Because that sure sounds like painting a brush amongst anybody interested in hypertrophy.
If you have a new client and they tell you muscle size is their goal, do you immediately jump to 5 sets, or do you start at 1? What’s the injury, adherence and burnout risk of a workout with 5x as many sets?
The study was done on college aged males. The conclusion says “thus, those seeking…” Do you think it’s fair to take a 40 year old male, with two kids, a mortgage, 9-5 job, and tell him he needs to do 45 sets a week for his quads –to failure- to maximize hypertrophy? Because he’s one of “those seeking…” And, if your recommendation is that, does your recommendation have to be “do 45 sets and also have a college lifestyle”? Oh, “and take some steroids so your testosterone levels are what they were in your early 20s.”
And “eat a certain level of protein.”
And “have a trainer next to you watching your technique and keeping you accountable.”
Someone is going to say that’s too pedantic. Too onerous. But that’s science. Generalizing too much, reading too much into sample sizes of 30 (split into three groups!), gets me clients pushing 70 years old, half jokingly, asking me why we’re not only doing 13 minute workouts for strength? Linking me to a NYTimes article about this study.
-> The 13 minute strength finding from this study hasn’t been as talked about. All the powerlifters and olympic lifters in the world must feel bad they’ve been lifting more than 13 minutes all these years.
A much more sensible conclusion is,
“Those looking to maximize hypertrophy should start with low volume and increase as results dictate.”
Something I do not like about many of the evidence driven writers in the exercise science world is how often they fall back on what is and isn’t, in their often bizarre rationalizations, scientific.
-> This study used self-reported food intake to monitor the subjects’ eating. That’s not falsifiable, which is not scientific.
–Why are we so confused about how and what to eat? <- an enormous problem in nutrition research
Denouncing anyone who does not speak in scientific terms, despite the fact many researchers’ conclusions are not scientific themselves. I routinely see, and saw it with this study, responses to the criticisms of “the critics are unscientific.” They’re arbitrarily picking a level of scientificness critics need to attain to be valid. Meanwhile, go ask some people in physics how scientific they think studies like this are.
-> In human research, we’re happy if we get to a p value of 0.05. One out of twenty, which is what this study used (and which was rarely attained). In physics, they shoot for one out of 3.5 million.
Or let me put it this way: If you’re a person struggling with hypertrophy, the bane of the evidence driven person’s life is for you to go on an internet forum and ask around for advice. For you to seek the endlessly cited confirmation and availability biases.
You’ll hear some tell you “Do more volume. Look at bodybuilders.” You’ll hear others say “Do less. That worked for me.” The person concludes,
“For a fair amount of people, more volume worked, but there’s definitely enough others telling me less might work too. Based on my experience and circumstances, I think I’ll try X and go from there.”
So the person goes off and experiments. Does a study like this do any more than suggest the same? If it doesn’t, how scientific is it? (How scientific, research wise, are humans able of being?) If we’re left to the same devices as anecdote, what’s the value or utility in continuing this kind of research? Are we adding clarity, or confusion?
asgag
September 19, 2018
Thanks for this. Food for thought.
b-reddy
September 21, 2018
Glad you liked it!
Dr. J
September 20, 2018
I think the conversations taking place due to this paper are almost more interesting and educational than the paper itself.
Questions of efficacy vs effectiveness and long term sustainability come into play.
I want GAINZ as much as the next bro, but 45 sets to failure per exercise? Just not pragmatic for the vast majority of trainees. I think I rather take of jogging (pronounced “yogging” with a soft j). Especially in light of the equivocal data presented.
Unfortunately we are left with the boring old standbys – some exercise is better than no exercise and the best program for the individual is the one that they will actually do consistently over the long term. Yawn.
b-reddy
September 21, 2018
It all did escalate quickly.
I agree. There are a lot of parallels to eating. There are many ways to skin a cat.
Chris
October 23, 2018
Hello Brian, thank you for the article!
Imo the title of the article is not adequate and Id like to remark some points:
– Sample size: Its a complicated field and every layman´s knee-jerk-approach to everything N<100 is the same "Sample size too small!!!111" (because thats the only thing they ever heard of statistics). Now, the adequate mininum sample size depends on many factors; 30 may be enough in one case, 10000 too few in the other. Brad was among the first ones in exercise science to use power analyses to determine ss, so the ss is predetermined to be adequate. Even if one study has a meagre ss, its inclusion in a MA makes even that study worthwhile.
– Speaking of MA, Id like to point you to Greg Nuckols´ excellent article https://www.strongerbyscience.com/frequency-muscle/ on that topic that shows the same results as this Schoenfeld study. Moreover, he did an analysis of Schoenfeld´s study in MASS, and responded to Lyle (who clearly has not sufficient statistical knowledge and got some things mixed up) here https://www.facebook.com/gregory.nuckols/posts/10155783995703779 . In Greg´s response, he also tackles the blinding and training standardizations you mentioned.
– Looking at individual data points: You simply cant do that the way you did. Studies use several participants and calculate means to cancel out noise. We know there is great interindividual variety – not only due to genetic factors, but also due to environmental reasons. In a study, the more ecologically valid, the more, there is variation in the response not only due to the independent variable: Life stress, minor illnesses, nutrition, mood and motivation. Now if you point at two persons of one group saying they fared worse than 6 of the other – that doesnt mean at all they wouldve been better off in the other treatment. You also seem to be surprised about the huge SD of results, especially of negative ones (losing muscle). This is pretty normal for such a training study and due to intra- and interindividual variation. It conforms with the literature. That also means there are no proper "outliers" in this study.
– I am completely convinced there are factors determining individualized optimal training volume, frequency etc – but your claims cant be made from such single-point comparisons between groups. In a related vein, Id also favour a mixed model for the starting differences, and a statistical setup for detecting interaction effects. The latter one is research that will be happening once the fundamental results of volume, freq etc is explored. Thats the same in every science: First researching basic principles, then proceeding to finer and finer individual factors/subgroups. We simply need to wait for the research on individualized training …and in practice, we probably wont have the ressources to determine the individual (genetic) setup for the adequate training, at least not in near future.
I like your intellectual critical mind and motivation. I just think that for this topic, for example Greg Nuckols has a bit more understanding both of the statistical methodology used, the previous literature and the practical constraints and approaches of training research. And I recommend reading and discussing his take on the study.
b-reddy
October 23, 2018
Hey Chris,
-I think you misread quite a bit of my post(s) on this study. Maybe I didn’t write certain parts well enough, but you’re making arguments against points or approaches I didn’t make or take. Because of that I’m going to address some of your points specifically and more broadly-
-As I hit on in the posts about this paper, I actually emailed multiple statisticians. (Something the study did not have one of.) I explicitly said to them how often I fall back on sample size as a crutch because it’s what I’m most familiar with, so I wanted to see if they could point to something I was missing. They couldn’t. There may be merit to your broader point -though anybody trying to justify small sample sizes is walking on ice- but not with this study. As one statistician said to me “your concerns about sample size are valid.”
And -I’m quite surprised you’d make this argument- just because a study hits some level of statistical power does not mean it’s practically relevant / worth including just the same as whether a study hits a p-level. If you’re going to make that argument, then we should all be reacting strongly to every published study of this nature. There is an enormous amount of poorly done research that hits statistical significance. (Not to mention this study had issues hitting both power and p-value numbers.)
By the way, I still haven’t gotten anybody to be able to explain the ANCOVA issue to me. Including James Krieger who never responded to my asking him on his site.
-Greg looks at these kinds of debates from a researcher’s standpoint. Lyle has always looked at these issue more from a practitioner’s point of view. (It’s why people are still gravitating to his takeaways.) I hit on this quite a bit in the article but will reiterate-
You need to think about this from the average personal trainer’s point of view. The fact is they will not be able to wrap their head around this amount of mumbo-jumbo with statistics. Showing them charts of the variance like I did? People can grasp that quickly. We can’t expect practitioners to have deep statistics backgrounds. That was another point of the articles. We need more easily accessible forms of how to implement the knowledge IF it’s worth implementing. If the researchers can’t give us that, it’s their problem, not ours.
Regardless, if we’re all having to do this deep of a dive into the methodology something isn’t right. Great research is obvious. It’s a home run. This clearly wasn’t. And no, we don’t base the quality of research on whether it agrees or disagrees with other research. If that’s your barometer, Einstein would have been a failure.
While I’m of the belief excessive cynicism is a terrible trait, Lyle has a lot of fair points about p-hacking, lack of pre-registering, etc. Cornell just fired a professor for this with nutrition research; Harvard is going through an issue with a cardiologist and stem cells. These things are not one off events. They happen routinely within research. Enough as a practitioner you have to keep it in the back of your mind. A study seemingly out of nowhere is eschewing p-values for some esoteric “Bayesian” approach? Using a biomechanics PhD *student* as their point person? I have no qualms with people being skeptical of that.
-I’m not at all surprised by the big standard deviation. That was the point of the article. (And why I referenced another article I’d written with the same finding regarding appetite and exercise.) You can’t focus on averages if there are enormous deviations around those averages.
Even if the study did have hundreds of participants, if it got the same results with the same deviations, it wouldn’t change the rules of how we should train people: try some volume, adjust from there.
-“Now if you point at two persons of one group saying they fared worse than 6 of the other – that doesnt mean at all they wouldve been better off in the other treatment.”
I didn’t say that. I’m saying you *can’t* make individual conclusions. Everyone is trying to say how you should train an individual from a study like this. I’m saying don’t do that. Hence, “futility.” (If you’re training a football team of 100 people, all having to do the same workout, maybe you have a different point of view.)
-Re: noise. See, that’s interesting. How is noise canceled out form the practitioner’s point of view? Studies like these are always on college aged males with no injury history who are able to show up to the majority of their sessions. That doesn’t apply to the majority of people who step in a gym. The only noise a study like this is canceling out is the noise of real life. This is part of the reason virtually no high level athletes or lifters extensively use academia for their performance.
Again, if you want to say “how to maximize hypertrophy in X population with Y and Z characteristics,” that’s one thing. If you’re going to say how to maximize hypertrophy broadly, well, you can’t make that conclusion from a study like this. If nothing else, how do you even know people would show up for this amount and these kinds of training sessions??? If I get a client to show up twice a week but can’t get them to show up 3 times a week, does it really matter how much more effective 3 times a week is?
-I like Greg’s work. I’ve read most of it. He’s not a statistician though, and again, he’s coming at from a researcher’s point of view. Deep down he knows the lack of ultrasound blinding is an *enormous* red flag, yet he tries to half defend it by saying an increase in probability of bias doesn’t imply bias happened. Well, duh, but that’s not how it works.
I would recommend talking to people in other fields. I talked to multiple doctors about this study. I tell them the sample size and they laugh and stop listening. Yet for some reason in exercise science we’re supposed to be ok with a study of 35 people because the people doing it are nice and well intentioned. To Greg, a blinded technician is ideal. In other fields, it is an absolute necessity.
-I’m sorry if that comes off like an ass. I don’t mean to. But I do strongly disagree with much of what you’ve written. Again, I realize that could have been poor writing on my part. Either way, I appreciate the counter point of view.