• Welcome Guest
Major League Baseball

Welcome to the Major League Baseball.
Before posting, please review our Message Board Guidelines

    • More stat head exercises
  • 4/5/12
  • TheAllSeeingEYE

"Regression allows for the inclusion of multiple independent variables, which has several advantages over your model."

As I said in the OP I am going to do them separate and then in combination and use the adj r2 to pick the best fit from among them. And the purpose of this experiment (see the original thread for more details) is to collect a set of stats to use in a Message Board Power Ranking based on stats that the members here chose with an offense, defense and pitching component weighted to see what the results where and how they changed from period to period (either a week, month, or 10 day interval.)

We are also going to give more weight to the numbers from the most recent period when calculating the YTD rankings.

  • Reply to this Message
  • 4/5/12
  • Jasreds
"As I said in the OP I am going to do them separate and then in combination and use the adj r2 to pick the best fit from among them."

Why not just tell SAS or STATA to mine the data to find the best possible fit? Of course all that really means is that you will find the variable combination with strongest correlation. Just going adj r2 you could even end up excluding the variable with the strongest relationship with wins due to other statistical issues, such as multicollinearity and spurious effects.

I still don't get what running regressions with a single independent variable gets you. What is so crucial about the information that will be provided that outweights the problems associated with this design?

"We are also going to give more weight to the numbers from the most recent period when calculating the YTD rankings"

Why? I need a theoritical reason why.

Edited 4/5/12   by  Jasreds
  • Reply to this Message
  • 4/5/12
  • TheAllSeeingEYE

"I still don't get what running regressions with a single independent variable gets you."

I think that is the most interesting part of the analysis. Unless we scare them (other posters) off before the thread gets going good I would think there would be the expected argument spring up about which stat is going to carry the highest correlation and then we plug in the numbers to see who was right.

"Why not just tell SAS or STATA to mine the data to find the best possible fit?"

Because then it wouldn't be the board members designing their own metric. And that was the original intent to see how it scored the teams as the season went along. It makes for a conversation where the members had a stake in the outcome. Otherwise we could just post a link to B-R and call it a day but that wouldn't be much fun.

"Why? I need a theoritical reason why."

Because another poster suggested it and others agreed. If you have not read the first thread try to find it and you will have a better understanding of what this experiment is all about.

  • Reply to this Message
  • 4/5/12
  • Jasreds
What is the name of the other thread?

By doing one independent variable at a time you are opening the door to spurious relationships. Just like the classic example of the relationship between ice cream sales and murder rates. In the end, a regression with a single independent variable is basically a pearson's correlation.

I need more a reason then, that is what they wanted, like why did they want it.

What are going to do about the differences in scales between your independent variables?

Edited 4/5/12   by  Jasreds
  • Reply to this Message
  • 4/5/12
  • TheAllSeeingEYE

"What is the name of the other thread?"

Something along the lines of "a new MB metric."

"By doing one independent variable at a time you are opening the door to spurious relationships."

I don't see that happening. What baseball team stat do you think is going to be suggested that might lead to that? This isn't beer sales and wins we are comparing and just about everything I can imagine they would propose in the world of baseball team stats will have some bearing on how good the team is.

"why did they want it."

To give more weight to a team that might have been short-handed earlier in the season but now has all their stars healthy and are playing better, and less weight to the team that was good before they lost two or three stars to injury...Plus trades that make a team better. That was the gist, I think.

"What are going to do about the differences in scales between your independent variables? "

Once we have all the stats we are going to use in the metric we will scale them all. The point of this thread is to narrow the field to a few numbers so the formula will not be too cumbersome. I will deal with any off-the-wall stats and multicollinearity and scaling and fine tuning the formulation. I just want the guys to discuss the stats so I can get the Power Ranking formula done by the end of the month so we can begin to publish our ranks.

  • Reply to this Message
  • 4/5/12
  • Jasreds
A spurious relationship is when you find a relationship between two variables that is due to a third uncontrolled variable that has a direct cause on both. So it is more likely to occur when the variables are closely related.

I can understand creating those weights, I am not sure that time is the right proxy in doing this.
  • Reply to this Message
  • 4/5/12
  • TheAllSeeingEYE

"A spurious relationship is when you find a relationship between two variables that is due to a third uncontrolled variable that has a direct cause on both. So it is more likely to occur when the variables are closely related."

I understand the definitions but I was wanting you to give an example of where you think it will occur in this analysis. Which team stats did you have in mind?

"I am not sure that time is the right proxy in doing this. "

This is the reason for the thread. Give some input on your idea for the model and we will discuss it.


Edited 4/5/12   by  TheAllSeeingEYE
  • Reply to this Message
  • 4/5/12
  • Jasreds
Off the top of my head. Hrs, runs, ribs, slg % should all be highly correlated. Btw having a DH will help with each of these.
  • Reply to this Message
  • 4/5/12
  • TheAllSeeingEYE

Issues of multicollinearity are easy to see and deal with which is why we are looking to throw out certain statistics in favor of stats that have a higher r2. Above all we must pick stats that are most favored by the board, and when there is too much overlap use one over another. This thread will help show which have a closer correlation to the team's strength.

E.G. there is no reason to use both runs and RBI. I don't have to do analysis to see they will be almost the same.

And the AL and NL rankings are going to be separated because of the DH rule. I have not yet figured out how we are going to deal with inter-league play.

  • Reply to this Message
  • 4/5/12
  • djiboutirox
what

oh jazzy's here, ill go back into my hole

Edited 4/5/12   by  djiboutirox
  • Reply to this Message
  • 4/6/12
  • Jasreds
"Issues of multicollinearity are easy to see and deal with which is why we are looking to throw out certain statistics in favor of stats that have a higher r2. "

Actually that is not the only way to handle multicollinearity. Multicollinearity only biases the standard errors, so significance test. The coefficients can still be examined, just more carefully.

"E.G. there is no reason to use both runs and RBI. I don't have to do analysis to see they will be almost the same"

Which ignores the other 2 stats that I mentioned.

"And the AL and NL rankings are going to be separated because of the DH rule. I have not yet figured out how we are going to deal with inter-league play."

Include a dummy variable for the DH in a multi independent regression model. Now you would be controlling for the DH. Again, another reason to use multiple independent variables to control for context.

  • Reply to this Message
  • 4/6/12
  • Jasreds
You can have a turn. So far, all i have really gotten is "this is what I want to do."
  • Reply to this Message
  • 4/6/12
  • TheAllSeeingEYE

I see no reason to overfit data and no reason to not handle multicollinearity by not including redundant numbers like RBI and runs given how phase two of the power rankings is going to be conducted with the paired comparisons (see original thread).

The most interesting phase of this whole thing will be the individual comparisons. It is simple and easy for people to understand. In other words, it fits the audience of the average baseball interlocutor.

I am not sure what information you are trying to find out that would help the items under discussion. It looks to me like you are trying to solve a problem that has not been proposed or are engaging in tetrapyloctomy.

The model I have laid out as a basis for this discussion is not flawed because it does what it was intended to do and does so simply. We are not testing a new medication designed to slow toenail growth.

  • Reply to this Message
  • 4/6/12
  • Jasreds
Look I do statistical analysis for a living and I teach multiple classes on stats and research design. I am not splitting hairs. I was trying to give some advice, based on years of experience and published work, that would actually improve your statistical design and avoid possible biased results, or at the very least provide a test to see if there is a bias.

You haven't really addressed any of the flaws I have pointed out. For example, you never addressed the issue of your dependent variable being a counting stat and the OLS can have problems with it. The answer you should have given is that you will run a regression model and theright MLE model will doing a hettest to see if you lose any effectiveness due to choice of statistical model.

I listed 4 stats, you picked two of them and have completely ignored the rest.

In addition, you seem to be completely against actually controlling for anything, including contextual variables.

Further, you plan to compare adj R2. How? Are you going to simply see which number is highest? The proper method would be to conduct statistical analysis across models to see if there is an actual significant difference or just random chance.

Heck, you are not even doing the simpliest and probably least biased statistic for a single independent variable model, Person's R. Why not just simple correlations? After all, R2 is just Person's r squared. This way you don't have to worry about OLS assumptions.

Take the advice or leave it.

"The model I have laid out as a basis for this discussion is not flawed because it does what it was intended to do and does so simply. We are not testing a new medication designed to slow toenail growth."

Actually medical research is not that statistically advance many of their studies get by with simple t-tests, difference of means, and ANOVAs.

Edited 4/6/12   by  Jasreds
Edited 4/6/12   by  Jasreds
  • Reply to this Message
  • 4/6/12
  • TheAllSeeingEYE

Did you finally say what you have been wanting to say. I hope so.

Now could we get back to the purpose of the thread?

  • Reply to this Message
  • 4/6/12
  • Jasreds
And it went in one of your ears and out the other. A shame, because I honestly did want to help in making it the best possible analysis that it could be in order to obtain the most accurate answers possible. But oh well.
  • Reply to this Message
  • 4/6/12
  • TheAllSeeingEYE

I want you to participate. Feel free to do any analysis and post it for discussion. But I think the most enjoyable and entertaining way to do it is to actually post the analysis rather than posting about the minutia. I had a simple idea for the thread. It was easy to calculate in about 30 seconds, post in 2 and be discussing it in under a minute.

The ultimate goal was to have a fun metric that would be our very own on this board created by our collective input on the parameters. So let's do that, OK?

  • Reply to this Message
Message 115193.22 was deleted
  • 4/6/12
  • TheAllSeeingEYE

"disallow discussion on it's accuracy and how to improve that."

With a ranking system there is often subjectivity involved. Therefore we would certainly debate its accuracy and that is the point. That is precisely the type of discussion it would generate. And it has already started doing what I was hoping would happen and that is individual posters creating their own metrics. And that is the fun part.

But blowing the thread out of the water just to throw your di­ck on the table before a single stat is put on the hot seat is just going to get people to throw up their hands and think it is too complicated to contemplate.

I have an idea on looking at some stats, the board can accept it , reject it, improve it. Then we move to the next phase. After that is done and we have a formula we publish the results. Others can do their analysis and we will see where that leads.

There is a constructive way and a destructive way. Let's get to the numbers, construct, and then revise.

  • Reply to this Message
Message 115193.24 was deleted
Recent Discussions
Most Popular Player Tournament
Latest by tribefan011 - 11:35 AM
OT: Video Games Thread
Latest by astatecard - 11:33 AM
OT: 2012-2013 NBA Season Thread
Latest by astatecard - 11:30 AM
Miguel Cabrera has entered god mode
Latest by astatecard - 11:28 AM
OT - Things that annoy me...
Latest by TheeDogg - 11:08 AM
OT: Random Thoughts
Latest by TheeDogg - 10:51 AM
For Scalene
Latest by TheeDogg - 10:50 AM
OT- Dog Pictures
Latest by TheeDogg - 10:47 AM
WeIrD ThInGS YoU dO
Latest by TheeDogg - 10:46 AM
Baseball video game question
Latest by yankees09champs - 10:42 AM
Powered by Mzinga
© 2001- MLB Advanced Media, L.P. All rights reserved.

The following are trademarks or service marks of Major League Baseball entities and may be used only with permission of Major League Baseball Properties, Inc. or the relevant Major League Baseball entity: Major League, Major League Baseball, MLB, the silhouetted batter logo, World Series, National League, American League, Division Series, League Championship Series, All-Star Game, and the names, nicknames, logos, uniform designs, color combinations, and slogans designating the Major League Baseball clubs and entities, and their respective mascots, events and exhibitions.

Use of the Website signifies your agreement to the Terms of Use and Privacy Policy.