"Regression allows for the inclusion of multiple independent variables, which has several advantages over your model."
As I said in the OP I am going to do them separate and then in combination and use the adj r2 to pick the best fit from among them. And the purpose of this experiment (see the original thread for more details) is to collect a set of stats to use in a Message Board Power Ranking based on stats that the members here chose with an offense, defense and pitching component weighted to see what the results where and how they changed from period to period (either a week, month, or 10 day interval.)
We are also going to give more weight to the numbers from the most recent period when calculating the YTD rankings.
"I still don't get what running regressions with a single independent variable gets you."
I think that is the most interesting part of the analysis. Unless we scare them (other posters) off before the thread gets going good I would think there would be the expected argument spring up about which stat is going to carry the highest correlation and then we plug in the numbers to see who was right.
"Why not just tell SAS or STATA to mine the data to find the best possible fit?"
Because then it wouldn't be the board members designing their own metric. And that was the original intent to see how it scored the teams as the season went along. It makes for a conversation where the members had a stake in the outcome. Otherwise we could just post a link to B-R and call it a day but that wouldn't be much fun.
"Why? I need a theoritical reason why."
Because another poster suggested it and others agreed. If you have not read the first thread try to find it and you will have a better understanding of what this experiment is all about.
"What is the name of the other thread?"
Something along the lines of "a new MB metric."
"By doing one independent variable at a time you are opening the door to spurious relationships."
I don't see that happening. What baseball team stat do you think is going to be suggested that might lead to that? This isn't beer sales and wins we are comparing and just about everything I can imagine they would propose in the world of baseball team stats will have some bearing on how good the team is.
"why did they want it."
To give more weight to a team that might have been short-handed earlier in the season but now has all their stars healthy and are playing better, and less weight to the team that was good before they lost two or three stars to injury...Plus trades that make a team better. That was the gist, I think.
"What are going to do about the differences in scales between your independent variables? "
Once we have all the stats we are going to use in the metric we will scale them all. The point of this thread is to narrow the field to a few numbers so the formula will not be too cumbersome. I will deal with any off-the-wall stats and multicollinearity and scaling and fine tuning the formulation. I just want the guys to discuss the stats so I can get the Power Ranking formula done by the end of the month so we can begin to publish our ranks.
"A spurious relationship is when you find a relationship between two variables that is due to a third uncontrolled variable that has a direct cause on both. So it is more likely to occur when the variables are closely related."
I understand the definitions but I was wanting you to give an example of where you think it will occur in this analysis. Which team stats did you have in mind?
"I am not sure that time is the right proxy in doing this. "
This is the reason for the thread. Give some input on your idea for the model and we will discuss it.
Issues of multicollinearity are easy to see and deal with which is why we are looking to throw out certain statistics in favor of stats that have a higher r2. Above all we must pick stats that are most favored by the board, and when there is too much overlap use one over another. This thread will help show which have a closer correlation to the team's strength.
E.G. there is no reason to use both runs and RBI. I don't have to do analysis to see they will be almost the same.
And the AL and NL rankings are going to be separated because of the DH rule. I have not yet figured out how we are going to deal with inter-league play.
I see no reason to overfit data and no reason to not handle multicollinearity by not including redundant numbers like RBI and runs given how phase two of the power rankings is going to be conducted with the paired comparisons (see original thread).
The most interesting phase of this whole thing will be the individual comparisons. It is simple and easy for people to understand. In other words, it fits the audience of the average baseball interlocutor.
I am not sure what information you are trying to find out that would help the items under discussion. It looks to me like you are trying to solve a problem that has not been proposed or are engaging in tetrapyloctomy.
The model I have laid out as a basis for this discussion is not flawed because it does what it was intended to do and does so simply. We are not testing a new medication designed to slow toenail growth.
Did you finally say what you have been wanting to say. I hope so.
Now could we get back to the purpose of the thread?
I want you to participate. Feel free to do any analysis and post it for discussion. But I think the most enjoyable and entertaining way to do it is to actually post the analysis rather than posting about the minutia. I had a simple idea for the thread. It was easy to calculate in about 30 seconds, post in 2 and be discussing it in under a minute.
The ultimate goal was to have a fun metric that would be our very own on this board created by our collective input on the parameters. So let's do that, OK?
"disallow discussion on it's accuracy and how to improve that."
With a ranking system there is often subjectivity involved. Therefore we would certainly debate its accuracy and that is the point. That is precisely the type of discussion it would generate. And it has already started doing what I was hoping would happen and that is individual posters creating their own metrics. And that is the fun part.
But blowing the thread out of the water just to throw your dick on the table before a single stat is put on the hot seat is just going to get people to throw up their hands and think it is too complicated to contemplate.
I have an idea on looking at some stats, the board can accept it , reject it, improve it. Then we move to the next phase. After that is done and we have a formula we publish the results. Others can do their analysis and we will see where that leads.
There is a constructive way and a destructive way. Let's get to the numbers, construct, and then revise.