Discussion Forums
Forums >
Reference
This is a place to ask questions about math terminology, and to post links to other resources out on the web.
Sam
2005-07-08 13:57:32 |
Statistics: Two of sets of data: HELP!
Hey all -
I'm deep into my thesis, and have only realized now that I don't know nearly enough statistics.
I have two sets of data which, when plotted on the same scatter graph, show two negative trends. When the trend-lines are drawn, one lies above the other. I'd like to be able to prove that this difference is statistically significant.
Doing some research, I've found that this is basically an ANCOVA analysis. However, ANCOVA appears to only work for parallel lines, which mine aren't.
Is there a similar analysis that can be performed on NON-parallel lines?
Thanks! |
Sam
2005-07-08 14:05:54 |
Re: Statistics: Two of sets of data: HEL
Also, is there a method of determining whether the difference in the slopes is significant? |
Federico Kereki
2005-07-08 17:23:32 |
Re: Statistics: Two of sets of data: HEL
A chi square test might be appropriate to compare both sets: see http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html |
Sam
2005-07-08 17:30:05 |
Re: Statistics: Two of sets of data: HEL
Chi-squared only works, I believe, when the dependent variable is the only other variable. In the case on the site, the independent variable is male/female, and the dependent is the type of footwear.
However, I've got something which _already_ varies with time, and I want to see if the way it varies in one situation is different than the way it varies in the other.
Sorry if I wasn't clear. |
Larry
2005-07-10 06:15:13 |
Re: Statistics: Two of sets of data: HEL
I think the devil is in the details when trying to fit a particular statistical method to real data. For example: I assume you have several types of shoeware, 2 sexes of shoe wearers, several points in time. Is there something else being measured, such as shoe comfort or running speed? Or is the issue the type of shoeware chosen by the person? |
Sam
2005-07-10 12:00:30 |
Re: Statistics: Two of sets of data: HEL
Actually, the shoewear example was from the link Federico suggested.
I'm actually doing with with an algorith to find solutions to problems (a genetic algorithm). As the complexity of the problem increases, the quality of the solution decreases (appoximately) linearly.
I have two different algorithms, the standard genetic algorithm and my variation. When I plot the data from both of these on a scatter graph, both lines decrease as the complexity increases. However, the line for my new algorithm is above the line for the standard algorithm. This means that my algorithm is consistantly finding better quality solutions than the standard algorithm, at all complexities. I want to show that this difference is significant.
However, I'm begining to think that this might be too complicated. Instead, I may just pick a single problem of a specific complexity, run a dozen tests each with the two algorithms, and then show that mine (hopefully) is significantly better on that one problem. This wouldn't be as strong a result, but it would cut down on the number of variables and allow me to do a simple chi-squared test... |
owl
2005-07-11 22:57:03 |
Re: Statistics: Two of sets of data: HELP!
Hi Sam,
I think I may be a little confused, as you mentioned time as a dependent variable. I think you may be fixing time on each algorithm and your dependent metric measures quality of solution. And your independent variable is complexity. To confound the issue, there are several other variables floating around even when complexity is fixed, such as starting values, the problems used, and time allotted. Since both algorithms probably converge to “the” solution over enough time, the limits you set are important. I assume you are running the algorithm on several problems of a fixed complexity and taking an average. (Pray for normality.)
Other issues you may face is how close to continuous are your dependent and independent variables; if they are too coarse you are going to have to move to non-parametric techniques. Ditto if the distribution of the quality of solution metric is not fairly normal over a fixed problem complexity.
I do think CS folks spend a great deal of time with how to construct reasonable metrics so that they can actually apply statistical techniques. The techniques themselves are well documented in biology and psychology; they have been dealing with this kind of data situation since day one. And I found several courses on the web about research methods in these two areas.
But before jumping into the quagmire of super-stats, notice that the power of the picture and some simple stats may work fine. For instance, I think an ANOVA test of some sort will address precisely the issue of whether your slopes are different (assuming that a linear model is a good fit for the two data sets). Or you could do a linear regression on the paired differences of the two sets (across complexity or even across problems and average over complexity) and do a goodness of fit on this model. Assuming a fairly good fit, the line tells most of the story for you. And then you could focus on a few select complexity levels for a more in-depth analysis.
I have seen researchers outside of the life sciences sometimes back off larger scale experiments because of the statistics barrier. But remember, half of the confusion is the fuzziness. Ignore the fuzz :-) |
Sam
2005-07-12 12:55:51 |
Re: Statistics: Two of sets of data: HEL
Well, like I said in the last post, I'm thinking of abandoning any method of seeing if I can work out a significance test for all the data and will instead just focus on several specific points of the graph.
But, to explain what I meant in its entirety:
1) Forget time. I through that in at the top because I thought it would be too complicated to explain what my experiments were.
2) My independant variables (the ones I can change) are problem complexity (approximately continuous) and the algorithm used (two different models). The dependant variable is the quality of the solution after 5000 generations.
Plotting the scattergraph and drawing the trendlines I get (hoping that the html pre tags work - I can't remember what we worked out with these forums)
|
| \
| \
| \ \
| \ \
| \
|________________
(if the pre tags down't work, the above is supposed to be a representation of a graph with two downwards sloping graphs).
The x-axis represents complexity and the y-axis solution quality.
If the lines were parallel, I could run an ANCOVA analysis and see whether, in general, the second algorithm found significantly better solutions than the first, over all complexities. However, they aren't parallel.
But anyway, no matter. I disocvered yeterday that there was a previously unnoticed bug in my program, so it's quite possible that data from somewhere in the region of 100 hours of simulations will need to be scapped. My thesis is due in a month and a half. Such is life. |
|