[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Simo Hansen" <simohansen@gmail.com> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: St: How to handle missing observations in the factor-principal component analysis |

Date |
Wed, 19 Dec 2007 13:51:25 +0200 |

Dear Dr.Maarten, Thank you for eye-opening explanation and example. However, I need to have a second variable to perform ice command. For example, I create a dummy variable indicating whether she is using a computer at work or not. In order to use "ice", I need to have a second variable, which I don't have-do I have?-. In your example, I created many dummy variables to capture a women's knowledge level. So I am thinking that I am forced to replace missing values with mean values. Am I right? I think I am missing something here. For example, for computer dummy variable-let's call computer, what would be the second variable that I can use? ice computer ????,saving (temp,replace). Your explanation raised another question for me: You said that "This looks problematic, even without missing data." Are there alternative ways for the same purpose? Thank you very much suggestion and explanation. Best regards, Simo -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten buis Sent: 19 Aralık 2007 Çarşamba 13:03 To: statalist@hsphsun2.harvard.edu Subject: Re: st: St: How to handle missing observations in the factor-principal component analysis --- Simo Hansen <simohansen@gmail.com> wrote: > I try to construct knowledge index for women in my data. I have some > missing observation for the variables that are converted to dummy > variables to conduct the factor analysis. Could anyone provide a help > about how to handle those missing observations? This looks problematic, even without missing data. At the very bottom I will suggest a solution to the missing data problem if you still wish to proceed with this analsys. Consider the logic behind factor analysis: We imagine that there is one or more unobserved variables (f) that influence the observed variables (x) in a linear way. Say we have three observed variables (x1, x2, and x3) and one factor (f), and that both the observed variables and the factor are standardized, so there is no constant. So, we get the following system: x1 = l1 f + e1 x2 = l2 f + e2 x3 = l3 f + e3 We don't observe f, but if we asssume that the errors across equations are not correled, than all correlation between x1, x2, and x3 is due to the fact that they have f in common. We use this to reconstruct f. The problem is that if any of the xs is a dummy than the assumption of a linear effect of a variable on that x can fail. If you turn a catorgical or ordinal variable into dummies than you are adding dependencies between your variables that have nothing to do with the common factor but are still assigned to that factor. > The other question I have that there is a > following command in SPSS: > /Missing MeanSub. > How can I write this command in Stata? This looks like mean imputation to me. This is a very very bad idea. Remember that factor analysis uses the correlations between variables to reconstruct the latent factor. With mean imputation you seriously distort those correlations. Consider two variables: x1 and x2, where x1 has missing data, which are replaced by the mean. The consequences are shown in the graphs below: | *** | *** | ***** | ***** | ***** | ***** x1| ***** |xx*****xxx | ***** | ***** |***** |***** |*** |*** --------------- --------------- x2 x2 The xs in the right graph are the imputed values, it is clear that they seriously distort the correlation. In particular this leads to an underestimation of the correlation. The help file of -factor- also links to the -impute- command, which does regression imputation. This too is a bad idea. It puts all the missing values on the regression line, as is shown in the graphs below, and thus overestimates the correlation. | *** | *x* | ***** | **x** | ***** | **x** x1| ***** | **x** | ***** | **x** |***** |**x** |*** |*x* --------------- --------------- x2 x2 A better method is to use -ice-, which can be downloaded from -ssc-. This will preserve the actual correlation, by adding the necesary noice around the regression line: | *** | x** | ***** | ***x* | ***** | x**** x1| ***** | ****x | ***** | **x** |***** |***x* |*** |x** --------------- --------------- Notice that it is not necesary to let -ice- make multiple imputed datasets if you are only interested in the point estimates for the factor scores. The multiple imputations are only used for adjusting the standard errors. So my suggestion to your missing data problem is: use -ice- to create an imputed dataset, and use that dataset to do the factor analysis. Remember to add -if _mj==1- to the -factor- command (-ice- stores your original data on top which is identified by the value 0 on the variable _mj). Hope this helps, Maarten Ps. Below is a simulation showing what the different methods do to the correlations: *----------------- begin example ----------------------- set more off sysuse auto, clear corr headroom trunk global true = r(rho) cd "h:\temp" capture program drop sim program sim, rclass sysuse auto, clear replace headroom = . if uniform() < invlogit(-1 - .1* trunk) impute headroom trunk, gen(headimp) corr headimp trunk return scalar imp = r(rho) - $true sum headroom, meanonly gen headmean = cond(missing(headroom),r(mean),headroom) corr headmean trunk return scalar mean = r(rho) - $true ice headroom trunk, saving(temp, replace) use temp, clear corr headroom trunk if _mj == 1 return scalar ice = r(rho) - $true end simulate imp=r(imp) mean=r(mean) ice=r(ice), reps(10000) : sim twoway kdensity imp || kdensity mean || kdensity ice, /// legend(order(1 "-impute-" 2 "mean" "imputation" 3 "-ice-")) /// xtitle("deviation from true correlation") *------------------ end example ------------------------ (For more on how to use examples I sent to the Statalist, see http://home.fsw.vu.nl/m.buis/stata/exampleFAQ.html ) ----------------------------------------- Maarten L. Buis Department of Social Research Methodology Vrije Universiteit Amsterdam Boelelaan 1081 1081 HV Amsterdam The Netherlands visiting address: Buitenveldertselaan 3 (Metropolitan), room Z434 +31 20 5986715 http://home.fsw.vu.nl/m.buis/ ----------------------------------------- ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: St: How to handle missing observations in the factor-principal component analysis***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**References**:**Re: st: St: How to handle missing observations in the factor-principal component analysis***From:*Maarten buis <maartenbuis@yahoo.co.uk>

- Prev by Date:
**st: egen to calculate industry medians with own frim excluded** - Next by Date:
**Re: st: Basic question on interpreting Durbin alternative testfor autocorrelation** - Previous by thread:
**Re: st: St: How to handle missing observations in the factor-principal component analysis** - Next by thread:
**RE: st: St: How to handle missing observations in the factor-principal component analysis** - Index(es):

© Copyright 1996–2021 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |