kepcotrend: remove systematic trends Kepler light curves using cotrending basis vectors¶

pyke.kepcotrend.
kepcotrend
(infile, bvfile, listbv, outfile=None, fitmethod='llsq', fitpower=1, iterate=False, sigma=None, maskfile='', scinterp='linear', plot=False, noninteractive=False, overwrite=False, verbose=False, logfile='kepcotrend.log')[source]¶ kepcotrend – Remove systematic trends Kepler light curves using cotrending basis vectors. The cotrending basis vectors files can be found here: http://archive.stsci.edu/kepler/cbv.html
Simple Aperture Photometry (SAP) data often contain systematic trends associated with the spacecraft, detector and environment rather than the target. See the the Kepler data release notes for descriptions of systematics and the cadences that they affect. Within the Kepler pipeline these contaminants are treated during Presearch Data Conditioning (PDC) and cleaned data are provided in the light curve files archived at MAST within the column PDCSAP_FLUX. The Kepler pipeline attempts to remove systematics with a combination of data detrending and cotrending against engineering telemetry from the spacecraft such as detector temperatures. These processes are imperfect but tackled in the spirit of correcting as many targets as possible with enough accuracy for the mission to meet exoplanet detection specifications.
The imperfections in the method are most apparent in variable stars, those stars that are of most interest for stellar astrophysics. The PDC correction can occasionally hamper data analysis or, at worst, destroy astrophysical signal from the target. While data filtering (
kepoutlier
,kepfilter
) and data detrending with analytical functions (kepdetrend
) often provide some mitigation for data artifacts, these methods require assumptions and often result in lossy data. An alternative viable approach is to identify the photometric variability common to all of the stars neighboring the target and subtract those trends from the target. In principle, the correct choice, weighting and subtraction of these common trends will leave behind a corrected flux time series which better represents statistically the true signal from the target.While GOs, KASC members and archive users wait for the Kepler project to release quarters of data, they do not have access to all the light curve data neighboring their targets and so cannot take the ensemble approach themselves without help. To mitigate this problem the Kepler Science Office have made available ancillary data which describes the systematic trends present in the ensemble flux data for each CCD channel. These data are known as the Cotrending Basis Vectors (CBVs). More details on the method used to generate these basis vectors will be provided in the Kepler Data Processing Handbook soon, but until that time, a summary of the method is given here. To create the initial basis set, that is the flux time series’ that are used to make the cotrending basis vectors:
The time series photometry of each star on a specific detector channel is normalized by its own median flux. One (unity) is subtracted from each time series so that the median value of the light curve is zero. The time series is divided by the rootmean square of the photometry. The correlation between each time series on the CCD channel is calculated using the median and rootmean square normalized flux. The median absolute correlation is then calculated for each star. All stars on the channel are sorted into ascending order of correlation. The 50 percent most correlated stars are selected. The median normalized fluxes only (as opposed to the rootmean square normalized fluxes) are now used for the rest of the process Singular Value Decomposition is applied to the matrix of correlated sources to create orthonormal basis vectors from the U matrix, sorted by their singular values.
The archived cotrending basis vectors are a reducedrank representation of the full set of basis vectors and consist of the 16 leading columns.
To correct a SAP light curve, \(Fsap\), for systematic features,
kepcotrend
employs the cotrending basis vectors \(CBVi\). The task finds the coefficients \(Ai\) which minimize\[Fcbv = Fsap  \sum_{i} Ai \cdot CBV_i\]The corrected light curve, Fcbv, can be tailored to the needs of the user and their scientific objective. The user decides which combination of basis vectors best removes systematics from their specific Kepler SAP light curve. In principle the user can choose any combination of cotrending basis vectors to fit to the data. However, experience suggests that most choices will be to decide how many sequential basis vectors to include in the fit, starting with first vector. For example a user is much more likely to choose a vector combination 1, 2, 3, 4, 5, 6 etc over e.g. a combination 1, 2, 5, 7, 8, 10, 12. The user should always include at least the first two basis vectors. The number of basis vectors used is directly related to the scientific aims of the user and the light curve being analyzed and experimental iteration towards a targetspecific optimal basis set is recommended. Occasionally kepcotrend overfits the data and removes real astrophysical signal. This is particularly prevalent if too many basis vectors are used. A good rule of thumb is to start with two basis vectors and increase the number until there is no improvement, or signals which are thought to be astrophysical start to become distorted.
The user is given a choice of fitting algorithm to use. For most purposes the linear least squares method is both the fastest and the most accurate because it gives the exact solution to the least squares problem. However we have found a few situations where the best solution, scientifically, comes from using the simplex fitting algorithm which performs something other than a least squares fit. Performing a least absolute residuals fit (fitpower=1.0), for example, is more robust to outliers.
There are instances when the fit performs suboptimally due to the presence of certain events in the light curve. For this reason we have included two options which can be used individually or simultaneously to improve the fit  iterative fitting and data masking. Iterative fitting performs the fit and rejects data points that are greater than a specified distance from the optimal fit before refitting. The lower threshold for data clipping is provided by the user as the number of sigma from the best fit. The clipping threshold is more accurately defined as the number of Median Absolute Deviations (MADs) multiplied by 1.4826. The distribution of MAD will be identical to the distribution of standard deviation if the distribution is Gaussian. We use MAD because in highly nonGaussian distributions MAD is more robust to outliers than standard deviation.
The code will print out the coefficients fit to each basis vector, the rootmean square of the fit and the chisquared value of the fit. The rms and the chisquared value include only the data points included in the fit so if an iterative fit is performed these clipped values are not included in this calculation.
Parameters: infile : str
the input file in the FITS format obtained from MAST
outfile : str
the output will be a fits file in the same style as the input file but with two additional columns: CBVSAP_MODL and CBVSAP_FLUX. The first of these is the best fitting linear combination of basis vectors. The second is the new flux with the basis vector sum subtracted. This is the new flux value.
bvfile : str
the name of the FITS file containing the basis vectors
listbv : list of integers
the basis vectors to fit to the data
fitmethod : str
fit using either the ‘llsq’ or the ‘simplex’ method. ‘llsq’ is usually the correct one to use because as the basis vectors are orthogonal. Simplex gives you option of using a different merit function  ie. you can minimise the least absolute residual instead of the least squares which weights outliers less
fitpower : float
if using a simplex you can chose your own power in the metir function  i.e. the merit function minimises \(abs(Obs  Mod)^P\). \(P = 2\) is least squares, \(P = 1\) minimises least absolutes
iterate : bool
should the program fit the basis vectors to the light curve data then remove data points further than ‘sigma’ from the fit and then refit
maskfile : str
this is the name of a mask file which can be used to define regions of the flux time series to exclude from the fit. The easiest way to create this is by using
keprange
from the PyKE set of tools. You can also make this yourself with two BJDs on each line in the file specifying the beginning and ending date of the region to exclude.scinterp : str
the basis vectors are only calculated for long cadence data, therefore if you want to use short cadence data you have to interpolate the basis vectors. There are several methods to do this, the best of these probably being nearest which picks the value of the nearest long cadence data point. The options available are:
 linear
 nearest
 zero
 slinear
 quadratic
 cubic
plot : bool
Plot the data and result?
noninteractive : bool
If True, prevents the matplotlib window to pop up.
overwrite : bool
Overwrite the output file?
verbose : bool
Print informative messages and warnings to the shell and logfile?
logfile : str
Name of the logfile containing error and warning messages.
Examples
$ kepcotrend kplr0051104072009350155506_llc.fits ~/cbv/kplr2009350155506q03d25_lcbv.fits '1 2 3' plot verbose