These steps are best performed in a new markdown file. We'll be using a number of packages:
library(MALDIquantForeign)
library(MALDIquant)
Before we can process our samples, we need to create a list of which samples we want to process. Previously, we've got a list of files using Sys.glob. Here, we'll use that list to start a dataframe that holds information on our samples.
We've been told the following about our samples
To generate the annotation quickly based on this information, we can build vectors of relevant values and bind them
We can repeat an element of a vector using the rep() function. E.g: rep('A', 3)
gives us A A A.
A vector itself can also be repeated e.g.: rep(c('A','B'), 3)
gives us A B A B A B.
This can also be done by element: rep(c('A','B'), each=3)
gets us A A A B B B.
You can even nest a rep statement in another: rep(rep(c('A','B'), each=3), 2)
= A A A B B B A A A B B B.
Given the structure below, how can we fill in the missing annotations (i.e. replace the 0s)?
samples<-as.data.frame(Sys.glob("SUM/*.sum.mzML"), stringsAsFactors = F)
colnames(samples)[1]<-"filename"
samples$soil<-0
samples$biorep<-0
samples$techrep<-0
samples$bio_id<-paste0(samples$soil,samples$biorep)
samples$tech_id<-paste0(samples$bio_id,".",samples$techrep)
samples
filename | soil | biorep | techrep | bio_id | tech_id | |
---|---|---|---|---|---|---|
1 | SUM/HW-290120-007.sum.mzML | A | 1 | 1 | A1 | A1.1 |
2 | SUM/HW-290120-008.sum.mzML | A | 1 | 2 | A1 | A1.2 |
3 | SUM/HW-290120-009.sum.mzML | A | 1 | 3 | A1 | A1.3 |
4 | SUM/HW-290120-010.sum.mzML | A | 2 | 1 | A2 | A2.1 |
5 | SUM/HW-290120-011.sum.mzML | A | 2 | 2 | A2 | A2.2 |
6 | SUM/HW-290120-012.sum.mzML | A | 2 | 3 | A2 | A2.3 |
7 | SUM/HW-290120-013.sum.mzML | A | 3 | 1 | A3 | A3.1 |
8 | SUM/HW-290120-014.sum.mzML | A | 3 | 2 | A3 | A3.2 |
9 | SUM/HW-290120-015.sum.mzML | A | 3 | 3 | A3 | A3.3 |
10 | SUM/HW-290120-016.sum.mzML | F | 1 | 1 | F1 | F1.1 |
11 | SUM/HW-290120-017.sum.mzML | F | 1 | 2 | F1 | F1.2 |
12 | SUM/HW-290120-018.sum.mzML | F | 1 | 3 | F1 | F1.3 |
13 | SUM/HW-290120-019.sum.mzML | F | 2 | 1 | F2 | F2.1 |
14 | SUM/HW-290120-020.sum.mzML | F | 2 | 2 | F2 | F2.2 |
15 | SUM/HW-290120-021.sum.mzML | F | 2 | 3 | F2 | F2.3 |
16 | SUM/HW-290120-022.sum.mzML | F | 3 | 1 | F3 | F3.1 |
17 | SUM/HW-290120-023.sum.mzML | F | 3 | 2 | F3 | F3.2 |
18 | SUM/HW-290120-024.sum.mzML | F | 3 | 3 | F3 | F3.3 |
19 | SUM/HW-290120-025.sum.mzML | O | 1 | 1 | O1 | O1.1 |
20 | SUM/HW-290120-026.sum.mzML | O | 1 | 2 | O1 | O1.2 |
21 | SUM/HW-290120-027.sum.mzML | O | 1 | 3 | O1 | O1.3 |
22 | SUM/HW-290120-028.sum.mzML | O | 2 | 1 | O2 | O2.1 |
23 | SUM/HW-290120-029.sum.mzML | O | 2 | 2 | O2 | O2.2 |
24 | SUM/HW-290120-030.sum.mzML | O | 2 | 3 | O2 | O2.3 |
25 | SUM/HW-290120-031.sum.mzML | O | 3 | 1 | O3 | O3.1 |
26 | SUM/HW-290120-032.sum.mzML | O | 3 | 2 | O3 | O3.2 |
27 | SUM/HW-290120-033.sum.mzML | O | 3 | 3 | O3 | O3.3 |
samples$soil<-c(rep(c('A','F','O'), each=9))
#Alternatively:
#samples$soil<-c(rep('A',9),rep('F',9),rep('O',9))
samples$biorep<-c(rep(rep(1:3, each=3),3))
samples$techrep<-c(rep(1:3,9))
samples$bio_id<-paste0(samples$soil,samples$biorep)
samples$tech_id<-paste0(samples$bio_id,".",samples$techrep)
samples
After summing, the size of the data is significantly smaller. In this project, we'll therefore be able to load all the samples into memory at the same time, which makes it easier to process all samples in one go. In our samples
dataframe we have a list of files in the column filename
. The MALDIquantForeign package provides the function importMzMl
to load mzML files:
>suppressWarnings(rawdata<-importMzMl(samples$filename, centroided = F, verbose=F))
length(rawdata)
The code chunk suppresses warnings for the import function, as it will produce a bunch of warnings about mismatching checksums. Whilst there is a standard for how these should be calculated, it appears that there are different implementations that can cause mismatches. At this stage the relevant question is:
Did we load the expected number of files?
As we have 27 samples, we expect 27 files to be loaded. If this isn't the case, it's time for some troubleshooting. Is your sample table as expected? Do the files exist in the correct directories? Do you have the relevant packages installed and loaded?
We can take a look at a full spectrum and a small section with some more detail like so:
mzroi<-c(200,210) #We'll reuse these coordinates later.
inroi<-c(0,10000)
plot(rawdata[[1]])
plot(rawdata[[1]], xlim=mzroi, ylim=inroi)
This creates two plots. In Rstudio, you can switch between plots by clicking the small previews above the current plot. One of the first steps to take is to transform the raw intensities by taking the square root. This causes the variance between measurements (e.g. the same peak in technical replicates) to be less dependent on the mean intensity (of e.g. the peak).
spectra<-transformIntensity(rawdata, method="sqrt")
plot(spectra[[1]], xlim=mzroi, ylim=sqrt(inroi))
This does highlight the noise in the data - the signal does not look very smooth. A common step in signal processing is to smooth out the signal. Using information from neighbouring data points, we try to approximate what the underlying signal is likely to be. A widely used smoothing approach is applying a Savitzky-Golay filter. The important question is how many data points should we use to define the neighbourhood. In other words, how much should the data be smoothed? As a rule of thumb, as many data points in the smooth as there are in the top half of the narrowest peak that should be included. From our random zoomed-in part of the spectrum, we take the 2nd largest peak:
plot(spectra[[1]], xlim=c(205.5,206.5), ylim=c(0,40), type='b')
Assuming the signal between 206.0 and 206.2 is a single peak, how many data points should we smooth over?
Optional tip: to add a horizontal line to a plot, add a new line that callsabline
(see?abline
) after your plot command.
There are 11 data points between when it first goes above 20 to when the twin peaks drop below 20 counts.
Many filters need a half window size to be specified. For a point for which we are calculating the smoothed value, the half window size is the number of preceding as well as the number following data points to include. What half window size could we use here?
For the Savitzky-Golay filter the total window ('neighbourhood') is 2 * halfwindow + 1 (the data point itself). If we have a full window of 11, the halfwindow size is 5 ((11-1)/2).
Next, test out the effect of smoothing
testspectrumhw5<-smoothIntensity(spectra[[1]], method="SavitzkyGolay", halfWindowSize=5)
plot(testspectrumhw5, xlim=c(205.5,206.5), ylim=c(0,40), type='b')
This still preserves part of the 'twin' peak shape.
What happens if you increase the halfWindowSize?
For example:
testspectrumhw11<-smoothIntensity(spectra[[1]], method="SavitzkyGolay", halfWindowSize=11)
plot(spectra[[1]], xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="No smoothing")
plot(testspectrumhw5, xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="SG halfWindowSize 5")
plot(testspectrumhw11, xlim=c(205.5,206.5), ylim=c(0,40), type='b', main="SG halfWindowSize 11")
Cycling through the plots, we can see that details that were still being retained with less smoothing (such as the twin peak shape) are lost. Too much smoothing may smooth out actual signal, rather than just noise.
We stick with a halfWindowSize of 5, and apply it to all spectra.
spectra<-smoothIntensity(spectra, method="SavitzkyGolay", halfWindowSize=5)
We now have a somewhat smooth signal. In the previous plots we saw a noisy line that hovered around 10, with an occasional peak. The level of this line may differ between samples, and may also be different at different m/z ratios. To make samples more comparable, we can correct for it. As with smoothing, it is possible to remove too much, so we take a look first:
baselinetest <- estimateBaseline(spectra[[1]], method="SNIP", iterations=100)
plot(spectra[[1]])
lines(baselinetest, col='red')
In general it looks like there's not too much variation with this sample. However, estimated baseline shown in red looks a little bumpy. In close up:
plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))
lines(baselinetest, col='red')
We can make the baseline less likely to fit to individual points by increasing the number of iterations performed in the calculation:
baselinetest <- estimateBaseline(spectra[[1]], method="SNIP", iterations=250)
plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))
lines(baselinetest, col='red')
plot(spectra[[1]])
lines(baselinetest, col='red')
As it now looks smoother, we apply these settings for baseline removal. As an output to check what's happened, we include a couple of plots.
spectra <- removeBaseline(spectra, method="SNIP", iterations=250)
plot(spectra[[1]], xlim=mzroi, ylim=c(0,40))
plot(spectra[[1]])
In addition to removing the baseline, there are more steps to take to make signals more comparable. Some samples may have an overall higher (or lower) signal than other samples. We can check for this by calculating the total signal, or the total ion count (TIC). The assumption here is that the majority of the signal is similar between samples - which for soils may be a valid assumption, but this may not bet true is other settings. We can get the sums of the intensities to check for big differences:
intensities<-sapply(spectra, function(spectrum){sum(intensity(spectrum))})
barplot(intensities/1e6, names.arg = samples$tech_id,
xlab = "sample", ylab = "total intensity (in 1e6 counts)", las=2)
Are there large differences?
Are any of these differences structured?
There's about a two-fold difference between the least and the most intensive sample. The forest soil samples seem to have a higher overall intensity. There is a slight trend for the variation between technical replicates being smaller than that between biological replicates.
For the actual calibration, and comparison afterwards:
spectra <- calibrateIntensity(spectra, method="TIC")
intensities<-sapply(spectra, function(spectrum){sum(intensity(spectrum))})
barplot(intensities, names.arg = samples$tech_id,
xlab = "sample", ylab = "total intensity", las=2)
The differences should be lot smaller now.
There can be differences in the detection of ions that causes the signal to somewhat shift on the m/z dimension between samples. To see what the alignment is, we call some peaks in the data and see how well they align. The code below this, and will create a file called Rplots.pdf in your project directory.
testpeaks<-detectPeaks(spectra, halfWindowSize=5, SNR=3)
warps<-determineWarpingFunctions(testpeaks, plot = T, tolerance = 100e-6)
The plots in the pdf show the difference between peaks in the reference (a mix of peaks from all samples) and the peaks found in each sample. A line is fitted through these comparisons that describes the averaged shift from the reference.
What patterns do you see?
The fitted lines are mostly flat, but curve a bit in the higher m/z ranges. In this case alignment may not change much, especially as the variation in the difference between reference peak and sample is larger than the range of the corrections proposed. In other words, if the fitted line is near 0 for all samples, we don't expect large differences after alignment. In this case the spread of differences (how far are the individual points are from 0) is much larger in both directions than the suggested correction, meaning the inaccuracy of the peak m/z value will be largely caused by rather variation (which we cannot control) rather than a systematic shift (which we could correct for). The code for aligning the spectra below is included for completeness, even though it isn't strictly necessary.
Align the spectra using the following code:
spectra_aligned <- alignSpectra(spectra,
halfWindowSize=5,
SNR=3,
tolerance=100e-6,
warpingMethod="lowess")
We're expecting changes in the higher m/z range. Let's take a look:
testpeaks_after<-detectPeaks(spectra_aligned, halfWindowSize=5, SNR=3)
plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
lines(spectra[[18]], col='grey', lty=3)
lines(spectra_aligned[[7]], col='blue')
lines(spectra[[7]], col='cyan3', lty=3)
points(testpeaks[[18]], col='grey', pch=4)
points(testpeaks[[7]], col='cyan3', pch=4)
points(testpeaks_after[[18]], pch=4)
points(testpeaks_after[[7]], col='blue', pch=4)
The dotted lines are the spectra before alignment, the solid lines are the spectra after alignment. The peak position is indicated by a cross. As discussed, the shifts here are minor. We can see if the peaks are brought closer together:
peakmzs_18<-mz(testpeaks[[18]])
peakmzs_07<-mz(testpeaks[[7]])
peakmzs_after_18<-mz(testpeaks_after[[18]])
peakmzs_after_07<-mz(testpeaks_after[[7]])
peakmzs_18[peakmzs_18 > 1062 & peakmzs_18 < 1063] -
peakmzs_07[peakmzs_07 > 1062 & peakmzs_07 < 1063]
peakmzs_after_18[peakmzs_after_18 > 1062 & peakmzs_after_18 < 1063] -
peakmzs_after_07[peakmzs_after_07 > 1062 & peakmzs_after_07 < 1063]
The samples are now quite comparable, so we can move towards extracting the data that we're interested in: the peak locations (m/z) and the intensity in each sample (count). As we've seen, there is still some noise in the data - this will always be the case. In most applications we want to avoid looking at the noise too much, so often a Signal to noise ratio (SNR) is given as a criteria for peak calling. This means that to be be regarded as a peak, the peak height needs to be above a number of times the noise level determined by the SNR. We can investigate our data for noise like so:
noise<-MALDIquant::estimateNoise(spectra_aligned[[1]])
plot(spectra_aligned[[1]], xlim = mzroi, ylim=c(0,0.006))
lines(noise, col="red")
lines(noise[,1],noise[,2]*5, col="purple")
lines(noise[,1],noise[,2]*10, col="blue")
Now it looks like an SNR of 5 (purple line) would already cut out some peaks. However, this picture may be misleading. Let's look at a much higher m/z range:
plot(spectra_aligned[[1]], xlim=c(1060,1070), ylim=c(0,0.01))
lines(noise, col="red") #Red line: noise level
lines(noise[,1],noise[,2]*5, col="purple") #Purple line: noise level * 5
lines(noise[,1],noise[,2]*10, col="blue") #Blue line: noise level * 10
Here it look like we get a lot of repeating signal just above the 5 SNR line. An SNR of 8, or possibly even 10 might prove a better balance between restricting noise and throwing away signal.
Keeping an SNR of 8, call the peaks with some additional refinements:
peaks <- detectPeaks(spectra_aligned, method="MAD", halfWindowSize=5, SNR=8,
refineMz = "descendPeak", signalPercentage=25)
As we saw previously, even after alignment there are small difference in the peak m/z values. For an illustration:
plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
for(n in seq_along(peaks)){
points(peaks[[n]], pch = 4, col=rainbow(length(peaks))[[n]])
}
To get rid of those small differences, we bin the Peaks from different samples within a certain tolerance together.
peaks<-binPeaks(peaks, method="strict", tolerance = 50e-6)
plot(spectra_aligned[[18]], xlim=c(1061.9,1062.9), ylim=c(0,0.006))
for(n in seq_along(peaks)){
points(peaks[[n]], pch = 4, col=rainbow(length(peaks))[[n]])
}
This replaces the peak m/z of all the peaks that are close together with the average m/z of those peaks. The intensities remain unchanged.
We can then extract our data.
featureMatrix <- intensityMatrix(peaks, spectra_aligned)
As you'll have noticed,
intensityMatrix
is also reading in our aligned spectra. Why would it do that?
In some samples we may not be able to call a peak at the location where we can call one in other samples. Reasons for this can be that there is no peak, the peak is masked by a stronger adjacent peak, or the noise level is higher in some samples. As you can read from
?intensityMatrix
, the function uses the spectra to fill in values for missing peaks.
The
featureMatrix
object is, as the name suggests, a Matrix. One row for each sample, one column for each peak. How many peaks are there in the matrix?
Essentially this is the number of columns
ncol(featureMatrix)
(2041)
We can also take a look at the number of peaks called in each sample:
>peakspersample<-sapply(peaks, function(x){length(mz(x))})
barplot(peakspersample, names.arg = samples$tech_id, xlab = "sample", ylab = "peaks called", las=2)
Are there any differences?
There's a trend that more peaks are called in the Forest samples.
The differing number of peaks between biological replicates, and even between technical replicates raises some questions about how reliable those peaks are. A peak that is only called in one replicate of one sample is probably not going to allow us come to any statistically relevant conclusions. A way of getting a cleaner data set is filtering peaks based on in which samples they are called. >Based on the documentation for filterPeaks
, what does the following do?
peaks_byrep<-filterPeaks(peaks, minFrequency = 1, labels = samples$bio_id, mergeWhitelists = TRUE)
peaks_byrep_bysample<-filterPeaks(peaks_byrep, minFrequency = 0.5, labels = samples$soil, mergeWhitelists = TRUE)
It first keeps only peaks that appear in all three replicates (
minFrequency = 1
) of at least one (mergeWhitelists = TRUE
) biological replicate (labels = samples$bio_id
).
Next, the surviving peaks are filtered again to retain only peaks that were called in at least half of the technical replicates across all biological replicates in at least one soil type.
Get a matrix with just the 'reliable' features:
featureMatrix_filter <- intensityMatrix(peaks_byrep_bysample, spectra_aligned)
And check how many peaks we retained:
ncol(featureMatrix_filter)
We now know where there are peaks, and what the intensity is in each sample. We don't yet know whether there are any differences between samples. A lot of analysis can be done in R, but alternatively we can export the data as csv files to enable import in other applications which are more user friendly:
intensities_out<-data.frame(Sample=samples$tech_id,
Label=samples$soil,
featureMatrix_filter, check.names = F, stringsAsFactors = F)
write.table(t(intensities_out)[,!intensities_out$Sample %in% c("A3.3")], file = "Intensities_col.csv",
row.names = TRUE, col.names = FALSE, sep=",")
write.csv(intensities_out, row.names = F, file = "Intensities.csv")
There will now be a couple of csv files stored in your project directory, which we will use for subsequent steps.
This assumes a dataframe called
samples
, with a columnfilename
that points to mzML files
suppressWarnings(rawdata<-importMzMl(samples$filename, centroided = F, verbose=F))
spectra<-transformIntensity(rawdata, method="sqrt")
spectra<-smoothIntensity(spectra, method="SavitzkyGolay", halfWindowSize=5)
spectra<-removeBaseline(spectra, method="SNIP", iterations=250)
spectra<-calibrateIntensity(spectra, method="TIC")
spectra_aligned<-alignSpectra(spectra, halfWindowSize=5, SNR=3,
tolerance=100e-6, warpingMethod="lowess")
peaks<-detectPeaks(spectra_aligned, method="MAD", halfWindowSize=5, SNR=8,
refineMz="descendPeak", signalPercentage=25)
peaks<-binPeaks(peaks, method="strict", tolerance = 50e-6)
peaks_byrep<-filterPeaks(peaks, minFrequency = 1, labels = samples$bio_id, mergeWhitelists = TRUE)
featureMatrix_filter <- intensityMatrix(peaks_byrep, spectra_aligned)
Additional filtering is recommended. The exact output format depends on the downstream application, and is not included in the summary above.
See here.
Next we look at what the extracted data can tell us.