| Title: | Tidy and Streamlined Metabolomics Data Workflows |
|---|---|
| Description: | Facilitate tasks typically encountered during metabolomics data analysis including data import, filtering, missing value imputation (Stacklies et al. (2007) <doi:10.1093/bioinformatics/btm069>, Stekhoven et al. (2012) <doi:10.1093/bioinformatics/btr597>, Tibshirani et al. (2017) <doi:10.18129/B9.BIOC.IMPUTE>, Troyanskaya et al. (2001) <doi:10.1093/bioinformatics/17.6.520>), normalization (Bolstad et al. (2003) <doi:10.1093/bioinformatics/19.2.185>, Dieterle et al. (2006) <doi:10.1021/ac051632c>, Zhao et al. (2020) <doi:10.1038/s41598-020-72664-6>) transformation, centering and scaling (Van Den Berg et al. (2006) <doi:10.1186/1471-2164-7-142>) as well as statistical tests and plotting. 'metamorphr' introduces a tidy (Wickham et al. (2019) <doi:10.21105/joss.01686>) format for metabolomics data and is designed to make it easier to build elaborate analysis workflows and to integrate them with 'tidyverse' packages including 'dplyr' and 'ggplot2'. |
| Authors: | Yannik Schermer [aut, cre, cph] (ORCID: <https://orcid.org/0009-0002-5201-057X>) |
| Maintainer: | Yannik Schermer <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0.9000 |
| Built: | 2026-06-08 20:27:23 UTC |
| Source: | https://github.com/yasche/metamorphr |
The data set contains the atomic weights of the elements and their isotopes.
It is used to calculate the exact mass in formula_to_mass but can also be used as a reference.
description
atomsatoms
atomsA data frame with 442 rows and 7 columns:
The atomic number of the element in the periodic table.
The element.
The mass number of the specific isotope.
The atomic symbol. Either only the letter (for standard isotopes) or the mass number followed by the symbol (for special isotopes).
The monoisotopic mass of the isotope.
The fraction of the isotope in the naturally occuring element.
The standard atomic weight of the element. It is the sum of the product of the Weight and Composition column for each element. Where no composition is available, the weight of the IUPAC "ATOMIC WEIGHTS OF THE ELEMENTS 2023" table was used. See the Source section for more information.
...
The table was retrieved from the National Institute of Standards and Technology (NIST) at https://physics.nist.gov/cgi-bin/Compositions/stand_alone.pl, accesed in October 2025, and enriched with data from the IUPAC "ATOMIC WEIGHTS OF THE ELEMENTS 2023" table at https://iupac.qmul.ac.uk/AtWt/, accesed in October 2025
Calculate the Kendrick mass for a given mass (or m/z) and repeating unit.
The Kendrick mass is a rescaled mass, that usually sets CH2 = 14 but other
repeating units can also be used. It is usefull for the visual identification
of potential homologues. See the References section for more information.
The Kendrick mass is not to be confused with the Kendrick mass defect
(KMD, calc_kmd) and
the nominal Kendrick mass (calc_nominal_km).
calc_km(mass, repeating_unit = "CH2")calc_km(mass, repeating_unit = "CH2")
mass |
A molecular mass (or m/z). |
repeating_unit |
The formula of the repeating unit, given as a string. |
The Kendrick mass.
Edward Kendrick, Anal. Chem. 1963, 35, 2146–2154.
C. A. Hughey, C. L. Hendrickson, R. P. Rodgers, A. G. Marshall, K. Qian, Anal. Chem. 2001, 73, 4676–4681.
# Calculate the Kendrick masses for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_km(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. # Note that in the publication by Hughey et al., the nominal Kendrick mass # is used on the x-axis instead of the exact Kendrick mass. # See ?calc_nominal_km. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), KM = calc_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = KM, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()# Calculate the Kendrick masses for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_km(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. # Note that in the publication by Hughey et al., the nominal Kendrick mass # is used on the x-axis instead of the exact Kendrick mass. # See ?calc_nominal_km. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), KM = calc_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = KM, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()
The Kendrick mass defect (KMD) is calculated by subtracting the Kendrick mass
(calc_km) from the nominal Kendrick mass
(calc_nominal_km). See the References section for
more information.
calc_kmd(mass, repeating_unit = "CH2")calc_kmd(mass, repeating_unit = "CH2")
mass |
A molecular mass (or m/z). |
repeating_unit |
The formula of the repeating unit, given as a string. |
The Kendrick mass defect (KMD)
Edward Kendrick, Anal. Chem. 1963, 35, 2146–2154.
C. A. Hughey, C. L. Hendrickson, R. P. Rodgers, A. G. Marshall, K. Qian, Anal. Chem. 2001, 73, 4676–4681.
# Calculate the Kendrick mass defects for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_kmd(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), `nominal KM` = calc_nominal_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = `nominal KM`, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()# Calculate the Kendrick mass defects for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_kmd(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), `nominal KM` = calc_nominal_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = `nominal KM`, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()
calc_neutral_loss() is fully replaced by msn_calc_nl.
Calculate neutral loss spectra for all ions with available MSn spectra in data. To calculate neutral losses, MSn spectra are required.
See read_mgf. This step is required for subsequent filtering based on
neutral losses (filter_neutral_loss). Resulting neutral loss spectra are stored in tibbles in a new list column named Neutral_Loss.
calc_neutral_loss(data, m_z_col)calc_neutral_loss(data, m_z_col)
data |
A tidy tibble created by |
m_z_col |
Which column holds the precursor m/z? Uses |
A tibble with added neutral loss spectra. A new list column is created named Neutral_Loss.
toy_mgf %>% calc_neutral_loss(m_z_col = PEPMASS)toy_mgf %>% calc_neutral_loss(m_z_col = PEPMASS)
The nominal Kendrick mass is the Kendrick mass
(calc_km), rounded up to the nearest
whole number. The nominal Kendrick mass and the Kendrick mass are both required
to calculate the Kendrick mass defect (KMD).
The nominal Kendrick mass is not to be confused with the Kendrick mass defect
(calc_kmd) and
the Kendrick mass (calc_km).
calc_nominal_km(mass, repeating_unit = "CH2")calc_nominal_km(mass, repeating_unit = "CH2")
mass |
A molecular mass (or m/z). |
repeating_unit |
The formula of the repeating unit, given as a string. |
The nominal Kendrick mass.
Edward Kendrick, Anal. Chem. 1963, 35, 2146–2154.
C. A. Hughey, C. L. Hendrickson, R. P. Rodgers, A. G. Marshall, K. Qian, Anal. Chem. 2001, 73, 4676–4681.
# Calculate the nominal Kendrick masses for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_nominal_km(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), `nominal KM` = calc_nominal_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = `nominal KM`, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()# Calculate the nominal Kendrick masses for two measured masses with # CH2 as the repeating unit. # See Hughey et al. in the References section above calc_nominal_km(c(351.3269, 365.3425)) # Construct a KMD plot from m/z values. # RT is mapped to color and the feature-wise maximum intensity to size. toy_metaboscape %>% dplyr::group_by(UID, `m/z`, RT) %>% dplyr::summarise(max_int = max(Intensity, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::mutate(KMD = calc_kmd(`m/z`), `nominal KM` = calc_nominal_km(`m/z`)) %>% ggplot2::ggplot(ggplot2::aes(x = `nominal KM`, y = KMD, size = max_int, color = RT)) + ggplot2::geom_point()
Calculates the minimum of the intensity of technical replicates (e.g., if the same sample was injected multiple times or if multiple workups have been performed on the same starting material). The function assigns new sample names by joining either group and replicate name, or if a batch column is specified group, replicate and batch together with a specified separator. Due to the nature of the function, sample and feature metadata columns will be dropped unless they are specified with the according arguments.
collapse_max( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )collapse_max( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
replicate_column |
Which column contains replicate information? Usually |
batch_column |
Which column contains batch information? If all samples belong to the same batch (i.e., they all have the same batch identifier in the |
feature_metadata_cols |
A character or character vector containing the names of the feature metadata columns. They are usually created when reading the feature table with |
sample_metadata_cols |
A character or character vector containing the names of the sample metadata columns. They are usually created when joining the metadata with |
separator |
Separator used for joining group and replicate, or group, batch and replicate together to create the new sample names. The new sample names will be Group name, separator, Batch name, separator, Replicate name, or Group name, separator, Replicate name, in case all samples belong to the same batch (i.e., they all have the same batch identifier in the |
A tibble with intensities of technical replicates collapsed.
# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_max(group_column = Group, replicate_column = Replicate)# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_max(group_column = Group, replicate_column = Replicate)
Calculates the mean of the intensity of technical replicates (e.g., if the same sample was injected multiple times or if multiple workups have been performed on the same starting material). The function assigns new sample names by joining either group and replicate name, or if a batch column is specified group, replicate and batch together with a specified separator. Due to the nature of the function, sample and feature metadata columns will be dropped unless they are specified with the according arguments.
collapse_mean( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )collapse_mean( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
replicate_column |
Which column contains replicate information? Usually |
batch_column |
Which column contains batch information? If all samples belong to the same batch (i.e., they all have the same batch identifier in the |
feature_metadata_cols |
A character or character vector containing the names of the feature metadata columns. They are usually created when reading the feature table with |
sample_metadata_cols |
A character or character vector containing the names of the sample metadata columns. They are usually created when joining the metadata with |
separator |
Separator used for joining group and replicate, or group, batch and replicate together to create the new sample names. The new sample names will be Group name, separator, Batch name, separator, Replicate name, or Group name, separator, Replicate name, in case all samples belong to the same batch (i.e., they all have the same batch identifier in the |
A tibble with intensities of technical replicates collapsed.
# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_mean(group_column = Group, replicate_column = Replicate)# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_mean(group_column = Group, replicate_column = Replicate)
Calculates the median of the intensity of technical replicates (e.g., if the same sample was injected multiple times or if multiple workups have been performed on the same starting material). The function assigns new sample names by joining either group and replicate name, or if a batch column is specified group, replicate and batch together with a specified separator. Due to the nature of the function, sample and feature metadata columns will be dropped unless they are specified with the according arguments.
collapse_median( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )collapse_median( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
replicate_column |
Which column contains replicate information? Usually |
batch_column |
Which column contains batch information? If all samples belong to the same batch (i.e., they all have the same batch identifier in the |
feature_metadata_cols |
A character or character vector containing the names of the feature metadata columns. They are usually created when reading the feature table with |
sample_metadata_cols |
A character or character vector containing the names of the sample metadata columns. They are usually created when joining the metadata with |
separator |
Separator used for joining group and replicate, or group, batch and replicate together to create the new sample names. The new sample names will be Group name, separator, Batch name, separator, Replicate name, or Group name, separator, Replicate name, in case all samples belong to the same batch (i.e., they all have the same batch identifier in the |
A tibble with intensities of technical replicates collapsed.
# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_median(group_column = Group, replicate_column = Replicate)# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_median(group_column = Group, replicate_column = Replicate)
Calculates the minimum of the intensity of technical replicates (e.g., if the same sample was injected multiple times or if multiple workups have been performed on the same starting material). The function assigns new sample names by joining either group and replicate name, or if a batch column is specified group, replicate and batch together with a specified separator. Due to the nature of the function, sample and feature metadata columns will be dropped unless they are specified with the according arguments.
collapse_min( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )collapse_min( data, group_column = .data$Group, replicate_column = .data$Replicate, batch_column = .data$Batch, feature_metadata_cols = "Feature", sample_metadata_cols = NULL, separator = "_" )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
replicate_column |
Which column contains replicate information? Usually |
batch_column |
Which column contains batch information? If all samples belong to the same batch (i.e., they all have the same batch identifier in the |
feature_metadata_cols |
A character or character vector containing the names of the feature metadata columns. They are usually created when reading the feature table with |
sample_metadata_cols |
A character or character vector containing the names of the sample metadata columns. They are usually created when joining the metadata with |
separator |
Separator used for joining group and replicate, or group, batch and replicate together to create the new sample names. The new sample names will be Group name, separator, Batch name, separator, Replicate name, or Group name, separator, Replicate name, in case all samples belong to the same batch (i.e., they all have the same batch identifier in the |
A tibble with intensities of technical replicates collapsed.
# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_min(group_column = Group, replicate_column = Replicate)# uses a slightly modified version of toy_metaboscape_metadata collapse_toy_metaboscape_metadata <- toy_metaboscape_metadata collapse_toy_metaboscape_metadata$Replicate <- 1 toy_metaboscape %>% join_metadata(collapse_toy_metaboscape_metadata) %>% impute_lod() %>% collapse_min(group_column = Group, replicate_column = Replicate)
This functions transforms a matrix holding a wide feature table into a "long" and tidy tibble to use it with functions provided in the metamorphr package.
convert_from_matrix works with objects of class matrix. To convert a data frame or tibble, see convert_from_wide.
convert_from_matrix(data, samples_in_cols = TRUE)convert_from_matrix(data, samples_in_cols = TRUE)
data |
A feature table matrix in wide format. To convert a wide data frame, see |
samples_in_cols |
|
A tidy tibble.
# Using a small fictional data set dataset <- matrix(1:9, ncol = 3) colnames(dataset) <- paste0("sample", 1:3) rownames(dataset) <- paste0("feature", 1:3) # Example 1: Samples in columns dataset convert_from_matrix(dataset) # Example 2: Samples in rows dataset_transposed <- t(dataset) dataset_transposed convert_from_matrix(dataset, samples_in_cols = FALSE)# Using a small fictional data set dataset <- matrix(1:9, ncol = 3) colnames(dataset) <- paste0("sample", 1:3) rownames(dataset) <- paste0("feature", 1:3) # Example 1: Samples in columns dataset convert_from_matrix(dataset) # Example 2: Samples in rows dataset_transposed <- t(dataset) dataset_transposed convert_from_matrix(dataset, samples_in_cols = FALSE)
Feature tables are usually stored in a "wide" data format where sample names are stored in columns and features are stored in rows.
This functions transforms those feature tables into a "long" and tidy data format to use it with functions provided in the metamorphr package.
convert_from_wide works with tibbles and data frames. To convert a matrix, see convert_from_matrix.
convert_from_wide(data, label_col = 1, metadata_cols = NULL)convert_from_wide(data, label_col = 1, metadata_cols = NULL)
data |
A feature table data frame or tibble in wide format. To convert a matrix, see |
label_col |
The index or name of the column that will be used to label Features. For example an identifier (e.g., KEGG, CAS, HMDB) or a m/z-RT pair. |
metadata_cols |
The index/indices or name(s) of column(s) that hold additional feature metadata (e.g., retention times, additional identifiers or m/z values). |
A tidy tibble.
featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") featuretable_wide <- read.csv(featuretable_path) convert_from_wide(featuretable_wide, metadata_cols = 2:5)featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") featuretable_wide <- read.csv(featuretable_path) convert_from_wide(featuretable_wide, metadata_cols = 2:5)
Takes a tidy tibble created by metamorphr::read_featuretable() and returns an empty tibble for sample metadata. The tibble can either be populated directly in R or exported and edited by hand (e.g. in Excel). Metadata are necessary for several downstream functions. More columns may be added if necessary.
create_metadata_skeleton(data)create_metadata_skeleton(data)
data |
A tidy tibble created by |
An empty tibble structure with the necessary columns for metadata:
The sample name
To which group does the samples belong? For example a treatment or a background. Note that additional columns with additional grouping information can be freely added if necessary.
If multiple technical replicates exist in the data set,
they must have the same value for Replicate and the same value for Group so that they can be collapsed.
Examples for technical replicates are: the same sample was injected multiple times or workup was performed multiple times with the same starting material.
If no technical replicates exist, set Replicate = 1 for all samples.
The batch in which the samples were prepared or measured. If only one batch exists, set Batch = 1 for all samples.
A sample-specific factor, for example dry weight or protein content.
...
featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") metadata <- read_featuretable(featuretable_path, metadata_cols = 2:5) %>% create_metadata_skeleton()featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") metadata <- read_featuretable(featuretable_path, metadata_cols = 2:5) %>% create_metadata_skeleton()
Filters Features based on their occurrence in blank samples.
For example, if min_frac = 3 the maximum intensity in samples must be at least 3 times as high as in blanks
for a Feature not to be filtered out.
filter_blank( data, blank_samples, min_frac = 3, blank_as_group = FALSE, group_column = NULL )filter_blank( data, blank_samples, min_frac = 3, blank_as_group = FALSE, group_column = NULL )
data |
A tidy tibble created by |
blank_samples |
Defines the blanks. If |
min_frac |
A numeric defining how many times higher the maximum intensity in samples must be in relation to blanks. |
blank_as_group |
A logical indicating if |
group_column |
Only relevant if |
A filtered tibble.
# Example 1: Define blanks by sample name toy_metaboscape %>% filter_blank(blank_samples = c("Blank1", "Blank2"), blank_as_group = FALSE, min_frac = 3) # Example 2: Define blanks by group name # toy_metaboscape %>% # join_metadata(toy_metaboscape_metadata) %>% # filter_blank(blank_samples = "blank", # blank_as_group = TRUE, # min_frac = 3, # group_column = Group)# Example 1: Define blanks by sample name toy_metaboscape %>% filter_blank(blank_samples = c("Blank1", "Blank2"), blank_as_group = FALSE, min_frac = 3) # Example 2: Define blanks by group name # toy_metaboscape %>% # join_metadata(toy_metaboscape_metadata) %>% # filter_blank(blank_samples = "blank", # blank_as_group = TRUE, # min_frac = 3, # group_column = Group)
Filters Features based on their coefficient of variation (CV).
The CV is defined as with = Standard deviation of sample and = Mean of sample .
filter_cv( data, reference_samples, max_cv = 0.2, ref_as_group = FALSE, group_column = NULL, na_as_zero = TRUE )filter_cv( data, reference_samples, max_cv = 0.2, ref_as_group = FALSE, group_column = NULL, na_as_zero = TRUE )
data |
A tidy tibble created by |
reference_samples |
The names of the samples or group which will be used to calculate the CV of a feature. Usually Quality Control samples. |
max_cv |
The maximum allowed CV. 0.2 is a reasonable start. |
ref_as_group |
A logical indicating if |
group_column |
Only relevant if |
na_as_zero |
Should |
A filtered tibble.
Coefficient of Variation on Wikipedia
# Example 1: Define reference samples by sample names toy_metaboscape %>% filter_cv(max_cv = 0.2, reference_samples = c("QC1", "QC2", "QC3")) # Example 2: Define reference samples by group name toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% filter_cv(max_cv = 0.2, reference_samples = "QC", ref_as_group = TRUE, group_column = Group)# Example 1: Define reference samples by sample names toy_metaboscape %>% filter_cv(max_cv = 0.2, reference_samples = c("QC1", "QC2", "QC3")) # Example 2: Define reference samples by group name toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% filter_cv(max_cv = 0.2, reference_samples = "QC", ref_as_group = TRUE, group_column = Group)
Filters features based on the number or fraction of samples they are found in. This is usually one of the first steps in metabolomics data analysis and often already performed when the feature table is first created from the raw spectral files..
filter_global_mv(data, min_found = 0.5, fraction = TRUE)filter_global_mv(data, min_found = 0.5, fraction = TRUE)
data |
A tidy tibble created by |
min_found |
In how many samples must a Feature be found? If |
fraction |
Either |
A filtered tibble.
# Example 1: A feature must be found in at least 50 % of the samples toy_metaboscape %>% filter_global_mv(min_found = 0.5) # Example 2: A feature must be found in at least 8 samples toy_metaboscape %>% filter_global_mv(min_found = 8, fraction = FALSE)# Example 1: A feature must be found in at least 50 % of the samples toy_metaboscape %>% filter_global_mv(min_found = 0.5) # Example 2: A feature must be found in at least 8 samples toy_metaboscape %>% filter_global_mv(min_found = 8, fraction = FALSE)
Similar to filter_global_mv it filters features that are found in a specified number of samples.
The key difference is that filter_grouped_mv() takes groups into consideration and therefore needs sample metadata.
For example, if fraction = TRUE and min_found = 0.5, a feature must be found in at least 50 % of the samples of at least 1 group.
It is very similar to the Filter features by occurrences in groups option in Bruker MetaboScape.
filter_grouped_mv( data, min_found = 0.5, group_column = .data$Group, fraction = TRUE )filter_grouped_mv( data, min_found = 0.5, group_column = .data$Group, fraction = TRUE )
data |
A tidy tibble created by |
min_found |
Defines in how many samples of at least 1 group a Feature must be found not to be filtered out. If |
group_column |
Which column should be used for grouping? Usually |
fraction |
Either |
A filtered tibble.
# A Feature must be found in all samples of at least 1 group. toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% filter_grouped_mv(min_found = 1, group_column = Group)# A Feature must be found in all samples of at least 1 group. toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% filter_grouped_mv(min_found = 1, group_column = Group)
Filters Features based on the presence of MSn fragments. This can help, for example with the identification of potential homologous molecules.
filter_msn( data, fragments, min_found, tolerance = 5, tolerance_type = "ppm", show_progress = TRUE )filter_msn( data, fragments, min_found, tolerance = 5, tolerance_type = "ppm", show_progress = TRUE )
data |
A data frame containing MSn spectra. |
fragments |
A numeric. Exact mass of the fragment(s) to filter by. |
min_found |
How many of the |
tolerance |
A numeric. The tolerance to apply to the fragments. Either an absolute value in Da (if |
tolerance_type |
Either |
show_progress |
A |
A filtered tibble.
# all of the given fragments (3) must be found # returns the first row of toy_mgf toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5678), min_found = 3) # all of the given fragments (3) must be found # returns an empty tibble because the third fragment # of row 1 (34.5678) # is outside of the tolerance (5 ppm): # Lower bound: # 34.5688 - 34.5688 * 5 / 1000000 = 34.5686 # Upper bound: # 34.5688 + 34.5688 * 5 / 1000000 = 34.5690 toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5688), min_found = 3) # only 2 of the 3 fragments must be found # returns the first row of toy_mgf toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5688), min_found = 2)# all of the given fragments (3) must be found # returns the first row of toy_mgf toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5678), min_found = 3) # all of the given fragments (3) must be found # returns an empty tibble because the third fragment # of row 1 (34.5678) # is outside of the tolerance (5 ppm): # Lower bound: # 34.5688 - 34.5688 * 5 / 1000000 = 34.5686 # Upper bound: # 34.5688 + 34.5688 * 5 / 1000000 = 34.5690 toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5688), min_found = 3) # only 2 of the 3 fragments must be found # returns the first row of toy_mgf toy_mgf %>% filter_msn(fragments = c(12.3456, 23.4567, 34.5688), min_found = 2)
Facilitates filtering by given mass-to-charge ratios (m/z) with a defined tolerance. Can also be used to filter based on exact mass.
filter_mz(data, m_z_col, masses, tolerance = 5, tolerance_type = "ppm")filter_mz(data, m_z_col, masses, tolerance = 5, tolerance_type = "ppm")
data |
A tidy tibble created by |
m_z_col |
Which column holds the precursor m/z (or exact mass)? Uses |
masses |
The mass(es) to filter by. |
tolerance |
A numeric. The tolerance to apply to the masses Either an absolute value in Da (if |
tolerance_type |
Either |
A filtered tibble.
# Use a tolerance of plus or minus 5 ppm toy_metaboscape %>% filter_mz(m_z_col = `m/z`, 162.1132, tolerance = 5, tolerance_type = "ppm") # Use a tolerance of plus or minus 0.005 Da toy_metaboscape %>% filter_mz(m_z_col = `m/z`, 162.1132, tolerance = 0.005, tolerance_type = "absolute")# Use a tolerance of plus or minus 5 ppm toy_metaboscape %>% filter_mz(m_z_col = `m/z`, 162.1132, tolerance = 5, tolerance_type = "ppm") # Use a tolerance of plus or minus 0.005 Da toy_metaboscape %>% filter_mz(m_z_col = `m/z`, 162.1132, tolerance = 0.005, tolerance_type = "absolute")
The occurrence of characteristic neutral losses can help with the putative annotation of molecules. See the Reference section for an example.
filter_neutral_loss( data, losses, min_found, tolerance = 10, tolerance_type = "ppm", show_progress = TRUE )filter_neutral_loss( data, losses, min_found, tolerance = 10, tolerance_type = "ppm", show_progress = TRUE )
data |
A data frame containing MSn spectra. |
losses |
A numeric. Exact mass of the fragment(s) to filter by. |
min_found |
How many of the |
tolerance |
A numeric. The tolerance to apply to the fragments. Either an absolute value in Da (if |
tolerance_type |
Either |
show_progress |
A |
A filtered tibble.
A. Brink, F. Fontaine, M. Marschmann, B. Steinhuber, E. N. Cece, I. Zamora, A. Pähler, Rapid Commun. Mass Spectrom. 2014, 28, 2695–2703, DOI 10.1002/rcm.7062.
# neutral losses must be calculated first toy_mgf_nl <- toy_mgf %>% calc_neutral_loss(m_z_col = PEPMASS) # all of the given losses (3) must be found # returns the first row of toy_mgf toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.3333), min_found = 3) # all of the given fragments (3) must be found # returns an empty tibble because the third loss # of row 1 (33.3333) # is outside of the tolerance (10 ppm): # Lower bound: # 33.4333 - 33.4333 * 5 / 1000000 = 33.4333 # Upper bound: # 33.4333 + 33.4333 * 5 / 1000000 = 33.4336 toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.4333), min_found = 3) # only 2 of the 3 fragments must be found # returns the first row of toy_mgf toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.4333), min_found = 2)# neutral losses must be calculated first toy_mgf_nl <- toy_mgf %>% calc_neutral_loss(m_z_col = PEPMASS) # all of the given losses (3) must be found # returns the first row of toy_mgf toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.3333), min_found = 3) # all of the given fragments (3) must be found # returns an empty tibble because the third loss # of row 1 (33.3333) # is outside of the tolerance (10 ppm): # Lower bound: # 33.4333 - 33.4333 * 5 / 1000000 = 33.4333 # Upper bound: # 33.4333 + 33.4333 * 5 / 1000000 = 33.4336 toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.4333), min_found = 3) # only 2 of the 3 fragments must be found # returns the first row of toy_mgf toy_mgf_nl %>% filter_neutral_loss(losses = c(11.1111, 22.2222, 33.4333), min_found = 2)
Calculates the monoisotopic mass from a given formula. If only the element symbols are provided, the calculated mass corresponds to that of a molecule made up from the most abundant isotopes. Other isotopes can also be provided (e.g., 13C, instead of the naturally most abundant 12C). See the samples for details.
formula_to_mass(formula)formula_to_mass(formula)
formula |
A formula as a string. |
The monoisotopic mass of the formula.
# The monoisotopic mass is calculated with the most abundant isotopes # if only the element symbols are provided: formula_to_mass("CH4") formula_to_mass("NH3") formula_to_mass("C10H17N3O6S") # Other isotopes can be provided as follows: formula_to_mass("[13C]H4") formula_to_mass("[15N]H3") # Every isotope, including the most abundant ones, can be named explicitly. # Compare: formula_to_mass("[14N][1H]3") formula_to_mass("NH3") # The function also supports brackets and nested brackets: formula_to_mass("(CH3)2") formula_to_mass("(((CH3)2N)3C)2") formula_to_mass("((([13C]H3)2N)3C)2")# The monoisotopic mass is calculated with the most abundant isotopes # if only the element symbols are provided: formula_to_mass("CH4") formula_to_mass("NH3") formula_to_mass("C10H17N3O6S") # Other isotopes can be provided as follows: formula_to_mass("[13C]H4") formula_to_mass("[15N]H3") # Every isotope, including the most abundant ones, can be named explicitly. # Compare: formula_to_mass("[14N][1H]3") formula_to_mass("NH3") # The function also supports brackets and nested brackets: formula_to_mass("(CH3)2") formula_to_mass("(((CH3)2N)3C)2") formula_to_mass("((([13C]H3)2N)3C)2")
One of several PCA-based imputation methods. Basically a wrapper around pcaMethods::pca(method = "bpca").
For a detailed discussion, see the vignette("pcaMethods") and vignette("missingValues", "pcaMethods") as well as the References section.
Important Note
impute_bpca() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_bpca() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use impute_bpca() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data.
impute_bpca(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)impute_bpca(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)
data |
A tidy tibble created by |
n_pcs |
The number of PCs to calculate. |
center |
Should |
scale |
Should |
direction |
Either |
A tibble with imputed missing values.
H. R. Wolfram Stacklies, 2017, DOI 10.18129/B9.BIOC.PCAMETHODS.
W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, Bioinformatics 2007, 23, 1164–1167, DOI 10.1093/bioinformatics/btm069.
toy_metaboscape %>% impute_bpca()toy_metaboscape %>% impute_bpca()
Replace missing intensity values (NA) with the lowest observed intensity.
impute_global_lowest(data)impute_global_lowest(data)
data |
A tidy tibble created by |
A tibble with imputed missing values.
toy_metaboscape %>% impute_global_lowest()toy_metaboscape %>% impute_global_lowest()
Basically a wrapper function around impute::impute.knn. Imputes missing values using the k-th nearest neighbor algorithm.
Note that the function ln-transforms the data prior to imputation and transforms it back to the original scale afterwards. Please do not do it manually prior to calling impute_knn()!
See References for more information.
Important Note
impute_knn() depends on the impute package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_knn() is called without the impute package installed, you should be asked if you want to install pak and impute.
If you want to use impute_knn() you have to install those. In case you run into trouble with the automatic installation, please install impute manually. See
impute: Imputation for microarray data for instructions on manual installation.
impute_knn(data, quietly = TRUE, ...)impute_knn(data, quietly = TRUE, ...)
data |
A tidy tibble created by |
quietly |
|
... |
Additional parameters passed to |
A tibble with imputed missing values.
Robert Tibshirani, Trevor Hastie, 2017, DOI 10.18129/B9.BIOC.IMPUTE.
J. Khan, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, P. S. Meltzer, Nat Med 2001, 7, 673–679, DOI 10.1038/89044.
toy_metaboscape %>% impute_knn()toy_metaboscape %>% impute_knn()
Basically a wrapper around pcaMethods::llsImpute.
For a detailed discussion, see the vignette("pcaMethods") and vignette("missingValues", "pcaMethods") as well as the References section.
Important Note
impute_lls() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_svd() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use impute_lls() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data for instructions on manual installation.
impute_lls( data, correlation = "pearson", complete_genes = FALSE, center = FALSE, cluster_size = 10 )impute_lls( data, correlation = "pearson", complete_genes = FALSE, center = FALSE, cluster_size = 10 )
data |
A tidy tibble created by |
correlation |
The method used to calculate correlations between features. One of |
complete_genes |
If |
center |
Should |
cluster_size |
The number of similar features used for regression. |
A tibble with imputed missing values.
H. R. Wolfram Stacklies, 2017, DOI 10.18129/B9.BIOC.PCAMETHODS.
W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, Bioinformatics 2007, 23, 1164–1167, DOI 10.1093/bioinformatics/btm069.
# The cluster size must be reduced because # the data set is too small for the default (10) toy_metaboscape %>% impute_lls(complete_genes = TRUE, cluster_size = 5)# The cluster size must be reduced because # the data set is too small for the default (10) toy_metaboscape %>% impute_lls(complete_genes = TRUE, cluster_size = 5)
Replace missing intensity values (NA) by what is assumed to be the detector limit of detection (LoD).
It is estimated by dividing the Feature minimum by the provided denominator, usually 5. See the References section for more information.
impute_lod(data, div_by = 5)impute_lod(data, div_by = 5)
data |
A tidy tibble created by |
div_by |
A numeric value that specifies by which number the Feature minimum will be divided |
A tibble with imputed missing values.
toy_metaboscape %>% impute_lod()toy_metaboscape %>% impute_lod()
Replace missing intensity values (NA) with the Feature mean of non-NA values. For example, if a Feature has the measured intensities NA, 1, NA, 3, 2 in samples 1-5,
the intensities after impute_mean() would be 2, 1, 2, 3, 2.
impute_mean(data)impute_mean(data)
data |
A tidy tibble created by |
A tibble with imputed missing values.
toy_metaboscape %>% impute_mean()toy_metaboscape %>% impute_mean()
Replace missing intensity values (NA) with the Feature median of non-NA values. For example, if a Feature has the measured intensities NA, 1, NA, 3, 2 in samples 1-5,
the intensities after impute_median() would be 2, 1, 2, 3, 2.
impute_median(data)impute_median(data)
data |
A tidy tibble created by |
A tibble with imputed missing values.
toy_metaboscape %>% impute_median()toy_metaboscape %>% impute_median()
Replace missing intensity values (NA) with the Feature minimum of non-NA values.
impute_min(data)impute_min(data)
data |
A tidy tibble created by |
A tibble with imputed missing values.
toy_metaboscape %>% impute_min()toy_metaboscape %>% impute_min()
One of several PCA-based imputation methods. Basically a wrapper around pcaMethods::pca(method = "nipals").
For a detailed discussion, see the vignette("pcaMethods") and vignette("missingValues", "pcaMethods") as well as the References section.
Important Note
impute_nipals() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_nipals() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use impute_nipals() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data for instructions on manual installation.
impute_nipals(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)impute_nipals(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)
data |
A tidy tibble created by |
n_pcs |
The number of PCs to calculate. |
center |
Should |
scale |
Should |
direction |
Either |
A tibble with imputed missing values.
H. R. Wolfram Stacklies, 2017, DOI 10.18129/B9.BIOC.PCAMETHODS.
W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, Bioinformatics 2007, 23, 1164–1167, DOI 10.1093/bioinformatics/btm069.
toy_metaboscape %>% impute_nipals()toy_metaboscape %>% impute_nipals()
One of several PCA-based imputation methods. Basically a wrapper around pcaMethods::pca(method = "ppca").
For a detailed discussion, see the vignette("pcaMethods") and vignette("missingValues", "pcaMethods") as well as the References section.
In the underlying function (pcaMethods::pca(method = "ppca")), the order of columns has an influence on the outcome. Therefore, calling pcaMethods::pca(method = "ppca")
on a matrix and calling metamorphr::impute() on a tidy tibble might give different results, even though they contain the same data. That is because under the hood,
the tibble is transformed to a matrix prior to calling pcaMethods::pca(method = "ppca") and you have limited influence on the column order of the
resulting matrix.
Important Note
impute_ppca() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_ppca() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use impute_ppca() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data for instructions on manual installation.
impute_ppca( data, n_pcs = 2, center = TRUE, scale = "none", direction = 2, random_seed = 1L )impute_ppca( data, n_pcs = 2, center = TRUE, scale = "none", direction = 2, random_seed = 1L )
data |
A tidy tibble created by |
n_pcs |
The number of PCs to calculate. |
center |
Should |
scale |
Should |
direction |
Either |
random_seed |
An integer used as seed for the random number generator. |
A tibble with imputed missing values.
H. R. Wolfram Stacklies, 2017, DOI 10.18129/B9.BIOC.PCAMETHODS.
W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, Bioinformatics 2007, 23, 1164–1167, DOI 10.1093/bioinformatics/btm069.
toy_metaboscape %>% impute_ppca()toy_metaboscape %>% impute_ppca()
Basically a wrapper function around missForest::missForest. Imputes missing values using the random forest algorithm.
impute_rf(data, random_seed = 1L, ...)impute_rf(data, random_seed = 1L, ...)
data |
A tidy tibble created by |
random_seed |
A seed for the random number generator. Can be an integer or |
... |
Additional parameters passed to |
A tibble with imputed missing values.
missForest on CRAN
D. J. Stekhoven, P. Bühlmann, Bioinformatics 2012, 28, 112–118, DOI 10.1093/bioinformatics/btr597.
toy_metaboscape %>% impute_rf()toy_metaboscape %>% impute_rf()
Basically a wrapper around pcaMethods::pca(method = "svdImpute").
For a detailed discussion, see the vignette("pcaMethods") and vignette("missingValues", "pcaMethods") as well as the References section.
Important Note
impute_svd() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When impute_svd() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use impute_svd() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data for instructions on manual installation.
impute_svd(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)impute_svd(data, n_pcs = 2, center = TRUE, scale = "none", direction = 2)
data |
A tidy tibble created by |
n_pcs |
The number of PCs to calculate. |
center |
Should |
scale |
Should |
direction |
Either |
A tibble with imputed missing values.
H. R. Wolfram Stacklies, 2017, DOI 10.18129/B9.BIOC.PCAMETHODS.
W. Stacklies, H. Redestig, M. Scholz, D. Walther, J. Selbig, Bioinformatics 2007, 23, 1164–1167, DOI 10.1093/bioinformatics/btm069.
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Bioinformatics 2001, 17, 520–525, DOI 10.1093/bioinformatics/17.6.520.
toy_metaboscape %>% impute_svd()toy_metaboscape %>% impute_svd()
Replace missing intensity values (NA) with a user-provided value (e.g., 1).
impute_user_value(data, value)impute_user_value(data, value)
data |
A tidy tibble created by |
value |
Numeric that replaces missing values |
A tibble with imputed missing values.
toy_metaboscape %>% impute_user_value(value = 1)toy_metaboscape %>% impute_user_value(value = 1)
Joins a featuretable and associated sample metadata. Basically a wrapper around left_join where by = "Sample".
join_metadata(data, metadata)join_metadata(data, metadata)
data |
A feature table created with |
metadata |
Sample metadata created with |
A tibble with added sample metadata.
toy_metaboscape %>% join_metadata(toy_metaboscape_metadata)toy_metaboscape %>% join_metadata(toy_metaboscape_metadata)
Calculate neutral loss spectra for all ions with available MSn spectra in data. To calculate neutral losses, MSn spectra are required.
See read_mgf. This step is required for subsequent filtering based on
neutral losses (filter_neutral_loss). Resulting neutral loss spectra are stored in tibbles in a new list column named Neutral_Loss.
msn_calc_nl(data, m_z_col)msn_calc_nl(data, m_z_col)
data |
A tidy tibble created by |
m_z_col |
Which column holds the precursor m/z? Uses |
A tibble with added neutral loss spectra. A new list column is created named Neutral_Loss.
toy_mgf %>% msn_calc_nl(m_z_col = PEPMASS)toy_mgf %>% msn_calc_nl(m_z_col = PEPMASS)
Scale the intensity of each peak in an MSn spectrum to that of the highest peak. MSn spectra are required to use this function.
See read_mgf.
Important Note
Please note that existing MSn spectra in data will be overwritten.
msn_scale(data, scale_to = 100)msn_scale(data, scale_to = 100)
data |
A tidy tibble created by |
scale_to |
A |
A tibble with scaled MSn spectra.
toy_mgf %>% msn_scale()toy_mgf %>% msn_scale()
The steps the algorithm takes are the following:
log2 transform the intensities
Choose 2 samples to generate an MA-plot from
Fit a LOESS curve
Subtract half of the difference between the predicted value and the true value from the intensity of sample 1 and add the same amount to the intensity of Sample 2
Repeat for all unique combinations of samples
Repeat all steps until the model converges or n_iter is reached.
Convergence is assumed if the confidence intervals of all LOESS smooths include the 0 line. If fixed_iter = TRUE, the algorithm will perform exactly n_iter iterations.
If fixed_iter = FALSE, the algorithm will perform a maximum of n_iter iterations.
See the reference section for details.
normalize_cyclic_loess( data, n_iter = 3, fixed_iter = TRUE, loess_span = 0.7, level = 0.95, verbose = FALSE, ... )normalize_cyclic_loess( data, n_iter = 3, fixed_iter = TRUE, loess_span = 0.7, level = 0.95, verbose = FALSE, ... )
data |
A tidy tibble created by |
n_iter |
The number of iterations to perform. If |
fixed_iter |
Should a fixed number of iterations be performed? |
loess_span |
The span of the LOESS fit. A larger span produces a smoother line. |
level |
The confidence level for the convergence criterion. Note that a a larger confidence level produces larger confidence intervals and therefore the algorithm stops earlier. |
verbose |
|
... |
Arguments passed onto |
A tibble with intensities normalized across samples.
B. M. Bolstad, R. A. Irizarry, M. Åstrand, T. P. Speed, Bioinformatics 2003, 19, 185–193, DOI 10.1093/bioinformatics/19.2.185.
Karla Ballman, Diane Grill, Ann Oberg, Terry Therneau, “Faster cyclic loess: normalizing DNA arrays via linear models” can be found under https://www.mayo.edu/research/documents/biostat-68pdf/doc-10027897, 2004.
K. V. Ballman, D. E. Grill, A. L. Oberg, T. M. Therneau, Bioinformatics 2004, 20, 2778–2786, DOI 10.1093/bioinformatics/bth327.
toy_metaboscape %>% impute_lod() %>% normalize_cyclic_loess()toy_metaboscape %>% impute_lod() %>% normalize_cyclic_loess()
Normalization is done by dividing the intensity by a sample-specific factor (e.g., weight, protein or DNA content).
This function requires a sample-specific factor, usually supplied via the Factor column from the sample metadata.
See the Examples section for details.
normalize_factor(data, factor_column = .data$Factor)normalize_factor(data, factor_column = .data$Factor)
data |
A tidy tibble created by |
factor_column |
Which column contains the sample-specific factor? Usually |
A tibble with intensities normalized across samples.
toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% normalize_factor()toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% normalize_factor()
Normalize across samples by dividing feature intensities by the sample median, making the median 1 in all samples. See References for more information.
normalize_median(data)normalize_median(data)
data |
A tidy tibble created by |
A tibble with intensities normalized across samples.
T. Ramirez, A. Strigun, A. Verlohner, H.-A. Huener, E. Peter, M. Herold, N. Bordag, W. Mellert, T. Walk, M. Spitzer, X. Jiang, S. Sperber, T. Hofmann, T. Hartung, H. Kamp, B. Van Ravenzwaay, Arch Toxicol 2018, 92, 893–906, DOI 10.1007/s00204-017-2079-6.
toy_metaboscape %>% normalize_median()toy_metaboscape %>% normalize_median()
This method was originally developed for H-NMR spectra of complex biofluids but has been adapted for other 'omics data. It aims to eliminate dilution effects by calculating the most probable dilution factor for each sample, relative to one or more reference samples. See references for more details.
normalize_pqn( data, fn = "median", normalize_sum = TRUE, reference_samples = NULL, ref_as_group = FALSE, group_column = NULL )normalize_pqn( data, fn = "median", normalize_sum = TRUE, reference_samples = NULL, ref_as_group = FALSE, group_column = NULL )
data |
A tidy tibble created by |
fn |
Which function should be used to calculate the reference spectrum from the reference samples? Can be either "mean" or "median". |
normalize_sum |
A logical indicating whether a sum normalization (aka total area normalization) should be performed prior to PQN. It is recommended to do so and other packages (e.g., KODAMA) also perform a sum normalization prior to PQN. |
reference_samples |
Either |
ref_as_group |
A logical indicating if |
group_column |
Only relevant if |
A tibble with intensities normalized across samples.
F. Dieterle, A. Ross, G. Schlotterbeck, H. Senn, Anal. Chem. 2006, 78, 4281–4290, DOI 10.1021/ac051632c.
# specify the reference samples with their sample names toy_metaboscape %>% impute_lod() %>% normalize_pqn(reference_samples = c("QC1", "QC2", "QC3")) # specify the reference samples with their group names toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% impute_lod() %>% normalize_pqn(reference_samples = c("QC"), ref_as_group = TRUE, group_column = Group)# specify the reference samples with their sample names toy_metaboscape %>% impute_lod() %>% normalize_pqn(reference_samples = c("QC1", "QC2", "QC3")) # specify the reference samples with their group names toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% impute_lod() %>% normalize_pqn(reference_samples = c("QC"), ref_as_group = TRUE, group_column = Group)
This is the standard approach for Quantile Normalization. Other sub-flavors are also available:
See References for more information.
normalize_quantile_all(data)normalize_quantile_all(data)
data |
A tidy tibble created by |
A tibble with intensities normalized across samples.
Y. Zhao, L. Wong, W. W. B. Goh, Sci Rep 2020, 10, 15534, DOI 10.1038/s41598-020-72664-6.
toy_metaboscape %>% normalize_quantile_all()toy_metaboscape %>% normalize_quantile_all()
This function performs a Quantile Normalization on each sub-group and batch in the data set. It therefore requires grouping information. See
Examples for more information. This approach might perform better than the standard approach, normalize_quantile_all,
if sub-groups are very different (e.g., when comparing cancer vs. normal tissue).
Other sub-flavors are also available:
See References for more information. Note that it is equivalent to the 'Discrete' normalization in Zhao et al. but has been renamed for internal consistency.
normalize_quantile_batch( data, group_column = .data$Group, batch_column = .data$Batch )normalize_quantile_batch( data, group_column = .data$Group, batch_column = .data$Batch )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
batch_column |
Which column contains the batch information? Usually |
A tibble with intensities normalized across samples.
Y. Zhao, L. Wong, W. W. B. Goh, Sci Rep 2020, 10, 15534, DOI 10.1038/s41598-020-72664-6.
toy_metaboscape %>% # Metadata, including grouping and batch information, # must be added before using normalize_quantile_batch() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_batch(group_column = Group, batch_column = Batch)toy_metaboscape %>% # Metadata, including grouping and batch information, # must be added before using normalize_quantile_batch() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_batch(group_column = Group, batch_column = Batch)
This function performs a Quantile Normalization on each sub-group in the data set. It therefore requires grouping information. See
Examples for more information. This approach might perform better than the standard approach, normalize_quantile_all,
if sub-groups are very different (e.g., when comparing cancer vs. normal tissue).
Other sub-flavors are also available:
See References for more information. Note that it is equivalent to the 'Class-specific' normalization in Zhao et al. but has been renamed for internal consistency.
normalize_quantile_group(data, group_column = .data$Group)normalize_quantile_group(data, group_column = .data$Group)
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
A tibble with intensities normalized across samples.
Y. Zhao, L. Wong, W. W. B. Goh, Sci Rep 2020, 10, 15534, DOI 10.1038/s41598-020-72664-6.
toy_metaboscape %>% # Metadata, including grouping information, must be added before using normalize_quantile_group() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_group(group_column = Group)toy_metaboscape %>% # Metadata, including grouping information, must be added before using normalize_quantile_group() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_group(group_column = Group)
This function performs a smooth Quantile Normalization on each sub-group in the data set (qsmooth). It therefore requires grouping information. See
Examples for more information. This approach might perform better than the standard approach, normalize_quantile_all,
if sub-groups are very different (e.g., when comparing cancer vs. normal tissue). The result lies somewhere between normalize_quantile_group
and normalize_quantile_all. Basically a re-implementation of Hicks et al. (2018).
normalize_quantile_smooth( data, group_column = .data$Group, rolling_window = 0.05 )normalize_quantile_smooth( data, group_column = .data$Group, rolling_window = 0.05 )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
rolling_window |
|
A tibble with intensities normalized across samples.
S. C. Hicks, K. Okrah, J. N. Paulson, J. Quackenbush, R. A. Irizarry, H. C. Bravo, Biostatistics 2018, 19, 185–198, DOI 10.1093/biostatistics/kxx028.
Y. Zhao, L. Wong, W. W. B. Goh, Sci Rep 2020, 10, 15534, DOI 10.1038/s41598-020-72664-6.
toy_metaboscape %>% # Metadata, including grouping information, must be added before using normalize_quantile_group() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_smooth(group_column = Group)toy_metaboscape %>% # Metadata, including grouping information, must be added before using normalize_quantile_group() join_metadata(toy_metaboscape_metadata) %>% normalize_quantile_smooth(group_column = Group)
Performs a normalization based on a reference feature, for example an internal standard. Divides the Intensities of all features by the Intensity of the reference feature in that sample and multiplies them with a constant value, making the Intensity of the reference feature the same in each sample.
normalize_ref( data, reference_feature, identifier_column, reference_feature_intensity = 1 )normalize_ref( data, reference_feature, identifier_column, reference_feature_intensity = 1 )
data |
A tidy tibble created by |
reference_feature |
An identifier for the reference feature. Must be unique. It is recommended to use the UID. |
identifier_column |
The column in which to look for the reference feature. It is recommended to use |
reference_feature_intensity |
Either a constant value with which the intensity of each feature is multiplied or a function (e.g., mean, median, min, max).
If a function is provided, it will use that function on the Intensities of the reference feature in all samples before normalization and multiply the intensity of each feature with that value after dividing by the Intensity of the reference feature.
For example, if |
A tibble with intensities normalized across samples.
# Divide by the reference feature and make its Intensity 1000 in each sample toy_metaboscape %>% impute_lod() %>% normalize_ref(reference_feature = 2, identifier_column = UID, reference_feature_intensity = 1000) # Divide by the reference feature and make its Intensity the mean of intensities # of the reference features before normalization toy_metaboscape %>% impute_lod() %>% normalize_ref(reference_feature = 2, identifier_column = UID, reference_feature_intensity = mean)# Divide by the reference feature and make its Intensity 1000 in each sample toy_metaboscape %>% impute_lod() %>% normalize_ref(reference_feature = 2, identifier_column = UID, reference_feature_intensity = 1000) # Divide by the reference feature and make its Intensity the mean of intensities # of the reference features before normalization toy_metaboscape %>% impute_lod() %>% normalize_ref(reference_feature = 2, identifier_column = UID, reference_feature_intensity = mean)
Normalize across samples by dividing feature intensities by the sum of all intensities in a sample, making the sum 1 in all samples.
Important Note
Intensities of individual features will be very small after this normalization approach. It is therefore advised to multiply all intensities with a fixed number (e.g., 1000) after normalization. See this discussion on OMICSForum.ca and the examples below for further information.
normalize_sum(data)normalize_sum(data)
data |
A tidy tibble created by |
A tibble with intensities normalized across samples.
# Example 1: Normalization only toy_metaboscape %>% normalize_sum() # Example 2: Multiply with 1000 after normalization toy_metaboscape %>% normalize_sum() %>% dplyr::mutate(Intensity = .data$Intensity * 1000)# Example 1: Normalization only toy_metaboscape %>% normalize_sum() # Example 2: Multiply with 1000 after normalization toy_metaboscape %>% normalize_sum() %>% dplyr::mutate(Intensity = .data$Intensity * 1000)
Performs PCA and creates a Scores or Loadings plot. Basically a wrapper around pcaMethods::pca
The plot is drawn with ggplot2 and can therefore be easily manipulated afterwards (e.g., changing the theme or the axis labels).
Please note that the function is intended to be easy to use and beginner friendly and therefore offers limited ability to fine-tune certain parameters of the resulting plot.
If you wish to draw the plot yourself, you can set return_tbl = TRUE. In this case, a tibble is returned instead of a ggplot2 object which you can use to create a plot yourself.
Important Note
plot_pca() depends on the pcaMethods package from Bioconductor. If metamorphr was installed via install.packages(), dependencies from Bioconductor were not
automatically installed. When plot_pca() is called without the pcaMethods package installed, you should be asked if you want to install pak and pcaMethods.
If you want to use plot_pca() you have to install those. In case you run into trouble with the automatic installation, please install pcaMethods manually. See
pcaMethods – a Bioconductor package providing PCA methods for incomplete data for instructions on manual installation.
plot_pca( data, method = "svd", what = "scores", n_pcs = 2, pcs = c(1, 2), center = TRUE, group_column = NULL, name_column = NULL, return_tbl = FALSE, verbose = FALSE )plot_pca( data, method = "svd", what = "scores", n_pcs = 2, pcs = c(1, 2), center = TRUE, group_column = NULL, name_column = NULL, return_tbl = FALSE, verbose = FALSE )
data |
A tidy tibble created by |
method |
A character specifying one of the available methods ("svd", "nipals", "rnipals", "bpca", "ppca", "svdImpute", "robustPca", "nlpca", "llsImpute", "llsImputeAll"). If the default is used ("svd") an SVD PCA will be done, in case |
what |
Specifies what should be returned. Either |
n_pcs |
The number of PCs to calculate. |
pcs |
A vector containing 2 integers that specifies the PCs to plot. Only relevant if |
center |
Should |
group_column |
Either |
name_column |
Either |
return_tbl |
A logical. If |
verbose |
Should outputs from |
Either a Scores or Loadings Plot in the form of a ggplot2 object or a tibble.
# Draw a Scores Plot toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_pca(what = "scores", group_column = Group) # Draw a Loadings Plot toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_pca(what = "loadings", name_column = Feature)# Draw a Scores Plot toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_pca(what = "scores", group_column = Group) # Draw a Loadings Plot toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_pca(what = "loadings", name_column = Feature)
Performs necessary calculations (i.e., calculate p-values and log2-fold changes) and creates a basic Volcano Plot.
The plot is drawn with ggplot2 and can therefore be easily manipulated afterwards (e.g., changing the theme or the axis labels).
Please note that the function is intended to be easy to use and beginner friendly and therefore offers limited ability to fine-tune certain parameters of the resulting plot.
If you wish to draw the plot yourself, you can set return_tbl = TRUE. In this case, a tibble is returned instead of a ggplot2 object which you can use to create a plot yourself.
A Volcano Plot is used to compare two groups. Therefore grouping information must be provided. See join_metadata for more information.
plot_volcano( data, group_column, name_column, groups_to_compare, batch_column = NULL, batch = NULL, log2fc_cutoff = 1, p_value_cutoff = 0.05, colors = list(sig_up = "darkred", sig_down = "darkblue", not_sig_up = "grey", not_sig_down = "grey", not_sig = "grey"), adjust_p = FALSE, log2_before = FALSE, return_tbl = FALSE, ... )plot_volcano( data, group_column, name_column, groups_to_compare, batch_column = NULL, batch = NULL, log2fc_cutoff = 1, p_value_cutoff = 0.05, colors = list(sig_up = "darkred", sig_down = "darkblue", not_sig_up = "grey", not_sig_down = "grey", not_sig = "grey"), adjust_p = FALSE, log2_before = FALSE, return_tbl = FALSE, ... )
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
name_column |
Which column contains the feature names? Can for example be |
groups_to_compare |
Names of the groups which should be compared as a character vector. Those are the group names in the |
batch_column |
Which column contains the batch information? Usually |
batch |
The names of the batch(es) that should be included when calculating p-value and log2 fold change. |
log2fc_cutoff |
A numeric. What cutoff should be used for the log2 fold change? Traditionally, this is set to |
p_value_cutoff |
A numeric. What cutoff should be used for the p-value? Traditionally, this is set to |
colors |
A named list for coloring the dots in the Volcano Plot or |
adjust_p |
Should the p-value be adjusted? Can be either |
log2_before |
A logical. Should the data be log2 transformed prior to calculating the p-values? |
return_tbl |
A logical. If |
... |
Arguments passed on to |
Either a Volcano Plot in the form of a ggplot2 object or a tibble.
# returns a Volcano Plot in the form of a ggplot2 object toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_volcano( group_column = Group, name_column = Feature, groups_to_compare = c("control", "treatment") ) # returns a tibble to draw the plot manually toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_volcano( group_column = Group, name_column = Feature, groups_to_compare = c("control", "treatment"), return_tbl = TRUE )# returns a Volcano Plot in the form of a ggplot2 object toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_volcano( group_column = Group, name_column = Feature, groups_to_compare = c("control", "treatment") ) # returns a tibble to draw the plot manually toy_metaboscape %>% impute_lod() %>% join_metadata(toy_metaboscape_metadata) %>% plot_volcano( group_column = Group, name_column = Feature, groups_to_compare = c("control", "treatment"), return_tbl = TRUE )
Basically a wrapper around readr::read_delim() but performs some initial tidying operations such as gather() rearranging columns. The label_col will be renamed to Feature.
read_featuretable( file, delim = ",", label_col = 1, metadata_cols = NULL, remove_empty_cols = FALSE, show_removed_cols = TRUE, ... )read_featuretable( file, delim = ",", label_col = 1, metadata_cols = NULL, remove_empty_cols = FALSE, show_removed_cols = TRUE, ... )
file |
A path to a file but can also be a connection or literal data. |
delim |
The field separator or delimiter. For example "," in csv files. |
label_col |
The index or name (as a character) of the column that will be used to label Features. For example an identifier (e.g., KEGG, CAS, HMDB) or a m/z-RT pair. |
metadata_cols |
The index/indices or name(s) of column(s) that hold additional feature metadata (e.g., retention times, additional identifiers or m/z values). |
remove_empty_cols |
Either |
show_removed_cols |
Only relevant if |
... |
Additional arguments passed on to |
A tidy tibble.
H. Wickham, J. Stat. Soft. 2014, 59, DOI 10.18637/jss.v059.i10.
H. Wickham, M. Averick, J. Bryan, W. Chang, L. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. Pedersen, E. Miller, S. Bache, K. Müller, J. Ooms, D. Robinson, D. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, H. Yutani, JOSS 2019, 4, 1686, DOI 10.21105/joss.01686.
“12 Tidy data | R for Data Science,” can be found under https://r4ds.had.co.nz/tidy-data.html, 2023.
# Read a toy dataset in the format produced with Bruker MetaboScape (Version 2021). featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") # Example 1: Provide indices for metadata_cols featuretable <- read_featuretable(featuretable_path, metadata_cols = 2:5) featuretable # Example 2: Provide a name for label_col and indices for metadata_cols featuretable <- read_featuretable( featuretable_path, label_col = "m/z", metadata_cols = c(1, 2, 4, 5) ) featuretable # Example 3: Provide names for both, label_col and metadata_cols featuretable <- read_featuretable( featuretable_path, label_col = "m/z", metadata_cols = c("Bucket label", "RT", "Name", "Formula") ) featuretable# Read a toy dataset in the format produced with Bruker MetaboScape (Version 2021). featuretable_path <- system.file("extdata", "toy_metaboscape.csv", package = "metamorphr") # Example 1: Provide indices for metadata_cols featuretable <- read_featuretable(featuretable_path, metadata_cols = 2:5) featuretable # Example 2: Provide a name for label_col and indices for metadata_cols featuretable <- read_featuretable( featuretable_path, label_col = "m/z", metadata_cols = c(1, 2, 4, 5) ) featuretable # Example 3: Provide names for both, label_col and metadata_cols featuretable <- read_featuretable( featuretable_path, label_col = "m/z", metadata_cols = c("Bucket label", "RT", "Name", "Formula") ) featuretable
Similar to read_featuretable but specifically for full_feature_table' files created with 'mzmine'. For more information, see the 'mzmine' documentation.
read_featuretable_mzmine( file, intensity = "height", field_separator = ",", label_col = 1, import_datafile_cols = FALSE, remove_empty_cols = FALSE, show_removed_cols = TRUE )read_featuretable_mzmine( file, intensity = "height", field_separator = ",", label_col = 1, import_datafile_cols = FALSE, remove_empty_cols = FALSE, show_removed_cols = TRUE )
file |
A path to a file but can also be a connection or literal data. |
intensity |
A character that specifies what should be used as the (semi-)quantitative measure. Either |
field_separator |
The field separator as specified in 'mzmine'. Usually |
label_col |
The index or name (as a character) of the column that will be used to label Features. For example an identifier (e.g., KEGG, CAS, HMDB) or a m/z-RT pair. |
import_datafile_cols |
Should columns that begin with |
remove_empty_cols |
Either |
show_removed_cols |
Only relevant if |
A tidy tibble.
H. Wickham, J. Stat. Soft. 2014, 59, DOI 10.18637/jss.v059.i10.
H. Wickham, M. Averick, J. Bryan, W. Chang, L. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. Pedersen, E. Miller, S. Bache, K. Müller, J. Ooms, D. Robinson, D. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, H. Yutani, JOSS 2019, 4, 1686, DOI 10.21105/joss.01686.
“12 Tidy data | R for Data Science,” can be found under https://r4ds.had.co.nz/tidy-data.html, 2023.
# Read a toy dataset in the format produced with mzmine. featuretable_path <- system.file("extdata", "toy_mzmine.csv", package = "metamorphr") # Example 1: Use feature height as the metric featuretable <- read_featuretable_mzmine( featuretable_path, intensity = "height" ) featuretable # Example 2: Use feature area as the metric featuretable <- read_featuretable_mzmine( featuretable_path, intensity = "area" ) featuretable # Example 3: Use the 'mz' column as a Feature label featuretable <- read_featuretable_mzmine( featuretable_path, label_col = "mz" ) featuretable# Read a toy dataset in the format produced with mzmine. featuretable_path <- system.file("extdata", "toy_mzmine.csv", package = "metamorphr") # Example 1: Use feature height as the metric featuretable <- read_featuretable_mzmine( featuretable_path, intensity = "height" ) featuretable # Example 2: Use feature area as the metric featuretable <- read_featuretable_mzmine( featuretable_path, intensity = "area" ) featuretable # Example 3: Use the 'mz' column as a Feature label featuretable <- read_featuretable_mzmine( featuretable_path, label_col = "mz" ) featuretable
MGF files allow the storage of MS/MS spectra. This
function reads them into a tidy tibble. Each variable is stored in a column and each ion (observation) is stored in a separate row.
MS/MS spectra are stored in a list column named MSn.
Please note that MGF files are software-specific so the variables
and their names may vary. This function was developed with the GNPS file format exported from mzmine in mind.
If you encounter any bugs please report them: https://github.com/yasche/metamorphr/issues
read_mgf(file, show_progress = TRUE)read_mgf(file, show_progress = TRUE)
file |
The path to the MGF file. |
show_progress |
A |
A tidy tibble holding MS/MS spectra.
mgf_path <- system.file("extdata", "toy_mgf.mgf", package = "metamorphr") read_mgf(mgf_path)mgf_path <- system.file("extdata", "toy_mgf.mgf", package = "metamorphr") read_mgf(mgf_path)
Remove empty columns (i.e., columns that only contain NA) from a tibble or data frame.
remove_empty_cols(data, always_keep = NULL, show_removed_cols = TRUE)remove_empty_cols(data, always_keep = NULL, show_removed_cols = TRUE)
data |
A tibble or data frame in wide format. |
always_keep |
Specify columns that should always be kept, regardless if they only contain |
show_removed_cols |
If |
A tibble or data frame in wide format without empty columns.
# Columns `a` and `d` contains only `NA` and should be removed na_tibble <- tibble::tibble( a = c(NA, NA, NA), b = c(1, 2, 3), c = c(NA, 2, 3), d = c(NA, NA, 3), e = c(NA, NA, NA) ) remove_empty_cols(na_tibble) # Columns `a` and `d` contains only `NA` but `a` should be kept anyways remove_empty_cols(na_tibble, always_keep = a)# Columns `a` and `d` contains only `NA` and should be removed na_tibble <- tibble::tibble( a = c(NA, NA, NA), b = c(1, 2, 3), c = c(NA, 2, 3), d = c(NA, NA, 3), e = c(NA, NA, NA) ) remove_empty_cols(na_tibble) # Columns `a` and `d` contains only `NA` but `a` should be kept anyways remove_empty_cols(na_tibble, always_keep = a)
Scales the intensities of all features using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling, is the mean of intensities of feature across all samples
and is the standard deviation of intensities of feature across all samples.
In other words, it subtracts the mean intensity of a feature across samples from the intensities of that feature in each sample and divides by the standard deviation of that feature.
For more information, see the reference section.
scale_auto(data)scale_auto(data)
data |
A tidy tibble created by |
A tibble with autoscaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% scale_auto()toy_metaboscape %>% scale_auto()
Centers the intensities of all features around zero using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling and is the mean of intensities of feature across all samples.
In other words, it subtracts the mean intensity of a feature across samples from the intensities of that feature in each sample.
For more information, see the reference section.
scale_center(data)scale_center(data)
data |
A tidy tibble created by |
A tibble with intensities scaled around zero.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% scale_center()toy_metaboscape %>% scale_center()
Scales the intensities of all features using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling and is the mean of intensities of feature across all samples
In other words, it performs centering (scale_center) and divides by the feature mean, thereby focusing on the relative intensity.
scale_level(data)scale_level(data)
data |
A tidy tibble created by |
A tibble with level scaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% impute_lod() %>% scale_level()toy_metaboscape %>% impute_lod() %>% scale_level()
Scales the intensities of all features using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling, is the mean of intensities of feature across all samples
and is the square root of the standard deviation of intensities of feature across all samples.
In other words, it subtracts the mean intensity of a feature across samples from the intensities of that feature in each sample and divides by the square root of the standard deviation of that feature.
For more information, see the reference section.
scale_pareto(data)scale_pareto(data)
data |
A tidy tibble created by |
A tibble with autoscaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% scale_pareto()toy_metaboscape %>% scale_pareto()
Scales the intensities of all features using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling, is the mean of intensities of feature across all samples,
is the maximum intensity of feature across all samples and is the minimum intensity of feature across all samples.
In other words, it subtracts the mean intensity of a feature across samples from the intensities of that feature in each sample and divides by the range of that feature.
For more information, see the reference section.
scale_range(data)scale_range(data)
data |
A tidy tibble created by |
A tibble with range scaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% scale_range()toy_metaboscape %>% scale_range()
Scales the intensities of all features using
where is the intensity of sample , feature after scaling,
is the intensity of sample , feature before scaling, is the mean of intensities of feature across all samples
and is the standard deviation of intensities of feature across all samples. Note that where CV is the coefficient of variation across all samples.
scale_vast_grouped is a variation of this function that uses a group-specific coefficient of variation.
In other words, it performs autoscaling (scale_auto) and divides by the coefficient of variation, thereby reducing the importance of features with a poor reproducibility.
scale_vast(data)scale_vast(data)
data |
A tidy tibble created by |
A tibble with vast scaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
J. Sun, Y. Xia, Genes & Diseases 2024, 11, 100979, DOI 10.1016/j.gendis.2023.04.018.
toy_metaboscape %>% scale_vast()toy_metaboscape %>% scale_vast()
A variation of scale_vast but uses a group-specific coefficient of variation and therefore requires group information. See scale_vast and the References section for more information.
scale_vast_grouped(data, group_column = .data$Group)scale_vast_grouped(data, group_column = .data$Group)
data |
A tidy tibble created by |
group_column |
Which column should be used for grouping? Usually |
A tibble with vast scaled intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% scale_vast_grouped()toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% scale_vast_grouped()
Information about a feature table. Prints information to the console (number of samples, number of features and if applicable number of groups, replicates and batches) and returns a sample-wise summary as a list.
summary_featuretable( data, n_samples_max = 5, n_features_max = 5, n_groups_max = 5, n_batches_max = 5 )summary_featuretable( data, n_samples_max = 5, n_features_max = 5, n_groups_max = 5, n_batches_max = 5 )
data |
A tidy tibble created by |
n_samples_max |
How many Samples should be printed to the console? |
n_features_max |
How many Features should be printed to the console? |
n_groups_max |
How many groups should be printed to the console? |
n_batches_max |
How many Batches should be printed to the console? |
A sample-wise summary as a list.
toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% summary_featuretable()toy_metaboscape %>% join_metadata(toy_metaboscape_metadata) %>% summary_featuretable()
The raw feature table is also included.
This tibble can be reproduced with metamorphr::read_featuretable(system.file("extdata", "toy_metaboscape.csv", package = "metamorphr"), metadata_cols = 2:5).
toy_metaboscapetoy_metaboscape
toy_metaboscapeA data frame with 110 rows and 8 columns:
A unique identifier for each Feature. This column is automatically generated by metamorphr::read_featuretable() when the feature table is imported.
A label given to each Feature for easier identification. The column of the original feature table that is used to generate the Feature column is specified with the label_col argument of metamorphr::read_featuretable().
Sample name. Column names in the original feature table.
Measured intensity (or area).
Retention time. Feature metadata and therefore not really necessary.
Mass over charge. Feature metadata and therefore not really necessary.
Feature name. Feature metadata and therefore not really necessary.
Chemical formula. Feature metadata and therefore not really necessary.
...
This data set contains fictional data!
toy_metaboscape
Data was generated with metamorphr::create_metadata_skeleton() and can be reproduced with
metamorphr::toy_metaboscape %>% create_metadata_skeleton().'
toy_metaboscape_metadatatoy_metaboscape_metadata
toy_metaboscape_metadataA data frame with 11 rows and 5 columns:
The sample name
To which group does the samples belong? For example a treatment or a background. Note that additional columns with additional grouping information can be freely added if necessary.
The replicate.
The batch in which the samples were prepared or measured.
A sample-specific factor, for example dry weight or protein content.
...
This data set contains fictional data!
Data was generated with metamorphr::read_mgf() and can be reproduced with
This tibble can be reproduced with metamorphr::read_mgf(system.file("extdata", "toy_mgf.mgf", package = "metamorphr")).
toy_mgftoy_mgf
toy_mgfA data frame with 3 rows and 5 columns:
A fictional variable.
A fictional variable.
A fictional variable.
The precursor ion m/z.
A list column containing MSn spectra.
...
This data set contains fictional data!
Log-transforms intensities. The default (base = 10) calculates the log10. This transformation can help reduce heteroscedasticity. See references for more information.
transform_log(data, base = 10)transform_log(data, base = 10)
data |
A tidy tibble created by |
base |
Which base should be used for the log-transformation. The default (10) means that log10 values of the intensities are calculated. |
A tibble with log-transformed intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% impute_lod() %>% transform_log()toy_metaboscape %>% impute_lod() %>% transform_log()
Calculates the nth root of intensities with x^(1/n). The default (n = 2) calculates the square root. This transformation can help reduce heteroscedasticity. See references for more information.
transform_power(data, n = 2)transform_power(data, n = 2)
data |
A tidy tibble created by |
n |
The nth root to calculate. |
A tibble with power-transformed intensities.
R. A. Van Den Berg, H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. Van Der Werf, BMC Genomics 2006, 7, 142, DOI 10.1186/1471-2164-7-142.
toy_metaboscape %>% impute_lod() %>% transform_power()toy_metaboscape %>% impute_lod() %>% transform_power()