The power of single cell analysis

In the OMICS world with huge steps is entering single cell analysis. What is meant with single cell experiments? For example in case of single cell transcriptomics, during an experiment, each cell is barcoded. This allows further examination of cells separately. The transcriptomic profiles of the cells helps in recognition of the cell subsets. Commonly used Bioconductor package for analysis of single cell sequencing data (but also spatial transcriptomics! highly interesting topic as well!) is Seurat. The developers prepared really good examples, tutorials and vignettes, available at https://satijalab.org/seurat/ . The name is also very creative, as it comes from very known French artist Seurat, whose “A summer landscape” is presented below. As many points, representing cells, are building nice-looking picture, which helps us to see with higher resolution the biology behind studied aspect, the French painter was showing us world, through his specific style of many colourful dots and short brush strokes.

Rawpixel Ltd, CC BY 2.0 https://creativecommons.org/licenses/by/2.0, via Wikimedia Commons

Really nice example of single cell transcriptomics-based study is presented in paper “Colorectal Cancer Stem Cell States Uncovered by Simultaneous Single‐Cell Analysis of Transcriptome and Telomeres” [1]. As studying of very small subpopulation, which is cancer stem cells is extremely difficult with bulk sequencing, the single cell sequencing approach was perfect choice. The researchers in the study were using single cell sequencing and single cell telomere length analysis for studying the telomeres in the cancer stem cell population in colorectal cancer patients. The study includes many experimental methods, validation steps and throughout all the paper builds ground for final hypothesis. I strongly recommend you to read the article. It is published in open access journal, so no-one of you will have troubles to access it, which makes it even better!

[1] Wang, Hua, et al. “Colorectal Cancer Stem Cell States Uncovered by Simultaneous Single‐Cell Analysis of Transcriptome and Telomeres.” Advanced Science 8.8 (2021): 2004320.

MaxQuant and Perseus

Hey all! Earlier, I mentioned MaxQuant Summer School. Right now, it is happening. They shared in this week new “MaxQuant.Live” which is a software framework to monitor mass spectrometric data and to control the data acquisition. It is free and Live-2 beta supports Q-Exactive series.

Together with this, the lectures are so explanatory and very well prepared. All lectures are so useful not only for improving your skills and to learn deeply in the concepts of MaxQuant and Perseus but also to meet and learn about proteomics, if you’re a beginner in this area. They will also put the all recording videos to YouTube page of MaxQuant, after the course. I definitely recommend that you take a look. 

Also, as a user, I can say that both of the MaxQuant and Perseus are very friendly use software to for qualification-quantification and analysis of proteomic data. Also, they are widely used, so you can find the all documentations and training videos on web. Even if you don’t have any experience for these software, don’t hesitate to take a look 🙂

Yiz

Dear Methanol :)

Hey all,

This week is a little crazy because of my current workload, as well as extraction of all my wisdom teeth. Therefore, with a short introduction, I am going to leave an interesting publication here and leave you to it.

This time the main topic is different than proteomic research. We are travelling far before proteins and amino acids, at the speed of light and from our orbit. We are looking at a possible pathway of observed methanol (CH3OH) in a disk around a young star which is a molecule participated in formation of more complex molecules such as amino acids and proteins, structure stability, folding and activities of proteins. Especially I really liked the authors’ approach with to discuss the whole idea with the in the perspective of chemical evolution.

If you are interested, I wish you enjoyable reading and also the wisdom teeth that hold on to life more than mine or decide not to grow.

Yiz

Reference:

Booth, A.S., Walsh, C., Terwisscha van Scheltinga, J. et al. An inherited complex organic molecule reservoir in a warm planet-hosting disk. Nat Astron (2021).

Paper review: Multivariate Analysis in Metabolomics

Hey there! Today I want to share with you a paper about multivariate analysis in Metabolomics. Lately, I have been reading about metabolomics data normalization, scaling, and analysis. Several people in our PhD cohort analyze metabolomics data, and they highly suggest using MetaboAnalyst. It is definitely a user-friendly website to analyze our data. We can use many different tools for each step of the analysis on the website. Clicking things gave me results and beautiful plots, but I wasn’t 100% sure if I was doing things correctly! So I had to step back and read first, what do these different options do? 😀

I noticed that metabolomics papers frequently use PLS (Partial Least Squares Projections to Latent Structures) and OPLS (Orthogonal Projections to Latent Structures) for multivariate analysis. However, I never used these models and wondered how to use them properly. Sinemyiz recommended me this paper: Multivariate Analysis in Metabolomics (Worley, 2013). Turns out there are serious limitations for PLS and OPLS. For example, they will over-fit models to the data and can separate classes even with random data! So we need to be careful with interpreting the results, and validation is necessary. The paper then discussed different ways to validate PLS models.

The authors also listed other multivariate methods that researchers can use: hierarchical clustering, support vector machines, and artificial neural networks. There are studies with these methods, but the authors said that the metabolomics community is more accustomed to PCA and PLS. Do you think this statement is still true today?

Here is the paper to read more:

Worley B, Powers R. Multivariate Analysis in Metabolomics. Curr Metabolomics. 2013;1(1):92-107. doi:10.2174/2213235X11301010092

R Package to do basic statistics with a Great Visualization!

The package ggstatsplot is an extension of ggplot2 package for creating graphics with details from statistical tests included in the information-rich plots themselves. In a typical exploratory data analysis workflow, data visualization and statistical modeling are two different phases: visualization informs modeling, and modeling in its turn can suggest a different visualization method, and so on and so forth. The central idea of ggstatsplot is simple: combine these two phases into one in the form of graphics with statistical details, which makes data exploration simpler and faster.

Therefore, produces different kinds of plots for different analyses:

FunctionPlotDescription
ggbetweenstatsviolin plotsfor comparisons between groups/conditions
ggwithinstatsviolin plotsfor comparisons within groups/conditions
gghistostatshistogramsfor distribution about numeric variable
ggdotplotstatsdot plots/chartsfor distribution about labeled numeric variable
ggscatterstatsscatterplotsfor correlation between two variables
ggcorrmatcorrelation matricesfor correlations between multiple variables
ggpiestatspie chartsfor categorical data
ggbarstatsbar chartsfor categorical data
ggcoefstatsdot-and-whisker plotsfor regression models and meta-analysis

I have used from first hand the function ggbetweenstats to compare two variables from prospective data. Additionally, there is also a grouped_ variant of this function that makes it easy to repeat the same operation across a single grouping variable.

https://cran.r-project.org/web/packages/ggstatsplot/readme/README.html

I show an example of my data for a BMI condition, and the statistics report format. Of course the package have the option to change also the :

  • Graphical element
  • Central tendency measure
  • Hypothesis testing
  • Effect size estimation and
  • Pairwise comparison tests

So take a look, it’s a great tool and package to start analyzing your data with a great way of visualization.

Felipe.

References:

  • Patil, I. (2018). Visualizations with statistical details: The ‘ggstatsplot’ approach. PsyArxiv. doi:10.31234/osf.io/p7mku

Swimming in data – Long way hurray!

Hey all! Hoping you all doing well…

Today the topic is very general. I was going to look at my data, and I thought “ooh which paper can I share this time?”. It was after 7 hours working on screen (not a good idea! do not do it :D). And, after all, data work in the last few days, I think that may be better to share a working note, rather than paper in this time.

Sometimes, I know that the easiest way, the clearer one, the way without complex steps can lead you to more meaningful results. Although I very much agree with this idea, especially when designing experiments, this may not always bring you to the deepest and most meaningful and clean conclusion. Sometimes, continuing with the winding-long-very unpredictable roads can lead you to different discoveries.

The data you obtained at the end of the proteomic examination; you will have a large number of proteins. Of course, you analyze these proteins using bioinformatics tools, you annotated all, you-know-who (rather than Voldemort :D), even you know how they can be associated. Every day there is a lot of work published about the molecular activities or biological roles of a lot of proteins. For example, ribosomal proteins, maybe their role in the formation of ribosome structure or protein synthesis comes to mind first. But they are interesting, they participate in various biological processes such as apoptosis, differentiation, regulation of proteostasis. Thus, it is important to take the data obtained after your bioinformatic analysis and review the literature comprehensively.

I am not going to say that it is gonna be alright, most probably it will be a crazy little lovely fight. But somehow long ways bring you to synthesize and interpret the knowledge and imagine. As a result, you may start to see your results much differently than you first saw.

Let me also leave the title’s inspiration for the long way series here. Thank you for inspiring.

Yiz

Bioinformatics Pipeline Development with Nextflow

As part of the main objective of my project is the development of a pipeline, therefore having a high knowledge on how to create an efficient workflow is essential. Nextflow is a workflow management system that uses Docker technology for the multi-scale handling of containerized computation. It enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

I assisted to the Bioinformatics NGS Data Analysis Summer School organized by ecSeq Bioinformatics so when I heard the were making an online workshop of Bioinformatics Pipeline Development with Nextflow I knew I needed to assist.

As it is mentioned on the webpage, the purpose of the workshop was to introduce the concepts of bioinformatic pipeline development through the context of the open source Workflow Management System (WMS) Nextflow. I was trained in the scripting, configuration and execution of example analysis pipelines based on current industry best-practices, and learned how to share them with other users. Finally, I applied everything I have learned by implementing my own analysis pipelines from the ground up.

Personal opinion:

The target audience are biologists or data analysts with no or little experience in developing computational pipelines for data analysis. A superficial understanding of molecular biology (DNA, RNA, gene expression, PCR, …) is assumed, as examples were given. I think a medium knowledge of Linux and data analysis before taking the course is necessary in order to take the most advantage of it. The organizers were super friendly, solving all of the doubts with precision. For a first time organizing this course, I can say it was very good structured. Don’t loss you opportunity and check their different on-site and online workshops. 🙂

Bibliography:

  • Di Tommaso, P.; Chatzou, M.; Floden, E. W.; Barja, P. P.; Palumbo, E.; Notredame, C. Nextflow Enables Reproducible Computational Workflows. Nat Biotechnol2017, 35 (4), 316–319. https://doi.org/10.1038/nbt.3820.

What are “pipelines” in bioinformatics?

The interest in next-generation sequencing (NGS) has increased rapidly nowadays, rising the number of methods and software to analyze the results, however, a clear understanding about the standards of qualitative and quantitative analyses has not yet been achieved when compared to older methodologies. The analyses are often complex and the researchers don´t have the required bioinformatics skills, hindering the full use of NGS potential. This observation stress the need to create friendly-user and powerful pipelines to deal with this data.

Bioinformatics analysis involve the management of files through a series of steps, called a pipeline or a workflow. These steps involve transformations that are done by executable command line software written for Unix-compatible operating systems. Despite the existing bioinformatics pipelines offer high performance analysis is not easy to integrate new specific tools. Bioinformatics framework should be capable to contain pipelines consisting of serial and parallel steps, dependencies, different software and different data file types.

Basic Elements of a bioinformatic pipeline

Validation: A requirement for implementing a pipeline is a systematic clinical validation. It is needed to understand and document each component of the pipeline, the data dependencies, input/output constraints, and develops mechanisms to alert for unexpected errors. Command-line parameters for each component of the pipeline and their settings should be documented and locked before validating the pipeline.

Version Control:  Version control of the pipeline should include semantic versioning of the deployed instance of a pipeline as a whole. Also the versions of the individual components of the pipeline. Since pipeline upgrades often significantly change the NGS test results.

This is just an introduction of what a pipeline is, and what are their basic elements. Nevertheless their applications, obstacles and possible advantages in their use in clinical environment will be discussed in next posts. Stay tuned 😉.

Bibliography:

Is it cheating, or is it not?

I hope, I caught your attention with that intriguing title. Today I would like to present you something, that helps me while analysing, processing and cleaning data in R. There are common packages used in R, which help you in data visualisation (ggplot2), data transformation (dplyr) or list manipulation (purrr). In each package, there is a set of often used commands, which is amazingly collected and presented in cheatsheets! When you write a code, you might have forgotten a command, then you can just have a look at this one page and get a quick reminder, mostly even with an example! On the website: https://www.rstudio.com/resources/cheatsheets/, you can download many cheatsheets, not only for R packages but also for Python or SQL. The most common used cheatsheets, you can access also directly from R Studio. In the screenshot below, you have presented, how to find it. First, you should go to tools tab, then choose cheatsheets and a list of them is shown. If you want to look further, you will be also able to access the previously given website through “Browse Cheatsheets…”. As I am working often with data frames, I am often using Data Transformation Cheatsheet, which contains useful commands from dplyr package. Additionally, I am trying to keep my analysis in R notebook or Rmarkdown files. Especially at the beginning, it is very helpful to have R Markdown Cheatsheet by side, or use the Markdown Quick Reference, also accessible through the Tools tab.

Let me know, if you already used the cheatsheets, or maybe you find it useful for you in the future ?

LOVE CETACEA + PROTEOMICS :)

Photo by Thomas Lipke

Hi everyone!

I am always so excited about the strong-lovely relationship between all biology, chemistry, physics disciplines, and evolutionary perspective which offers a versatile vision. It makes me wonder and imagine more. For example, cetaceans and their metabolic-molecular cellular dynamics – including functional and structural changes in the macromolecular environment and their metabolic consequences. (of course, all phylogenetic classes and their paleontological histories are so cool, however, I might have a little bit more intense curiosity for Cetacea in my heart 😊)

Today, I would like to share a comprehensive study that presents potential diagnostic biomarkers for the determination of health status of marine mammals, from Magnadóttira et al. Identification of deiminated proteins and miRNA analysis in whale serum and extracellular vehicles (EVs) are some of the interesting parts of the study.

Protein deamination is one of the irreversible post-translational modifications. And such modification can change the protein functions, affect the regulation of gene expression, and promote protein moonlighting. All mentioned alterations represent a great interest especially for the protein networks in evolutionary conserved signaling pathways because of the highly conserved peptidylarginine deiminases (PADs) among different several taxas and their potential roles in physiological and immunological processes. On the other hand, EVs also present another interesting research insight for the concept of pathophysiology as well as physiological stress because of their role in cellular communication (through the transfer of miRNAs and proteins).

The authors also used STRING analysis for the analysis of protein interaction networks. STRING is such a useful and open database-tool for the prediction of protein-protein interactions and functional enrichment analysis. (You can also use it in coordination with open-source software Cytoscape).

Besides my natural excitement of such research and molecular interplays, discovering diagnostic biomarkers, getting more and more crucial every day, due to an increase in pollution, infection, global warming. Furthermore, these studies may help us to develop new cancer therapies with different approaches which we learn from long-lived mammals (as today’s paper mentioned).

References:

  1. Magnadóttir B, Uysal-Onganer P, Kraev I, Svansson V, Hayes P, Lange S. Deiminated proteins and extracellular vesicles – Novel serum biomarkers in whales and orca. Comp Biochem Physiol Part D Genomics Proteomics. 2020 Jun;34:100676.
  2. Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, Lars J Jensen, Christian von Mering, STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D607–D613.
  3. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498-2504.

I also would like to mention two inspiring organization which work in ocean and sea conservation. One of them is “Sealegacy” and another is “The Mediterranean Conservation Society“.  If you are interested, you can check them 🙂

Photo by guille pozzi

In the hope that no MobyDick come across a Captain Ahab

Yiz