Coding for Cancer

James Wadman

Addressing serious diseases from the perspective of a computer scientist might seem counterintuitive to the biological nature of health. However, as our collective information continues to increase, it seems likely that programming will be used as a dominant technique used to attack problems smoothly and efficiently using catalogued information. My case for this argument stems from intracellular cascades, which are identified in biology as many of the essential regulation checkpoints for cell growth, proliferation, and destruction. The discoveries of these intracellular cascades can be synonymous with oncogene pathways and biosignaling markers, depending on which pathway one is addressing. Knowing this, it should be clear why I find this important. Identifying any components missing in these regulation pathways can allow us to use catalogued information about these pathways in unison with measured genotypes in an affected individual to determine risk, cause, and treatment of cell regulation- based diseases, namely cancer.


Like many focuses of cell and molecular biology, intracellular signaling is constantly evolving with discoveries being made nearly on a daily basis. I figure then that it is essential that we start somewhere in beginning a code for cancer. This code takes into account various cell pathways that go along with cell regulation, including apoptosis and gene regulation through activation or inhibition of transcription factors. Being that I am more focused on biology in general, I thought it would be a more interesting task to focus on the computer science aspect of this problem. Therefore, the complexities of the pathways are not yet indicated by this primordial code of a far greater task. Instead, the idea of this code is to attack the problem from its very most basic nature, understanding that even though many variables are in play, the direct impact of the presence or absence of a signaling molecule is often binary. Then by creating data structures that are interwoven by code, one can see how the binary direct impact of a signaling molecule can actually have profound impacts on the overall health of the cell.


How The Code Works


My proposed code attempts to answer the question, how can we use stored data to infer information about cancer risks and treatment? The stored data in this case comes from two sources: the person’s genotype and all the possible information about cell cycle proteins and genetic expression that we have stored thus far. My solution therefore stems from user input, which codes for a person’s genotype within certain parameters and compares that information to data stored in lists, which is then analyzed by several functions to give the user a detailed report on the potential risk for a general cell with all other conditions standardized. Let’s take a look at each step in detail.


The user input allows a person to enter in specific mutations within certain parameters, as stated before. This is one of the more limited steps of this code so far, because it is difficult for a person to gain access to their own genome and even if the full genome can be analyzed, much of that information cannot be directly correlated completely with protein expression. My code, therefore, skips the step of the human genome (note, in the future this step will be critical for efficiency and should NOT be skipped), and takes input data regarding protein expression. In other words, the user inputs which proteins are not expressed or the user can input WT for wild-type (no genetic loss of function mutations). The user input looks like this, where given parameters are suggested for protein loss-of-function:


Screen Shot 2015-05-03 at 2.32.28 PM


The stored data can come from any trustworthy collection of cell-cycle pathways, but in the future it will certainly come from objective databanks with consistent vocabulary for each protein or gene in question. Analyzing the pathways is the most important part of the code, because the code must be written to understand a few critical rules in cell regulation. Some of these that I worked into my code include considerations such as the fact that if an activating molecule is missing the regulation for that pathway will be lost completely, while if an inhibiting molecule is missing, the pathway will function but in the opposite regulation standard that a healthy cell would intend. It must be considered in this portion of the code that genetic expression is not an “all-or-nothing” process, and very slight manipulations in the genome can result in a gradient of genetic expression by way of mutations that code for separate but similar amino acids or permit posttranslational modification unique to the wild-type. There is also the possibility of over-expression of genes, which can serve to overregulate cell division if the cell-growth checkpoints and inhibitors are in some way overshadowed by a higher concentration of signaling molecules promoting cell proliferation. It is in this step that the complexity of collective data will ultimately be simplified into a way to efficiently understand the human genome, but such data is currently still a bit out of reach. The standard for data I used for the “all-or-nothing” protein function or loss-of-function in my code is shown below:


Source: Oncology Biomarkers; Take particular notice of arrows and flat-edged lines (corresponding to activation and inhibition, respectively)


The results of my code will give clues into the potential health of specific pathways in the cell. Molecular cancer treatments can function to save impacted cell regulation pathways from the consequences of mutations. The user input will take a list up to the size of all proteins in the parameters shown above, and it does not add duplicates.  Examples of the results of my code are shown below:

Screen Shot 2015-05-03 at 2.33.15 PM
PI3K mutation results
Screen Shot 2015-05-03 at 2.32.55 PM
Wild-type organism results; completely healthy cell



If you have any questions, comments, or your own ideas please let me know. I will continue working on developing this code while more data surfaces with insights into the human genome and the cell cycle. My current troubleshooting tasks involve some accuracy discrepancies, but overall the code runs mostly smoothly.  I will provide a brief conclusion next week regarding how humanity can use coding in biology and medicine, as well as what we must carefully consider as we take steps forward.


Special note: If you use these ideas in your own project or plan on reposting content from me, please quote me as a source. Any ideas of data collection for distribution to the masses should come with proper awareness of ethics, but I also would love to follow along with anyone else pursuing these ideas because I know coders out there can do a better job than I can and I would enjoy seeing how this idea is manifested by other minds. 

3 thoughts on “Coding for Cancer

  1. Artem Kaznatcheev

    It is always great to see people excited about applications of math and CS to cancer. I hope you are continuing with thinking in this area.

    However, I think you might be underestimating the difficulty of the problem you are setting yourself. Not only by reducing cancer to a genetic disease (it isn’t all mutations, epigenetics and the local micro environment can matter a lot) but also by having too much confidence in the consistency and completeness of our knowledge of signaling pathways or how to affect them. If you are interested more in the computational approaches to cancer then I suggest searching around for “mathematical oncology” and if you want to see some cutting edge CS done on cancer then you should try to attend the upcoming computational cancer biology workshop at the Simons Institute. They seem to be interested in the same bioinformatics approach as you:

    Computational Cancer Biology is a rapidly expanding area, utilizing deep sequencing techniques (“Next Generation Sequencing”) that facilitate the sequencing of tens of thousands of tumor genomes, along with other matching information. Large international projects are collecting and organizing this data, but developing powerful algorithms for the data analysis is a bottleneck. Current analysis techniques combine graph theoretic and machine learning approaches. One such line of work builds on the rich combinatorial and algorithmic theory of genome rearrangements. Another aims to improve classification of cancer patients and reveal biomarkers for specific disease subtypes, with the goal of improved diagnosis, prognosis and patient stratification. This workshop aims to survey the state of the art in this field and explore new algorithmic approaches with potentially large impact.

  2. James Wadman Post author

    Hey thanks for the feedback!
    This code was my first attempt at describing a layman’s perspective on computational cancer genetics — and I have been in development on much more complicated classes that can cross-check different data and form new correlative mathematical variables. I completely agree with your words about the complexity, which is why I want to make sure that this particular program is labeled as more of an introduction to a problem, rather than an introduction to the solution.
    I looked into your link and that looks really fascinating. I hope that mixed disciplines continue to be prevalent in the natural sciences, particularly when they combine biology with cs.

    I hope you will stick around to see the new codes I am working on because those are the ones that are going to be more stimulating to the mathematical/cs minds.

    Stay in touch,

  3. Sandeep Jain

    Hello James. I’m a medical student at Vanderbilt University and recently gave a quick presentation on the mathematical modeling of cancer to my oncology classmates & professors. Most of my presentation was derived from “Computational investigation of intrinsic and extrinsic mechanisms underlying the formation of carcinoma” by Dr. Vito Quaranta and a few of his colleagues. It’s an awesome article that I think you’d be very interested in. These authors coded a mathematical model of cancer that depends on virtual receptors on the cells they’re modeling. The receptors determine whether a cell continues to proliferate, enter a state of rest, dies, etc. Interestingly, the model was able to produce cell structures that looked exactly like those grown in wet-labs. In other words, the model has been somewhat validated. I hope you find the paper interesting.


Leave a Reply