List Serve for CECS 401: Special Topics in Bioinformatics



S.No QUESTION ANSWERS
1
I went on ahead and looked up the running time of the bipartite matching
problem, and there is one fact that has to be considered. It seemed from
the groups presentation that we are matching molecules and that the points
being matched are already divided into two distinct sets. In this case, 
the problem IS solvable in polynomial time through graphs algorithms 
(specifically, O(n + m) where n and m are the number of vertices and edges 
in the graph, which translates loosely to O(n^3) because m = O(n^2)). The 
problem would be NP-complete if we didn't have the distinct entities to 
match and the points were all in one set, initially not partitioned.

(Steven Cummings)
Yeah, the bipartite searching algorithm, which is used in old DOCK, is in 
the field of Graph calculations; and for the algorithm of new version, 
DOCK4.0, it is probably "divide and conquer".  thank Steve for the important 
info, and we are working on it and will show the details next Mon. 

(Ding)
2
This question is for group 3 about the binding site.

I know X-ray is the major technique to determine the sizes and shapes 
of the protein binding sites, however the protein must be able to be
crystalized before X-ray is applied to it. 

The question is: if you can't crystalize a protein, are there any other 
methods to tell the properties of the binding sites? if so, how big a 
job is it?
(Hongying James)
  • NMR is another major technique to determine the structure of a protein
    and its binding site(s). 
    Kenzior Olga.
  • There are both direct and indirect 
    methods.Usually after we confirmed two molecule will interact w/ each other, 
    there is a need to determine the binding site(domain). 
    
    1.Indirect Method: define the binding site from Sequence. 
    
    For example, the inhibitor of CAD protein is bound thus inhibited,by ICAD 
    protein. to determine the CAD binding site on ICAD. we could systematically 
    delete the amino acid sequence of ICAD, then do in vitro analysis(e.g. 
    Immunoprecipitation) to see the functional sequence(domain) for CAD binding 
    in ICAD. 
    
    2.direct method: NMR(Nuclear Magnetic Resonance) and X-ray 
    
    combined NMR and structure simulation, we could see the motif(3D structure) 
    of the binding site. 
    (Ding)
3
In the paper Group 4 presented, the Rosetta program is less accurate to 
predict the initial and terminal coding exons because it predicts the 
initiation or stop codon as splice site. For me, I thought that to predict 
correctly the initial and stop codon in the initial and terminal coding 
exons should be very important. Anybody has any ideas about how to improve 
the accuracy? 

Liu Shuyu
No reply!!!
4
It is well known that protein domains which perform a similar function 
often have a similar structure.These domains are grouped together in 
order to allow a single protein perform a set of functions.Would it be 
possible to find these domains, and then design a custom protein by just 
grouping together all the domains which correspond to the desired functions?  
I'm sure people have already thought of this, but have they had any success? 

I think that the homologous sequence analysis stuff that we are learning 
about in class could be applied to the discovering of these domains. And 
maybe when taking these domains as single entities (that is, ignoring the 
intradomain interactions) it would not be a very difficult molecular dynamics 
problem to predict their interactions with each other.They could then be 
arranged in such a way as to ensure that they do not interfere with each 
other's functions, and maybe they could even be configured to assist each other 
or cooperate.
What do you guys think, is this feasible? 

Michael Lawrence
  • Many protein contain distinct domains separated by linker fragments.By knowing where these linkers occur within the DNA sequence, one can quite easily swap domains at the level of DNA (called a two-hybrid system).This is a very common experimental technique, in fact. The most well-known example of such a modular protein is the GAL4 transcriptional factor of yeast.GAL4 is a transcriptional factor that is required for cell grow on the sugar galactose.The GAL4 protein consists of a DNA-binding domain which binds specific DNA sequences within yeast promoters, and an activation domain which interacts with other proteins which aid in transcription and which is activated by galactose.A common experiment is to fuse the activation domain of an unrelated protein to the DNA-binding domain of GAL4.This new protein will now transcribe as a GAL4 protein but will be activated by a different substrate.The activity of the activation domain can then be assayed. This site explain it much better than I can (with pictures!): http://www.sb.fsu.edu/~hongli/BCH5425/note32.html
  • I'm no expert in molecular dynamics, but I think this is a trickier problem than you suspect, especially as the number of domains being considered increases.In many proteins, the activity of one domain is dependent on the activity of another.For example, in a typical two-component sensor protein, a substrate binds to the sensor domain which in turn causes a conformation shift (change of shape) in the other domain which induces autophosphorylation (adds phosphorous atoms to itself).It's not always clear how these conformation changes occur, so you always run that risk when you start fusing proteins.
  • Chris Hemme
    5
    I have a question for group 7.Is NADH not produced in the second 
    pathway,passing through the shunt... 
    As I am a complete naive in the stuff you are dealing with,I would 
    like to know what would be the next step after you resolve the RBC Network.
    
    Tulasi
    
  • In pathway 2, denoted p2 in Table 3 in the paper, the net reaction produces NADH, but no
    ATP like p1.The reason it does not produce any net ATP is that the shunt in p2 skips the 
    ATP production.However, the NADH (in both p1 and p2) is produced in the GAPDH reaction 
    just before the shunt.The reason that p1 has no net NADH production is that the NADH is 
    broken down to NAD+ and H+ in the conversion of pyruvate to lactate at the end of the 
    pathway (p2 stops at pyruvate).
    
    Hope that makes sense, 
    Michael Lawrence 
    
  • I read ur doubt. the explaination is as follows: 
    In Rapoport Luebering Shunt instead of going from 1,3 DPG(Diphospho glycerate) directly to 
    3 PG(Phosphoglycerate)which involves production of an ATP, it passes from 1,3 DPG to 2,3 DPG 
    then to 3 PG, thereby bypassing the normal route and hence no production of  ATP at that 
    point.This shunt usually does not occur except when one goes to high altitudes called as
    oxyhemoglobin modulation.
    
    In RBC's (& Skeletal muscle)-> anaerobic glycolysis occurs 
    where in Pyruvate is further broken down to lactate and in this step NADH is converted into 
    NAD+.So the NADH produced in the converion of GA3P(Glyceraldehyde 3 Phsosphate) to 1,3 DPG is 
    utilised in conversion of Pyruvate to Lactate.so "NO" net production of NADH in conversion of 
    Glucose to Lactate.
    
    A relatively simple sketch of glycolysis can be seen at: 
    glycol.html.
    
    
    Now coming on to your second question: utility of resolving the RBC network: As you can see the 
    most common pathway involves the conversion of Glucose to lactate in RBC(P1).P2 to P6 represent 
    the extreme pathways which can occur in RBC.This way any reaction occuring in RBC can be limited 
    to 6 pathways and the other 16 reversible reaction can just be excluded.
    
    Other examples to prove implementation of these sort of networks is as follows: 
    H. Influenzae: a gram negative pathogen, causes otitis media, acute & chronic respiratory tract 
    infections.It consists of 83 potential substrates,50 products and 461 reactions.
    
    Similarly H. Pylori: Commonest cause of Gastric Ulcer.It consists of 583 reactions and 381 
    metabolites.The number of reactions & metabolites which exist in this reaction are enormously
    large.We conclude: 
    a) Algorithm cannot be parallelised easily and requires a fast processor with large memory. 
    b)Current calculation of the full pathway structure is infeasible(time & memory requirements 
    are too large). 
    c)We can restrict the output of the matabolic network to smaller subsets. 
    means the philosophy of " Divide & Conquer". 
    Hereby studying the entire network and limiting it to the pathways of importance to us: 
    Pharmaceutical industry to manufacture drugs, biologists to know the rate limiting steps.
    
    Note: GAPDH reaction (as mentioned by Lawrence:I have not heard of it). 
    Seth, Raman
    
    
  • 6
    Where does the fuzzy logic apply in the application of today's presentation? I am not 
    pretty clear with that.. 
    
    Also I have doubt about this: ei, k = {0,1}, I thought if 0 and 1 indicated the presence 
    of causality, it would imply that from state i to state k there is a transition which 
    has the corresponding causality.
    
    It was just the way I understood, I would like to know if what I perceived is correct.. 
    
    Tulasi 
    
    We had a very interesting lengthy discussion over this, so the whole of the discussion 
    is at: fuzzy.doc
    
    7
    I have a question for group 7.  You mentioned that the dehydration reaction of glycolysis
    is mediated by the enzyme dehydratase.  How is this enzyme different from dehydrogenase?
    
    Stephanie Carns 
    
  • Stephanie, Are you sure we mentioned "dehydratase"? If you carefully looked at the paper, several dehydrogenases are listed in the Table 1. Dehydratase is not a part of glycolisis. Olga (Group 7)
  • Not group 7, but... A dehydrogenase removes the equivalent of a hydrogen atom. For example: Glycerol + NAD(+) <= glycerol dehydrogenase => Glycerone + NADH + H(+) In this case, two protons (H+) and two electrons (-) are removed from the alcohol glycerol to make the ketone glycerone. The two electrons and one of the protons are accepted by the cofactor NAD(+). While the protons do not come of as elemental hydrogen, a proton + an electron is the equivalent of a hydrogen atom, so the process is known as dehydrogenation. Dehydrogenation is one of the primary means of enzymatic oxidation of organic molecules. Dehydrogenases are members of the oxidoreductase class of enzymes, which catalyze reactions involving the transfer of electrons. A dehydratase, on the other hand, removes the equivalent of a water molecule. For example: Glycerol <= glycerol dehydratase => 3-Hydroxypropanal + H2O A dehydratase is a member of the lyase class of enzymes, which catalyze reactions involving double bonds, in this case between carbon and oxygen. Two good enzyme resources are the ExPASy database:http://us.expasy.org/cgi-bin/enzyme-search-cl and the GenomeNet database: http://www.genome.ad.jp/ The GenomeNet database allows you to search for a particular reaction by enzyme (GENES) and substrate (LIGAND) and also allows you to search by metabolic pathway (KEGG) both in general and for specific organisms. Chris Hemme
  • 8
    I remember from when I learned glycolysis and all the surrounding pathways 
    that there were many points along the pathways at which intermidiates
    could enter into the pathway from other sources or pathways.  It seems like 
    that would mess up a method that used mass balance to recreate a pathway.  
    Figure 5 shows a pathway that is much more isolated from outside influence
    that it would actually be in nature.  Does this matter or am I missing
    something in the paper.  
    
    Daniel 
    
    it is why they make steady state approximation S x n = 0. 
    Olga
    
    9
    I just have a quick question for Group 8.  In the paper "Creating Metabolic 
    and Regulatory Network Models using Fuzzy Cognitive Maps", in Figure 3 it 
    says that white means the node was on and black means the node was 
    off....what are the gray sections??? 
    
    Cammack, Kristi M.
    
    The logistic function used actually did allow for values in between.  Because of the
    choice of c=1,000 only rarely does any input (yi) attain a value other than 0 or 1, 
    but it does happen sometimes, hence some values were 0.5 and gray.
    
    Bill 
    
    10
    For today's class, the network structure-enumerating algorithm. 
    Does this method guarantee to get a solution compatiable to the data? Or 
    in other words, is it true that if the algorithm can not get a solution, 
    it means that the data measured in the experiment is wrong? 
    
    So I am still wondering if the proof of this effectiveness of the method 
    is true. And, it does not has general constraints on how to construct 
    the network, just shows the example of a multi-variate(level) gene in 
    the network. 
    
    hai
    
  • Hai! I feel that in general it may not be possible to construct a right model with wrong data.And morever even if we have right data , we must be sure that it is respresentative of the actual model we want to construct. In other words we must have samples that represent every possible have samples that represent every possible behavior of the system to construct a perfect model.So I feel , the quality of model is more data dependent than method dependent. Regarding effectiveness of this method, I feel it is better if we wait and see what other team members (Ozy and Richa)have to say. May be we can have some criterian function to decide at any point of time during the execution to decide which is best of the models generated so far.We could also have some optimization techiniques to speed up the process of generation ,validation and evaluation of models. Does any one have any other opinons? Rajkumar Bondugula
  • Hi ! As for the model, since we are using the reverse engineering process, ie. the data is used for modelling, Raj is very much correct about the model being more of data dependent. Also, the criteria under which we get the data and factors such as accuracy of data, etc. will be covered next week by Ozy and Richa where they discuss the identification of Gene Regulatory Networks from Gene Expression data and I think you will all have to wait till then to get the exact answer of your questions. Gaurav.
  • 11
  • Hello, 
    In the paper "Identifying Gene Regulatory Networks from Experimental Data" 
    in the Activation-Inhibition Scores section, Figure 3 shows an example of a 
    "good candidate activator" and a "good candidate inhibitor".  I can follow the 
    Activators graph up until about time step 12.  It seems like between time step 
    12 and 14, the decrease in concentration of the activator actually "causes" the 
    concentration of the activee to increase.  
    Two questions: 
    1) Is this accounted for when calculating the activation grade they mentioned 
    (where they look at the peaks, leading edges, etc.)  In other words, can we just 
    ignore what happens between these two time steps simply because things look nicer 
    for the rest of the graph? 
    and 
    2) Why in nature would the activee concentration continue to rise even after the 
    concentration of the activator has decreased quite a bit.  Should we guess that some 
    of the transcription factors have remained bound to the binding sites?
    
    Seth 
    
    P.S.  Theres a similar (but opposite) effect between time steps 13 and 14 of the 
    Inhibitors graph. 
    
    
  • Hi, 
    
    Is it that both c1, c2 are dependant on each other(in the graph)??If so in what way? 
    
    Tulasi 
    
  • tulasi, 
    
    1)As it is a feedback network (which is obvious), the input for each binding site is 
    always(in the considered network)the output from the previous substance generator.The 
    input for the binding sites b2 & b1 is always the output generated by the substance 
    generators r1 and r2 respectively.
    For this reason,I think the concentrations for e.g. c1 and c2 are dependent on each other. 
    2)As the final output is evaluated by considering several combinations of these networks,
    there might be a situation where the concentrations might be independent.
    
    I am not sure whether this a fair answer. 
    
    Bhavani 
    
  • I think it is fair to say that concentration A depends on concentration B if some change 
    in concentration B crosses over a threshold and triggers a substance generator which affects 
    concentration A.
    
    So the dependence of the concentrations is explicit in the model. 
    Does this sound reasonable? 
    
    Michael 
    
  • Seth, I am no biologist, and we have none in our group. So I have very little insight into what is actually going on here. I was asking the same question too. The paper does not describe the "boundary condition" strategies when determining the edge Activator or Inhibitor strength. What happen on Figure 3A between time steps 12 - 14 is quite puzzling for me too. But isn't it possible that cluster 272 (the regulated cluster) may be affected by a different activator (other than cluster 249) during observation in time steps 12 - 14 ? Since no boundary condition is described, I guess that they only pay attention to peaks that are well represented (have leading edge, max point, and trailing edge) and the peaks on both cluster 249 and cluster 272 from time step 1 to 10 seem to be the only one to satisfy these criteria. There is a peak from time step 10 to 14 on cluster 249 profile, however the profile of cluster 272 during the same time steps does not yield a complete peak definition (the trailing edge is not known). But again, this is just me reading between the line. I understand that a single inhibitor can beat some activators and suppress the expression level of a regulated gene. So what happen in Fig. 3B sure raises some questions too for me. Unless, my understanding on how inhibitor and activators affet a regulated gene is wrong. Ozy
  • 12
    I just had a quick question for group 9 about the data that was shown at the 
    beginning of the reverse engineering section.  This is the graph with the colored 
    circles representing different concentrations at different timepoints.  Was that 
    actual microarray data or an example of microarray data?  I am kind of confused 
    about where you start from in reverse engineering.  The data shown seemed too 
    orderly to come from an actual experiment.  Thanks.  
    
    
    Daniel 
    
    I am not the presenter or the 1st paper, but the data shown in paper 1 does look too orderly. In 
    paper 2 (that we missed successfuly), they used expression level of gene ORF of Saccharomyces 
    cerevisae (6601 genes total) with 17 observation/measurements taken at 10 minutes interval. The 
    authors in paper 2 did not produce the data, the data was first presented by R. Cho et. al., "A 
    genome-wide transcriptional analysis of the mitotic cell cycle", Molecular Cell, 2:65-73, July 1998.
    
    Paper 2 can be viewed as reversed engineering approached from different angle. They first pruned the 
    genes by keeping only those that are active and have discernable peaks in their ORF profiles. Rather 
    than model the network mathematically, paper 2 clusters genes that behave similarly using average 
    linkage (hierarchical & agglomerative) clustering algorithm.  
    
    Next they use a continuous weight function (I am tempted to call it a fuzzy membership function, 
    because it looks like one) to determine the type of edge that connects two clusters, Cr and Cs. Cr 
    is said to be the activator of Cs, if the strength of "activator" is higher than that for "inhibitor" 
    and vice versa. They determine this for all possible pair of clusters. Essentially creating a directed 
    graph whose edges are labeled A or I and assigned appropriate weight to show how high the influence of 
    A or I given by the regulatory cluster to the regulated cluster.
    
    Next, they apply a simulated annealing based optimization algorithm (known to be good at finding "a 
    solution" out of high combinatorial problem) to find the most optimum subgraph based on some 
    constraints. The constraint they first use is quite restrictive, like they only allow each vertex to 
    have two in-edges (one A and one I, which are the ones having the highest edge weight from each type). 
    They also want to determine which vertices that act as regulatory elements. Hence, the vertex is slowly 
    labeled as A, I or N (I guess it means they are the regulated ones). Again vertex labeled as A or I 
    does not receive input edges from other vertices, and vertices labeled as N cannot regulate other 
    vertices (it receives but does not send). The constraints are quite restrictive, but I think they just 
    try to limit their test case to show that the optimization can work for a simplified form of regulatory 
    networks. They show some results for this.
    
    Later, (Richa Puri was supposed to present this) they also proposed some theorems that explore the 
    complexity and possible solution search space for maximum gene regulation problem (where the constraints 
    they use previously were relaxed).
    
    So, if we had been able to present the second paper, I think we would have a much better view of what 
    could be done with gene regulatory network.
    
    Ozy 
    
    13
    Hi, all: 
    We discussed in the class what data the biologist should give to the computer engineer. 
    It seems that the biologist should try each combination of the binding sites and it 
    becomes a NP-hard problem. So my question here is how much experimental data is enough? 
    
    Since team 9 will talk about how to get the data, could you explain this in your 
    next presentation? 
    
    Liwen Tu 
    
    We are not going to talk about "how" to get the data. I think there are Biologists in our class who  
    are more qualified to explain this. 
    
    The second paper is a different approach of "reverse engineering". As a computer engineer, I just look 
    at it as a clustering and graph optimization problem. The use of optimization algorithm with a 
    specific criterion function (hence defining our constraint on the networks produced) allow us to come 
    up with a solution which optimized the given criterion function. Is this the best solution ? No, but 
    this is one possible solution for which the constraints are satisfied to a certain degree. I do not 
    think there are any rules on how many data points needed in order to get a good results. The rules on  
    how many data points needed in order to get a good results. The more the better, and it is always true 
    in both supervised and unsupervised learning problem, perhaps more true in unsupervised cases like this. 
    
    Ozy 
    
    14
    Hello: 
    
    Team 9 presented that gene profiling could also be very useful to identify a 
    regulation network.However, the regulatory relations btw genes could be 
    ambiguous in this method. For example, one of questions raised in today's 
    class, as for the peak of gene B follows gene A, there is several possibilites. 
    
    1.gene A acivate B 
    2.gene B inhibit A  
    
    How could we distinguish this? 
    
    I know that "genome wide gene profiling" is successfully used to identify genes 
    that are regulated by a certain signal. So we could knockout gene A and do the 
    profiling again. if the expression of B doesn't change. We could say "gene B 
    inhibit A in a negative feedback loop". if the B expression increases, we 
    could say "gene A acivate B". 
    
    Ding 
    Team #3 
    
  • Ding, that is a good idea. However, the suggested method would require the expression observation to be redone, in this case you will need to repeat the profiling for 6600 genes (after knocking one out). But wouldn't it increase the cost of the experiments substantially ? Especially if you have to do this for every possible pair of gene out of 6601 (well, 3000+ after filtering)? My suggestion is this: repeat the gene profiling process and increase the number of observation (significantly). We need to have enough sampling data to allow more than one peaks in our profile. Multiple peaks may be used to confirm the existence of activator/inhibitor relation between two genes. If a profile A contains say 3 peaks, and profile B also contains 3 peaks, we then compute the candidate activator/inhibitor level between each peak following their temporal order of appearance. If all 3 pairs of peaks show the same level of activator/inhibitor relation, then we can assign the A or I edge label more confidently. However, in the case of activator relation, it is conceivable that there will be an inhibitor for profile B that shoots up during the instance of one of the peaks in A, hence suppressing B even in the presence of a peak of A. In this case it may be a good idea to "measure" the inhibitor relation first, and use the data to answer any possible coexistent of both activator and inhibitor when encountering a case described above. Will be a slow process, but at least we dont need to go back and regenerate the profiles all over again. Ozy
  • Hi, Ding: It is a good idea to construct genetic networks through a carefully designed experimental: But I have different interpreatation of your experiment: Given gene A is knocked out: if B expression increases, we should say A inhibit B; if B expression decreases, we should say A activate B; if B expression doesn't change, we should say A has nothing to do with B in regulation. Mingshu
  • 15
    This is a question for all the "network" group:
    
    I have noticed that in "network" presentations, concentrations and other parameters
    (binding constants , disassociation constant, etc) are the factors in creating or 
    identifying a network model, however temperatures are not considered, are we assuming 
    the temperature is alway 37C or room temperature? I think binding constant and 
    disassociation constant are temperature dependant?
    
    also would the solvents affect the network?
    Hongying James 
    
    It's probably safe to assume 37 C unless you're dealing with proteins designed to operate at much 
    higher or lower temperatures.  The activities of most proteins occur over a narrow range of 
    conditions, so changing the temperature too much will inhibit protein activity.
    
    As far as solvents:  since we are talking about proteins, water is really the only solvent you 
    need to consider.  Correct protein folding requires a polar solvent and many (if not most) 
    proteins require specific physical interactions with water molecules.
    
    Chris Hemme
    
    16
    I am a little confused with the optimization function: 
    f(g) = -C1(count(A) + count(I)) ... 
    
    Is this C1 a function or a constant?  All the other C's in the paper appear to be 
    constants, but the description of the function says that this term is a penalty 
    for unlabeled vertices.  So is C1 a function that subtracts the number of activating 
    and inhibiting vertices from the total number of vertices?
    
    Also, they are aiming to maximize this function, but the description of simulated 
    annealing given in class was to minimize an energy function.  So in this case would 
    they use gradient ascent instead of gradient descent?
    
    Michael Lawrence
    
  • Gradient ascent/descent is the same algorithm, just multiply the function to be maximized/minimized by -1 to switch between the two. For example, instead of maximizing f(g)= -C1(...), this is the same as minimizing f(g)= C1(...). One person's maximization problem is another's minimization problem... Wade
  • What Wade said. The main difference between Gradient descent and Sim Annealing is the way we come up with the next state. In gradient descent, the next state is determined by the direction of error gradient (in the case of minimization problem). So if your error is two consecutive time t and (t+1) shows a continuously declining mode, then you want to continue pushing your system state to the same direction as you did when moving from t to (t+1). Simulated annealing applies a random state displacement that may take your system to either direction (lower or higher energy state). If your new energy state is lower than before, then you would normally adopt the new state as your starting state for the next cycle. However, if your new energy state is higher than your last one, then you do compute the probability than your next state will be lower. Several things that you can play with when dealing with simulated annealing: 1. control the range of your "random" displacement. If this is too large then your system may be bouncing off all around the energy surface, may be hard to converge. But if it is too small, it may cost you a long convergence time. 2. Find the correct probability measure when the new energy state is higher than before. 3. Determine the right threshold for the probability measure for accepting the new state if its new energy state is higher than the one before. I never used simulated annealing myself, but the 3 points above sounds reasonable. Ozy
  • 17
    The purpose to smooth a function is to make it differentiable, is this correct?
    How do you smooth a function?
    Let's say a function:
    y=x, where x belongs to [0,1]
    y=-x+2, where x belongs to (1,2]
    How to do smoothing for this function? This is my research-related question. Would you give
    me a pointer to the relevant subjects if it takes time to explain. 
    
    Your help will be greatly appreciated!
    
    Mingshu
    
    One purpose of smoothing is to make it meet certain "regularity" conditions, which relate to conditions 
    on its derivative. In practice, smoothing is mostly done on raw data, which is thought to approximate 
    and underlying function. Since the smoothing is done on the data, rather than the analytical functions 
    (which rarely exist in practice), there is no closed form for the resulting smoothed function, just  
    the new data points from the smoothed function. 
    
    Smoothing is typically done to de-noise a set of data (density estimation, nonparametric regression) in  
    order to capture the underlying function. 
    
    There are dozens of smoothing algorithms, but the basis for many is the kernel smother. Basically, a 
    kernel function (typically, a density function, since a kernel K(x) should integrate to 1 ) is used to 
    reweigh the points.The weights are assigned a value from the kernel, and the kernel is passed along the 
    function using a "smoothing window". A larger window results in a smoother function and is less
    representative of the local behavior than the raw data. A shorter window is less smooth, but more  
    representative. The tradeoff is more bias for less variance, and vice-versa. 
    
    A simple example of another type of smoother is the moving-average smoother.To use on your example below, 
    sample your function below n times (you pick n) over the range [0,2]. So you have a collection of n  
    points (xn,yn). The new smoothed points will be the average of the nearest k points centered at (x,y).k  
    here would be your window. A kernel smoother would just reassign different weights to the average, so 
    that it is a weighted average. 
    
    See Silverman (1986) "Density estimation for statistics and data analysis",chp 3, for more details. 
    
    Wade
    
    18
  • Hi! the link to the classification of edges is as follows: http://www.cs.sunysb.edu/~skiena/gene/jizu/edge_function.htm. secondly how can we come to know from graph whether a activates b or a inhibits b.
  • Hi! I have send in the link for classification of edges, and my question -> which edges u have used in the paper ? The trio(Tim, Vladimir & Skiena) had given a power point presentation.the link is http://www.cs.dartmouth.edu/~brd/Teaching/Bio/2000/Presentations/zack2.ppt In it go to slide -> 13, 14, here the authors are also skeptical abt whether A activates B or is it VICE VERSA. I do not know how u people are so sure abt it. The graphs shown in the class -> of A activates B & inhibits B -> plz send me the link for the GRAPHS...
  • Hi! the link to the classification of edges is as follows: http://www.cs.sunysb.edu/~skiena/gene/jizu/edge_function.htm. secondly how can we come to know from graph whether a activates b or a inhibits b. Raman
  • If you are referring to the graph on the second paper of group 9, I think all edges have arrow pointing to the regulated vertices. If you backtrack the arrow, you will see that it originates either from an A labeled vertex (activator) or an I labeled vertex (inhibitor).
  • discussion cont. Ozy
  • 19
    Hello All, 
    
    Any pointers to Microarray Roberts?? Are they just capable of Probing and 
    Scanning data or they can also do Clustering of Data?
    
    Sarada. 
    
  • Here is a link describing a microarray robot. http://arrayit.com/PDF/SpotBot_Protocol.pdf I think robots are used just for handling the mechanical work with better precision than a human could achieve. I do not think that they are involved in the analysis of the data. Michael
  • There are series of robert like systems to deal with the data generated by microarray, but it does nothing to the concept of Microarray. Ma
  • 20
    In the rare event that 2 data points can be assigned to each of 2 clusters
    (in nearest neighbor clustering), how is it determined to which cluster the 
    point should be assigned?  Can this effect the results/interpretation of 
    the results of the experiment?
    
    Stephanie
    
    Stephanie, 
    Just to clarify, it cannot happen in nearest neighbor algorithm because it 
    proceeds sequentially. Such a point would be assigned to the first cluster 
    that was within the threshold t. You are probably thinking of k-means. It is 
    conceivable (in k-means) that there could be a situation where a point in 
    a cluster is just as close to the mean of its own cluster as it is to the 
    mean of another cluster. I don't know what would happen in a situation like 
    that, but I think it would be a fairly rare event in a high-dimension data 
    set (d>15) of a decent size (n>100). Even for noncontinuous data, the 
    multivariate mean vector would be "nearly" continuous over a certain 
    hyperspace. Assume a worst case scenario, that this event did happen and 
    assuming the point is always assigned to the opposite cluster, then the 
    cluster assignment would just flip-flop as the algorithm was close to 
    converging (assuming everything held constant). From a statistics point of 
    view, it wouldn't really matter what group this point was assigned to 
    because it is "equally" similar to each group. Clustering is an exploratory 
    tool, so such situations shouldn't be of serious concern. We are not truly 
    testing a hypothesis with clustering in the strict sense of the term 
    "testing". 
    Wade
    
    21
    How is the intensity of the color measured when microarrays are used?  If it is 
    a method similar to spectrophotometry, how is the discrepency between the different 
    wavelengths of red and green light settled or is this not a concern?
    
    Stephanie
    
     
    
  • My understanding is that the two wavelengths (green from the reference and red from the target) mix.. and from the resultant wavelength one can determine the ratios of expression. Michael
  • A laser is used to excite the fluorescent tags (typically Cy3 and Cy5) that are attached to the target DNA. The intensity of the red and green colors that are the emitted from the fluorescent tags are detected separately utilizing filters that are able to allow light of specific wavelengths to pass. Yellow is presented in papers where red and green light is detected from a single spot on the array. The yellow color is not actually detected by the microarray scanner. Rather the yellow coloring is created by the scanner and and is known as a pseudocolor. The primary purpose for showing yellow is to provide visual interest to the people reading the article. Aaron
  • 22
    Hello, 
    I got 2 questions and 1 comment: 
    
    Question 1: 
    In Eisen paper, last few lines of the first paragraph on page 14865 says "When designing 
    experiments, it may be more valuable to sample a wide variety of conditions than to make 
    repeat observations on identical conditions"
    
    What can be conditions? Are those factors able to regulate the expressions of  different 
    genes?In figure 2, it provides a an example about some processes, such as high 
    temperature or lower temperature, are they the conditions?
    
    Question 2: 
    The conclusion of the paper is something like: based on the similarity in pattern of gene 
    expression, some yeast genes can be clustered in a group which they have similar function
    (function is already known). Coexpression of genes of known functions with poorly 
    characterized or novel genes may leads to the functions of many genes. 
    
    In figure 1, at the end of the desdription of figure 1, it says "These clusters (A, B, C, 
    D, E) also contian named genes not involved in these processes and numerous uncharacterized 
    genes. 
    
    Based on these facts and conclusion, if we analyze some microarray data by clustering and 
    we already know the fuctions of most gene in a cluster, can we infer other new genes with 
    unknown functions have similar functions with those known genes?
    
    Comment: 
    In hierarchical clustering, regarding those 3 algorithms, single-link, complete-link, and 
    group average, even the later K-mean methods using single-link, I think selection of 
    algorithms should be based on what knid of data you have. If the data is more evenly 
    scattered within clusters, the average may be better that single-link.
    
    Any explanations and comments are welcome! 
    
    Shuyu 
    
  • I will comment on your final comment. Linkage methods(single, complete, average) are used for estimating the distance between clusters, though it can apply to two single points.K-means clustering algo. use the distance of the data point and the representative of a cluster(just the mean of the cluster normally), the distance here is distance between points, not computed using linkage methods. Though linkage methods can be used in clustering algo. other hierarchical methods, for the simplest k-means algorithms, the linkage methods are not used. Yes, normally average linkage performs better than single linkage. Wade said average linkage and complete linkage satisfy the three properties of distance metrics, they can really be treated as "distance". But single linkage does not satisfy the triangular property. Hai
  • Shuyu, I can answer a few of your questions. Based on these facts and conclusion, if we analyze some microarray data by clustering and we already know the fuctions of most gene in a cluster, can we infer other new genes with unknown functions have similar functions with those known genes? This situation is similar to "correlation does not imply causation". But, it is a good place to start looking for connections. Remember, clustering is an exploratory tool. Results from clustering help guide us in the direction for further investigation, it is not the end of the investigation. So, I would say that the next step would be to take a more focused approached to studying the relationship between the "new genes" and their possible function based on the genes their were clustered with... Comment: In hierarchical clustering, regarding those 3 algorithms, single-link, complete-link, and group average, even the later K-mean methods using single-link, I think selection of algorithms should be based on what kind of data you have. If the data is more evenly scattered within clusters, the average may be better that single-link. You are correct about the linkage methods. Simple linkage methods are best used on data that are "stringy" and complete linkage is best used on groups that "clumpy". Here is an excerpt from http://www.statsoftinc.com/textbook/stcluan.html.Single linkage (nearest neighbor). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps" If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Wade
  • Shuyu, The genes in one cluster might have the same function, but also they might have different functions and be regulated by the same stimulus. Such genes might be involved in the same biological process-replication, for example. There are many proteins with different functions that are involved in DNA replication, but they are upregulated in cells simultaneously by the same stimuli. Olga.
  • 23
    
    The features of the data are expected to be independent as much as possible. 
    possible. Replicated(or highly correlated) features cause the problem of 
    "curse of dimensionality". For an extreme example, originally we have 
    only two independent feature, then we replicate the second feature by 
    100 times, we get a data set with 101 dimensions. Now if we treat the 
    dimensions orthorgonally by choosing Euclidean distance metrics, the 
    differentiating ability of the second dimension has been amplified a 
    lot, then the effect of the first dimension disappears. 
    
    So how to reduce the features? One way is to use covariance matrix of 
    the data to show the shape of the data set in the high-dimensional 
    space(we just compute the eigen values and discard those directions(in 
    the directions of the eigen vectors) associated with very small eigen 
    values). One problem of this approach is that it may not be correct when 
    dealing with the case that the data set is like two parallel line 
    segments and close to each other and each line is a cluster(this 
    approach just save the direction of the line, and discard the direction 
    normal to the line, which is really good to differentiate the two 
    clusters). Another way is to compute the correlations of each pair of 
    features, if the result is too high(but need to be normalized), the two 
    features are probably highly correlated. However, sometimes one feature 
    is the non-linear function of another feature, using this approach may 
    not reflect the relationship of the two. 
    
    I guess replication of the features is often in microarray data. So this 
    issue is really important. Any other ideas? 
    
    Hai
    
  • The method you describe based on the eigenvalues of the covariance matrix is known as principal components (PC). This a widely used method for reducing the dimensionality of a data set. Linear combinations of selected variables are used to create new data of a much smaller dimension. The heart of this approach is the Spectral Decomposition Theory. A limitation of this method is that LINEAR combinations of the old variables are used to create the new data (as Hai said). Unfortunately, linear may not be the best approach, but it is the most tractable way of handling the problem. Wade
  • 24
    DNA chip vs Microarray
    
  • Many people have a confusion of these two concepts. As I know from references, DNA chip and microarray are two different, but of course, related concepts. These are distinguished by the two kinds of immobilized sequences showed this afternoon. DNA chip: about 20 bp known single strand DNA fragment synthesized on a chip. It was used for small genome sequence studies, like yeast, but rarely used for gene expression studies. If the wash condition is properate, the signels can easily be classified to two discrete classes, presence or absence. This is because even one out of 20 miss match (two ends are exceptions) will not show a specific hybridization. Microarray: about a few hundreds double stranded cDNA printed on a chip. It is used for gene expression studies. The signal detection levels are various. Ma
  • Microarrays can take many forms besides "a few hundreds double stranded cDNA printed on a chip". Transcript abundance is commonly analyzed using cDNA microarrays that can have up to 40,000 spots per slide. Additionally, oligonucleotide arrays are also used for the transcript abundance analysis. Genomic DNA microarrays are used to analyze gene copy number as well as the binding sites of transcription factors and DNA binding proteins. Protein microarrays can be used to detect proteins and antibodies in a solution. These are just a few examples of microarrays that have been developed. Additional uses and types of microarrays will be presented in the next two presentations. Aaron
  • 25
  • I have never been in a biochemistry lab...but I can imagine to some extent what is going on in here. but my basic simple question is how is the intensity of the color measured? Is it done with a reference strip of color code with varied intensity or comes with experience of the researcher( i would doubt that), else is the intensity is decided based on a comparative study! ( will be able to understand well if an analogy is presented) and there is also the issue of discrepency between the wavelengths of red and green light of any concern? venky
  • Why do we need to calculate the intensities? Also when we cluster, using any of the methods, we are just grouping them together based on their positions(calculating their distances..), this way genes with similar functions would not be grouped together isn't it ??? Tulasi
  • The intensity of the red and green light is measured by a software program from a digitized image that is created by a microarray scanner. Each spot from the digitized image of a microarray is comprised of pixels. Each pixel has an intensity that is measurable. The total of the measured intensities for all the pixels of a spot determines the intensity of the total light emitted from the spot. The intensity measured from digitized images are often referred to in terms of "volume". A filter that allows only red light or only green light to pass is used to eliminate or substantially reduce the detection of the opposite color dye. It is impossible for anyone to estimate, with any accuracy, the intensity of light coming from a spot on a microarray. Aaron
  • When you cluster, the intention is that the variables (features) you use to cluster the data should be relevant and informative so as to cluster the genes in a meaningfully "similar" manner. When you say "we are just grouping them together based on their positions(calculating their distances..), this way genes with similar functions would not be grouped together isn't it ???", you are probably thinking of all of the two dimensional examples that we showed in class and equating that with the microarray images. In the paper, they aren't LITERALLY clustering on the (x,y) coordinates of the spots on the microarray. That is probably what you are thinking. They are clustering 12-dimensional features vectors of different genes. A typical point would be like GENENAME=(ratio of intensities at t1, ratio of intensities at t2,...,ratio of intensities at t12). So yes, in clustering, you are grouping objects based on their relative positions ("distances"), but this is done in the FEATURE space. It's hard to visualize high-dimensional spaces such as the 12 dimensional feature space in the paper. Wade
  • The intensity of the light would probably be a function of the expression level of the gene. The more a gene is expressed, the more mRNA is produced. When the mRNA is converted to cDNA, there is correspondingly more cDNA.The cDNA is attached to the dye, so the more cDNA, the more intense the color. As for the second question, maybe you are thinking of their distances as being their actual separation in the genome? This is not the case. The expression levels of the genes are measured at several time intervals.Each time represents a dimension, with the position of a gene in that dimension being directly related to its level of expression. Therefore, the genes with similar expression patterns are closer to each other in space. Genes which are expressed at the same time are often related in function. That is, they don't do exactly the same thing, but they are all involved in the same pathway or process in the cell. For example, when insulin, a messenger in the body, interacts with the cell's outer surface, the cell reacts by expressing new genes and halting the expression of some others. The newly expressed genes would then form an assembly line, or pathway, to produce a result in response to the signal. Michael
  • 26
    How accurate is the clustering in predicting gene function since it doesn't take into 
    effect post-transcriptional gene regulation?
    
    Meyer, Louis John 
    
    Statistically speaking, clustering is not designed for prediction (although you could use it for that) but 
    rather for exploration.  So there is no confidence probability you associate with your results. Through 
    cross-validation, (provided you have enough data) you could heuristically determine the error rate and 
    thus get an empirical estimate of how accurate the clustering is for a particular set of data, but these  
    results would not generalize to other data sets. 
    
    Wade 
    
    
  • Wade, how abou this: .... Take your first data set (assuming that you have lots of data that allow you to create 2 data sets: training and testing) and call it your training set. Peform your clustering on this data set. Let's make it simple, we wish to use Gaussian distribution to represent each cluster, hence using maximum likelihood estimate, we calculate the means and variance for the Gaussian distribution for each cluster. We now view these Gaussians pdf (probability density function) as a membership function (sorry, cant help it) for each of this cluster ..... From statistician's point of view, is it valid to use these Gaussians to form decision boundaries that will allow us to use them as classifiers on the 2nd data set ? I have never tried this before, so I dont know how well it will work. Ozy
  • 27
    To the class, 
    
    The paper uses term parametric ordering of genes, but hasn't mentioned what are the
    parameters they are using(though they say it is their alternative approach).Is 
    clustering analysis a non-parametric approach??
    
    Any ideas about these Statements!!! 
    
    Bhavani
    
    
  • Clustering is normally considered to be non-parametric approach. A sample of parametric approach is Bayes' classifier, where you have to assume what pdf to use and estimate a priori probalities and the parameters used in these pdfs. Am I off here Wade ? ozy
  • You are correct, although some Bayesian statisticians would tell you that a Bayesian approach could also be nonparametric if you use "noninformative priors", i.e. prior probability distributions that are not very "sharp" such as the uniform distribution for a particular parameter. Clustering is a definitely a nonparametric approach in its most commonly used forms. Wade
  • 28
    Hi All, 
    
    In the first reference paper it is said that the microarrays and reverse dot-blot
    analysis are similar in concept. 
    Then what makes microarrays distinct from the other? 
    
    Tulasi
    
    Tulasi, 
    
    I think it is the number of genes you are working with. In my lab when we are doing dot-blot analysis we
    are interested in a few particular genes. We make DNA samples of these genes by PCR ( a molecular biology 
    technique that amplifies a specific DNA) and attach it for a membrane. Then, we want to know if these  
    genes are expressed in certain cells upon  certain stimulation (UV, growth factors, cytokines). We stimulate 
    cells and isolate total mRNA from these cells and control, unstimulated cells, label it. It can be a 
    radioactive label, or a fluorescent label, or biotin, or an enzyme. If total pool of mRNA contains mRNA of 
    interest, it will hybridize with DNA on the membrane and we will detect it. So, the principal is the same 
    as in microarray analysis. The difference is the number of genes analyzed. 
    
    Olga 
    
    29
    How can I tell which cluster represent which gene from the microarry figure? Thanks. 
    
    Liwen Tu 
    
    
    Notes on clustering by Hai clustering.
    
    Just remember clustering isn't magic. garbage in=garbage out. If that data 
    clusters doesn't have a meaningful interpretation , then it really doesn't 
    matter. Clustering is just an intermediate tool. The algorithm and the 
    convergence might work fine, but if the end result doesn't have a useful 
    interpretation , than I would be wary. If you were only concerned only with 
    dimension reduction, then that might be OK. 
    In the case of the uniformly distributed points, clustering would not make 
    much sense. It would be like reading tea leaves! A lot of data CAN be run 
    through a clustering algorithm, but that doesn't mean that it SHOULD be. 
    There are good counterexamples for many algorithms, but most of the time 
    these counterexamples are situations were the algorithm isn't appropriate. 
    That is why they are counterexamples! 
    
    Wade Davis
    
    30
    How often does this happen:
    you assumed that there were k clusters, after you finished classification, 
    you got your k clusters. However, in one of your clusters, there are very 
    distinct two subclusters, and it may be better to classify the two subclusters 
    into two separate clusters so that there are total k+1 clusters. 
    
    I guess my question is: how do you make assumption on number of clusters 
    and how do you correct yourself?
    Hongying James 
    
    It depends on what clustering algorithm you use. But my first choice would be the Possibilistic 
    C-means algorithm due to its "mode seeking" characteristics.  If you try to find m clusters and 
    there are k naturally existing clusters in your data set: 
    
    if m>k, you will see that some of clusters in m will be very close to each other if not identical 
    in terms of membership value of their members and their cluster centers. This is an indication 
    that you have over-specify the number of cluster.
    
    If m is less than k, you will get m clusters and these are likely to be m highest density cluster out of k 
    cluster in the set, what happen to the other clusters not represented in the m clusters you 
    find ? Most likely, the members of these k-m clusters will receive low membership value. Hence, 
    you can filter them out by applying some thresholding on the membership value to detect data 
    points that should belong to other clusters outside the m you already found.
    
    A fuzzy C-means will simply split the data points belonging to the k-m clusters among the m 
    clusters, because the condition that the sum of all membership value should add to 1. Hence, 
    it will be hard to differentiate the data points that should belong to the k-m clusters.
    
    Ozy
    
    31
    What is the terminating condition for k-means algorithm if we don't 
    get good clusters from iterations ? 
    
    Raman Seth
    
    Here is the discussion on k-means algorithm, the plot.jpg
    
    32
    I've written up a word document in which I try to explain how the 
    decision function works (in dual space, that is, with the kernel 
    function) and how it learns this function by maximizing the distance 
    to the closest examples.
    
    The second part I explain with regard to the direct space representation 
    of the decision function, since it is easier to understand.  
    
    This is an excellent paper to read if you are interested in all the details: 
    http://portal.acm.org/citation.cfm?doid=130385.130401 
    
    I've attached the word document which tries to give the explanation in terms 
    that someone without a strong mathematical background can understand.
    svm.doc
    
    Michael 
    
    
    
    
    33
    To the class, 
    
    In the microarray analysis we are subjected to get duplicated genes.  
    How are we supposed to deal with these genes? I mean, If we have to 
    calculate the fold changes do we have to average the signal intensities
    of all the duplicated genes or will that be ok to consider the duplicated 
    gene as an individual gene. In what manner the results vary?
    
    bhavani
    
    With respect to using a microarray for expression analysis, it should not matter if there 
    are multiple copies of a gene. The goal of an expression array is to identify the quantity 
    of mRNA in a sample. The source of that mRNA (i.e. one or multiple copies of a gene) will
    not be detected by, or will not affect the results of, analysis by a microarray. 
    Alternatively, microarrays have been used to identify multiple copies of genes present 
    in genomic DNA.
    
    Aaron
    
    34
    
    I'm just wondering, do we know how many forms of proteins can a gene 
    express? and which one is the dominant form that have the tendancy to
    cause cancer? I remembered sometiem ago, it's reported that breast 
    cancer is rare among Asian people because of their large amount of 
    Tofu (a soybean product) consumption, does this indicate that soybean 
    may have the ability to inhibite the production of the cancer_prone protein?
    Hongying James
    
  • It is believed that soybeans contain large amounts of phytoestrogenes. Breast tissues are estrogen responsive so, although the mechanism is not clear, phytoestrogenes might prevent neoplastic growth in breast tissue by binding to the estrogen receptors and competing with the human estrogen. Olga
  • A gene can express any number of proteins due to alternative splicing. Some genes can produce more proteins than others due to more alternative splicing choices. I may be wrong, but I don't think there is a "dominant" form. One form may be more prevalent than another, but not necessarily because it is dominant. Hope this helps! Stephanie
  • Also, don't forget post-translational modification, such as cleavage of the signal sequence in the synthesis of membrane proteins, and covalent modifications, such as phosphorylation, which determine the ultimate effect of the gene. There may not be "dominant forms" but there are certainly forms with "dominant functions." If one form of a protein causes effects which occur regardless of an alternate form, it could be said to be dominant with regard to those effects. Everything becomes much more complex when moving from genes to proteins. In computer science terms, this is like trying to figure out what a program is doing by reading its source code compared to looking at the contents of memory while the program is running. You get a definite picture of what is happening when you look at the raw memory, but it is obviously much harder to intuitively understand. Michael
  • 35
    can the cancer stages such as primary and secondary stage be explained 
    using the  dominant proteins/gene  theroy? or else is it just based on  
    the extent to which the disease has spread in the body and the damage caused? 
    
    Venky
    
    It's a common knowledge at this point in time that family history is probably 
    considered the most important factor when assessing breast-cancer risk. 
    
    How far is it possible to use the proteomic approach described in the 
    R. A. Harris et al. paper to establish this theory.
    
    Basu
    
    
    This is something I have found sometime trying to figure out the reasons causing cancer: 
    
    The majority of breast cancer is sporadic, occurring in women without a family history of breast 
    cancer. Approximately 15-20% of breast cancer is associated with some family history. In general, 
    a twofold to threefold increase in the risk of breast cancer development has been associated with 
    breast cancer in a mother or sister.(2) Importantly, a woman's risk of breast cancer is strongly 
    related to the number and type of relatives affected as well as the age at which these relatives 
    were diagnosed.(2) Presumably, this familial clustering is a result of multiple, relatively weaker
    genetic influences, single cancer susceptibility genes with low penetrance and shared environmental 
    risk factors. Only 5-10% of breast cancer is thought to be due to the inheritance of a single, highly 
    penetrant autosomal dominant mutation in a single cancer susceptibility gene such as BRCA1 or BRCA2. 
    
    You could get some more info at http://www.health.state.ri.us/disease/cancer/canbrca.htm. 
    
    Tulasi  
    
    36>
    My idea is there are different proteins in different tissues and the abnormal 
    behaviour of these proteins result in dreadful diseases.
    
    So how do we identify which part of the human body is affected by cancer? 
    Do we have to consider the tissues from different parts to identify the cancer 
    effected in one part. 
    
    Bhavani 
    
  • There is always a normal tissue next to a tumor which is always taken ( or is supposed to be taken) as a control in any kind of experiments ( microarray, proteomics, histochemistry, in situ hybridization and so on). Olga
  • Here's some latest reasearch facts I found about genes and breast cancer if anybody is intersted... Kappa statistics was used to examine the association between genetics and morphology. A kappa value of 0 indicates that the process is random and a value of 1 indicates that it is completely determined (i.e. genetic); values between 0.40 and 0.60 are considered to indicate a moderately determined process. The study sample included a total of 25,730 first and 3394 second invasive breast cancers, and 2990 in situ breast cancers. 164 mother-daughter pairs with breast cancer of a defined morphology, yielding a low kappa value of 0.08. Among 100 sister pairs the kappa value was 0.002. In individuals with two primary breast cancers the kappa values were 0.22 and 0.01 for two invasive and in situ-invasive pairs, respectively. However, for a tumour with a subsequent tumour detected in the contralateral breast less than 1 year later the kappa value was 0.47. This was explained as that, tumours that are diagnosed temporally close together are likely to be reported as independent primaries if they occur contralaterally rather than ipsilaterally. Because practically all tumours reported to the Swedish Cancer Registry are histologically or cytologically verified, the reliability of the data is high, and this also applies to second primaries. The results of the study suggested that breast cancer morphology is not genetically determined. Nandini
  • 37
    I had a question for the biologists or anyone who has an answer. 
    
    It is known that a tumor needs microvasculature (network of blood vessels) to 
    survive. So how can you differentiate between tumor microvasculature and normal 
    microvasculature using the new data coming out of human genome and come up with 
    antigenic differences that can be used to target therapy? 
    
    Nandini 
    I work with vascularization in the placenta. From what I've read, the vascularization process in  
    cancer, the placenta, the ovary and other tissues is very similar at the molecular level. In my  
    opinion, you can't differentiate between tumor microvasculature and normal microvasculature. But 
    that doesn't mean that you can't use drugs of other techniques that inhibit vascularization in an 
    attempt to control cancer growth. 
    
    Mesa
    
    
    38
    So if you can't differentiate between tumor and normal microvasculature, wouldn't 
    the drugs meant to control cancer growth, inhibit normal vascularization too?   
     
    Nandini
    
  • Yes, the drugs would also affect normal vascularization. The hope is that the cancer is more susceptible to vascular degeneration and therefore would die before the patient does. There may be some ways to making the drug specific to the tissue where the cancer is located, but there's always going to be major side effects to such a treatment. That is why cancer is so much of a problem. Michael
  • I found this on the net while surfing for tumor microvasculature. http://w3dibit.hsr.it/PhD/people/corti.html "The anti-tumor properties of TNF and its unique efficacy in selective destruction of tumor associated vessels are well known. However, the clinical use of TNF as an anticancer drug has been so far limited to local or locoregional treatments, as its systemic administration is hampered by dose-limiting toxicity. The dramatic responses observed with loco-regional treatments have fostered, in the last years, further investigations aimed at decreasing the toxicity of TNF and enabling systemic administration of therapeutic doses. We have observed that tumor pretargeting with biotinylated antibodies and avidin can increase the anti-tumor activity of biotin-TNF conjugates administered intravenously to mice bearing subcutaneous tumors, with no evidence of increased toxicity. Even when TNF was targeted to tumor cells, potentiation of the anti-tumor effects was related primarily to indirect mechanisms involving a host mediated response, causing damage to the tumor associated microvasculature. Studies on the mechanism of action showed that indirect mechanisms were related to dissociation of TNF from the targeting complex via trimer-monomer-trimer transition. Based on these notions, we are now trying to develop new targeting systems for the direct delivery of TNF to tumor associated vessels."
  • 39
    This question : 
    
    what's the difference between Northern point and microarray? I didn't 
    get what exactly Northern point is, can any biologist explain this? 
    
    by Hongying James, went unanswered!! 
    
    Any ideas!! 
    
    Tulasi  
    
  • Microarray is an array of 1000 or more oligonucleotides or cDNA spots between 50 m and 300 m in size printed on a solid substrate. In Northern blot analysis instead of using DNA, RNA is allowed to be separated on the gel. RNA is single stranded, so if the molecular weight of the RNA is to be proportional to the position in the gel, it must remain denatured(made single stranded). For this reason, RNA is run in the presence of formamide. Once the RNA is electrophoresed through a gel matrix to separate the individual fragments by size and transferred to a membrane where the RNA is fixed to that membrane so that it does not come off. This describes what is a Northen Blot. http://www.bio.davidson.edu/courses/genomics/method/Northernblot.html Bhavani
  • The northern blot is similar to microarrays in that they both involve hybridization. The northern blot is performed by separating mRNA by its size, using gel electrophoresis. Then the mRNA on the gel is transferred to another medium, without losing the pattern of the gel. That is then treated with a labeled probe, usually radioactive, that has a known sequence. Whereever that probe hybridizes, you know that there is a similar sequence at that location. In microarrays, it is usually the opposite. mRNA's with known sequence are laid out in separate spots on the array, and they are treated with an unknown probe. Since the spots have known sequence, the hybridization of the probe tells you about its identity. Some advantages that microarrays have over northern blots are that they involve less work and are "cleaner" (there's no transferring of the mRNA from the gel to the membrane, for example). Also, I think that microarrays are much better for quantifying the amount of expression, where northern blots are mostly for testing for the presence of expression. I have to admit though that I know little about microarrays and northern blots. I just did a search on google and then basically guessed at the advantages. Michael