« Homework Syllabus for all sect | Main | Homework Syllabus for all sect »

February 02, 2006

Project #Syllabus for all sect

In Project #2, you implement an algorithm for doing phylogenetic inference. It is due as indicated in the schedule on February 17th.

Change Log
Feb 14 at 1:34 am: Removed "-1" from line 488 of viewer.cs. This may be causing your unrooted tree to drop the last nucleotide from the taxa. Other less critical changes include: deleted graphviz distribution from zip file (it is in the labs and you can install it on your own machine for free), added pdf files containing some notes on solving the problem using binary trees, added an extra database with a simple problem for debugging scoring functions. I've completed my implementation of the project and have made a few observations in the comments below.

Feb. 10 at 2:15pm: I thought of a better way to do the algorithm. Basically, you convert the unrooted tree to a rooted tree and do all of the algorithms on the rooted tree. I've made some notes with the algorithms and datastructures. This new problem formulation uses the same information and algorithms but is so easy that I think that even I could code it up. I wish I had thought of this before our "help" session, but here it is anyway. You can use this idea within the context of the existing code and project.

Feb 10 at 1159 am: Notes from the 11am help session

Feb. 9 at 9:39 am: ParsimonyScore algorithm modified to replace TotalScore with score. Previously, TotalScore was set to 0, left unmodified and returned.

Feb. 9 at 9:39 am: Node 8 in problem 3 was incorrectly marked as a taxon although it was an interior node. Deleted node 8 from taxa table and decremented taxa count to 40 for problem 3 in the problems table in phylo.mdb


Download the project file distribution.. Read the project description document contained in the distribution and follow the directions under "How to Proceed".

Dr. Jones has written some notes on debugging with Visual Studio that may still be useful.

Posted by jones at February 2, 2006 10:20 PM

Comments

I'm working on the current distribution, and I noticed that the project is hard-coded to use the path c:\tmp, and it also relies on a program called "dot.exe", which doesn't seem to be included with the project, nor does it come with SVG Viewer (it seems).

As a result, the project crashes when I choose a db file. What to do?

Posted by: Andrew Arnott [TypeKey Profile Page] at February 6, 2006 02:42 PM

you can get dot.exe by installing graphviz from AT&T at http://www.research.att.com/sw/tools/graphviz/

The database reader is most likely crashing because it calls dot.exe to create a picture of the graph after reading from the database.

Hope that helps,
Mike.

Posted by: Mike Jones [TypeKey Profile Page] at February 6, 2006 02:57 PM

I had no problems installing the SVG Viewer.

Posted by: Andrew Arnott [TypeKey Profile Page] at February 7, 2006 10:59 PM

When running the program, it really looks like problem 3's node 8 is a leaf node, with THREE relations to non-leaf nodes, rather than the customary one. That is, as I understand it, leaf nodes are to have just one relation to another node, which is not a leaf node. And only leaf nodes have taxa assigned to them. But in problem 3, node 8, this is not the case, as I see it.

Can anyone explain this to me?

Posted by: Andrew Arnott [TypeKey Profile Page] at February 8, 2006 09:27 PM

I am confused by the ParsimonyScore algorithm given in the Word document. It initializes totalScore = 0, but then it never uses it again until it returns it. So, where does totalScore get updated?

(and secondly, I asked a TA about line 7, and he couldn't tell me what it meant either).

1.	Algorithm ParsimonyScore (unrootedTree t)
2.	totalScore = 0; 
3.	For i = 1 to length (DNA sequences in t) 
4.	   Score each leaf node with a 0.
5.	   For each leaf x 
6.	      x.Labels = {character i of DNA sequence for x}
7.	   node n = some interior node of t  
8.	   score = score + PostOrderScore (n); 
9.	   score = score + PostOrderScore (n.centerChild); 
10.	   if (intersection (n.Labels, n.centerChild.Labels) is empty) 
11.	        score = score + 1;  
12.	return totalScore; 

Posted by: Andrew Arnott [TypeKey Profile Page] at February 8, 2006 09:43 PM

Andrew,

Good catch on totalScore. Just replace totalScore with score and you'll get what you are looking for...

1. Algorithm ParsimonyScore (unrootedTree t)
2. score = 0;
3. For i = 1 to length (DNA sequences in t)
4. Score each leaf node with a 0.
5. For each leaf x
6. x.Labels = {character i of DNA sequence for x}
7. node n = some interior node of t
8. score = score + PostOrderScore (n);
9. score = score + PostOrderScore (n.centerChild);
10. if (intersection (n.Labels, n.centerChild.Labels) is empty)
11. score = score + 1;
12. return score;

On the second question, "pick an interior node" means just that. Find some node inside the tree and make that n. The point here is that the entire tree will have the same score no matter what you pick for the starting node. The only caveat is that the interior node must have 3 adjacent nodes (and, otherwise it wouldn't even be an interior node).

Mik.e

Posted by: Mike Jones [TypeKey Profile Page] at February 9, 2006 09:09 AM

Andrew,

Node 8 was incorrectly marked as a taxon. I've since changed the database to delete 8 from the taxa table and I've updated the problem table so that problem 3 has 40 instead of 41 taxa.

You can modify your database yourself or download the new one.

As always, thanks for being our beta tester.

Mike.

Posted by: Mike Jones [TypeKey Profile Page] at February 9, 2006 09:37 AM

A modified distribution is available here.



Or, if you are familiar with Subversion, I suggest you download the distribution from the subversion repository here.



The differences include (but are not limited to):


  • No more requirement for c:\tmp, or any temporary directory for that matter.

  • A user interface that is more friendly to resizing for better viewing

  • Some slightly renamed classes and methods

Posted by: Andrew Arnott [TypeKey Profile Page] at February 10, 2006 11:26 AM

I noticed that when I used my own toDotty I was only getting 9 chars in my taxa while the default toDotty shows 10. I checked in viewer.cs and found that it is only sending me 9. I changed line 488 of viewer.cs from:

char [] taxaCharacters = taxaString.ToCharArray(0, taxaString.Length - 1);

to:

char [] taxaCharacters = taxaString.ToCharArray(0, taxaString.Length);

and now I get 10 chars. Are we supposed to have 10 chars in our taxa? Is this a valid fix?

Posted by: Nathan Jenne [TypeKey Profile Page] at February 11, 2006 03:01 PM

That correction to ToCharArray looks like a good fix to me. I've committed it to the SVN version. If anyone is using SVN to get the modified distribution, a simple SVN Update should get you the bug fix (and a few others trivial ones).

Posted by: Andrew Arnott [TypeKey Profile Page] at February 13, 2006 07:15 AM

I've finished my scoring (but not my solving). I wanted to publish my scores for the unsolved problems so we could compare answers and see if we've got the algorithm right. Mine are not guaranteed to be correct, but I feel good about them at this point: #1 score: 54. #2 score: 141, #3 score: 194.

Anybody else got results that can confirm or deny these? (I applied the taxa bug fix to get these numbers).

Posted by: Andrew Arnott [TypeKey Profile Page] at February 13, 2006 04:05 PM

Thanks Andrew,

My results are different. I am getting:
problem 1 88
problem 2 226
problem 3 313
problem 4 4

I am not asserting that these are correct, but the solution I have from last year got the same results, so I am more confident than I would be otherwise.

Problem 4 is a simple problem that I added to the database (and which helped me see that my set union used "and" while my set intersection used "or". It is odd, even for me, to get both of them wrong in the same class) You can get the new database and see what you get on problem 4. You can also add your own problems to the database. Just open it and edit the tables.

Mike.

Posted by: Mike Jones [TypeKey Profile Page] at February 13, 2006 06:11 PM

I've completed the project. It took me...
2 hours to convert unrooted trees to binary trees,
1 hour to get the scoring function working (mostly debugging)
2 more hours get branch swapping working (45 minutes to find and fix silly programming errors).
45 minutes to redo my binary tree building algorithm to build trees that had about the same number of children on both sides of the root node. This gave me a reduction on all 3 problems. Without it, I got a reduction on only one problem.
Total = 5 hours 45 minutes. Round up to 6 hours. I don't know how that will scale to your programming speed (it took me 2 hours of coding and 1.5 hours of debugging to do the last project). I understood the problem well, but made a plethora of silly programming mistakes. So this is a nontrivial project, but doable. I think you'll learn a lot about divide and conquer algorithms.

I'll list a few of my mistakes and observations here for your benefit:


  • basic set operations not correctly implemented
  • I removed all "catch"es from viewer.cs so I could track my exceptions a little better. The call stack viewer (under debug|windows|stack) was then very helpful.
  • I needed to see my trees and how they were scored. So I made a very simple problem (see comment above for link) and scored it by hand. Then, I modified my "toDotty" function to first score the tree and then to annotate the node labels with the sets of nucleotides that labeled each node. This was quite useful.
  • My branch swapping algorithm didn't score the second version of the swapped tree. I forgot to insert that line of code. Took 15 minutes to figure that out.
  • My branch swapping algorithm didn't correctly store the positions of the grandchildren in the two swapped versions of the tree. This was a problem because the last step of my branchswapping algorithm was to restore the tree to the optimal arrangement of grandchildren.
  • I ended up using the array representation of a binary tree in my project rather than pointers or references. This implementation is discussed in the textbook. The trick was to store parent and child array index location in my node class. This allowed me to update nodes to "point" to different parents.
  • I ended up not storing any intermediate score information with any nodes in my tree. I simply recomputed the score all the way down to the taxa every time I needed to score a subtree. For my implementation, this optimization was not neccesary as my code runs in under 1 second for each problem without this optimization.
  • In the branch swapping method, I ended up not handling the case where the root node has 2 grandchildren and 1 child that is itself a taxon. Doing so would have given me more combinations to try, but wasn't neccesary to get a reduction on all three problems in my implementation (which is all that the project requires). If I were doing an improvement project, I would have implemented some code to handle this special case and determined if it gave me more reduction.

Good luck and ASK QUESTIONS! I read mail almost constnatly and I get mail when comments are posted to this blog entry. Set up a time to come to my office and talk about the project if that would help (but I'll be on the way to Moab Friday after class, so come before then). Ryan the TA has a help session Tuesday at 5pm probably in 120 TMCB, go to that if you can.

Posted by: Mike Jones [TypeKey Profile Page] at February 14, 2006 01:08 AM

Here are my scores before swapping:
1. 90
2. 228
3. 316

I logged (in detail) all of my scoring for problem #1 and posted it at http://students.cs.byu.edu/~nsj5/phylo-score.txt . As far as I can tell it looks right, but maybe I am still missing something.

Posted by: Nathan Jenne [TypeKey Profile Page] at February 14, 2006 06:30 PM

Since performing set operations like intersection and union can be error-prone and taxing (ha ha) on the programmer, I wanted to share a really easy way in C# to perform the necessary intersection and union operations for the taxa. The code and comments below should show you how to do it. Hope it helps someone out.



/// <summary>
/// Represents the four individual taxa characters.
/// </summary>
/// <remarks>
/// The enumerable is marked with [Flags], and numbered in powers of 2
/// to allow you to set multiple values instead of just choosing one as
/// with a normal enumerable type.
/// </remarks>
/// <example>
/// You can perform an intersection operation using bitwise AND. And a
/// union operation using bitwise OR.
/// <code>
/// Labels label1 = Labels.C | Labels.A; // both C and A
/// Labels label2 = Labels.G | Labels.A; // both G and A
/// Labels intersection = label1 & label2; // bitwise AND -> A
/// Labels union = label1 | label2; // bitwise OR -> C, G, A
/// </code>
/// </example>
[Flags]
public enum TaxaLabels : int
{
None = 0x0,
C = 0x1,
T = 0x2,
G = 0x4,
A = 0x8,
}

Posted by: Andrew Arnott [TypeKey Profile Page] at February 14, 2006 09:17 PM

Sorry about the formatting of my last post. Why is it that the Preview feature doesn't accurately represent how it will look?

Anyway, I forgot to show you how to easily convert the "char[] taxa" to the enumerable so that you can do those set operations. It's just one line:

TaxaLabels label = (TaxaLabels)Enum.Parse(typeof(TaxaLabels), Taxa[taxaIndex].ToString(), true);

This is assuming taxaIndex is the index of the individual taxa character you are considering at the time.

Posted by: Andrew Arnott [TypeKey Profile Page] at February 14, 2006 09:21 PM

i'm running into a weird issue. my original score for example 1 is 88. when i solve i end up with 84. so at first it thought, great! it worked! but then i keep running again on the newly created tree and i get 83....81....82!....77....76...74...76!..74...repeats between 76 and 74.
WOAH! how in the world is the score getting WORSE from 81 to 82 and from 74 to 76? obviously i am doing something wrong in my algorithm, maybe cutting out early. i dunno. i have worked about 2 hours trying to figure something out but no go.

anybody have a similar issue and know how to fix it? i already implemented my toDotty and that helps, but i still can't see where i am accepting a higher score.

Posted by: Daniel Larsen [TypeKey Profile Page] at February 17, 2006 11:09 AM

Daniel,

First off, you only need some improvement for the project. So you are done once you can go from 88 to 83. But, I am still thinking about whether or not I think you can get a higher score by only accepting the same or lower score in subtrees.

You'd need to reduce the score by X at a subtree and then that decision would need to force an increase of X+d higher in the tree. I think I believe that such a thing can happen but don't have an example.

Mike.

Posted by: Mike Jones [TypeKey Profile Page] at February 17, 2006 07:30 PM

Up until now, I had been doing my programming in the labs. Now that I have tried it at home I am getting "error establishing database connection" in my statusBox after selecting which database to load.

If I click on solve problem (which might be unrelated since I already received an error loading the database), it mentions something about "The .Net Framework Data Providers require Microsoft Data Access Components(MDAC). Please install Microsoft Data Access Components(MDAC) version 2.6 or later." So, I download version 2.8 from the Microsoft website and it informs me it is built into this version of windows (XP SP2).

However, when I run the phylo program outside of the IDE, it returns the error "The procedure entry point _GetIUMS@4 could not be located in the dynamic link library MSDART.dll" This occurs when selecting the database.

I have office 2003 installed including access. I can open the phylo.mdb using access. I am using the latest Jones distribution. Please advise.

Posted by: Mike [TypeKey Profile Page] at February 18, 2006 11:12 AM

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?