Published by Arto Jarvinen on 27 Jun 2009
How do you wrap your brain around 1 MLOC?
The short answer is: you don’t! When stumbling upon 1 million lines of code it is more important to understand what you don’t need to understand than to understand what you need to understand.
My mission is to add my synchronizing renderer filter as an option to MPC-HC. To do this, I need to understand where to insert my code. And to understand that, I need to know how the existing version of MPC-HC is wired, what its architecture is. How can I do that when there is no documentation of the architecture whatsoever and comments are as rare as …?
Since I do software development just for fun, I seldom have the time to penetrate a technology area in a way that would be possible if I worked with it 8+ hours a day. My interests also span from hacking a bit of php to get my photo site to behave to developing multi-threaded and time-critical applications with DirectX and DirectShow. Sometimes working for a living prevents me from working on my own projects for weeks or even months which means that I forget a lot (and this doesn’t exactly get better with age). These circumstances have on the other hand forced me to get better at getting under the hood of a new application (and rapidly identifying what parts to remain blissfully ignorant about). These are some of my tactics and tricks to do that:
![]() |
| A picture and a thousand words. |
- First of all, I read up on the basics of any frameworks used that I don’t already know (or have forgotten about). Even if the application documentation is non-existent, the documentation of the commercial frameworks (in this case at least MFC, ATL, and DirectShow) is usually good. Since these are all Microsoft frameworks, there is a plethora of books to read too if you are of the book-reading persuasion. I mostly use Google. The frameworks often determine the taxonomy and the architecture of the application to a large degree.
- When learning a new framework such as DirectShow, I almost always find it much easier to generalize from examples (samples in this case) than to specialize from general descriptions; I found it for instance much easier to learn UML from annotated examples than from a meta-model. The meta-model became useful much later. I compiled and played around with the excellent samples that came with the DirectShow library. Another excellent source of samples that I consult quite often is the Code Project site
- I usually draw simple UML class diagrams of the basic classes of the application together with the (seemingly) most interesting instance variables and operations. Sometimes I also draw the inheritance diagram to remember where to look for variables and operations when I can’t find them in the class itself. Visual Studio has a pretty good class browser too where it is easy to see inheritance hierarchies.
- Once the main classes are mapped, in a media application a good starting point is to try to understand the “play file” use case or similar. Here I usually draw a simplified UML sequence diagram mapping the use case realizations of the interesting use cases (see image). Reading code is useful for documenting the use case realizations but sometimes I get lazy or find it hard to follow the execution flow of some other reason. In these cases I add traces to the beginnings and the ends of the operations that I believe take part of the use case realization and then run the application in debug mode. This is admittedly a rather dumb method but I make progress even when I’m tired or when the code is a real eyesore. And sometimes grunt work can be quite relaxing.
DirectShow is a modular architecture built on filters and media formats and the filter graph that connects the filters together into a media application. Since the media formats are standardized, it is relatively easy to replace a filter with an other filter in the application (filter graph) as long as it can handle the same media formats as the replaced filter. I can thus put my renderer filter in the place of any other renderer filter as long as it handles the same input formats (there are no “output formats” as the renderer is typically the last filter in the filter graph producing video on the screen). Because of the same reason I can safely ignore the source code for all the other built-in filters such as “source filters”, “splitters” and “decoders” as long as I see to it that I can handle the media formats that they output. That probably eliminates more then 900 KLOC of code to understand, including a few thousand lines of assembler that I can really live without understanding.
Ok, so where does that new code go?



