Title: Investment Science Corp.
1Investment Science Corp.
2Symbolic Regression
- Large Scale
- 1M rows x 20 columns
- Single computer
- Less than 50 hours computation time
- Server Farm Scaling
- Assume a 100 server farm
- The Farm can manage 1000 symbolic regressions of
1M rows x 20 cols - Elapsed computation time will be 500 hours
- Trading System Deployment
- Training weekly requires one large scale symbolic
regression - To deploy, must get approval from trading
committee - Blind forward 20 year history requires 1000 large
scale symbolic regressions
3DeepGreenTM WorkFlow
The CTO provides hands on direction to Tiger
teams developing advanced analytics.
Weekly Securities Buy-Sell Account Traders
Account Traders See only The weekly HTML reports.
Weekly Production Run Green Team
Weekly HTML Reports Account Traders
What-If Blind-Forward Testing Investment Science
Department Green Team
New Algorithm Integration Scoring Investment
Science Department Green Team
- New Algorithm Research Development
- Engineering Department
- Green Team
- Development Teams
Development Teams See only their own projects.
4Weekly Production Run
- Produces Thousands of HTML Pages
- Analyses pages for each of 1,500 stocks
- Performance history for each of top 150 trader
agents - Product strategy performance for each of top 35
product strategies - Requires Retraining of all 40M Agents
- Currently uses 50 weekend hours for computation
- Deployment of new models requires a server farm
- Data Collection
- Weekly data feed from Valueline
- Weekly data feed from DownloadQuotes.com
- Weekly data feed from First Call
- Weekly data feed from Standard Poors
- Data cleansing is automated but requires some
human intervention - Valueline once asked to re-purchase our cleansed
data (we declined)
5Automating New Product Development
- Marketing Defines New Product Requirements
- Type of risk management required (Structural,
Statistical) - Competitive rates of return available in market
- Competitive risk levels available in market
- Engineering Product Mock Ups
- Review top trader performance (each of 40M trader
agents) - Review product strategy performance (each of 35
product strategies) - Can we fulfill with a mixture of existing product
strategies traders? - Product Research (fulfilling the future)
- Review top academic algorithm performance
- Review additional data requirements for market
penetration - Can we fulfill with a mixture of new algorithms
new data? - Acquire test additional data and new academic
algorithms
6Symbolic Regression
- Linear Regression
- Gaussian substitution
- Least Squares
- Multivariate polynomial models
- Non-Linear Regression
- Logit regression
- Support Vector Regression
- A growing but very limited set of tools models
- The Growing Need
- Scientific problems are growing in complexity
- All current regression tools are ON2
computational complexity or greater - Generalized, scalable, symbolic regression is
badly needed - It could change the practice of science
7Large Scale Symbolic Regression
- Why chose evolutionary technology?
- Algorithms are ON1 computational complexity (they
scale well) - Just-in-time algorithms
- Algorithms are creative
- Multiple techniques available
- Genetic Programming
- Grammatical Evolution
- Grammatical Swarm Optimization
- Reaching scalability
- Challenge basic assumptions
- Recombine disparate techniques into powerful
partnerships - Use statistics
- Use computer science
8Experimental Combinations
- Combined Hybrid combination of particle swarm
agents and GP - Combined Hybrid combination of grammar and
tree-based GP - Combined Hybrid fitness measure supporting
symbolic regression and classification long/short
candiudates - Combined Hybrid combination of multiple island
populations and boosting with GP
9Techniques Employed
- Experimental setup with separate training
testing data sets - Generate training data with simple complex
models noise - High speed compiler generating register speed
individuals - Fitness measure rewards accuracy first then tail
classification - Standard GP using the abstract grammar
- Abstract grammars implemented as particle swarm
agents - Vertical slicing (sort by Y then use every nth
training example) - Hill climbing mutation added to crossover
operator - Context Aware Crossover added to crossover
operator - Standard GP using the MVL grammar
- Island GP using multiple grammars
- Separate island for each boosting run
- Exhaustively search abstract roots in each run
- Standard GP using abstract grammar
- Tournament-of-Champions every five training runs
10Simplified Concept Flow
11Abstract Grammar
- Little Fine-Grain Control log(x3.2392)/sin(x10
56.341) - More Fine-Grain Control log(V1C1)/sin(V2C2)
- Abstract Substitution Vi choose from X1 thru
XN, Ci choose any real number - Swarm Intelligence Use particle swarm or
differential evolution for fine-grain control
12Remaining Big Issues
- Time Constraint Abstract grammars produce better
results when given more time. - Poor Performance on Multi-model Test Cases GP
still having problems on more difficult cases. - Search Coverage GP still prematurely converging
on local minima.
13Future Research Steps
- Age-Layered Population Structure as an attempt to
avoid premature convergence? - A Posteriori Fitness Subsets as a directed search
for better performance on more difficult
multi-model test cases? - Information Theoretic Fitness Measures as a tool
for improving performance on more difficult
multi-model test cases?
14Reviewer Questions Part1
- Which market index was used for this market
neutral study? (and all other questions of this
nature) - Is a fixed five-year training window adequate
for all changing market regimes? - Are there other advantages of Vertical Slicing
beyond saving training time? - Why do you claim classification was good in most
cases?
15Reviewer Questions Part2
- What is the time scale used - which is better
one month hold, one quarter hold, etc? - Is the possibility under consideration that the
tool could evolve itself or that it could adapt?
- What is the motivation for retraining on all
1250 samples, when only the latest 5 of them are
new each week, is unclear? - Does this extended context-aware crossover yield
a number of evaluations significantly less than
that of enumerative search by a constructive
procedure that simply builds successively more
complex canonical forms?
16Audience QA
- What are the audience questions?
17Investment Science Corp.