Title: SoftLab Bogazii University Department of Computer Engineering Software Engineering Research Lab http
1SoftLabBogaziçi University Department of
Computer EngineeringSoftware Engineering
Research Labhttp//softlab.boun.edu.tr/
2Research Challenges
- Trend to large, heterogenous, distributed sw
systems leads to an increase in system complexity - Software and service productivity lags behind
requirements - Increased complexity takes sw developers further
from stakeholders - Importance of interoperability, standardisation
and reuse of software increasing.
3Research Challenges
- Service Engineering
- Complex Software Systems
- Open Source Software
- Software Engineering Research
4Software Engineering Research Approaches
- Balancing theory and praxis
- How engineering research differs from scientific
research - The role of empirical studies
- Models for SE research
5The need to link research with practice
Colin Potts, Software Engineering Research
Revisited, IEEE Software, September 1993
- Why after 25 years of SE has SE research failed
to influence industrial practice and the quality
of resulting software? - Potts argues that this failure is caused by
treating research and its application by industry
as separate, sequential activities. - What he calls the research-then-transfer
approach. The solution he proposes is the
industry-as-laboratory approach. - .
6Research-then-Transfer
Research Solution V1
Problem V1
Wide gulf bridged by indirect, anecdotal knowledge
Research Solution V2
Problem V2
Problem V3
Research Solution V3
Problem V4
Research Solution V4
Technology transfer Gap bridged by hard, but
frequently inappropriate technology
Problem evolves
invisibly to the
research community
Incremental Refinement of research solutions
7Research-then-Transfer Problems
- Both research and practice evolve separately
- Match between current problems in industry and
research solutions is haphazard - No winners
8Disadvantages of Research-then-Transfer
- Research problems described and understood in
terms of solution technology - whatever is
current research fashion. Connection to practice
is tenuous. - Concentration is on technical refinement of
research solution - OK but lacks industrial need
as focus, so effort may be misplaced. - Evaluation is difficult as research solutions may
use technology that is not commonly used in
industry - Delay in evaluation means problem researchers are
solving has often evolved through changes in
business practice, technology etc. - Transfer is difficult because industry has little
basis for confidence in proposed research
solution.
9Industry-as-Laboratory Approach to SE research
Problem V1
Research Solution V1
Problem V2
Research Solution V2
Problem V3
Research Solution V3
Problem V4
Research Solution V4
10Advantages of Industry-as-Laboratory Approach
- Stronger connection at start because knowledge of
problem is acquired from the real practitioners
in industry, often industrial partners in a
research consortium. - Connection is strengthened by practitioners and
researchers constantly interacting to develop the
solution - Early evaluation and usage by industry lessens
the Technology Transfer Gap. - Reliance on Empirical Research
- shift from solution-driven SE to problem-focused
SE - solve problems that really do matter to
practitioners
11Early SEI industrial survey research
- What a SEI survey learned from industry
- There was a thin spread of domain knowledge in
most projects - Customer requirements were extremely volatile.
- These findings point towards research combining
work on requirements engineering with reuse -
instead of the approach of researching these
topics by separate SE research communities - as
is still found today! - From A field study of the Software Development
Process - for Large Systems, CACM, November 1988.
12Further Results from Potts et al Early 90s Survey
- 23 software development organizations (during
1990-92). (Survey focused on Requirements
Modeling process) - Requirements were invented not elicited.
- Most development is maintenance.
- Most specification is incremental.
- Domain knowledge is important.
- There is a gulf between the developer and user
- User-interface requirements continually change.
- There is a preference for office-automation tools
over CASE tools to support development. I.e.
developers found using a WP DB more useful
than any CASE tools.
13Industry-as-Laboratory emphasizes Real Case
Studies
- Advantages of case studies over studying problems
in research lab. - Scale and complexity - small, simple (even
simplistic) cases avoided - these often bear
little relation to real problems. - Unpredictability - assumptions thrown out as
researchers learn more about real problems - Dynamism - a real case study is more vital than
a textbook account - The real-world complications of industrial case
studies are more likely to throw up
representative problems and phenomena than
research laboratory examples influenced by the
researchers preconceptions.
14Need to consider Human/Social Context in SE
research
- Not all solutions in software engineering are
solely technical. - There is a need to examine organizational, social
and cognitive factors systematically as well. - Many problems are people problems, and require
people-orientated solutions.
15Theoretical SE research
- While there is still a place for innovative,
purely speculative research in Software
Engineering, research which studies real problems
in partnership with industry needs to be given a
higher profile. - These various forms of research ideally
complement one another. - Neither is particularly successful if it ignores
the other. - Too industrially focused research may lack
adequate theory! - Academically focused research may miss the
practice!
16Research models for SE
- Problem highlighted by Glass
- Most SE Research in 1990s was Advocacy
Research. Better research models needed. - The software crisis provided the platform on
which most 90s research was founded. - SE Research ignored practice, for the most part
lack of practical application and evaluation were
gapping holes in most SE research. - Appropriate research models for SE are needed.
- Robert Glass, The Software -Research Crisis,
IEEE Software, November 1994
17Methods underlying Models
- Scientific method
- Engineering method
- Empirical method
- Analytical method
- From W.R.Adrion, Research Methodology in Software
Engineering, ACM SE Notes, Jan. 1993
18Scientific method
Observe real world
Propose a model or theory of some real world
phenomena
Measure and analyze above
Validate hypotheses of the model or theory
If possible, repeat
19Engineering method
Observe existing solutions
Propose better solutions
Build or develop better solution
Measure, analyze, and evaluate
Repeat until no further improvements are possible
20Empirical method
Propose a model
Develop statistical or other basis for the model
Apply to case studies
Measure and analyze
Validate and then repeat
21Analytical method
Propose a formal theory or set of axioms
Develop a theory
Derive results
If possible, compare with empirical observations
Refine theory if necessary
22Need to move away from purely analytical method
- The analytical method was the most widely used in
mid-90s SE research, but the others need to be
considered and may be more appropriate in some SE
research. - Good research practice combines elements on all
these approaches.
234 important phases for any SE research project
(Glass)
- Informational phase - Gather or aggregate
information via - reflection
- literature survey
- people/organization survey
- case studies
- Propositional phase - Propose and build
hypothesis, method or algorithm, model, theory or
solution - Analytical phase - Analyze and explore proposal
leading to demonstration and/or formulation of
principle or theory - Evaluation phase - Evaluate proposal or analytic
findings by means of experimentation (controlled)
or observation (uncontrolled, such as case study
or protocol analysis) leading to a substantiated
model, principle, or theory.
24Software Engineering Research Approaches
- The Industry-as-Laboratory approach links theory
and praxis - Engineering research aims to improve existing
processes and/or products - Empirical studies are needed to validate Software
Engineering research - Models for SE research need to shift from the
analytic to empirical.
25Empirical SE Research
26SE Research
- Intersection of AI and Software Engineering
- An opportunity to
- Use some of the most interesting computational
techniques to solve some of the most important
and rewarding questions
27AI Fields, Methods and Techniques
28What Can We Learn From Each Other?
29Software Development Reference Model
Intersection of AI and SE Research
Empirical Software Engineering
30Intersection of AI and SE Research
- Build Oracles to predict
- Defects
- Cost and effort
- Refactoring
- Measure
- Static code attributes
- Complexity and call graph structure
- Data collection
- Open repositories (NASA, Promise)
- Open source
- Softlab Data Repository (SDR)
31Software Engineering Domain
- Classical ML applications
- Data miner performance
- The more data the better the performance
- Little or no meaning behind the numbers, no
interesting stories to tell
32Software Engineering Domain
- Algorithm performance
- Understanding Data
- Change training data over/ under/ micro sampling
- Noise analysis
- Increase information content of data
- Feature analysis/ weighting
- Learn what you will predict later
- Cross company vs within company data
- Domain Knowledge
- SE
- ML
33In Practise
- Product quality
- Lower defect rates
- Less costly testing times
- Low maintenance cost
- Process quality
- Effort and cost estimation
- Process improvement
34Software Engineering Research
- Predictive Models
- Defect prediction and cost estimation
- Bioinformatics
- Process Models
- Quality Standards
- Measurement
35Major Research Areas
- Software Measurement
- Defect Prediction/ Estimation
- Effort Cost Estimation
- Process Improvement (CMM)
36Defect Prediction
- Software development lifecycle
- Requirements
- Design
- Development
- Test (Takes 50 of overall time)
- Detect and correct defects before delivering
software. - Test strategies
- Expert judgment
- Manual code reviews
- Oracles/ Predictors as secondary tools
37A Testing Workbench
38Static Code Attributes
- void main()
-
- //This is a sample code
- //Declare variables
- int a, b, c
- // Initialize variables
- a2
- b5
- //Find the sum and display c if greater than
zero - csum(a,b)
- if c lt 0
- printf(d\n, a)
- return
-
LOC Line of Code LOCC Line of commented Code V
Number of unique operandsoperators CC
Cyclometric Complexity
39Defect Prediction
- Machine Learning based models.
- Defect density estimation
- Defect prediction between versions
- Defect prediction for embedded systems
- Software Defect Identification Using Machine
Learning Techniques, E. Ceylan, O. Kutlubay, A.
Bener, EUROMICRO SEAA, Dubrovnik, Croatia, August
28th - September 1st, 2006 - "Mining Software Data", B. Turhan and O.
Kutlubay, Data Mining and Business Intelligence
Workshop in ICDE'07 , Istanbul, April 2007 - "A Two-Step Model for Defect Density Estimation",
O. Kutlubay, B. Turhan and A. Bener, EUROMICRO
SEAA, Lübeck, Germany, August 2007 - Defect Prediction for Embedded Software, A.D.
Oral and A. Bener, ISCIS 2007, Ankara, November
2007 - "A Defect Prediction Method for Software
Versioning", Y. Kastro and A. Bener, Software
Quality Journal (in print). - Ensemble of Defect Predictors An Industrial
Application in Embedded Systems Domain. Tosun,
A., Turhan, B., Bener, A. A, and Ulgur, N.I.,
ESEM 2008. - B.Turhan, A. Tosun and A. Bener, "An Industrial
Application of Classifier Ensembles for Locating
Software Defects". Submitted to Information and
Software Technology Journal, 2008.
40Constructing Predictors
- Baseline Naive Bayes.
- Why? Best reported results so far (Menzies et
al., 2007) - Remove assumptions and construct different
models. - Independent Attributes -gtMultivariate dist.
- Attributes of equal importance -gt Weighted Naive
Bayes
- "Software Defect Prediction Heuristics for
Weighted Naïve Bayes", B. Turhan and A. Bener,
ICSOFT2007, Barcelona, Spain, July 2007. - Software Defect Prediction Modeling, B. Turhan,
IDOESE 2007, Madrid, Spain, September 2007 - Yazilim Hata Kestirimi için Kaynak Kod
Ölçütlerine Dayali Bayes Siniflandirmasi,
UYMS2007, Ankara, September 2007 - A Multivariate Analysis of Static Code
Attributes for Defect Prediction, B. Turhan and
A. Bener QSIC 2007, Portland, USA, October 2007. - Weighted Static Code Attributes for Defect
Prediction, B.Turhan and A. Bener, SEKE 2008,
San Francisco, July 2008. - B.Turhan and A. Bener, "Analysis of Naive Bayes'
Assumptions on Software Fault Data An Empirical
Study". Data and Knowledge Engineering Journal,
2008, in print - B.Turhan, A. Tosun and A. Bener, "An Industrial
Application of Classifier Ensembles for Locating
Software Defects". Submitted to Data and
Knowledge Engineering Journal, 2008. - B.Turhan, A. Bener and G. Kocak "Data Mining
Source Code for Locating Software Bugs A Case
Study in Telecommunication Industry". Submitted
to Expert Systems with Applications Journal,
2008.
41WC vs CC Data for Defects?
- When to use WC or CC?
- How much data do we need to construct a model?
Implications of Ceiling Effects in Defect
Predictors, Menzies, T., Turhan, B., Bener, A.,
Gay, G., Cukic, B., Jiang, Y. PROMISE 2008,
Leipzig, Germany, May 2008. Nearest Neighbor
Sampling or Cross Company Defect Predictors,
Turhan, B., Bener, A., Menzies, T., DEFECTS 2008,
Seattle, USA, July 2008. "On the Relative Value
of Cross-company and Within-Company Data for
Defect Prediction", B. Turhan, T. Menzies, A.
Bener, J. Distefano, Empirical Software
Engineering Journal, 2008, in print T. Menzies,
Z.Milton, B. Turhan, Y. Jiang, G. Gay, B. Cukic,
A. Bener, "Overcoming Ceiling Effects in Defect
Prediction", Submitted to IEEE Transactions on
Software Engineering, 2008.
42Module Structure vs Defect Rate
- Fan-in, fan-out
- Page Rank Algorithm
- Dependency graph information
- small is beautiful
Koçak, G., Turhan, B., Bener,A. Software Defect
Prediction Using Call Graph Based Ranking
Algorithm, Euromicro 2008. G. Kocak, B. Turhan
and A.Bener, "Predicting Defects in a Large
Telecommunication System, ICSOFT'08.
43COST ESTIMATION
- Cost Estimation predicting the effort required
to develop a new software project - Effort the number of months one person would
need to develop a given project (person
months-PM) - CE assists project managers when they make
important decisions (bidding, planning, resource
allocation) - underestimation ? approve projects that would
then exceed their budgets - overestimation ? waste of resources
- Modeling accurate robust cost estimators
Successful software project management
44COST ESTIMATION
- Understanding the data structure?
- CROSS- vs. WITHIN-APPLICATION DOMAIN embedded
software domain - Better predictor?
- Point Estimation a single value of effort is
tried to be estimated - Interval Estimation effort intervals are tried
to be estimated - COST CLASSIFICATION
dynamic intervals
classification algorithms
point estimates
45COST ESTIMATION
- How can we achieve accurate estimations with
limited amount of effort data? - feature subset selection Save the cost of
extracting less important features
46Cost Estimation
- Comparison of ML based models with parametric
models - Feature ranking
- COCOMO81- COCOMO2-COQUALMO
- Cost estimation as a classification problem
(interval prediction)
- "Mining Software Data", B. Turhan and O.
Kutlubay, Data Mining and Business Intelligence
Workshop in ICDE'07 , Istanbul, April 2007 - Software Effort Estimation Using Machine
Learning Methods, B. Baskeles, B.Turhan, A.
Bener, ISCIS 2007,Ankara, November 2007. - "Evaluation of Feature Extraction Methods on
Software Cost Estimation", B. Turhan, O.
Kutlubay, A. Bener, ESEM2007, Madrid, Spain,
September 2007 . ENNA Software Effort
Estimation Using Ensemble of Neural Networks with
Associative Memory Kültür Y., Turhan B., Bener
A., FSE 2008. - Software Cost Estimation as a Classification
Problem, Bakir, A., Turhan, B., Bener, A. ICSOFT
2008. - B.Turhan, A. Bakir and A. Bener, "A Comparative
Study for Estimating Software Development Effort
Intervals". Submitted to Knowledge Based Systems
Journal, 2008. - B.Turhan, Y. Kultur and A. Bener, "Ensemble of
Neural Networks with Associative Memory (ENNA)
for Estimating Software Development Costs",
Submitted to Knowledge Based Systems Journal,
2008. - A. Tosun, B. Turhan, A. Bener, "Feature
Weighting Heuristics for Analogy Based Effort
Estimation Models", Submitted to Expert Systems
with Applications, 2007. - A. Bakir, B.Turhan and A. Bener, "A New
Perspective on Data Homogeneity for Software Cost
Estimation". Submitted to Software Quality
Journal, 2008.
47Prest
- A tool developed by Softlab
- Parser
- C, Java, C, jsp
- Metric Collection
- Data Analysis
48Data Sources
- Public Datasets
- NASA (IVV Facility, Metrics Program)
- PROMISE (Software Engineering Repository)
- Includes Softlab data now
- Open Source Projects (Sourceforge, Linux, etc.)
- Internet based small datasets
- University of South California (USC) Dataset
- Desharnais Dataset
- ICBSG Dataset
- NASA COCOMO and NASA 93 Datasets
- Softlab Data Repository (SDR)
- Local industry collaboration
- Total 20 companies, 25 projects over 5 years
49Process Automation
- UML Refactoring
- Class diagram source code
- Tool
- Algorithm (graph based)
- What needs to be refactored
- Complexity vs call graphs
Y. Kösker and A. Bener . "Synchronization of UML
Based Refactoring with Graph Transformation",
SEKE 2007, Boston, July 9-11, 2007 B.Turhan, Y.
Kosker and A. Bener, "An Expert System for
Determining Candidate Software Classes for
Refactoring". Submitted to Expert Systems with
Applications Journal, 2008. Y. Kosker, A.Bener
and B. Turhan, "Refactoring Prediction Using
Class Complexity Metrics, ICSOFT'08, 2008. B.
Turhan, A. Bener and Y.Kosker, "Tekrar Tasarim
Gerektiren Siniflarin Karmasiklik Olcutleri
Kullanilarak Modellenmesi" (in Turkish), 2.
Ulusal Yazilim Mimarisi Konferansi (UYMK'08),
2008.
50Process Improvement and Assessment
- A Case in health care industry
- Process Improvement with CMMI
- Requirements Management
- Change Management
- Comparison A Before and After Evaluation
- Lessons Learned
- Tosun, B. Turhan and A. Bener,"The Benefits of a
Software Quality Improvement Project in a Medical
Software Company - A Before and After Comparison", Invited Paper and
Keynote speech in International Symposium on
Health Informatics and - Bioinformatics (HIBIT'08), 2008.
51Metrics Program in Telecom
- Metrics extraction from 25 Java applications
- Static code attributes (McCabe, Halstead and LOC
metrics) - CallGraph information (caller-callee relation
between modules) - Information from four versions (a version in two
weeks) - Product and test defects (pre-release defects)
- Various experimental designs for predicting
fault-prone files - Discard versions Treat all applications in a
version as a single project - Predict fault-prone parts of each application
- Using previous versions of all the applications
- Using previous versions of the selected
application - Additionally,
- Optimization of the local prediction model using
CallGraph metric - Refactoring prediction using class complexity
metrics
52Matching reqs with defects
Requirements Analysis
Call Graph / Refactoring
Design
Test driven development
Coding
Defect prediction
Test
Refactoring
Maintenance ? 8
53Emerging Research Topics
- Adding organizational factors to local prediction
model - Information about the development team,
experience, coding practices, etc. - Adding file metrics from version history
- Modified/added/deleted lines of code
- Selecting only modified files from each version
in the prediction model - Confidence Factor
- Using time factors
- Dynamic prediction Constructing a model
- for each application in a version
- for each module/package in an application
- for each developer by learning from his/her
coding habits - TDD
- Measuring test coverage
- Defect proneness
- Company wide implementation process
- Embedded systems
- Cost/ Effort Estimation
- Dynamic estimation per process
- Bioinformatics